Overview
Sequence data generated through the CCGP will comprise whole genome resequencing (WGS) from roughly 22,000 individuals from 150 genera and 230 unique species across the landscape of California. For most genera, to fully utilized the landscape WGS data, CCGP will additionally generate high quality reference genomes comprised of PacBio long read HiFi sequence data, Dovetail Omni-C conformational scaffolding data, and RNAseq based on Illumina short reads.
In order to ensure that this truly massive data set is available not only to CCGP researchers, we will make all data publicly available through several genetic databases hosted by NCBI. (Please see the CCGP Data Release Policies and Timelines for more detail).
Within the NCBI organization, there are different types of elements (i.e. records) that allow us to identify the type of data we want to submit:
BioProjects are at the top of the organizational hierarchy and contain the metadata information for a particular set of elements (such as those listed below). BioProjects can also be “upgraded” to Umbrella BioProjects. These types of BioProjects allow us (as users of NCBI) to navigate throughout the whole hierarchy and can also include multiple BioProjects or Umbrella BioProjects.
BioSamples contain descriptions of biological source materials used in experimental assays. In our case, these represent directly the samples that have been used (or will be used) to prepare libraries for our sequencing efforts.
Assemblies are meant to generate a database that provides information on the structure of assembled genomes, assembly names and other meta-data, statistical reports, and links to genomic sequence data.
SRA (sequence read archive): library prep methods and sequencing data.
Organization
1 - CCGP Umbrella BioProject
The project organization on NCBI starts with the CCGP Umbrella BioProject (PRJNA720569). This will cover all the resources that have been (and will be) generated through CCGP efforts. This is the main node of our project.
2 - Genus-level “Species Project” Umbrella BioProjects
Each genus-level Umbrella BioProject generally corresponds to a full CCGP project, which might include several species and subspecies in a genus, and the reference genome species. For example, for the project “Genetic diversity in widespread and endemic manzanita species”, the CCGP team generates an Umbrella BioProject (PRJNA720709) for the genus called: “Arctostaphylos spp. Genomic Resources”. These genus-level Umbrella BioProjects and their metadata will be generated by the CCGP team. They will have generic descriptions and do not require input from PIs.
3 - Species-level Umbrella BioProjects
Each genus-level Umbrella BioProject will further contain Umbrella BioProjects for each species within that genus. Additionally, for species that are used to generate the reference genome, there will be two species-level BioProjects:
A. The reference genome
B. The WGS data.
All species-level Umbrella BioProjects will be generated by the CCGP team, but will require some metadata from PIs/project members.
4 - Reference genome assembly: sequencing data
Sequencing data used for the reference genome assembly will be submitted to NCBI as a separate BioProjects as shown below. The CCGP will submit all of the assembly data, but will require some metadata from PIs/project members for the samples (BioSamples) used to generate the reference genome.