Title: Optimizing Genomic Data Storage for Wide Accessibility
1Optimizing GenomicData StorageforWide
Accessibility
Joint Genome Institute (JGI) NERSC
Center Computational Research Division (LBNL)
2Collaborators
Nancy Meyer NERSC - HPSS Harvard Holmes NERSC
- HPSS Jonathan Carter NERSC - User Services
Horst Simon NERSC Center Director Susan
Lucas JGI-PGF - Head, Production
Sequencing Arthur Kobayashi JGI-PGF - Production
Informatics Eddy Rubin JGI Director Arie
Shoshani LBNL Computational Research
Division Millions of Microbes Everywhere
3- General Goals
- Genomic Data
- Life after the Human Genome Project
- NERSC Storage Systems
- Data Management
- Future Directions
4General Goals
- Distribute, archive, and enhance access to the
data generated at DOEs Joint Genome
Institute(JGI) Production Genomic Facility(PGF) - Serve as a resource for community access to these
data. - Establish a long term collaboration between the
JGI and the NERSC Center. - High Performance Storage System (HPSS)
5Environmental GenomicsCarbon Cycle
6Environmental Genomics
- lt 1 of microbes are culturable
- Many unculturables live in interdependent
consortia of considerable diversity - Aim to recover genome-scale sequences and reveal
metabolic capabilities - How can we understand the action of microbes at
the molecular level? - What is the structure of natural microbial
populations? What is a microbial species?
7Future environmentaltargets for JGI
Whole metagenome shotgun sequencing and targeted
fosmid-based methods can be used to recover
useful draft genomes
- Newman and Banfield, Science 2002
8JGI Microbial Program
- JGI microbial sequencing targets a broad range of
bacteria and archaea with relevance to - Bioremediation
- Carbon Sequestration
- Global Climate Change
- Biodiversity
- Biomass Conversion
- Energy Production
- Disease
9Plants, Animals, Fungi
EUCARYA
BACTERIA
ARCHAEA
10JGI Microbial Program
Lactic acid bacteria Lactobacillus gasseri
(Klaenhammer)Oenoccoccus oeni (Mills) Complex
polysaccharide degradation Clostridium
thermocellum (Wu) Microbulbifer degradans
(Weiner) (complements white rot fungus
sequence) Phototrophic bacteria Rhodospirillium
rubrum (Roberts) (complements Rhodopseudomonas
palustris and Rhodobacter spheroides)
Toxic waste degradation and microbial
ecology Desulfuromonas acetoxidans
(Lovely) Desulfovibrio desulfuricans Microbes in
extreme environments Psychrobacter
(Thomashow) Methanococcoides burtonii (Sowers,
Cavicchioli) Infectious diseases of plants and
animals Erlichia chaffeensis (Yu) Pseudomonas
syringae (Lindow)
Anaerobic methane oxidizing consortium ball of
bugs (DeLong, Monterey Bay) one (or two?!)
reverse methanogenic archaea in core plus sulfur
reducing bacterium on surface
11JGI - Then Now
- Then
- Single project - Human Genome (ch 5,16, 19)
- All data sent to NCBI/GenBank for storage and
distribution - Minimum local responsibility for data stewardship
- Relatively low production sequencing rate
- Now
- Dozens of whole genome projects (2 million to
more than a billion bases, each) - Multiple species (microbial to vertebrates)
- Complex environmental genomic communities
- Full responsibility for data storage and
distribution - Limited storage capacity
- Production sequencing rate is increasing
12JGI Monthly Production
Millions of Bases
5yr History
12 months
13This is Not Raw Data
1 CAGGTCAACG GATCATCTGT TTCTGACCAT TCCTTCCCGT
TCCTGACCCC AGGGAGTGCA 61 GGGTGTCCTA GCCAAGCCGG
CGTCCCTCCT AGTAGTACCG CTGCTCTCTA ACCTCAGGAC 121
GTCAAGGGCC TAGAGCGACA GATGTTTCCC AGCAGGGGGT
TCTGAGGCTG TGCGCCCAGA 181 TCGCGAGAGA GGCAAGTGGG
GTGACGAGGT CGTGCACTGA GGGTGGACGT AGAGGCCAGG 241
AGTAGCAGGC GGCCGGGGAA AAGAGGTGGA GAAAGGAAAA
AAGAGGAGAA AAGTGGAGGA 301 GGGCGAGTAG GGGGGTGGGG
CAGAGAGGGG CGGGCCCGAG TGCGCCCCCC GCCCCCAGCC 361
CCGCTCTGCC AGCTCCCTCC CAGCCCAGCC GGCTACATCT
GGCGGCTGCC CTCCCTTGTT 421 TCCGCTGCAT CCAGACTTCC
TCAGGCGGTG GCTGGAGGCT GCGCATCTGG GGCTTTAAAC 481
ATACAAAGGG ATTGCCAGGA CCTGCGGCGG CGGCGGCGGC
GGCGGGGGCT GGGGCGCGGG 541 GGCCGGACCA TGAGCCGCTG
AGCCGGGCAA ACCCCAGGCC ACCGAGCCAG CGGACCCTCG 601
GAGCGCAGCC CTGCGCCGCG GACCAGGCTC CAACCAGGCG
GCGAGGCGGC CACACGCACC 661 GAGCCAGCGA CCCCCGGGCG
ACGCGCGGGG CCAGGGAGCG CTACGATGGA GGCGCTAATG 721
GCCCGGGGCG CGCTCACGGG TCCCCTGAGG GCGCTCTGTC
TCCTGGGCTG CCTGCTGAGC 781 CACGCCGCCG CCGCGCCGTC
GCCCATCATC AAGTTCCCCG GCGATGTCGC CCCCAAAACG 841
GACAAAGAGT TGGCAGTGGT GAGTTGCT
14Neither is This
15These are the Raw Data
16Genome Sequencing
Sequence both ends of fragments
Make sheared fragments
Start with genomic DNA
High-throughput computational analysis
Reconstruct genome computationally
Provide genome and tools to community
17Paired Plasmid Sequencing
18JGI Data Production
- Millions of files per month of raw trace data
- 100 assembled projects per month(50MB-250MB) and
several large assembled projects per year - More data are being generated than ever before
- Currently trace data are maintained online only
while projects are in process. - Whole completed projects are available to
download. They are large and contain millions of
files.
19JGI Raw Data Organization
Project Series of Libraries that define a
genome Library Series of Plates Plate 384
Clones Clone 2 Lanes 1 Lane 1MB each
distributed into 4 files 1 FASTA file
1KB 1 scf file 50KB 1 abd file 250KB 1
rsd/ab1file 650KB In May-03, PGF ran
2.5 million successful lanes
2.5TB/month 10 million files
(0.75TB/month (9 TB/year) non-trace files) This
does not include any assembly, database or
metadata!
20Current Access to JGI Data
- Access to these data is in demand by scientific
fields that were not anticipated by the Human
Genome Project - Microbiologists
- Environmental Scientists
- Evolutionary Scientists
- GtL projects
- The computational sophistication of the user
community is uneven, at best. Not everyone will
want the same kind of files. - GenBank is not capable of serving all of the
JGIs needs.
21Current Access to JGI Data(cont.)
- The data are processed by researchers using
iterative and pattern matching techniques often
requiring access to data that spans several
projects and genomes. This is different from the
Human Project. - Currently, this requires downloads of projects
and then unpacking the project files to access
the data. Millions of files to unpack and slow
transfer of whole project files. - At best, the raw data used to generate the
sequences in a project are very difficult to
retrieve and interrogate.
22NERSC Storage Systems
- DOEs largest unclassified storage systems with
current archival capacity of 8PBs - Robust and available 24x7 with high reliability
and excellent network connectivity - Very configurable and currently provides good
service for both large streaming data and
concurrent direct access. - Experienced and innovative staff are adding new
capabilities and distributing storage as the
NERSC Center data requirements change over time.
23Distribute and Enhance Access
- 1. Initially, we plan to hold all the sequence
data online or near-line. - We will prototype and select the best way to
do this - distributed file systems
- local file systems
- cached web servers
- tools.
- 2. Collaborate with JGI to organize and cluster
the sequence data so they can be retrieved in
meaningful pieces. -
24Distribute and Enhance Access(cont.)
- Distribute the data between JGI and NERSC/HPSS
- Develop tools and methodologies to move the data
between JGI and NERSC/HPSS for timely access to
sequence data as they are being generated. - Incorporate this into regular site backups
- 4. Build a web interface to the data providing a
consistent view of the data (allowing the data to
be distributed underneath) with a link to the
data at JGI for ease of access. -
25Data Organization Requirements
1. Metadata for the files being collected
-- schema definition development -- the
database system to support the metadata --
query interfaces to query the metadata --
possible rapid prototyping using the OPM
tools 2. Data entry tools for the metadata
-- procedure to enforce metadata entry --
checks on the correctness of the metadata entered
None of this was contemplated in the Human Project
26Data Organization Requirements (cont.)
3. Robust massive file movement -- from
daily generated files into NERSC's HPSS --
insure correctness in spite of system, network,
and HPSS transient failures --
automated reporting of errors / failures --
possible use of HRM technology 4. Managing
annotations of genomic data -- need to
support history of annotation, perhaps by version
hierarchy -- need for a controlled
vocabulary (an ontology) for searching the
annotations
27Future Goals
- 1. Hold more partial and raw data online
- 2. Enhance searching these data using annotated
databases. - Enhance current iterative processing of the data
by moving some of this processing close to the
data. - For example some programs could run on the web
server with access to a local file system of data
for matches and selections of data.
NERSC to become the repository of DOE genomic
data focusing on microbial and environmental
genomics