Optimizing Genomic Data Storage for Wide Accessibility - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Optimizing Genomic Data Storage for Wide Accessibility

Description:

Life after the Human Genome Project. NERSC Storage Systems. Data Management. Future Directions ... Single project - Human Genome (ch 5,16,& 19) ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 28
Provided by: alic145
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Genomic Data Storage for Wide Accessibility


1
Optimizing GenomicData StorageforWide
Accessibility
Joint Genome Institute (JGI) NERSC
Center Computational Research Division (LBNL)
2
Collaborators
Nancy Meyer NERSC - HPSS Harvard Holmes NERSC
- HPSS Jonathan Carter NERSC - User Services
Horst Simon NERSC Center Director Susan
Lucas JGI-PGF - Head, Production
Sequencing Arthur Kobayashi JGI-PGF - Production
Informatics Eddy Rubin JGI Director Arie
Shoshani LBNL Computational Research
Division Millions of Microbes Everywhere
3
  • General Goals
  • Genomic Data
  • Life after the Human Genome Project
  • NERSC Storage Systems
  • Data Management
  • Future Directions

4
General Goals
  • Distribute, archive, and enhance access to the
    data generated at DOEs Joint Genome
    Institute(JGI) Production Genomic Facility(PGF)
  • Serve as a resource for community access to these
    data.
  • Establish a long term collaboration between the
    JGI and the NERSC Center.
  • High Performance Storage System (HPSS)

5
Environmental GenomicsCarbon Cycle
6
Environmental Genomics
  • lt 1 of microbes are culturable
  • Many unculturables live in interdependent
    consortia of considerable diversity
  • Aim to recover genome-scale sequences and reveal
    metabolic capabilities
  • How can we understand the action of microbes at
    the molecular level?
  • What is the structure of natural microbial
    populations? What is a microbial species?

7
Future environmentaltargets for JGI
Whole metagenome shotgun sequencing and targeted
fosmid-based methods can be used to recover
useful draft genomes
  • Newman and Banfield, Science 2002

8
JGI Microbial Program
  • JGI microbial sequencing targets a broad range of
    bacteria and archaea with relevance to
  • Bioremediation
  • Carbon Sequestration
  • Global Climate Change
  • Biodiversity
  • Biomass Conversion
  • Energy Production
  • Disease

9
Plants, Animals, Fungi
EUCARYA
BACTERIA
ARCHAEA
10
JGI Microbial Program
Lactic acid bacteria Lactobacillus gasseri
(Klaenhammer)Oenoccoccus oeni (Mills) Complex
polysaccharide degradation Clostridium
thermocellum (Wu) Microbulbifer degradans
(Weiner) (complements white rot fungus
sequence) Phototrophic bacteria Rhodospirillium
rubrum (Roberts) (complements Rhodopseudomonas
palustris and Rhodobacter spheroides)
Toxic waste degradation and microbial
ecology Desulfuromonas acetoxidans
(Lovely) Desulfovibrio desulfuricans Microbes in
extreme environments Psychrobacter
(Thomashow) Methanococcoides burtonii (Sowers,
Cavicchioli) Infectious diseases of plants and
animals Erlichia chaffeensis (Yu) Pseudomonas
syringae (Lindow)
Anaerobic methane oxidizing consortium ball of
bugs (DeLong, Monterey Bay) one (or two?!)
reverse methanogenic archaea in core plus sulfur
reducing bacterium on surface
11
JGI - Then Now
  • Then
  • Single project - Human Genome (ch 5,16, 19)
  • All data sent to NCBI/GenBank for storage and
    distribution
  • Minimum local responsibility for data stewardship
  • Relatively low production sequencing rate
  • Now
  • Dozens of whole genome projects (2 million to
    more than a billion bases, each)
  • Multiple species (microbial to vertebrates)
  • Complex environmental genomic communities
  • Full responsibility for data storage and
    distribution
  • Limited storage capacity
  • Production sequencing rate is increasing

12
JGI Monthly Production
Millions of Bases
5yr History
12 months
13
This is Not Raw Data
1 CAGGTCAACG GATCATCTGT TTCTGACCAT TCCTTCCCGT
TCCTGACCCC AGGGAGTGCA 61 GGGTGTCCTA GCCAAGCCGG
CGTCCCTCCT AGTAGTACCG CTGCTCTCTA ACCTCAGGAC 121
GTCAAGGGCC TAGAGCGACA GATGTTTCCC AGCAGGGGGT
TCTGAGGCTG TGCGCCCAGA 181 TCGCGAGAGA GGCAAGTGGG
GTGACGAGGT CGTGCACTGA GGGTGGACGT AGAGGCCAGG 241
AGTAGCAGGC GGCCGGGGAA AAGAGGTGGA GAAAGGAAAA
AAGAGGAGAA AAGTGGAGGA 301 GGGCGAGTAG GGGGGTGGGG
CAGAGAGGGG CGGGCCCGAG TGCGCCCCCC GCCCCCAGCC 361
CCGCTCTGCC AGCTCCCTCC CAGCCCAGCC GGCTACATCT
GGCGGCTGCC CTCCCTTGTT 421 TCCGCTGCAT CCAGACTTCC
TCAGGCGGTG GCTGGAGGCT GCGCATCTGG GGCTTTAAAC 481
ATACAAAGGG ATTGCCAGGA CCTGCGGCGG CGGCGGCGGC
GGCGGGGGCT GGGGCGCGGG 541 GGCCGGACCA TGAGCCGCTG
AGCCGGGCAA ACCCCAGGCC ACCGAGCCAG CGGACCCTCG 601
GAGCGCAGCC CTGCGCCGCG GACCAGGCTC CAACCAGGCG
GCGAGGCGGC CACACGCACC 661 GAGCCAGCGA CCCCCGGGCG
ACGCGCGGGG CCAGGGAGCG CTACGATGGA GGCGCTAATG 721
GCCCGGGGCG CGCTCACGGG TCCCCTGAGG GCGCTCTGTC
TCCTGGGCTG CCTGCTGAGC 781 CACGCCGCCG CCGCGCCGTC
GCCCATCATC AAGTTCCCCG GCGATGTCGC CCCCAAAACG 841
GACAAAGAGT TGGCAGTGGT GAGTTGCT
14
Neither is This
15
These are the Raw Data
16
Genome Sequencing
Sequence both ends of fragments
Make sheared fragments
Start with genomic DNA
High-throughput computational analysis
Reconstruct genome computationally
Provide genome and tools to community
17
Paired Plasmid Sequencing
18
JGI Data Production
  • Millions of files per month of raw trace data
  • 100 assembled projects per month(50MB-250MB) and
    several large assembled projects per year
  • More data are being generated than ever before
  • Currently trace data are maintained online only
    while projects are in process.
  • Whole completed projects are available to
    download. They are large and contain millions of
    files.

19
JGI Raw Data Organization
Project Series of Libraries that define a
genome Library Series of Plates Plate 384
Clones Clone 2 Lanes 1 Lane 1MB each
distributed into 4 files 1 FASTA file
1KB 1 scf file 50KB 1 abd file 250KB 1
rsd/ab1file 650KB In May-03, PGF ran
2.5 million successful lanes
2.5TB/month 10 million files
(0.75TB/month (9 TB/year) non-trace files) This
does not include any assembly, database or
metadata!
20
Current Access to JGI Data
  • Access to these data is in demand by scientific
    fields that were not anticipated by the Human
    Genome Project
  • Microbiologists
  • Environmental Scientists
  • Evolutionary Scientists
  • GtL projects
  • The computational sophistication of the user
    community is uneven, at best. Not everyone will
    want the same kind of files.
  • GenBank is not capable of serving all of the
    JGIs needs.

21
Current Access to JGI Data(cont.)
  • The data are processed by researchers using
    iterative and pattern matching techniques often
    requiring access to data that spans several
    projects and genomes. This is different from the
    Human Project.
  • Currently, this requires downloads of projects
    and then unpacking the project files to access
    the data. Millions of files to unpack and slow
    transfer of whole project files.
  • At best, the raw data used to generate the
    sequences in a project are very difficult to
    retrieve and interrogate.

22
NERSC Storage Systems
  • DOEs largest unclassified storage systems with
    current archival capacity of 8PBs
  • Robust and available 24x7 with high reliability
    and excellent network connectivity
  • Very configurable and currently provides good
    service for both large streaming data and
    concurrent direct access.
  • Experienced and innovative staff are adding new
    capabilities and distributing storage as the
    NERSC Center data requirements change over time.

23
Distribute and Enhance Access
  • 1. Initially, we plan to hold all the sequence
    data online or near-line.
  • We will prototype and select the best way to
    do this
  • distributed file systems
  • local file systems
  • cached web servers
  • tools.
  • 2. Collaborate with JGI to organize and cluster
    the sequence data so they can be retrieved in
    meaningful pieces.

24
Distribute and Enhance Access(cont.)
  • Distribute the data between JGI and NERSC/HPSS
  • Develop tools and methodologies to move the data
    between JGI and NERSC/HPSS for timely access to
    sequence data as they are being generated.
  • Incorporate this into regular site backups
  • 4. Build a web interface to the data providing a
    consistent view of the data (allowing the data to
    be distributed underneath) with a link to the
    data at JGI for ease of access.

25
Data Organization Requirements
1. Metadata for the files being collected
-- schema definition development -- the
database system to support the metadata --
query interfaces to query the metadata --
possible rapid prototyping using the OPM
tools 2. Data entry tools for the metadata
-- procedure to enforce metadata entry --
checks on the correctness of the metadata entered
None of this was contemplated in the Human Project
26
Data Organization Requirements (cont.)
3. Robust massive file movement -- from
daily generated files into NERSC's HPSS --
insure correctness in spite of system, network,
and HPSS transient failures --
automated reporting of errors / failures --
possible use of HRM technology 4. Managing
annotations of genomic data -- need to
support history of annotation, perhaps by version
hierarchy -- need for a controlled
vocabulary (an ontology) for searching the
annotations
27
Future Goals
  • 1. Hold more partial and raw data online
  • 2. Enhance searching these data using annotated
    databases.
  • Enhance current iterative processing of the data
    by moving some of this processing close to the
    data.
  • For example some programs could run on the web
    server with access to a local file system of data
    for matches and selections of data.

NERSC to become the repository of DOE genomic
data focusing on microbial and environmental
genomics
Write a Comment
User Comments (0)
About PowerShow.com