Optimizing Genomic Data Storage for Wide Accessibility presentation

About This Presentation

Transcript and Presenter's Notes

Title: Optimizing Genomic Data Storage for Wide Accessibility

1
Optimizing GenomicData StorageforWide
Accessibility
Joint Genome Institute (JGI) NERSC
Center Computational Research Division (LBNL)
2
Collaborators
Nancy Meyer NERSC - HPSS Harvard Holmes NERSC
- HPSS Jonathan Carter NERSC - User Services
Horst Simon NERSC Center Director Susan
Lucas JGI-PGF - Head, Production
Sequencing Arthur Kobayashi JGI-PGF - Production
Informatics Eddy Rubin JGI Director Arie
Shoshani LBNL Computational Research
Division Millions of Microbes Everywhere
3

General Goals
Genomic Data
Life after the Human Genome Project
NERSC Storage Systems
Data Management
Future Directions

4
General Goals

Distribute, archive, and enhance access to the
data generated at DOEs Joint Genome
Institute(JGI) Production Genomic Facility(PGF)
Serve as a resource for community access to these
data.
Establish a long term collaboration between the
JGI and the NERSC Center.
High Performance Storage System (HPSS)

5
Environmental GenomicsCarbon Cycle
6
Environmental Genomics

lt 1 of microbes are culturable
Many unculturables live in interdependent
consortia of considerable diversity
Aim to recover genome-scale sequences and reveal
metabolic capabilities
How can we understand the action of microbes at
the molecular level?
What is the structure of natural microbial
populations? What is a microbial species?

7
Future environmentaltargets for JGI
Whole metagenome shotgun sequencing and targeted
fosmid-based methods can be used to recover
useful draft genomes

Newman and Banfield, Science 2002

8
JGI Microbial Program

JGI microbial sequencing targets a broad range of
bacteria and archaea with relevance to
Bioremediation
Carbon Sequestration
Global Climate Change
Biodiversity
Biomass Conversion
Energy Production
Disease

9
Plants, Animals, Fungi
EUCARYA
BACTERIA
ARCHAEA
10
JGI Microbial Program
Lactic acid bacteria Lactobacillus gasseri
(Klaenhammer)Oenoccoccus oeni (Mills) Complex
polysaccharide degradation Clostridium
thermocellum (Wu) Microbulbifer degradans
(Weiner) (complements white rot fungus
sequence) Phototrophic bacteria Rhodospirillium
rubrum (Roberts) (complements Rhodopseudomonas
palustris and Rhodobacter spheroides)
Toxic waste degradation and microbial
ecology Desulfuromonas acetoxidans
(Lovely) Desulfovibrio desulfuricans Microbes in
extreme environments Psychrobacter
(Thomashow) Methanococcoides burtonii (Sowers,
Cavicchioli) Infectious diseases of plants and
animals Erlichia chaffeensis (Yu) Pseudomonas
syringae (Lindow)
Anaerobic methane oxidizing consortium ball of
bugs (DeLong, Monterey Bay) one (or two?!)
reverse methanogenic archaea in core plus sulfur
reducing bacterium on surface
11
JGI - Then Now

Then
Single project - Human Genome (ch 5,16, 19)
All data sent to NCBI/GenBank for storage and
distribution
Minimum local responsibility for data stewardship
Relatively low production sequencing rate
Now
Dozens of whole genome projects (2 million to
more than a billion bases, each)
Multiple species (microbial to vertebrates)
Complex environmental genomic communities
Full responsibility for data storage and
distribution
Limited storage capacity
Production sequencing rate is increasing

12
JGI Monthly Production
Millions of Bases
5yr History
12 months
13
This is Not Raw Data
1 CAGGTCAACG GATCATCTGT TTCTGACCAT TCCTTCCCGT
TCCTGACCCC AGGGAGTGCA 61 GGGTGTCCTA GCCAAGCCGG
CGTCCCTCCT AGTAGTACCG CTGCTCTCTA ACCTCAGGAC 121
GTCAAGGGCC TAGAGCGACA GATGTTTCCC AGCAGGGGGT
TCTGAGGCTG TGCGCCCAGA 181 TCGCGAGAGA GGCAAGTGGG
GTGACGAGGT CGTGCACTGA GGGTGGACGT AGAGGCCAGG 241
AGTAGCAGGC GGCCGGGGAA AAGAGGTGGA GAAAGGAAAA
AAGAGGAGAA AAGTGGAGGA 301 GGGCGAGTAG GGGGGTGGGG
CAGAGAGGGG CGGGCCCGAG TGCGCCCCCC GCCCCCAGCC 361
CCGCTCTGCC AGCTCCCTCC CAGCCCAGCC GGCTACATCT
GGCGGCTGCC CTCCCTTGTT 421 TCCGCTGCAT CCAGACTTCC
TCAGGCGGTG GCTGGAGGCT GCGCATCTGG GGCTTTAAAC 481
ATACAAAGGG ATTGCCAGGA CCTGCGGCGG CGGCGGCGGC
GGCGGGGGCT GGGGCGCGGG 541 GGCCGGACCA TGAGCCGCTG
AGCCGGGCAA ACCCCAGGCC ACCGAGCCAG CGGACCCTCG 601
GAGCGCAGCC CTGCGCCGCG GACCAGGCTC CAACCAGGCG
GCGAGGCGGC CACACGCACC 661 GAGCCAGCGA CCCCCGGGCG
ACGCGCGGGG CCAGGGAGCG CTACGATGGA GGCGCTAATG 721
GCCCGGGGCG CGCTCACGGG TCCCCTGAGG GCGCTCTGTC
TCCTGGGCTG CCTGCTGAGC 781 CACGCCGCCG CCGCGCCGTC
GCCCATCATC AAGTTCCCCG GCGATGTCGC CCCCAAAACG 841
GACAAAGAGT TGGCAGTGGT GAGTTGCT
14
Neither is This
15
These are the Raw Data
16
Genome Sequencing
Sequence both ends of fragments
Make sheared fragments
Start with genomic DNA
High-throughput computational analysis
Reconstruct genome computationally
Provide genome and tools to community
17
Paired Plasmid Sequencing
18
JGI Data Production

Millions of files per month of raw trace data
100 assembled projects per month(50MB-250MB) and
several large assembled projects per year
More data are being generated than ever before
Currently trace data are maintained online only
while projects are in process.
Whole completed projects are available to
download. They are large and contain millions of
files.

19
JGI Raw Data Organization
Project Series of Libraries that define a
genome Library Series of Plates Plate 384
Clones Clone 2 Lanes 1 Lane 1MB each
distributed into 4 files 1 FASTA file
1KB 1 scf file 50KB 1 abd file 250KB 1
rsd/ab1file 650KB In May-03, PGF ran
2.5 million successful lanes
2.5TB/month 10 million files
(0.75TB/month (9 TB/year) non-trace files) This
does not include any assembly, database or
metadata!
20
Current Access to JGI Data

Access to these data is in demand by scientific
fields that were not anticipated by the Human
Genome Project
Microbiologists
Environmental Scientists
Evolutionary Scientists
GtL projects
The computational sophistication of the user
community is uneven, at best. Not everyone will
want the same kind of files.
GenBank is not capable of serving all of the
JGIs needs.

21
Current Access to JGI Data(cont.)

The data are processed by researchers using
iterative and pattern matching techniques often
requiring access to data that spans several
projects and genomes. This is different from the
Human Project.
Currently, this requires downloads of projects
and then unpacking the project files to access
the data. Millions of files to unpack and slow
transfer of whole project files.
At best, the raw data used to generate the
sequences in a project are very difficult to
retrieve and interrogate.

22
NERSC Storage Systems

DOEs largest unclassified storage systems with
current archival capacity of 8PBs
Robust and available 24x7 with high reliability
and excellent network connectivity
Very configurable and currently provides good
service for both large streaming data and
concurrent direct access.
Experienced and innovative staff are adding new
capabilities and distributing storage as the
NERSC Center data requirements change over time.

23
Distribute and Enhance Access

1. Initially, we plan to hold all the sequence
data online or near-line.
We will prototype and select the best way to
do this
distributed file systems
local file systems
cached web servers
tools.
2. Collaborate with JGI to organize and cluster
the sequence data so they can be retrieved in
meaningful pieces.

24
Distribute and Enhance Access(cont.)

Distribute the data between JGI and NERSC/HPSS
Develop tools and methodologies to move the data
between JGI and NERSC/HPSS for timely access to
sequence data as they are being generated.
Incorporate this into regular site backups
4. Build a web interface to the data providing a
consistent view of the data (allowing the data to
be distributed underneath) with a link to the
data at JGI for ease of access.

25
Data Organization Requirements
1. Metadata for the files being collected
-- schema definition development -- the
database system to support the metadata --
query interfaces to query the metadata --
possible rapid prototyping using the OPM
tools 2. Data entry tools for the metadata
-- procedure to enforce metadata entry --
checks on the correctness of the metadata entered
None of this was contemplated in the Human Project
26
Data Organization Requirements (cont.)
3. Robust massive file movement -- from
daily generated files into NERSC's HPSS --
insure correctness in spite of system, network,
and HPSS transient failures --
automated reporting of errors / failures --
possible use of HRM technology 4. Managing
annotations of genomic data -- need to
support history of annotation, perhaps by version
hierarchy -- need for a controlled
vocabulary (an ontology) for searching the
annotations
27
Future Goals

1. Hold more partial and raw data online
2. Enhance searching these data using annotated
databases.
Enhance current iterative processing of the data
by moving some of this processing close to the
data.
For example some programs could run on the web
server with access to a local file system of data
for matches and selections of data.

NERSC to become the repository of DOE genomic
data focusing on microbial and environmental
genomics

Write a Comment

User Comments (0)

About PowerShow.com

Optimizing Genomic Data Storage for Wide Accessibility PowerPoint PPT Presentation