Title: Parallel de novo Assembly of Large Genomes from Paired Short Reads
1Parallel de novo Assembly of Large Genomes from
Paired Short Reads
- Srinivas Aluru
- Electrical and Computer Engineering
- Iowa State University
- Computer Science and Engineering
- Indian Institute of Technology Bombay
2Sequencing Technology (1977-2005)
- Sanger Sequencing
- 500-1000 bp per run
- Up to 96 runs
3Sequencing long DNA
Multiple copies of the same source
Sequence Unordered genome fragments
Randomly fragment the copies
4Next-gen Sequencing
454
- 36 gt 50 gt 75 bp
- 100 200 million reads
- 400-500 bp
- 1.25 million reads
- Polonator
- 26 bp
- 200 400 million reads
Illumina Genome Analyzer
5High-throughput sequencing
- 10-fold increase per year
- Throughput of Solid sequencer doubling every
quarter recently - In 2008, Illumina customers sequenced 10 X
GenBank size in 2005 (Science 2009) - 15 million for 5-7X of Sanger vs. 48K for 30X
of Illumina for human genome (June 2009)
6The Resulting Data Deluge
- Broad this year
- 50 Gbases in 12 days
- 4GB/instrument/day
- 98 instruments
- 400 Gbases/day
- 4 TB/day raw data
- 1.5 Petabytes/year
- Broad next year
- 200 Gbases in 8 days
- 25GB/instrument/day
- 100 instruments
- 2.5 Tbases/day
- 28 TB/day raw data
- 10 Petabytes/year
CERN 15 PB/year
7Sanger De novo genome assembly
Overlap-Layout-Consensus
pairs O(n2l2) run-time
O(n) pairs O(nl2) run-time
8Genome Sequencing - Examples
- Human 1 B (Venter et al. 2001)
- 10,000 CPU hours for overlaps
- 10,000 CPU hours for the rest
- Parallel assembler (3 months
- ? 1 day Kalyanaraman Aluru
- JPDC 2007)
- B73 maize genome 30 million
- (WSU, UAZ, ISU, CSHL
- Science Nov. 2009)
9Next-Gen Graph based assembly
ACGTAAC
- Build de Bruijn graph from reads.
- Each read is a path
- Genome is a super path
- Use paired read constraints as
- guide
CGTA
CGTA
CGTA
GTAA
ACGT
ACGT
ACGT
GTAA
GTAA
GTAA
TAAC
TAAC
TAAC
AACG
TAAC
10Rapidly Growing Area
- ALLPATHS (Butler et al. 2008)
- EULER-SR (Chaisson et al. 2008)
- SHARCGS (Dohm et al. 2008)
- Velvet (Zerbino and Birney 2008)
- Bidirected graphs (Medvedev Brudno 2008)
- YAGA (Jackson, Schnable and Aluru 2009)
- ABySS (Simpson et al. 2009)
- Key Challenge
- memory-intensive graph models for high quality
- high-coverage to compensate for short reads
11Generating Paired Reads
12Multiple Fragment Types
13Error Correction
K-spectrum approach (Chaisson et al.,
Bioinformatics 2008) SHREC (Shroder et al.,
Bioinformatics 2009) Reptile (Yang and Aluru,
2010)
14Importance of Error Correction
Bartonella Henselae 330X coverage, 36bp
Illumina reads
Avg Contig Max Contig N50 N75 Contigs
Before 4474 29042 8260 3144 374
After 7094 79281 17397 4675 231
15Bidirected de Bruijn Graph
-
TCGCCGACTACT
-
-
-
-
16Bidirected String Graph
- Edges labeled with two strings
T/A
C/T
G/G
C/A
17Parallel Graph Representation
- Distributed tuple List for edges
- ltu, v, du, dv, Cu, Cvgt
- ltv, u, dv, du, Cv, Cugt
- 2k bits to represent node IDs
- Sorting tuples
- by node label edges incident to a node in local
memory - by (smaller node label, larger node label) both
tuples corresponding to an edge in local memory
18Parallel Graph Construction
- Reads are partitioned to processors based on
total size - Each processor generates (k1)-molecules (edges)
from its short reads - Sort to eliminate duplicates and keep frequency
count
19Graph Compaction
- Compact chains into single edges
- Reduction to familiar list ranking
- List ranking
- Distributed linked list with adjacency
information - Find distance from each node to end of list
- Extend
- Multiple linked lists
- Identify nodes with multiple edges
- Undirected
20Graph Compaction
- Edges in the graph are nodes for list ranking
- Sort by edges to assign unique edge labels
- Sort by nodes to bring edges incident to a node
together - Identify nodes on a chain
- Mark adjacent edges on chains
- Perform undirected list ranking
21Transforming Edge Compaction to List Ranking
The transformation can be accomplished using a
constant number of parallel sorts.
22Errors Detection and Removal
23Repeats and the String Graph
Repeat 1
Repeat 2
While a path through the graph corresponds to
the genome, it is not obvious which path it is
Paired reads provide approximate distance
constraints that can be used to choose between
alternate traversals
24Measuring path support
- Given a path in the graph
- Each (k1)-molecule in a read maps to some
position in the graph - Can check to see if the observed distance in the
path matches distance constraints
25Clustering Path Support
- Important idea can specify the amount of
expected support for two edges in a path - For each fragment type, much of some range should
be supported - Can cluster observed support into support ranges
and check to see if these observed ranges match
expected ranges - Use single linkage clustering to combine multiple
ranges into a single unit - Key Idea Clustering remains invariant in the
face of intervening path changes - can be precalculated prior to graph traversal
- These partial clusters can be converted to true
clusters during path walking
26Creating a path extension lock
For each edge in the path, we create an expected
support lock given a candidate edge extension
27Finding a Key (cluster) that Fits
- Convert precalculated partial clusters to
clusters specific to this path - Check calculated clusters against the lock
- Fits if
- Lock is a subrange of key
- Key fits tightly against one side of the lock
Fit
No Fit
28Parallel Short Sequence Assembler
- YAGA (Yet Another Genome Assembler)
- 12,000 line C implementation
- A parallel support library built on MPI
- A runtime library that allows alternate
implementation of underlying functionality (eases
debugging) - Computational methods as permutations and
transformations on arrays of tuples.
Assembler
Computation Primitives
Runtime Library
Parallel
Serial
MPI
29Experimental Results
- Synthetic data from previously assembled genomes
- Protocol I 900100, 4300600
- Protocol II - 33030, 66060, 1100100, 2200200,
4400400 - 30-50bp reads, 1 error rate, and 300X coverage
30Experimental Results
Organism Length Max N50 N75 N90 Mis Num Percent
C. pneumoniae 1.0 867 867 867 132 0.0 2 99.9
S. pneumoniae 2.1 321 137 92 77 0.0 19 95.5
E. coli 5.4 378 231 104 42 0.5 42 94.0
S. cerevisiae 12.2 290 107 75 25 0.8 148 94.1
D. melanogaster 120.3 855 102 43 12 1.5 1,687 91.2
Organism Length Max N50 N75 N90 Mis Num Percent
E. coli 5.4 224 85 43 10 0.2 91 90.0
S. cerevisiae 12.2 225 71 34 11 0.8 225 90.1
NX score is the maximum length L such that at
least X percent of the genome is covered by
assembled contigs of length L
31Performance Results (Blue Gene/L)
- D. Melanogaster
- 900 million reads and 512 processors
- 50 mins. for data transfer
- 100 mins. for the parallel phases
- 20 mins. for remaining
32Conclusions and Future Work
- HPC needed to scale graph methods to large
genomes - Method applicable to heterogeneous reads
- Can we do novo assemble large mammalian and plant
genomes? - Error correction for repeat-rich genomes?
- Parallel path walking?
- Improve assembly using multiple parameters?
- Scaffolding using graph and (k1) pair clusters?
33Acknowledgements