Parallel de novo Assembly of Large Genomes from Paired Short Reads presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parallel de novo Assembly of Large Genomes from Paired Short Reads

1
Parallel de novo Assembly of Large Genomes from
Paired Short Reads

Srinivas Aluru
Electrical and Computer Engineering
Iowa State University
Computer Science and Engineering
Indian Institute of Technology Bombay

2
Sequencing Technology (1977-2005)

Sanger Sequencing
500-1000 bp per run
Up to 96 runs

ABI 3700

3
Sequencing long DNA
Multiple copies of the same source
Sequence Unordered genome fragments
Randomly fragment the copies
4
Next-gen Sequencing
454

36 gt 50 gt 75 bp
100 200 million reads

400-500 bp
1.25 million reads
Polonator
26 bp
200 400 million reads

Illumina Genome Analyzer

5
High-throughput sequencing

10-fold increase per year
Throughput of Solid sequencer doubling every
quarter recently
In 2008, Illumina customers sequenced 10 X
GenBank size in 2005 (Science 2009)
15 million for 5-7X of Sanger vs. 48K for 30X
of Illumina for human genome (June 2009)

6
The Resulting Data Deluge

Broad this year
50 Gbases in 12 days
4GB/instrument/day
98 instruments
400 Gbases/day
4 TB/day raw data
1.5 Petabytes/year

Broad next year
200 Gbases in 8 days
25GB/instrument/day
100 instruments
2.5 Tbases/day
28 TB/day raw data
10 Petabytes/year

CERN 15 PB/year
7
Sanger De novo genome assembly
Overlap-Layout-Consensus
pairs O(n2l2) run-time
O(n) pairs O(nl2) run-time
8
Genome Sequencing - Examples

Human 1 B (Venter et al. 2001)
10,000 CPU hours for overlaps
10,000 CPU hours for the rest
Parallel assembler (3 months
? 1 day Kalyanaraman Aluru
JPDC 2007)
B73 maize genome 30 million
(WSU, UAZ, ISU, CSHL
Science Nov. 2009)

9
Next-Gen Graph based assembly
ACGTAAC

Build de Bruijn graph from reads.
Each read is a path
Genome is a super path
Use paired read constraints as
guide

CGTA
CGTA
CGTA
GTAA
ACGT
ACGT
ACGT
GTAA
GTAA
GTAA

TACG

TAAC
TAAC
TAAC
AACG
TAAC
10
Rapidly Growing Area

ALLPATHS (Butler et al. 2008)
EULER-SR (Chaisson et al. 2008)
SHARCGS (Dohm et al. 2008)
Velvet (Zerbino and Birney 2008)
Bidirected graphs (Medvedev Brudno 2008)
YAGA (Jackson, Schnable and Aluru 2009)
ABySS (Simpson et al. 2009)
Key Challenge
memory-intensive graph models for high quality
high-coverage to compensate for short reads

11
Generating Paired Reads
12
Multiple Fragment Types
13
Error Correction
K-spectrum approach (Chaisson et al.,
Bioinformatics 2008) SHREC (Shroder et al.,
Bioinformatics 2009) Reptile (Yang and Aluru,
2010)
14
Importance of Error Correction
Bartonella Henselae 330X coverage, 36bp
Illumina reads
Avg Contig Max Contig N50 N75 Contigs
Before 4474 29042 8260 3144 374
After 7094 79281 17397 4675 231
15
Bidirected de Bruijn Graph

-
TCGCCGACTACT
-

-

-
-
16
Bidirected String Graph

Edges labeled with two strings

T/A
C/T
G/G
C/A
17
Parallel Graph Representation

Distributed tuple List for edges
ltu, v, du, dv, Cu, Cvgt
ltv, u, dv, du, Cv, Cugt
2k bits to represent node IDs
Sorting tuples
by node label edges incident to a node in local
memory
by (smaller node label, larger node label) both
tuples corresponding to an edge in local memory

18
Parallel Graph Construction

Reads are partitioned to processors based on
total size
Each processor generates (k1)-molecules (edges)
from its short reads
Sort to eliminate duplicates and keep frequency
count

19
Graph Compaction

Compact chains into single edges
Reduction to familiar list ranking
List ranking
Distributed linked list with adjacency
information
Find distance from each node to end of list
Extend
Multiple linked lists
Identify nodes with multiple edges
Undirected

20
Graph Compaction

Edges in the graph are nodes for list ranking
Sort by edges to assign unique edge labels
Sort by nodes to bring edges incident to a node
together
Identify nodes on a chain
Mark adjacent edges on chains
Perform undirected list ranking

21
Transforming Edge Compaction to List Ranking
The transformation can be accomplished using a
constant number of parallel sorts.
22
Errors Detection and Removal
23
Repeats and the String Graph
Repeat 1
Repeat 2
While a path through the graph corresponds to
the genome, it is not obvious which path it is
Paired reads provide approximate distance
constraints that can be used to choose between
alternate traversals
24
Measuring path support

Given a path in the graph
Each (k1)-molecule in a read maps to some
position in the graph
Can check to see if the observed distance in the
path matches distance constraints

25
Clustering Path Support

Important idea can specify the amount of
expected support for two edges in a path
For each fragment type, much of some range should
be supported
Can cluster observed support into support ranges
and check to see if these observed ranges match
expected ranges
Use single linkage clustering to combine multiple
ranges into a single unit
Key Idea Clustering remains invariant in the
face of intervening path changes
can be precalculated prior to graph traversal
These partial clusters can be converted to true
clusters during path walking

26
Creating a path extension lock
For each edge in the path, we create an expected
support lock given a candidate edge extension
27
Finding a Key (cluster) that Fits

Convert precalculated partial clusters to
clusters specific to this path
Check calculated clusters against the lock
Fits if
Lock is a subrange of key
Key fits tightly against one side of the lock

Fit
No Fit
28
Parallel Short Sequence Assembler

YAGA (Yet Another Genome Assembler)
12,000 line C implementation
A parallel support library built on MPI
A runtime library that allows alternate
implementation of underlying functionality (eases
debugging)
Computational methods as permutations and
transformations on arrays of tuples.

Assembler
Computation Primitives
Runtime Library
Parallel
Serial
MPI
29
Experimental Results

Synthetic data from previously assembled genomes
Protocol I 900100, 4300600
Protocol II - 33030, 66060, 1100100, 2200200,
4400400
30-50bp reads, 1 error rate, and 300X coverage

30
Experimental Results
Organism Length Max N50 N75 N90 Mis Num Percent
C. pneumoniae 1.0 867 867 867 132 0.0 2 99.9
S. pneumoniae 2.1 321 137 92 77 0.0 19 95.5
E. coli 5.4 378 231 104 42 0.5 42 94.0
S. cerevisiae 12.2 290 107 75 25 0.8 148 94.1
D. melanogaster 120.3 855 102 43 12 1.5 1,687 91.2
Organism Length Max N50 N75 N90 Mis Num Percent
E. coli 5.4 224 85 43 10 0.2 91 90.0
S. cerevisiae 12.2 225 71 34 11 0.8 225 90.1
NX score is the maximum length L such that at
least X percent of the genome is covered by
assembled contigs of length L
31
Performance Results (Blue Gene/L)

D. Melanogaster
900 million reads and 512 processors
50 mins. for data transfer
100 mins. for the parallel phases
20 mins. for remaining

32
Conclusions and Future Work

HPC needed to scale graph methods to large
genomes
Method applicable to heterogeneous reads
Can we do novo assemble large mammalian and plant
genomes?
Error correction for repeat-rich genomes?
Parallel path walking?
Improve assembly using multiple parameters?
Scaffolding using graph and (k1) pair clusters?

33
Acknowledgements

Write a Comment

User Comments (0)

About PowerShow.com

Parallel de novo Assembly of Large Genomes from Paired Short Reads PowerPoint PPT Presentation