Title: Design 1
1Celera Assembler
Arthur L. Delcher Senior Research
Scientist CBCB University of Maryland
2Whole Genome Shotgun Sequencing
WGS Sequencing WGS Assembly Performance
3Mate-Pair Shotgun DNA Sequencing
DNA target sample
SHEAR SIZE (16 of these)
End Reads / Mate Pairs
e.g., 10Kbp 8 std.dev.
CLONE (16 of these) END SEQUENCE (automated)
550bp
10,000bp
4Shotgun DNA Sequencing (Technology)
5Whole Genome Shotgun Sequencing
- Collect 10x sequence in a 1-to-1 ratio of two
types of read pairs
35million reads for Human.
Short
Long
2Kbp
10Kbp
- Collect another 20X in clone coverage of 50Kbp
end sequence pairs
1.2million pairs for Human.
- Early simulations showed that if repeats were
considered black boxes, one could still cover
99.7 of the genome unambiguously.
BAC 3
BAC 5
6Clone-by-Clone Genome Sequencing
7Sequencing Factory
8Celeras Sequencing Factory(circa 2001)
- 300 ABI 3700 DNA Sequencers
- 50 Production Staff
- 20,000 sq. ft. of wet lab
- 20,000 sq. ft. of sequencing space
- 800 tons of A/C (160,000 cfm)
- 1 million / year for electrical service
- 10 million / month for reagents
9Human Data (April 2000)
- Collected 27.27 Million reads 5.11X coverage
- 21.04 Million are paired (77) 10.52 Million
pairs - 2Kbp 5.045M 98.6 true lt6 std.dev.
- 10Kbp 4.401M 98.6 true lt8 std.dev.
- 50Kbp 1.071M 90.0 true lt15 std.dev.
- validated against finished Chrom. 21 sequence
- The clones cover the genome 38.7X times
- Data is from 5 individuals (roughly 3X, 4 others
at .5X)
10Pairs Give Order Orientation
Contig
Assembly without pairs results in contigs whose
order and orientation are not known.
Consensus (15- 30Kbp)
Reads
?
2-pair
Pairs, especially groups of corroborating ones,
link the contigs into scaffolds where the size of
gaps is well characterized.
Mean Std.Dev. is known
Scaffold
11Anatomy of a WGS Assembly
Consensus
Reads (of several haplotypes)
SNPs External Reads
12Whole Genome Shotgun Assembly
WGS Sequencing WGS Assembly Performance
13Assembler Design Philosophy
- Detect repeats and so avoid being misled by them,
leave for the last. - Make 1st order use of mate-pairs first to
circumnavigate and later to fill in repeats. - Make all the sure moves first
- tiered phases that get progressively more
aggressive - output a complete audit trail of the evidence for
assembly.
14Assembly Pipeline (circa 2006)
Trim Screen
- Reads (typically 800bp) are quality-trimmed so
that average error rate is .5 with 1-in-1000
having more than 2 error. Average trim length
is 500-900bp, depending on the genome. (590bp for
human in year 2000) - Contaminant and vector sequence is removed
- Repeat screening makes run time and overlap graph
size reasonable, e.g. 106 overlaps per Alu read
must be avoided. - Now we dynamically limit repetitive overlaps in
the overlap phase. - gatekeeper program to vet inputs/assign
IDsReads stored in compressed, random-access
binary store.
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
15Assembly Pipeline
Overlap Detection
Trim Screen
Find all overlaps ? 40bp allowing 6 mismatch.
Use k-mer seed matches (k22) with O(nd)
extension where extension quits when probability
of seeing given of errors for amount of
sequence aligned is less than 1/1,000,000. Avoid
k-mers whose whose occurrence count c is such
that there is less than ? (10-6) chance of seeing
c occurrences given it is part of an R-fold (50)
or less repeat in a genome of length G (3x109).
Dynamic tuple selection is a form of automatic
repeat screening implying that overlaps involving
ubiquitous repetitive sequence may be missing.
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
K,?, and R were chosen to give us an appropriate
tradeoff between time, space, and sensitivity
16Assembly Pipeline
Error correction
If a k(10)-mer matches a k-mer from an
overlapping read then the bases in the k-mer of
the read are confirmed.
If a base is not confirmed, and the
1-neighborhood of an overlapping k-mer matches
it, then there is a vote for correction. The
majority correction vote is applied to the
sequence.
Sequences are not actually changed, but overlaps
are re-evaluated as SNPs are corrected.
17Assembly Pipeline
Trim Screen
Find all overlaps ? 40bp allowing 6 mismatch.
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
18Assembly Pipeline
Trim Screen
Compute all overlap consistent sub-assemblies Un
itigs (Uniquely Assembled Contig)
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
19OVERLAP GRAPH
Edge Types
Regular Dovetail
Prefix Dovetail
Suffix Dovetail
E.G.
Edges are annotated with deltas of overlaps
20The Unitig Reduction
1. Remove Transitively Inferrable Overlaps
21The Unitig Reduction
2. Collapse Unique Connector Overlaps
22Unitigs Definition
Chordal Subgraph with no conflicting edges.
23Unitig Theorem (Myers, JCB 95)
- (1) Remove contained fragments
- (2) Remove transitively inferred edges
- (3) Collapse into unitigs
- () Restore t.i. edges between unitig ends.
- THM Shortest Common Superstring of unitigs
Shortest Common Superstring of reads - Caveat SCS is not the right objective for
assembly.
24Revised Unitigger Algorithm
- Preceding is computationally expensive
- Current unitigger finds the best overlap on
each end of each readits best buddy. - Unitigs are chains of mutually unique best
buddiesadjacent reads are best buddies of each
other and of no other read. - This takes time and space linear in the number of
reads. - In rare cases results are different from graph
reduction.
25Branch Point Extension
- A repeat boundary reflected on an underlying
sequence read.
C
A
Genome
- Compare peers to detect branch pts.
A
- Make sure you get a read-length into each repeat
induced gap (most Alu sized elements are resolved)
D
26Bubble Smoothing
412
352
486
245
27Assembly Pipeline
Trim Screen
Unique
Repetitive
28Identifying Unique DNA Stretches
Repetitive DNA unitig
Unique DNA unitig
Arrival Intervals
Discriminator Statistic is log-odds ratio of
probability unitig is unique DNA versus 2-copy
DNA.
-10
10
0
Dist. For Unique
Dist. For Repetitive
Definitely Repetitive
Dont Know
Definitely Unique
29Assembly Pipeline
Scaffold U-unitigs with confirmed pairs
Trim Screen
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
30Assembly Pipeline
Trim Screen
Fill repeat gaps with doubly anchored positive
unitigs
Overlapper
Unitiggt0
Unitiger
Scaffolder
Repeat Rez I, II
31Assembly Pipeline
Trim Screen
Fill repeat gaps with assembled, singly anchored
reads
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
32Surrogates
- Stones containing more than 1 read are added to
contigs as consensus sequence only, without
underlying reads. - Called surrogates
- Allows repeat unitigs to be put in multiple
positions in the assembly, but leaves regions
without underlying read coverage. - We later attempt to resolve surrogates, by
assigning reads from the original repeat unitig
to the separate surrogate copies, based on mate
pairs.
33CelAsm Weaknesses
- No dynamic read trimming.
- however, latest version has fixed this - it
offers trimming based on overlaps. - Unitigging treats overlaps as boolean valuesno
quality variations are considered. - Unitigging ignores mate pairs.
- Unitig A-stat is sometimes unreliable for
non-random sequencing, haplotype variation,
repeat-screening. - Scaffolding has no provision for (moderate or
worse) polymorphism. - Read overlaps are ignored after unitigging.