Title: DNA Sequencing and Assembly
1DNA Sequencingand Assembly
2DNA sequencing
- How we obtain the sequence of nucleotides of a
species
ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
3Which representative of the species?
- Which human?
- Answer one
- Answer two it doesnt matter
-
- Polymorphism rate number of letter changes
between two different members of a species -
- Humans 1/1,000 1/10,000
- Other organisms have much higher polymorphism
rates
4DNA sequencing vectors
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)
5Different types of vectors
6DNA sequencing gel electrophoresis
- Start at primer
- (restriction site)
- Grow DNA chain
- Include dideoxynucleoside
- (modified a, c, g, t)
- Stops reaction at all
- possible points
- Separate products with
- length, using
- gel electrophoresis
7Electrophoresis diagrams
8Output of gel electrophoresis a read
- A read 500-700 nucleotides
- A C G A A T C A G . A
- 16 18 21 23 25 15 28 30 32 21
- Quality scores -10?log10Prob(Error)
- Reads can be obtained from leftmost, rightmost
ends of the insert - Double-barreled sequencing
- Both leftmost rightmost ends are sequenced
9Method to sequence segments longer than 500
genomic segment
cut many times at random (Shotgun)
Get one or two reads from each segment
500 bp
500 bp
10Reconstructing the Sequence (Fragment Assembly)
reads
Cover region with 7-fold redundancy (7X)
Overlap reads and extend to reconstruct the
original genomic region
11Definition of Coverage
C
- Length of genomic segment L
- Number of reads n
- Length of each read l
- Definition Coverage C nl/L
- How much coverage is enough?
- (Lander-Waterman model)
- Assuming uniform distribution of reads, C10
results in 1 gapped region /1,000,000 nucleotides
12Challenges with Fragment Assembly
- Sequencing errors
- 1-2 of bases are wrong
- Repeats
- Computation O( N2 ) where N reads
false overlap due to repeat
13Repeats
- Bacterial genomes 5
- Mammals 50
- Repeat types
- Low-Complexity DNA (e.g. ATATATATACATA)
- Microsatellite repeats (a1ak)N where k 3-6
- (e.g. CAGCAGTAGCAGCACCAG)
- Common Repeat Families
- SINE (Short Interspersed Nuclear Elements)
- (e.g. ALU 300-long, 106 copies)
- LINE (Long Interspersed Nuclear Elements)
- 500-5,000-long, 200,000 copies
- MIR
- LTR/Retroviral
- Other
- -Genes that are duplicated then diverge
(paralogs) - -Recent duplications, 100,000-long, very
similar copies
14What can we do about repeats?
- Two main approaches
- Cluster the reads
- Link the reads
15What can we do about repeats?
- Two main approaches
- Cluster the reads
- Link the reads
16Strategies for sequencing a whole genome
- Hierarchical Clone-by-clone
- Break genome into many long pieces
- Map each long piece onto the genome
- Sequence each piece with shotgun
- Example Yeast, Worm, Human, Rat
- Online version of (1) Walking
- Break genome into many long pieces
- Start sequencing each piece with shotgun
- Construct map as you go
- Example Rice genome
- Whole genome shotgun
- One large shotgun pass on the whole genome
- Example Drosophila, Human (Celera), Neurospora,
Mouse, Rat, Fugu
17Hierarchical Sequencing
18Hierarchical Sequencing Strategy
genome
- Obtain a large collection of BAC clones
- Map them onto the genome (Physical Mapping)
- Select a minimum tiling path
- Sequence each clone in the path with shotgun
- Assemble
- Put everything together
19Methods of physical mapping
- Goal
- Make a map of the locations of each clone
relative to one another - Use the map to select a minimal set of clones to
sequence - Methods
- Hybridization
- Digestion
201. Hybridization
p1
pn
- Short words, the probes, attach to complementary
words - Construct many probes
- Treat each BAC with all probes
- Record which ones attach to it
- Same words attaching to BACS X, Y ? overlap
21Hybridization Computational Challenge
p1 p2 .pm
- Matrix
- m probes ? n clones
-
- (i, j) 1, if pi hybridizes to Cj
- 0, otherwise
- Definition Consecutive ones matrix
- A matrix 1s are consecutive
- Computational problem
- Reorder the probes so that matrix is in
consecutive-ones form - Can be solved in O(m3) time (m gtgt n)
- Unfortunately, data is not perfect
0 0 1 ..1
C1 C2 .Cn
1 1 0 ..0
1 0 1...0
pi1pi2.pim
1 1 1 0 0 0..0
0 1 1 1 1 1..0
0 0 1 1 1 0..0
Cj1Cj2 .Cjn
0 0 0 0 0 01 1 1 0
0 0 0 0 0 00 1 1 1
222. Digestion
- Restriction enzymes cut DNA where specific words
appear - Cut each clone separately with an enzyme
- Run fragments on a gel and measure length
- Clones Ca, Cb have fragments of length li, lj,
lk ? overlap - Double digestion
- Cut with enzyme A, enzyme B, then enzymes A B
23Whole-Genome Shotgun Sequencing
24Whole Genome Shotgun Sequencing
genome
plasmids (2 10 Kbp)
forward-reverse linked reads
known dist
cosmids (40 Kbp)
500 bp
500 bp
25The Overlap-Layout-Consensus approach
1. Find overlapping reads
2. Merge good pairs of reads into longer contigs
3. Link contigs to form supercontigs
..ACGATTACAATAGGTT..
4. Derive consensus sequence
many heuristics
261. Find Overlapping Reads
- Sort all k-mers in reads (k 24)
- Find pairs of reads sharing a k-mer
- Extend to full alignment throw away if not gt95
similar
TAGATTACACAGATTAC
TAGATTACACAGATTAC
271. Find Overlapping Reads
- One caveat repeats
- A k-mer that appears N times, initiates N2
comparisons - ALU 1,000,000 times
- Solution
-
- Discard all k-mers that appear more than c ?
Coverage, (c 10)
281. Find Overlapping Reads
- Create local multiple alignments from the
overlapping reads
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
291. Find Overlapping Reads (contd)
- Correct errors using multiple alignment
C 20
C 20
C 35
C 35
T 30
C 0
C 35
C 35
TAGATTACACAGATTACTGA
C 40
C 40
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
A 15
A 15
A 25
A 25
-
A 0
A 40
A 40
A 25
A 25
- Score alignments
- Accept alignments with good scores
30Basic principle of assembly
- Repeats confuse us
- Ability to merge two reads ? ability to
detect repeats - We can dismiss as repeat any overlap of lt t
similarity - Role of error correction
-
- Discards 90 of single-letter sequencing errors
-
- ? Threshold t increases
312. Merge Reads into Contigs (contd)
- Merge reads up to potential repeat boundaries
- (Myers, 1995)
322. Merge Reads into Contigs (contd)
- Ignore non-maximal reads
- Merge only maximal reads into contigs
332. Merge Reads into Contigs (contd)
sequencing error
b
a
- Ignore hanging reads, when detecting repeat
boundaries
342. Merge Reads into Contigs (contd)
?????
Unambiguous
- Insert non-maximal reads whenever unambiguous
353. Link Contigs into Supercontigs
Normal density
Too dense Overcollapsed? (Myers et al. 2000)
Inconsistent links Overcollapsed?
363. Link Contigs into Supercontigs (contd)
Find all links between unique contigs
Connect contigs incrementally, if ? 2 links
373. Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of
overcollapsed contigs
383. Link Contigs into Supercontigs
Contig A
Contig B
Define G ( V, E ) V contigs E ( A, B
) such that d( A, B ) lt C Reason to do so
Efficiency full shortest paths cannot be computed
393. Link Contigs into Supercontigs
Contig A
Contig B
Define T contigs linked to either A or B
Fill gap between A and B if there is a path in G
passing only from contigs in T
404. Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
- Derive multiple alignment from pairwise read
alignments
Derive each consensus base by weighted voting
41Mouse Genome
- Several heuristics of iteratively
- Breaking supercontigs that are suspicious
- Rejoining supercontigs
- Size of problem 32,000,000 reads
- Time 15 days, 1 processor
- Memory 28 Gb
- N50 Contig size 16.3 Kb ? 24.8 Kb
- N50 Supercontig size .265 Mb ? 16.9 Mb
42Mouse Assembly
43Sequencing in the (near) future