Title: DNA Sequencing
1DNA Sequencing
2Whole Genome Shotgun Sequencing
genome
plasmids (2 10 Kbp)
forward-reverse paired reads
known dist
cosmids (40 Kbp)
500 bp
500 bp
3Steps to Assemble a Genome
Some Terminology read a 500-900 long word
that comes out of sequencer mate pair a pair
of reads from two ends of the same insert
fragment contig a contiguous sequence formed
by several overlapping reads with no
gaps supercontig an ordered and oriented
set (scaffold) of contigs, usually by
mate pairs consensus sequence
derived from the sequene multiple alignment
of reads in a contig
1. Find overlapping reads
2. Merge some good pairs of reads into longer
contigs
3. Link contigs to form supercontigs
4. Derive consensus sequence
..ACGATTACAATAGGTT..
42. Merge Reads into Contigs
- Overlap graph
- Nodes reads r1..rn
- Edges overlaps (ri, rj, shift, orientation,
score)
Reads that come from two regions of the genome
(blue and red) that contain the same repeat
Note of course, we dont know the color
of these nodes
52. Merge Reads into Contigs
Unique Contig
Overcollapsed Contig
- We want to merge reads up to potential repeat
boundaries
62. Merge Reads into Contigs
- Ignore non-maximal reads
- Merge only maximal reads into contigs
72. Merge Reads into Contigs
r
r1
r2
r3
- Remove transitively inferable overlaps
- If read r overlaps to the right reads r1, r2, and
r1 overlaps r2, then (r, r2) can be inferred by
(r, r1) and (r1, r2)
82. Merge Reads into Contigs
9Overlap graph after forming contigs
Unitigs Gene Myers, 95
10Repeats, errors, and contig lengths
- Repeats shorter than read length are easily
resolved - Read that spans across a repeat disambiguates
order of flanking regions - Repeats with more base pair diffs than sequencing
error rate are OK - We throw overlaps between two reads in different
copies of the repeat - To make the genome appear less repetitive, try
to - Increase read length
- Decrease sequencing error rate
- Role of error correction
- Discards up to 98 of single-letter sequencing
errors - decreases error rate
- ? decreases effective repeat content
- ? increases contig length
113. Link Contigs into Supercontigs
Normal density
Too dense ? Overcollapsed
Inconsistent links ? Overcollapsed?
123. Link Contigs into Supercontigs
Find all links between unique contigs
Connect contigs incrementally, if ? 2 links
supercontig (aka scaffold)
133. Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of repeat
contigs
144. Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
- Derive multiple alignment from pairwise read
alignments
Derive each consensus base by weighted
voting (Alternative take maximum-quality letter)
15Some Assemblers
- PHRAP
- Early assembler, widely used, good model of read
errors - Overlap O(n2) ? layout (no mate pairs) ?
consensus - Celera
- First assembler to handle large genomes (fly,
human, mouse) - Overlap ? layout ? consensus
- Arachne
- Public assembler (mouse, several fungi)
- Overlap ? layout ? consensus
- Phusion
- Overlap ? clustering ? PHRAP ? assemblage ?
consensus - Euler
- Indexing ? Euler graph ? layout by picking paths
? consensus
16Quality of assembliesmouse
Terminology N50 contig length If we sort contigs
from largest to smallest, and start Covering the
genome in that order, N50 is the length Of the
contig that just covers the 50th percentile.
7.7X sequence coverage
17Quality of assembliesdog
7.5X sequence coverage
18History of WGA
1997
- 1982 ?-virus, 48,502 bp
- 1995 h-influenzae, 1 Mbp
- 2000 fly, 100 Mbp
- 2001 present
- human (3Gbp), mouse (2.5Gbp), rat, chicken, dog,
chimpanzee, several fungal genomes
Lets sequence the human genome with the shotgun
strategy
That is impossible, and a bad idea anyway
Phil Green
Gene Myers
19Genomes Sequenced
- http//www.genome.gov/10002154
20Phylogeny Tree Reconstruction
21Phylogenetic Trees
- Nodes species
- Edges time of independent evolution
- Edge length represents evolution time
- AKA genetic distance
- Not necessarily chronological time
22Inferring Phylogenetic Trees
- Trees can be inferred by several criteria
- Morphology of the organisms
- Can lead to mistakes!
- Sequence comparison
- Example
- Orc ACAGTGACGCCCCAAACGT
- Elf ACAGTGACGCTACAAACGT
- Dwarf CCTGTGACGTAACAAACGA
- Hobbit CCTGTGACGTAGCAAACGA
- Human CCTGTGACGTAGCAAACGA
23Phylogeny and sequence comparison
- Basic principles
- Degree of sequence difference is proportional to
length of independent sequence evolution - Only use positions where alignment is certain
avoid areas with (too many) gaps
24Distance between two sequences
- Given sequences xi, xj,
- Define
-
- dij distance between the two sequences
- One possible definition
- dij fraction f of sites u where xiu ? xju
- Better scores are derived by modeling evolution
as a continuous change process not covered here - Jukes Kantor, Kimura, etc.
25A simple clustering method for building tree
- UPGMA (unweighted pair group method using
arithmetic averages) - Or the Average Linkage Method
- Given two disjoint clusters Ci, Cj of sequences,
- 1
- dij ?p ?Ci, q ?Cjdpq
- Ci ? Cj
- Claim that if Ck Ci ? Cj, then distance to
another cluster Cl is - dil Ci djl Cj
- dkl
- Ci Cj
-
Proof ?Ci,Cl dpq ?Cj,Cl dpq dkl
(Ci Cj)
Cl Ci/(CiCl) ?Ci,Cl dpq
Cj/(CjCl) ?Cj,Cl dpq
(Ci Cj) Ci dil Cj djl
(Ci Cj)
26Algorithm Average Linkage
1
- Initialization
- Assign each xi into its own cluster Ci
- Define one leaf per sequence, height 0
- Iteration
- Find two clusters Ci, Cj s.t. dij is min
- Let Ck Ci ? Cj
- Define node connecting Ci, Cj,
- place it at height dij/2
- Delete Ci, Cj
- Termination
- When two clusters i, j remain,
- place root at height dij/2
4
3
5
2
2
3
5
1
4
27Example
4
3
2
1
y
z
x
w
v
28Ultrametric Distances and Molecular Clock
- Definition
- A distance function d(.,.) is ultrametric if for
any three distances dij ? dik ? dij, it is true
that - dij ? dik dij
- The Molecular Clock
- The evolutionary distance between species x and y
is 2? the Earth time to reach the nearest common
ancestor - That is, the molecular clock has constant rate in
all species
The molecular clock results in ultrametric
distances
years
1
4
2
3
5
29Ultrametric Distances Average Linkage
1
4
2
3
5
- Average Linkage is guaranteed to reconstruct
correctly a binary tree with ultrametric
distances - Proof Exercise
30Weakness of Average Linkage
- Molecular clock all species evolve at the same
rate (Earth time) - However, certain species (e.g., mouse, rat)
evolve much faster - Example where UPGMA messes up
AL tree
Correct tree
3
2
1
3
4
4
2
1
31Additive Distances
1
4
12
8
3
13
7
9
5
11
10
6
2
- Given a tree, a distance measure is additive if
the distance between any pair of leaves is the
sum of lengths of edges connecting them - Given a tree T additive distances dij, can
uniquely reconstruct edge lengths - Find two neighboring leaves i, j, with common
parent k - Place parent node k at distance dkm ½ (dim
djm dij) from any node m ? i, j
32Additive Distances
z
x
w
y
- For any four leaves x, y, z, w, consider the
three sums - d(x, y) d(z, w)
- d(x, z) d(y, w)
- d(x, w) d(y, z)
- One of them is smaller than the other two, which
are equal - d(x, y) d(z, w) lt d(x, z) d(y, w)
d(x, w) d(y, z)
33Reconstructing Additive Distances Given T
x
T
D
y
5
4
3
z
3
4
w
7
6
v
If we know T and D, but do not know the length of
each leaf, we can reconstruct those lengths
34Reconstructing Additive Distances Given T
x
T
D
y
z
w
v
35Reconstructing Additive Distances Given T
D
x
T
y
z
a
w
D1
v
dax ½ (dvx dwx dvw)
day ½ (dvy dwy dvw)
daz ½ (dvz dwz dvw)
36Reconstructing Additive Distances Given T
D1
x
T
y
5
4
b
3
z
3
a
4
c
w
7
D2
6
d(a, c) 3 d(b, c) d(a, b) d(a, c) 3 d(c,
z) d(a, z) d(a, c) 7 d(b, x) d(a, x)
d(a, b) 5 d(b, y) d(a, y) d(a, b) 4 d(a,
w) d(z, w) d(a, z) 4 d(a, v) d(z, v)
d(a, z) 6 Correct!!!
v
D3
37Neighbor-Joining
- Guaranteed to produce the correct tree if
distance is additive - May produce a good tree even when distance is not
additive - Step 1 Finding neighboring leaves
- Define
- Dij (N 2) dij ?k?i dik ?k?j djk
- Claim The above magic trick ensures that Dij
is minimal iff i, j are neighbors
1
3
0.1
0.1
0.1
0.4
0.4
4
2
38Algorithm Neighbor-joining
- Initialization
- Define T to be the set of leaf nodes, one per
sequence - Let L T
- Iteration
- Pick i, j s.t. Dij is minimal
- Define a new node k, and set dkm ½ (dim djm
dij) for all m ? L -
- Add k to T, with edges of lengths dik ½ (dij
ri rj), djk dij dik -
- where ri (N 2)-1 ?k?i dik
- Remove i, j from L
- Add k to L
- Termination
- When L consists of two nodes, i, j, and the edge
between them of length dij
39Parsimony direct method not using distances
- One of the most popular methods
- GIVEN multiple alignment
- FIND tree history of substitutions explaining
alignment - Idea
- Find the tree that explains the observed
sequences with a minimal number of substitutions - Two computational subproblems
- Find the parsimony cost of a given tree (easy)
- Search through all tree topologies (hard)
40Example Parsimony cost of one column
A Final cost C 1
A
A, B Cost C1
A B A A
A
A
B
A
B
A
A
A
41Parsimony Scoring
- Given a tree, and an alignment column u
- Label internal nodes to minimize the number of
required substitutions - Initialization
- Set cost C 0 node k 2N 1 (last leaf)
- Iteration
- If k is a leaf, set Rk xku // Rk is
simply the character of kth species -
- If k is not a leaf,
- Let i, j be the daughter nodes
- Set Rk Ri ? Rj if intersection is nonempty
- Set Rk Ri ? Rj, and C 1, if intersection
is empty - Termination
- Minimal cost of tree for column u, C
42Example
B
A,B
A
B
A
A,B
A
A
A
A
B
B
A
B
A
A
A
A
B
A
B
A
B
43Traceback to find ancestral nucleotides
- Traceback
- Choose an arbitrary nucleotide from R2N 1 for
the root - Having chosen nucleotide r for parent k,
- If r ? Ri choose r for daughter i
- Else, choose arbitrary nucleotide from Ri
- Easy to see that this traceback produces some
assignment of cost C
44Example
Admissible with Traceback
x
B
Still optimal, but inadmissible with Traceback
A
A, B
B
A
x
A
B
A
A, B
A
B
B
x
B
x
A
A
B
B
A
A
A
B
B
B
A
B
A
A
x
A
x
A
A
B
B
45Probabilistic Methods
xroot
t1
t2
x1
x2
- A more refined measure of evolution along a tree
than parsimony - P(x1, x2, xroot t1, t2) P(xroot) P(x1 t1,
xroot) P(x2 t2, xroot) - If we use Jukes-Cantor, for example, and x1
xroot A, x2 C, t1 t2 1, - pA?¼(1 3e-4a) ?¼(1 e-4a) (¼)3(1
3e-4a)(1 e-4a)
46Probabilistic Methods
xroot
xu
x2
xN
x1
- If we know all internal labels xu,
- P(x1, x2, , xN, xN1, , x2N-1 T, t)
P(xroot)?j?rootP(xj xparent(j), tj, parent(j)) - Usually we dont know the internal labels,
therefore - P(x1, x2, , xN T, t) ?xN1 ?xN2 ?x2N-1
P(x1, x2, , x2N-1 T, t)
47Felsensteins Likelihood Algorithm
- To calculate P(x1, x2, , xN T, t)
- Initialization
- Set k 2N 1
- Iteration Compute P(Lk a) for all a ? ?
- If k is a leaf node
- Set P(Lk a) 1(a xk)
- If k is not a leaf node
- 1. Compute P(Li b), P(Lj b) for all b, for
daughter nodes i, j - 2. Set P(Lk a) ?b,c P(b a, ti) P(Li b)
P(c a, tj) P(Lj c) - Termination
- Likelihood at this column P(x1, x2, , xN
T, t) ?aP(L2N-1 a)P(a)
Let P(Lk a) denote the prob. of all the
leaves below node k, given that the residue at
k is a
48Probabilistic Methods
- Given M (ungapped) alignment columns of N
sequences, -
- Define likelihood of a tree
- L(T, t) P(Data T, t) ?m1M P(x1m, , xnm,
T, t) - Maximum Likelihood Reconstruction
- Given data X (xij), find a topology T and
length vector t that maximize likelihood L(T, t)
49Current popular methods
- HUNDREDS of programs available!
- http//evolution.genetics.washington.edu/phylip/so
ftware.htmlmethods - Some recommended programs
- DiscreteParsimony-based
- Rec-1-DCM3
- http//www.cs.utexas.edu/users/tandy/mp.html
- Tandy Warnow and colleagues
- Probabilistic
- SEMPHY
- http//www.cs.huji.ac.il/labs/compbio/semphy/
- Nir Friedman and colleagues