DNA Sequencing - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

DNA Sequencing

Description:

and red) that contain. the same repeat. CS262 Lecture 9, Win07, Batzoglou ... Discards up to 98% of single-letter sequencing errors. decreases error rate ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 49
Provided by: root
Category:
Tags: dna | sequencing

less

Transcript and Presenter's Notes

Title: DNA Sequencing


1
DNA Sequencing
2
Whole Genome Shotgun Sequencing
genome
plasmids (2 10 Kbp)
forward-reverse paired reads
known dist
cosmids (40 Kbp)
500 bp
500 bp
3
Steps to Assemble a Genome
Some Terminology read a 500-900 long word
that comes out of sequencer mate pair a pair
of reads from two ends of the same insert
fragment contig a contiguous sequence formed
by several overlapping reads with no
gaps supercontig an ordered and oriented
set (scaffold) of contigs, usually by
mate pairs consensus sequence
derived from the sequene multiple alignment
of reads in a contig
1. Find overlapping reads
2. Merge some good pairs of reads into longer
contigs
3. Link contigs to form supercontigs
4. Derive consensus sequence
..ACGATTACAATAGGTT..
4
2. Merge Reads into Contigs
  • Overlap graph
  • Nodes reads r1..rn
  • Edges overlaps (ri, rj, shift, orientation,
    score)

Reads that come from two regions of the genome
(blue and red) that contain the same repeat
Note of course, we dont know the color
of these nodes
5
2. Merge Reads into Contigs
Unique Contig
Overcollapsed Contig
  • We want to merge reads up to potential repeat
    boundaries

6
2. Merge Reads into Contigs
  • Ignore non-maximal reads
  • Merge only maximal reads into contigs

7
2. Merge Reads into Contigs
r
r1
r2
r3
  • Remove transitively inferable overlaps
  • If read r overlaps to the right reads r1, r2, and
    r1 overlaps r2, then (r, r2) can be inferred by
    (r, r1) and (r1, r2)

8
2. Merge Reads into Contigs
9
Overlap graph after forming contigs
Unitigs Gene Myers, 95
10
Repeats, errors, and contig lengths
  • Repeats shorter than read length are easily
    resolved
  • Read that spans across a repeat disambiguates
    order of flanking regions
  • Repeats with more base pair diffs than sequencing
    error rate are OK
  • We throw overlaps between two reads in different
    copies of the repeat
  • To make the genome appear less repetitive, try
    to
  • Increase read length
  • Decrease sequencing error rate
  • Role of error correction
  • Discards up to 98 of single-letter sequencing
    errors
  • decreases error rate
  • ? decreases effective repeat content
  • ? increases contig length

11
3. Link Contigs into Supercontigs
Normal density
Too dense ? Overcollapsed
Inconsistent links ? Overcollapsed?
12
3. Link Contigs into Supercontigs
Find all links between unique contigs
Connect contigs incrementally, if ? 2 links
supercontig (aka scaffold)
13
3. Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of repeat
contigs
14
4. Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
  • Derive multiple alignment from pairwise read
    alignments

Derive each consensus base by weighted
voting (Alternative take maximum-quality letter)
15
Some Assemblers
  • PHRAP
  • Early assembler, widely used, good model of read
    errors
  • Overlap O(n2) ? layout (no mate pairs) ?
    consensus
  • Celera
  • First assembler to handle large genomes (fly,
    human, mouse)
  • Overlap ? layout ? consensus
  • Arachne
  • Public assembler (mouse, several fungi)
  • Overlap ? layout ? consensus
  • Phusion
  • Overlap ? clustering ? PHRAP ? assemblage ?
    consensus
  • Euler
  • Indexing ? Euler graph ? layout by picking paths
    ? consensus

16
Quality of assembliesmouse
Terminology N50 contig length If we sort contigs
from largest to smallest, and start Covering the
genome in that order, N50 is the length Of the
contig that just covers the 50th percentile.
7.7X sequence coverage
17
Quality of assembliesdog
7.5X sequence coverage
18
History of WGA
1997
  • 1982 ?-virus, 48,502 bp
  • 1995 h-influenzae, 1 Mbp
  • 2000 fly, 100 Mbp
  • 2001 present
  • human (3Gbp), mouse (2.5Gbp), rat, chicken, dog,
    chimpanzee, several fungal genomes

Lets sequence the human genome with the shotgun
strategy
That is impossible, and a bad idea anyway
Phil Green
Gene Myers
19
Genomes Sequenced
  • http//www.genome.gov/10002154

20
Phylogeny Tree Reconstruction
21
Phylogenetic Trees
  • Nodes species
  • Edges time of independent evolution
  • Edge length represents evolution time
  • AKA genetic distance
  • Not necessarily chronological time

22
Inferring Phylogenetic Trees
  • Trees can be inferred by several criteria
  • Morphology of the organisms
  • Can lead to mistakes!
  • Sequence comparison
  • Example
  • Orc ACAGTGACGCCCCAAACGT
  • Elf ACAGTGACGCTACAAACGT
  • Dwarf CCTGTGACGTAACAAACGA
  • Hobbit CCTGTGACGTAGCAAACGA
  • Human CCTGTGACGTAGCAAACGA

23
Phylogeny and sequence comparison
  • Basic principles
  • Degree of sequence difference is proportional to
    length of independent sequence evolution
  • Only use positions where alignment is certain
    avoid areas with (too many) gaps

24
Distance between two sequences
  • Given sequences xi, xj,
  • Define
  • dij distance between the two sequences
  • One possible definition
  • dij fraction f of sites u where xiu ? xju
  • Better scores are derived by modeling evolution
    as a continuous change process not covered here
  • Jukes Kantor, Kimura, etc.

25
A simple clustering method for building tree
  • UPGMA (unweighted pair group method using
    arithmetic averages)
  • Or the Average Linkage Method
  • Given two disjoint clusters Ci, Cj of sequences,
  • 1
  • dij ?p ?Ci, q ?Cjdpq
  • Ci ? Cj
  • Claim that if Ck Ci ? Cj, then distance to
    another cluster Cl is
  • dil Ci djl Cj
  • dkl
  • Ci Cj

Proof ?Ci,Cl dpq ?Cj,Cl dpq dkl
(Ci Cj)
Cl Ci/(CiCl) ?Ci,Cl dpq
Cj/(CjCl) ?Cj,Cl dpq

(Ci Cj) Ci dil Cj djl
(Ci Cj)
26
Algorithm Average Linkage
1
  • Initialization
  • Assign each xi into its own cluster Ci
  • Define one leaf per sequence, height 0
  • Iteration
  • Find two clusters Ci, Cj s.t. dij is min
  • Let Ck Ci ? Cj
  • Define node connecting Ci, Cj,
  • place it at height dij/2
  • Delete Ci, Cj
  • Termination
  • When two clusters i, j remain,
  • place root at height dij/2

4
3
5
2
2
3
5
1
4
27
Example
4
3
2
1
y
z
x
w
v
28
Ultrametric Distances and Molecular Clock
  • Definition
  • A distance function d(.,.) is ultrametric if for
    any three distances dij ? dik ? dij, it is true
    that
  • dij ? dik dij
  • The Molecular Clock
  • The evolutionary distance between species x and y
    is 2? the Earth time to reach the nearest common
    ancestor
  • That is, the molecular clock has constant rate in
    all species

The molecular clock results in ultrametric
distances
years
1
4
2
3
5
29
Ultrametric Distances Average Linkage
1
4
2
3
5
  • Average Linkage is guaranteed to reconstruct
    correctly a binary tree with ultrametric
    distances
  • Proof Exercise

30
Weakness of Average Linkage
  • Molecular clock all species evolve at the same
    rate (Earth time)
  • However, certain species (e.g., mouse, rat)
    evolve much faster
  • Example where UPGMA messes up

AL tree
Correct tree
3
2
1
3
4
4
2
1
31
Additive Distances
1
4
12
8
3
13
7
9
5
11
10
6
2
  • Given a tree, a distance measure is additive if
    the distance between any pair of leaves is the
    sum of lengths of edges connecting them
  • Given a tree T additive distances dij, can
    uniquely reconstruct edge lengths
  • Find two neighboring leaves i, j, with common
    parent k
  • Place parent node k at distance dkm ½ (dim
    djm dij) from any node m ? i, j

32
Additive Distances
z
x
w
y
  • For any four leaves x, y, z, w, consider the
    three sums
  • d(x, y) d(z, w)
  • d(x, z) d(y, w)
  • d(x, w) d(y, z)
  • One of them is smaller than the other two, which
    are equal
  • d(x, y) d(z, w) lt d(x, z) d(y, w)
    d(x, w) d(y, z)

33
Reconstructing Additive Distances Given T
x
T
D
y
5
4
3
z
3
4
w
7
6
v
If we know T and D, but do not know the length of
each leaf, we can reconstruct those lengths
34
Reconstructing Additive Distances Given T
x
T
D
y
z
w
v
35
Reconstructing Additive Distances Given T
D
x
T
y
z
a
w
D1
v
dax ½ (dvx dwx dvw)
day ½ (dvy dwy dvw)
daz ½ (dvz dwz dvw)
36
Reconstructing Additive Distances Given T
D1
x
T
y
5
4
b
3
z
3
a
4
c
w
7
D2
6
d(a, c) 3 d(b, c) d(a, b) d(a, c) 3 d(c,
z) d(a, z) d(a, c) 7 d(b, x) d(a, x)
d(a, b) 5 d(b, y) d(a, y) d(a, b) 4 d(a,
w) d(z, w) d(a, z) 4 d(a, v) d(z, v)
d(a, z) 6 Correct!!!
v
D3
37
Neighbor-Joining
  • Guaranteed to produce the correct tree if
    distance is additive
  • May produce a good tree even when distance is not
    additive
  • Step 1 Finding neighboring leaves
  • Define
  • Dij (N 2) dij ?k?i dik ?k?j djk
  • Claim The above magic trick ensures that Dij
    is minimal iff i, j are neighbors

1
3
0.1
0.1
0.1
0.4
0.4
4
2
38
Algorithm Neighbor-joining
  • Initialization
  • Define T to be the set of leaf nodes, one per
    sequence
  • Let L T
  • Iteration
  • Pick i, j s.t. Dij is minimal
  • Define a new node k, and set dkm ½ (dim djm
    dij) for all m ? L
  • Add k to T, with edges of lengths dik ½ (dij
    ri rj), djk dij dik
  • where ri (N 2)-1 ?k?i dik
  • Remove i, j from L
  • Add k to L
  • Termination
  • When L consists of two nodes, i, j, and the edge
    between them of length dij

39
Parsimony direct method not using distances
  • One of the most popular methods
  • GIVEN multiple alignment
  • FIND tree history of substitutions explaining
    alignment
  • Idea
  • Find the tree that explains the observed
    sequences with a minimal number of substitutions
  • Two computational subproblems
  • Find the parsimony cost of a given tree (easy)
  • Search through all tree topologies (hard)

40
Example Parsimony cost of one column
A Final cost C 1
A
A, B Cost C1
A B A A
A
A
B
A
B
A
A
A
41
Parsimony Scoring
  • Given a tree, and an alignment column u
  • Label internal nodes to minimize the number of
    required substitutions
  • Initialization
  • Set cost C 0 node k 2N 1 (last leaf)
  • Iteration
  • If k is a leaf, set Rk xku // Rk is
    simply the character of kth species
  • If k is not a leaf,
  • Let i, j be the daughter nodes
  • Set Rk Ri ? Rj if intersection is nonempty
  • Set Rk Ri ? Rj, and C 1, if intersection
    is empty
  • Termination
  • Minimal cost of tree for column u, C

42
Example
B
A,B
A
B
A
A,B
A
A
A
A
B
B
A
B
A
A
A
A
B
A
B
A
B
43
Traceback to find ancestral nucleotides
  • Traceback
  • Choose an arbitrary nucleotide from R2N 1 for
    the root
  • Having chosen nucleotide r for parent k,
  • If r ? Ri choose r for daughter i
  • Else, choose arbitrary nucleotide from Ri
  • Easy to see that this traceback produces some
    assignment of cost C

44
Example
Admissible with Traceback
x
B
Still optimal, but inadmissible with Traceback
A
A, B
B
A
x
A
B
A
A, B
A
B
B
x
B
x
A
A
B
B
A
A
A
B
B
B
A
B
A
A
x
A
x
A
A
B
B
45
Probabilistic Methods
xroot
t1
t2
x1
x2
  • A more refined measure of evolution along a tree
    than parsimony
  • P(x1, x2, xroot t1, t2) P(xroot) P(x1 t1,
    xroot) P(x2 t2, xroot)
  • If we use Jukes-Cantor, for example, and x1
    xroot A, x2 C, t1 t2 1,
  • pA?¼(1 3e-4a) ?¼(1 e-4a) (¼)3(1
    3e-4a)(1 e-4a)

46
Probabilistic Methods
xroot
xu
x2
xN
x1
  • If we know all internal labels xu,
  • P(x1, x2, , xN, xN1, , x2N-1 T, t)
    P(xroot)?j?rootP(xj xparent(j), tj, parent(j))
  • Usually we dont know the internal labels,
    therefore
  • P(x1, x2, , xN T, t) ?xN1 ?xN2 ?x2N-1
    P(x1, x2, , x2N-1 T, t)

47
Felsensteins Likelihood Algorithm
  • To calculate P(x1, x2, , xN T, t)
  • Initialization
  • Set k 2N 1
  • Iteration Compute P(Lk a) for all a ? ?
  • If k is a leaf node
  • Set P(Lk a) 1(a xk)
  • If k is not a leaf node
  • 1. Compute P(Li b), P(Lj b) for all b, for
    daughter nodes i, j
  • 2. Set P(Lk a) ?b,c P(b a, ti) P(Li b)
    P(c a, tj) P(Lj c)
  • Termination
  • Likelihood at this column P(x1, x2, , xN
    T, t) ?aP(L2N-1 a)P(a)

Let P(Lk a) denote the prob. of all the
leaves below node k, given that the residue at
k is a
48
Probabilistic Methods
  • Given M (ungapped) alignment columns of N
    sequences,
  • Define likelihood of a tree
  • L(T, t) P(Data T, t) ?m1M P(x1m, , xnm,
    T, t)
  • Maximum Likelihood Reconstruction
  • Given data X (xij), find a topology T and
    length vector t that maximize likelihood L(T, t)

49
Current popular methods
  • HUNDREDS of programs available!
  • http//evolution.genetics.washington.edu/phylip/so
    ftware.htmlmethods
  • Some recommended programs
  • DiscreteParsimony-based
  • Rec-1-DCM3
  • http//www.cs.utexas.edu/users/tandy/mp.html
  • Tandy Warnow and colleagues
  • Probabilistic
  • SEMPHY
  • http//www.cs.huji.ac.il/labs/compbio/semphy/
  • Nir Friedman and colleagues
Write a Comment
User Comments (0)
About PowerShow.com