Dynamic Programming: Edit Distance - PowerPoint PPT Presentation

1 / 86
About This Presentation
Title:

Dynamic Programming: Edit Distance

Description:

An Introduction to Bioinformatics Algorithms. Dynamic Programming: Edit Distance ... An Introduction to Bioinformatics Algorithms. www.bioalgorithms.info. The ... – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 87
Provided by: SophieDa9
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Programming: Edit Distance


1
Dynamic ProgrammingEdit Distance
2
Outline
  • DNA Sequence Comparison First Success Stories
  • Change Problem
  • Manhattan Tourist Problem
  • Longest Paths in Graphs
  • Sequence Alignment
  • Edit Distance
  • Longest Common Subsequence Problem
  • Dot Matrices

3
Outline CHANGE
  • Manhattan Tourist Problem ADD tourist picture
    from the book-The slide MTP An Example uses
    the style from the book others are not redo
    follow-up slides GIVE an extra slide with
    Manhattan Tourist code
  • LCS in the matrix the long series of matrices
    explaining alignments are shifted need to be
    realigned.

4
DNA Sequence Comparison First Success Story
  • Finding sequence similarities with genes of known
    function is a common approach to infer a newly
    sequenced genes function
  • In 1984 Russell Doolittle and colleagues found
    similarities between cancer-causing gene and
    normal growth factor (PDGF) gene

5
Cystic Fibrosis
  • Cystic fibrosis (CF) is a chronic and frequently
    fatal disease of the body's mucus (abnormally
    high level of mucus in glands). CF primarily
    affects the respiratory systems in children.
  • Mucus is a slimy material that coats many
    epithelial surfaces and is secreted into fluids
    such as saliva

6
Cystic Fibrosis Inheritance
  • In early 1980s biologists hypothesized that CF is
    a genetic disorder caused by mutations in a gene
    that remained unknown till 1989
  • Heterozygous carriers are asymptomatic
  • Must be homozygously recessive in this gene in
    order to be diagnosed with CF

7
Cystic Fibrosis Finding the Gene
8
Finding Similarities between the Cystic Fibrosis
Gene and ATP binding proteins
  • ATP binding proteins are present on cell membrane
    and act as transport channel
  • In 1989 biologists found similarity between the
    cystic fibrosis gene and ATP binding proteins
  • A plausible function for cystic fibrosis gene,
    given the fact that CF involves sweet secretion
    with abnormally high sodium level

9
Cystic Fibrosis Mutation Analysis
  • If a high of cystic fibrosis (CF) patients have
    a certain mutation in the gene and the normal
    patients dont, then that could be an indicator
    of a mutation that is related to CF
  •  
  • A certain mutation was found in 70 of CF
    patients, convincing evidence that it is a
    predominant genetic diagnostics marker for CF

10
Cystic Fibrosis and CFTR Gene
11
Cystic Fibrosis and the CFTR Protein
  • CFTR (Cystic Fibrosis Transmembrane conductance
    Regulator) protein is acting in the cell membrane
    of epithelial cells that secrete mucus
  • These cells line the airways of the nose, lungs,
    the stomach wall, etc.

12
Mechanism of Cystic Fibrosis
  • The CFTR protein (1480 amino acids) regulates a
    chloride ion channel
  • Adjusts the wateriness of fluids secreted by
    the cell
  • Those with cystic fibrosis are missing one single
    amino acid in their CFTR
  • Mucus ends up being too thick, affecting many
    organs

13
Bring in the Bioinformaticians
  • Similarities between a gene with known function
    and a gene with unknown function allow biologists
    to infer the function of the gene
  • Computing a similarity score between two genes
    tells how likely it is that they have similar
    functions
  • Dynamic programming is a technique for revealing
    similarities between sequences
  • The Change Problem is a good problem to introduce
    the idea of dynamic programming

14
The Change Problem
Goal Convert some amount of money M into given
denominations, using the fewest possible number
of coins
Input An amount of money M, and an array of d
denominations c (c1, c2, , cd), in a
decreasing order of value (c1 gt c2 gt gt cd)
Output A list of d integers i1, i2, , id such
that c1i1 c2i2 cdid M and i1 i2
id is minimal
15
Change Problem Example
Given the denominations 1, 3, and 5, what is the
minimum number of coins needed to make change for
a given value?
Only one coin is needed to make change for the
values 1, 3, and 5
16
Change Problem Example (contd)
Given the denominations 1, 3, and 5, what is the
minimum number of coins needed to make change for
a given value?
1
2
3
4
5
6
7
8
9
10
Value
1
2
1
2
1
2
2
2
Min of coins
However, two coins are needed to make change for
the values 2, 4, 6, 8, and 10.
17
Change Problem Example (contd)
Given the denominations 1, 3, and 5, what is the
minimum number of coins needed to make change for
a given value?
1
2
3
4
5
6
7
8
9
10
Value
1
2
1
2
1
2
3
2
3
2
Min of coins
Lastly, three coins are needed to make change for
the values 7 and 9
18
Change Problem Recurrence
This example is expressed by the following
recurrence relation
19
Change Problem Recurrence (contd)
Given the denominations c c1, c2, , cd, the
recurrence relation is
20
Change Problem A Recursive Algorithm
  • RecursiveChange(M,c,d)
  • if M 0
  • return 0
  • bestNumCoins ? infinity
  • for i ? 1 to d
  • if M ci
  • numCoins ? RecursiveChange(M ci , c,
    d)
  • if numCoins 1 lt bestNumCoins
  • bestNumCoins ? numCoins 1
  • return bestNumCoins

21
The RecursiveChange Tree
77
74
76
70
75
73
69
73
71
67
69
67
63
74
72
68
68
66
62
70
68
64
68
66
62
62
60
56
72
70
66
72
70
66
66
64
60
66
64
60
. . .
. . .
70
70
70
70
70
22
RecursiveChange Is Not Efficient
  • It recalculates the optimal coin combination for
    a given amount of money repeatedly
  • i.e., M 77, c (1,3,7)
  • Optimal coin combo for 70 cents is computed 9
    times!

23
RecursiveChange Is Not Efficient
  • It recalculates the optimal coin combination for
    a given amount of money repeatedly
  • i.e., M 77, c (1,3,7)
  • Optimal coin combo for 70 cents is computed 9
    times!
  • Optimal coin combo for 50 cents is computed
    billions of times!

24
We Can Do Better
  • Were re-computing values in our algorithm more
    than once
  • Save results of each computation for 0 to M
  • This way, we can do a reference call to find an
    already computed value, instead of re-computing
    each time
  • Running time Md, where M is the amount of money
    and d is the number of denominations

25
The Change Problem Dynamic Programming
  • DPChange(M,c,d)
  • bestNumCoins0 ? 0
  • for m ? 1 to M
  • bestNumCoinsm ? infinity
  • for i ? 1 to d
  • if m ci
  • if bestNumCoinsm ci 1 lt
    bestNumCoinsm
  • bestNumCoinsm ? bestNumCoinsm
    ci 1
  • return bestNumCoinsM

26
DPChange Example
0
0
1
2
3
4
5
6
0
0
1
2
1
2
3
2
0
1
0
1
2
3
4
5
6
7
0
1
0
1
2
1
2
3
2
1
0
1
2
0
1
2
0
1
2
3
4
5
6
7
8
0
1
2
3
0
1
2
1
2
3
2
1
2
0
1
2
1
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
0
1
2
1
2
3
2
1
2
3
0
1
2
1
2
c (1,3,7)M 9
0
1
2
3
4
5
0
1
2
1
2
3
27
Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source












Sink
28
Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source












Sink
29
Manhattan Tourist Problem Formulation
Goal Find the longest path in a weighted grid.
Input A weighted grid G with two distinct
vertices, one labeled source and the other
labeled sink
Output A longest path in G from source to
sink
30
MTP An Example
0
1
2
3
4
j coordinate
source
3
2
4
0
9
5
3
0
0
1
0
4
3
2
2
3
2
4
13
1
1
6
5
4
2
0
7
3
4
19
15
2
i coordinate
4
5
2
4
1
3
3
0
2
3
20
3
8
5
6
5
sink
2
1
3
2
23
4
31
MTP Greedy Algorithm Is Not Optimal
1
2
5
source
3
10
5
5
2
5
1
3
5
3
1
4
2
3
promising start, but leads to bad choices!
5
0
2
0
22
0
0
0
sink
18
32
MTP Simple Recursive Program
  • MT(n,m)
  • if n0 or m0
  • return MT(n,m)
  • x ? MT(n-1,m)
  • length of the edge from (n-
    1,m) to (n,m)
  • y ? MT(n,m-1)
  • length of the edge from
    (n,m-1) to (n,m)
  • return maxx,y

33
MTP Simple Recursive Program
  • MT(n,m)
  • x ? MT(n-1,m)
  • length of the edge from (n-
    1,m) to (n,m)
  • y ? MT(n,m-1)
  • length of the edge from
    (n,m-1) to (n,m)
  • return minx,y
  • Whats wrong with this approach?

34
MTP Dynamic Programming
j
0
1
source
1
0
1
S0,1 1
i
5
1
5
S1,0 5
  • Calculate optimal path score for each vertex in
    the graph
  • Each vertexs score is the maximum of the prior
    vertices score plus the weight of the respective
    edge

35
MTP Dynamic Programming (contd)
j
0
1
2
source
1
2
0
1
3
S0,2 3
i
5
3
-5
1
5
4
S1,1 4
3
2
8
S2,0 8
36
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
S3,0 8
i
5
10
3
1
-5
1
5
4
13
S1,2 13
3
5
-5
2
8
9
S2,1 9
0
3
8
S3,0 8
37
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
S1,3 8
3
5
-3
3
-5
2
8
9
12
S2,2 12
0
0
0
3
8
9
S3,1 9
greedy alg. fails!
38
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
3
5
-3
2
3
3
-5
2
8
9
12
15
S2,3 15
0
0
-5
0
0
3
8
9
9
S3,2 9
39
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
Done!
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
(showing all back-traces)
3
5
-3
2
3
3
-5
2
8
9
12
15
0
0
-5
1
0
0
0
3
8
9
9
16
S3,3 16
40
MTP Recurrence
Computing the score for a point (i,j) by the
recurrence relation
The running time is n x m for a n by m grid (n
of rows, m of columns)
41
The City of Manhattan Is Not A Perfect Grid
What about diagonals?
  • The score at point B is given by

42
Traveling in an Arbitrary Graph
Computing the score for point x is given by the
recurrence relation
  • Predecessors (x) set of vertices that have
    edges leading to x
  • The running time for a graph with E edges is O(E)
    since each edge is evaluated once

43
Traveling in an Graph
  • The only hitch is that one must decide on the
    order in which visit the vertices
  • By the time the vertex x is analyzed, the values
    sy for all its predecessors y should be computed
  • We need to traverse the vertices in some order
  • Try to find such order for a directed cycle
  • ???

44
DAG Directed Acyclic Graph
  • Since Manhattan is not a perfect regular grid, we
    represent it as a DAG
  • DAG for Dressing in the morning problem

45
Topological Ordering
  • A labeling of vertices of the graph is called
    topological ordering of the DAG if every edge of
    the DAG connects a vertex with a smaller label to
    a vertex with a larger label
  • In other words, if vertices are positioned on a
    line in an increasing order of labels then all
    edges go from left to right.

46
Topological ordering
  • 2 different topological orderings of the DAG

47
Longest Path in DAG Problem
  • Goal Find a longest path between two vertices in
    a weighted DAG
  • Input A weighted DAG G with source and sink
    vertices
  • Output A longest path in G from source to sink

48
Longest Path in a DAG Dynamic Programming
  • Suppose vertex v has indegree 3 and predecessors
    u1, u2, u3
  • Longest path to v from source is
  • In General
  • sv maxu (su weight of edge from u to v)

su1 weight of edge from u1 to v su2 weight
of edge from u2 to v su3 weight of edge from
u3 to v
49
Traversing the Manhattan Grid
a)
b)
  • 3 different strategies
  • a) Column by column
  • b) Row by row
  • c) Along diagonals

c)
50
Sequence Alignment Two Row Representation
Given 2 DNA sequences v and w
v
m 7
w
n 6
Alignment 2 k matrix ( k gt m, n )
letters of v
A
T
--
G
T
A
T
--
T
letters of w
A
T
C
G
--
A
--
C
T
4 matches
2 insertions
2 deletions
51
Aligning DNA Sequences
m 8
V ATCTGATG
matches mismatches insertions deletions
4
n 7
1
W TGCATAC
2
match
2
mismatch
V
W
deletion
indels
insertion
52
Common Subsequence Alignment without Mismatches
  • Given two sequences
  • v v1 v2vm and w w1 w2wn
  • The Common Subsequence of v and w is a sequence
    of positions in
  • v 1 lt i1 lt i2 lt lt it lt m
  • and a sequence of positions in
  • w 1 lt j1 lt j2 lt lt jt lt n
  • such that it -th letter of v equals to jt-letter
    of w

53
Common Subsequence Example
i coords
elements of v
A
T
--
C
T
G
A
T
C
--
elements of w
--
T
G
C
T
--
A
--
C
A
j coords
(0,0)?
(1,0)?
(2,1)?
(2,2)?
(3,3)?
(3,4)?
(4,5)?
(5,5)?
(6,6)?
(7,6)?
(8,7)
positions in v
2 lt 3 lt 4 lt 6 lt 8
Matches shown in red
positions in w
1 lt 3 lt 5 lt 6 lt 7
Every common subsequence is a path in 2-D grid
54
Longest Common Subsequence Good Alignment
without Mismatches
  • Given two sequences
  • v v1 v2vm and w w1 w2wn
  • The Longest Common Subsequence (LCS) of v and w
    is a sequence of positions in
  • v 1 lt i1 lt i2 lt lt it lt m
  • and a sequence of positions in
  • w 1 lt j1 lt j2 lt lt jt lt n
  • such that it -th letter of v equals to jt-letter
    of w AND t is maximal

55
LCS Dynamic Programming
  • Find the LCS of two strings

Input A weighted graph G with two distinct
vertices, one labeled source one labeled sink
Output A longest path in G from source to
sink
  • Solve using an LCS edit graph with diagonals
    replaced with 1 edges

56
LCS Problem as Manhattan Tourist Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
57
Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
58
Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
Every path is a common subsequence. Every
diagonal edge adds an extra element to common
subsequence LCS Problem Find a path with maximum
number of diagonal edges
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
59
Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
60
Computing LCS (contd)
i-1,j
i-1,j -1
0
1
si-1,j 0
0
i,j -1
si,j MAX
si,j -1 0
i,j
si-1,j -1 1, if vi wj
61
Every Path in the Grid Corresponds to an
Alignment
W
A
T
C
G
0 1 2 2 3 4 V A T - G
T W A T C G 0
1 2 3 4 4
V
A
T
G
T
62
Aligning Sequences without Insertions and
Deletions Hamming Distance
Given two DNA sequences v and w
v
w
  • The Hamming distance dH(v, w) 8 is large
    but the sequences are very similar

63
Aligning Sequences with Insertions and Deletions
By shifting one sequence over one position
v
--
w
--
  • The edit distance dH(v, w) 2.
  • Hamming distance neglects insertions and
    deletions in DNA

64
Edit Distance
  • Levenshtein (1966) introduced edit distance
    between two strings as the minimum number of
    elementary operations (insertions, deletions, and
    substitutions) to transform one string into the
    other

d(v,w) MIN number of elementary operations
to transform v ? w
65
Edit Distance vs Hamming Distance
Hamming distance always compares i-th letter
of v with i-th letter of w
V ATATATAT
W TATATATA
Hamming distance d(v, w)8 Computing
Hamming distance is a trivial task.

66
Edit Distance vs Hamming Distance
Edit distance may compare i-th letter of v
with j-th letter of w
Hamming distance always compares i-th letter
of v with i-th letter of w
V - ATATATAT
V ATATATAT
Just one shift
Make it all line up
W TATATATA
W TATATATA
Hamming distance Edit
distance d(v, w)8
d(v, w)2 Computing Hamming distance
Computing edit distance is a
trivial task is a
non-trivial task
67
Edit Distance vs Hamming Distance
Edit distance may compare i-th letter of v
with j-th letter of w
Hamming distance always compares i-th letter
of v with i-th letter of w
V - ATATATAT
V ATATATAT
W TATATATA
W TATATATA
Hamming distance Edit
distance d(v, w)8
d(v, w)2
(one insertion and one
deletion) How to find what j goes with what i ???
68
Edit Distance Example
  • TGCATAT ? ATCCGAT in 5 steps
  • TGCATAT ? (delete last T)
  • TGCATA ? (delete last A)
  • TGCAT ? (insert A at front)
  • ATGCAT ? (substitute C for 3rd G)
  • ATCCAT ? (insert G before last A)
  • ATCCGAT (Done)


69
Edit Distance Example
  • TGCATAT ? ATCCGAT in 5 steps
  • TGCATAT ? (delete last T)
  • TGCATA ? (delete last A)
  • TGCAT ? (insert A at front)
  • ATGCAT ? (substitute C for 3rd G)
  • ATCCAT ? (insert G before last A)
  • ATCCGAT (Done)
  • What is the edit distance? 5?


70
Edit Distance Example (contd)
  • TGCATAT ? ATCCGAT in 4 steps
  • TGCATAT ? (insert A at front)
  • ATGCATAT ? (delete 6th T)
  • ATGCATA ? (substitute G for 5th A)
  • ATGCGTA ? (substitute C for 3rd G)
  • ATCCGAT (Done)

71
Edit Distance Example (contd)
  • TGCATAT ? ATCCGAT in 4 steps
  • TGCATAT ? (insert A at front)
  • ATGCATAT ? (delete 6th T)
  • ATGCATA ? (substitute G for 5th A)
  • ATGCGTA ? (substitute C for 3rd G)
  • ATCCGAT (Done)
  • Can it be done in 3 steps???

72
The Alignment Grid
  • Every alignment path is from source to sink

73
Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),
(7,7)
- Corresponding path -
74
Alignments in Edit Graph (contd)
  • and represent indels in v and w with
    score 0.
  • represent matches with score 1.
  • The score of the alignment path is 5.

75
Alignment as a Path in the Edit Graph
Every path in the edit graph corresponds to an
alignment
76
Alignment as a Path in the Edit Graph
Old Alignment 0122345677 v AT_GTTAT_ w
ATCGT_A_C 0123455667
New Alignment 0122345677 v AT_GTTAT_ w
ATCG_TA_C 0123445667
77
Alignment as a Path in the Edit Graph
0122345677 v AT_GTTAT_ w ATCGT_A_C
0123455667 (0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
78
Alignment Dynamic Programming
79
Dynamic Programming Example
Initialize 1st row and 1st column to be all
zeroes. Or, to be more precise, initialize 0th
row and 0th column to be all zeroes.
0
0
0
0
0
0
0
0
80
Dynamic Programming Example
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
?value from NW 1, if vi wj ? value from North
(top) ? value from West (left)
1
1
1
1
1
1
81
Alignment Backtracking
  • Arrows show where the score
    originated from.
  • if from the top
  • if from the left
  • if vi wj

82
Backtracking Example
Find a match in row and column 2. i2, j2,5 is
a match (T). j2, i4,5,7 is
a match (T). Since vi wj, si,j si-1,j-1
1 s2,2 s1,1 1 1 s2,5 s1,4 1
1 s4,2 s3,1 1 1 s5,2 s4,1 1
1 s7,2 s6,1 1 1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
2
83
Backtracking Example
0
0
0
0
0
0
0
0
Continuing with the dynamic programming
algorithm gives this result.
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
2
3
3
3
3
1
2
2
3
4
4
4
1
2
2
3
4
4
4
1
2
2
3
4
5
5
1
2
2
3
4
5
5
84
Alignment Dynamic Programming
85
Alignment Dynamic Programming
This recurrence corresponds to the Manhattan
Tourist problem (three incoming edges into a
vertex) with all horizontal and vertical edges
weighted by zero.
86
LCS Algorithm
  • LCS(v,w)
  • for i ? 1 to n
  • si,0 ? 0
  • for j ? 1 to m
  • s0,j ? 0
  • for i ? 1 to n
  • for j ? 1 to m
  • si-1,j
  • si,j ? max si,j-1
  • si-1,j-1 1, if vi wj
  • if si,j si-1,j
  • bi,j ? if si,j si,j-1
  • if si,j
    si-1,j-1 1
  • return (sn,m, b)



87
Now What?
  • LCS(v,w) created the alignment grid
  • Now we need a way to read the best alignment of v
    and w
  • Follow the arrows backwards from sink

88
Printing LCS Backtracking
  • PrintLCS(b,v,i,j)
  • if i 0 or j 0
  • return
  • if bi,j
  • PrintLCS(b,v,i-1,j-1)
  • print vi
  • else
  • if bi,j
  • PrintLCS(b,v,i-1,j)
  • else
  • PrintLCS(b,v,i,j-1)
Write a Comment
User Comments (0)
About PowerShow.com