Title: Dynamic Programming: Edit Distance
1Dynamic ProgrammingEdit Distance
2Outline
- DNA Sequence Comparison First Success Stories
- Change Problem
- Manhattan Tourist Problem
- Longest Paths in Graphs
- Sequence Alignment
- Edit Distance
- Longest Common Subsequence Problem
- Dot Matrices
3Outline CHANGE
- Manhattan Tourist Problem ADD tourist picture
from the book-The slide MTP An Example uses
the style from the book others are not redo
follow-up slides GIVE an extra slide with
Manhattan Tourist code - LCS in the matrix the long series of matrices
explaining alignments are shifted need to be
realigned.
4DNA Sequence Comparison First Success Story
- Finding sequence similarities with genes of known
function is a common approach to infer a newly
sequenced genes function - In 1984 Russell Doolittle and colleagues found
similarities between cancer-causing gene and
normal growth factor (PDGF) gene
5Cystic Fibrosis
- Cystic fibrosis (CF) is a chronic and frequently
fatal disease of the body's mucus (abnormally
high level of mucus in glands). CF primarily
affects the respiratory systems in children. - Mucus is a slimy material that coats many
epithelial surfaces and is secreted into fluids
such as saliva
6Cystic Fibrosis Inheritance
- In early 1980s biologists hypothesized that CF is
a genetic disorder caused by mutations in a gene
that remained unknown till 1989 - Heterozygous carriers are asymptomatic
- Must be homozygously recessive in this gene in
order to be diagnosed with CF
7Cystic Fibrosis Finding the Gene
8Finding Similarities between the Cystic Fibrosis
Gene and ATP binding proteins
- ATP binding proteins are present on cell membrane
and act as transport channel - In 1989 biologists found similarity between the
cystic fibrosis gene and ATP binding proteins - A plausible function for cystic fibrosis gene,
given the fact that CF involves sweet secretion
with abnormally high sodium level
9Cystic Fibrosis Mutation Analysis
- If a high of cystic fibrosis (CF) patients have
a certain mutation in the gene and the normal
patients dont, then that could be an indicator
of a mutation that is related to CF - Â
- A certain mutation was found in 70 of CF
patients, convincing evidence that it is a
predominant genetic diagnostics marker for CF
10Cystic Fibrosis and CFTR Gene
11Cystic Fibrosis and the CFTR Protein
- CFTR (Cystic Fibrosis Transmembrane conductance
Regulator) protein is acting in the cell membrane
of epithelial cells that secrete mucus - These cells line the airways of the nose, lungs,
the stomach wall, etc.
12Mechanism of Cystic Fibrosis
- The CFTR protein (1480 amino acids) regulates a
chloride ion channel - Adjusts the wateriness of fluids secreted by
the cell - Those with cystic fibrosis are missing one single
amino acid in their CFTR - Mucus ends up being too thick, affecting many
organs
13Bring in the Bioinformaticians
- Similarities between a gene with known function
and a gene with unknown function allow biologists
to infer the function of the gene - Computing a similarity score between two genes
tells how likely it is that they have similar
functions - Dynamic programming is a technique for revealing
similarities between sequences - The Change Problem is a good problem to introduce
the idea of dynamic programming
14The Change Problem
Goal Convert some amount of money M into given
denominations, using the fewest possible number
of coins
Input An amount of money M, and an array of d
denominations c (c1, c2, , cd), in a
decreasing order of value (c1 gt c2 gt gt cd)
Output A list of d integers i1, i2, , id such
that c1i1 c2i2 cdid M and i1 i2
id is minimal
15Change Problem Example
Given the denominations 1, 3, and 5, what is the
minimum number of coins needed to make change for
a given value?
Only one coin is needed to make change for the
values 1, 3, and 5
16Change Problem Example (contd)
Given the denominations 1, 3, and 5, what is the
minimum number of coins needed to make change for
a given value?
1
2
3
4
5
6
7
8
9
10
Value
1
2
1
2
1
2
2
2
Min of coins
However, two coins are needed to make change for
the values 2, 4, 6, 8, and 10.
17Change Problem Example (contd)
Given the denominations 1, 3, and 5, what is the
minimum number of coins needed to make change for
a given value?
1
2
3
4
5
6
7
8
9
10
Value
1
2
1
2
1
2
3
2
3
2
Min of coins
Lastly, three coins are needed to make change for
the values 7 and 9
18Change Problem Recurrence
This example is expressed by the following
recurrence relation
19Change Problem Recurrence (contd)
Given the denominations c c1, c2, , cd, the
recurrence relation is
20Change Problem A Recursive Algorithm
- RecursiveChange(M,c,d)
- if M 0
- return 0
- bestNumCoins ? infinity
- for i ? 1 to d
- if M ci
- numCoins ? RecursiveChange(M ci , c,
d) - if numCoins 1 lt bestNumCoins
- bestNumCoins ? numCoins 1
- return bestNumCoins
21The RecursiveChange Tree
77
74
76
70
75
73
69
73
71
67
69
67
63
74
72
68
68
66
62
70
68
64
68
66
62
62
60
56
72
70
66
72
70
66
66
64
60
66
64
60
. . .
. . .
70
70
70
70
70
22RecursiveChange Is Not Efficient
- It recalculates the optimal coin combination for
a given amount of money repeatedly - i.e., M 77, c (1,3,7)
- Optimal coin combo for 70 cents is computed 9
times!
23RecursiveChange Is Not Efficient
- It recalculates the optimal coin combination for
a given amount of money repeatedly - i.e., M 77, c (1,3,7)
- Optimal coin combo for 70 cents is computed 9
times! - Optimal coin combo for 50 cents is computed
billions of times!
24We Can Do Better
- Were re-computing values in our algorithm more
than once - Save results of each computation for 0 to M
- This way, we can do a reference call to find an
already computed value, instead of re-computing
each time - Running time Md, where M is the amount of money
and d is the number of denominations
25The Change Problem Dynamic Programming
- DPChange(M,c,d)
- bestNumCoins0 ? 0
- for m ? 1 to M
- bestNumCoinsm ? infinity
- for i ? 1 to d
- if m ci
- if bestNumCoinsm ci 1 lt
bestNumCoinsm - bestNumCoinsm ? bestNumCoinsm
ci 1 - return bestNumCoinsM
26DPChange Example
0
0
1
2
3
4
5
6
0
0
1
2
1
2
3
2
0
1
0
1
2
3
4
5
6
7
0
1
0
1
2
1
2
3
2
1
0
1
2
0
1
2
0
1
2
3
4
5
6
7
8
0
1
2
3
0
1
2
1
2
3
2
1
2
0
1
2
1
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
0
1
2
1
2
3
2
1
2
3
0
1
2
1
2
c (1,3,7)M 9
0
1
2
3
4
5
0
1
2
1
2
3
27Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source
Sink
28Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source
Sink
29Manhattan Tourist Problem Formulation
Goal Find the longest path in a weighted grid.
Input A weighted grid G with two distinct
vertices, one labeled source and the other
labeled sink
Output A longest path in G from source to
sink
30MTP An Example
0
1
2
3
4
j coordinate
source
3
2
4
0
9
5
3
0
0
1
0
4
3
2
2
3
2
4
13
1
1
6
5
4
2
0
7
3
4
19
15
2
i coordinate
4
5
2
4
1
3
3
0
2
3
20
3
8
5
6
5
sink
2
1
3
2
23
4
31MTP Greedy Algorithm Is Not Optimal
1
2
5
source
3
10
5
5
2
5
1
3
5
3
1
4
2
3
promising start, but leads to bad choices!
5
0
2
0
22
0
0
0
sink
18
32MTP Simple Recursive Program
- MT(n,m)
- if n0 or m0
- return MT(n,m)
- x ? MT(n-1,m)
- length of the edge from (n-
1,m) to (n,m) - y ? MT(n,m-1)
- length of the edge from
(n,m-1) to (n,m) - return maxx,y
33MTP Simple Recursive Program
- MT(n,m)
- x ? MT(n-1,m)
- length of the edge from (n-
1,m) to (n,m) - y ? MT(n,m-1)
- length of the edge from
(n,m-1) to (n,m) - return minx,y
- Whats wrong with this approach?
34MTP Dynamic Programming
j
0
1
source
1
0
1
S0,1 1
i
5
1
5
S1,0 5
- Calculate optimal path score for each vertex in
the graph - Each vertexs score is the maximum of the prior
vertices score plus the weight of the respective
edge
35MTP Dynamic Programming (contd)
j
0
1
2
source
1
2
0
1
3
S0,2 3
i
5
3
-5
1
5
4
S1,1 4
3
2
8
S2,0 8
36MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
S3,0 8
i
5
10
3
1
-5
1
5
4
13
S1,2 13
3
5
-5
2
8
9
S2,1 9
0
3
8
S3,0 8
37MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
S1,3 8
3
5
-3
3
-5
2
8
9
12
S2,2 12
0
0
0
3
8
9
S3,1 9
greedy alg. fails!
38MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
3
5
-3
2
3
3
-5
2
8
9
12
15
S2,3 15
0
0
-5
0
0
3
8
9
9
S3,2 9
39MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
Done!
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
(showing all back-traces)
3
5
-3
2
3
3
-5
2
8
9
12
15
0
0
-5
1
0
0
0
3
8
9
9
16
S3,3 16
40MTP Recurrence
Computing the score for a point (i,j) by the
recurrence relation
The running time is n x m for a n by m grid (n
of rows, m of columns)
41The City of Manhattan Is Not A Perfect Grid
What about diagonals?
- The score at point B is given by
42Traveling in an Arbitrary Graph
Computing the score for point x is given by the
recurrence relation
- Predecessors (x) set of vertices that have
edges leading to x - The running time for a graph with E edges is O(E)
since each edge is evaluated once
43Traveling in an Graph
- The only hitch is that one must decide on the
order in which visit the vertices - By the time the vertex x is analyzed, the values
sy for all its predecessors y should be computed - We need to traverse the vertices in some order
- Try to find such order for a directed cycle
- ???
-
44DAG Directed Acyclic Graph
- Since Manhattan is not a perfect regular grid, we
represent it as a DAG - DAG for Dressing in the morning problem
45Topological Ordering
- A labeling of vertices of the graph is called
topological ordering of the DAG if every edge of
the DAG connects a vertex with a smaller label to
a vertex with a larger label - In other words, if vertices are positioned on a
line in an increasing order of labels then all
edges go from left to right.
46Topological ordering
- 2 different topological orderings of the DAG
47Longest Path in DAG Problem
- Goal Find a longest path between two vertices in
a weighted DAG - Input A weighted DAG G with source and sink
vertices - Output A longest path in G from source to sink
48Longest Path in a DAG Dynamic Programming
- Suppose vertex v has indegree 3 and predecessors
u1, u2, u3 - Longest path to v from source is
- In General
- sv maxu (su weight of edge from u to v)
su1 weight of edge from u1 to v su2 weight
of edge from u2 to v su3 weight of edge from
u3 to v
49Traversing the Manhattan Grid
a)
b)
- 3 different strategies
- a) Column by column
- b) Row by row
- c) Along diagonals
c)
50Sequence Alignment Two Row Representation
Given 2 DNA sequences v and w
v
m 7
w
n 6
Alignment 2 k matrix ( k gt m, n )
letters of v
A
T
--
G
T
A
T
--
T
letters of w
A
T
C
G
--
A
--
C
T
4 matches
2 insertions
2 deletions
51Aligning DNA Sequences
m 8
V ATCTGATG
matches mismatches insertions deletions
4
n 7
1
W TGCATAC
2
match
2
mismatch
V
W
deletion
indels
insertion
52Common Subsequence Alignment without Mismatches
- Given two sequences
- v v1 v2vm and w w1 w2wn
- The Common Subsequence of v and w is a sequence
of positions in - v 1 lt i1 lt i2 lt lt it lt m
- and a sequence of positions in
- w 1 lt j1 lt j2 lt lt jt lt n
- such that it -th letter of v equals to jt-letter
of w
53Common Subsequence Example
i coords
elements of v
A
T
--
C
T
G
A
T
C
--
elements of w
--
T
G
C
T
--
A
--
C
A
j coords
(0,0)?
(1,0)?
(2,1)?
(2,2)?
(3,3)?
(3,4)?
(4,5)?
(5,5)?
(6,6)?
(7,6)?
(8,7)
positions in v
2 lt 3 lt 4 lt 6 lt 8
Matches shown in red
positions in w
1 lt 3 lt 5 lt 6 lt 7
Every common subsequence is a path in 2-D grid
54Longest Common Subsequence Good Alignment
without Mismatches
- Given two sequences
- v v1 v2vm and w w1 w2wn
- The Longest Common Subsequence (LCS) of v and w
is a sequence of positions in - v 1 lt i1 lt i2 lt lt it lt m
- and a sequence of positions in
- w 1 lt j1 lt j2 lt lt jt lt n
- such that it -th letter of v equals to jt-letter
of w AND t is maximal
55LCS Dynamic Programming
- Find the LCS of two strings
Input A weighted graph G with two distinct
vertices, one labeled source one labeled sink
Output A longest path in G from source to
sink
- Solve using an LCS edit graph with diagonals
replaced with 1 edges
56LCS Problem as Manhattan Tourist Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
57Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
58Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
Every path is a common subsequence. Every
diagonal edge adds an extra element to common
subsequence LCS Problem Find a path with maximum
number of diagonal edges
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
59Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
60Computing LCS (contd)
i-1,j
i-1,j -1
0
1
si-1,j 0
0
i,j -1
si,j MAX
si,j -1 0
i,j
si-1,j -1 1, if vi wj
61Every Path in the Grid Corresponds to an
Alignment
W
A
T
C
G
0 1 2 2 3 4 V A T - G
T W A T C G 0
1 2 3 4 4
V
A
T
G
T
62Aligning Sequences without Insertions and
Deletions Hamming Distance
Given two DNA sequences v and w
v
w
- The Hamming distance dH(v, w) 8 is large
but the sequences are very similar
63Aligning Sequences with Insertions and Deletions
By shifting one sequence over one position
v
--
w
--
- The edit distance dH(v, w) 2.
- Hamming distance neglects insertions and
deletions in DNA
64Edit Distance
- Levenshtein (1966) introduced edit distance
between two strings as the minimum number of
elementary operations (insertions, deletions, and
substitutions) to transform one string into the
other
d(v,w) MIN number of elementary operations
to transform v ? w
65Edit Distance vs Hamming Distance
Hamming distance always compares i-th letter
of v with i-th letter of w
V ATATATAT
W TATATATA
Hamming distance d(v, w)8 Computing
Hamming distance is a trivial task.
66Edit Distance vs Hamming Distance
Edit distance may compare i-th letter of v
with j-th letter of w
Hamming distance always compares i-th letter
of v with i-th letter of w
V - ATATATAT
V ATATATAT
Just one shift
Make it all line up
W TATATATA
W TATATATA
Hamming distance Edit
distance d(v, w)8
d(v, w)2 Computing Hamming distance
Computing edit distance is a
trivial task is a
non-trivial task
67Edit Distance vs Hamming Distance
Edit distance may compare i-th letter of v
with j-th letter of w
Hamming distance always compares i-th letter
of v with i-th letter of w
V - ATATATAT
V ATATATAT
W TATATATA
W TATATATA
Hamming distance Edit
distance d(v, w)8
d(v, w)2
(one insertion and one
deletion) How to find what j goes with what i ???
68Edit Distance Example
- TGCATAT ? ATCCGAT in 5 steps
- TGCATAT ? (delete last T)
- TGCATA ? (delete last A)
- TGCAT ? (insert A at front)
- ATGCAT ? (substitute C for 3rd G)
- ATCCAT ? (insert G before last A)
- ATCCGAT (Done)
-
69Edit Distance Example
- TGCATAT ? ATCCGAT in 5 steps
- TGCATAT ? (delete last T)
- TGCATA ? (delete last A)
- TGCAT ? (insert A at front)
- ATGCAT ? (substitute C for 3rd G)
- ATCCAT ? (insert G before last A)
- ATCCGAT (Done)
- What is the edit distance? 5?
70Edit Distance Example (contd)
- TGCATAT ? ATCCGAT in 4 steps
- TGCATAT ? (insert A at front)
- ATGCATAT ? (delete 6th T)
- ATGCATA ? (substitute G for 5th A)
- ATGCGTA ? (substitute C for 3rd G)
- ATCCGAT (Done)
-
71Edit Distance Example (contd)
- TGCATAT ? ATCCGAT in 4 steps
- TGCATAT ? (insert A at front)
- ATGCATAT ? (delete 6th T)
- ATGCATA ? (substitute G for 5th A)
- ATGCGTA ? (substitute C for 3rd G)
- ATCCGAT (Done)
- Can it be done in 3 steps???
72The Alignment Grid
- Every alignment path is from source to sink
73Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),
(7,7)
- Corresponding path -
74Alignments in Edit Graph (contd)
- and represent indels in v and w with
score 0. - represent matches with score 1.
- The score of the alignment path is 5.
75Alignment as a Path in the Edit Graph
Every path in the edit graph corresponds to an
alignment
76Alignment as a Path in the Edit Graph
Old Alignment 0122345677 v AT_GTTAT_ w
ATCGT_A_C 0123455667
New Alignment 0122345677 v AT_GTTAT_ w
ATCG_TA_C 0123445667
77Alignment as a Path in the Edit Graph
0122345677 v AT_GTTAT_ w ATCGT_A_C
0123455667 (0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
78Alignment Dynamic Programming
79Dynamic Programming Example
Initialize 1st row and 1st column to be all
zeroes. Or, to be more precise, initialize 0th
row and 0th column to be all zeroes.
0
0
0
0
0
0
0
0
80Dynamic Programming Example
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
?value from NW 1, if vi wj ? value from North
(top) ? value from West (left)
1
1
1
1
1
1
81Alignment Backtracking
- Arrows show where the score
originated from. - if from the top
- if from the left
- if vi wj
82Backtracking Example
Find a match in row and column 2. i2, j2,5 is
a match (T). j2, i4,5,7 is
a match (T). Since vi wj, si,j si-1,j-1
1 s2,2 s1,1 1 1 s2,5 s1,4 1
1 s4,2 s3,1 1 1 s5,2 s4,1 1
1 s7,2 s6,1 1 1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
2
83Backtracking Example
0
0
0
0
0
0
0
0
Continuing with the dynamic programming
algorithm gives this result.
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
2
3
3
3
3
1
2
2
3
4
4
4
1
2
2
3
4
4
4
1
2
2
3
4
5
5
1
2
2
3
4
5
5
84Alignment Dynamic Programming
85Alignment Dynamic Programming
This recurrence corresponds to the Manhattan
Tourist problem (three incoming edges into a
vertex) with all horizontal and vertical edges
weighted by zero.
86LCS Algorithm
- LCS(v,w)
- for i ? 1 to n
- si,0 ? 0
- for j ? 1 to m
- s0,j ? 0
- for i ? 1 to n
- for j ? 1 to m
- si-1,j
- si,j ? max si,j-1
- si-1,j-1 1, if vi wj
- if si,j si-1,j
- bi,j ? if si,j si,j-1
- if si,j
si-1,j-1 1 - return (sn,m, b)
87Now What?
- LCS(v,w) created the alignment grid
- Now we need a way to read the best alignment of v
and w - Follow the arrows backwards from sink
88Printing LCS Backtracking
- PrintLCS(b,v,i,j)
- if i 0 or j 0
- return
- if bi,j
- PrintLCS(b,v,i-1,j-1)
- print vi
- else
- if bi,j
- PrintLCS(b,v,i-1,j)
- else
- PrintLCS(b,v,i,j-1)