Title: Introduction to Dynamic Programming
1Introduction to Dynamic Programming
- Pair-wise Sequence Alignment Distance
2Outline
- DNA Sequence Comparison First Success Stories
- Dynamic Programming
- Change Problem
- Manhattan Tourist Problem
- Longest Paths in Graphs
- Pair-wise Sequence Alignment
- Edit Distance
- Longest Common Subsequence (LCS) Problem
3DNA Sequence Comparison First Success Story
- Finding sequence similarities with genes of known
function is a common approach to infer a newly
sequenced genes function - In 1984 Russell Doolittle and colleagues found
similarities between cancer-causing gene and
normal growth factor (PDGF) gene
4Cystic Fibrosis
- Cystic fibrosis (CF) is a chronic and frequently
fatal genetic disease of the body's mucus glands
(abnormally high level of mucus in glands). CF
primarily affects the respiratory systems in
children. - Mucus is a slimy material that coats many
epithelial surfaces and is secreted into fluids
such as saliva - In 1980s biologists hypothesized that CF is an
autosomal recessive disorder caused by mutations
in a gene that remained unknown till 1989
5Finding Similarities between the Cystic Fibrosis
Gene and ATP binding proteins
- ATP binding proteins are present on cell membrane
and act as transport channel - In 1989 biologists found similarity between the
cystic fibrosis gene and ATP binding proteins - A plausible function for cystic fibrosis gene,
given the fact that CF involves sweet secretion
with abnormally high sodium level
6Cystic Fibrosis and the CFTR Protein
- CFTR (Cystic Fibrosis Transmembrane conductance
Regulator) protein is acting in the cell membrane
of some epithelial cells that secrete mucus - These special cells might line the airways of the
nose, lungs, the stomach wall, etc.
7Mechanism of Cystic Fibrosis
- The CFTR protein (1480 amino acids) regulates a
chloride ion channel - Adjusts the wateriness of fluids secreted by
the cell - Those with cystic fibrosis are missing one single
amino acid in their CFTR - Mucus ends up being too thick, affecting many
organs
8Cystic Fibrosis Finding the Gene
9Cystic Fibrosis Mutation Analysis
- If a high of cystic fibrosis (CF) patients have
a certain mutation in the gene and the normal
patients all have the wild type, then that could
be an indicator of a mutation that is related to
CF - Â
- A certain mutation was found in 70 of CF
patients, convincing evidence that it is a
predominant genetic diagnostics marker for CF
10Cystic Fibrosis and CFTR Gene
The CF gene is on the long arm of chromosome 7
11Bring in the Bioinformaticians
- Gene similarities between two genes with known
and unknown function alert biologists to some
possibilities - Computing a similarity score between two genes
tells how likely it is that they have similar
functions - Dynamic programming is commonly used technique
for pair-wise sequence alignment - The Change Problem and Manhattan Tourist Problem
are good problems to introduce the idea of
dynamic programming
12The Change Problem
- Specify the problem precisely
- Goal Convert some amount of money M into given
denominations, using the fewest possible number
of coins - Input An amount of money M, and an array of d
denominations c (c1, c2, , cd), in a
decreasing order of value (c1 c2 cd) - Output A list of d integers i1, i2, , id such
that - c1i1 c2i2 cdid M
- and i1 i2 id is minimal
13A Correct But VERY Slow Algorithm
- BruteForceChange(M, c, d)
- smallestNumberOfCoins 8
- for each (i1, , id) from (0, , 0) to (M/c1, ,
M/cd) - valueOfCoins S ikck
- if valueOfCoins M
- numberOfCoins S ik
- If numberOfCoins
- smallestNumberOfCoins numberOfCoins
- bestChange (i1, i2, , id)
- return (bestChange)
14Change Problem Example
Given the denominations 1, 3, and 5, what is the
minimum number of coins needed to make change for
a given value?
Only one coin is needed to make change for the
values 1, 3, and 5
15Change Problem Example (contd)
Given the denominations 1, 3, and 5, what is the
minimum number of coins needed to make change for
a given value?
1
2
3
4
5
6
7
8
9
10
Value
1
2
1
2
1
2
2
2
Min of coins
However, two coins are needed to make change for
the values 2, 4, 6, 8, and 10.
16Change Problem Example (contd)
Given the denominations 1, 3, and 5, what is the
minimum number of coins needed to make change for
a given value?
1
2
3
4
5
6
7
8
9
10
Value
1
2
1
2
1
2
3
2
3
2
Min of coins
Lastly, three coins are needed to make change for
the values 7 and 9
17Change Problem Recurrence
This example is expressed by the following
recurrence relation
18Change Problem Recurrence (contd)
Given the denominations c c1, c2, , cd, the
recurrence relation is
19Change Problem A Recursive Algorithm
- RecursiveChange(M,c,d)
- if M 0
- return 0
- bestNumCoins ? infinity
- for i ? 1 to d
- if M ci
- numCoins ? RecursiveChange(M ci , c,
d) - if numCoins 1
- bestNumCoins ? numCoins 1
- return bestNumCoins
20RecursiveChange Is Not Efficient
- It recalculates the optimal coin combination for
a given amount of money repeatedly - i.e., M 77, c (1,3,7)
- Optimal coin combo for 70 cents is computed 9
times!
21The RecursiveChange Tree
77
74
76
70
75
73
69
73
71
67
69
67
63
74
72
68
68
66
62
70
68
64
68
66
62
62
60
56
72
70
66
72
70
66
66
64
60
66
64
60
. . .
. . .
70
70
70
70
70
22We Can Do Better
- Were re-computing values in our algorithm more
than once - Save results of each computation for 0 to M
(memo-ization) - This way, we can do a referrence call to find an
already computed value, instead of re-computing
each time - Running time Md, where M is the value of money
and d is the number of denominations .
23DPChange Example
0
0
1
2
3
4
5
6
0
0
1
2
1
2
3
2
0
1
0
1
2
3
4
5
6
7
0
1
0
1
2
1
2
3
2
1
0
1
2
0
1
2
0
1
2
3
4
5
6
7
8
0
1
2
3
0
1
2
1
2
3
2
1
2
0
1
2
1
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
0
1
2
1
2
3
2
1
2
3
0
1
2
1
2
c (1,3,7)M 9
0
1
2
3
4
5
0
1
2
1
2
3
24The Change Problem Dynamic Programming
- DPChange(M,c,d)
- bestNumCoins0 ? 0
- for m ? 1 to M
- bestNumCoins ? infinity
- for i ? 1 to d
- if m ci
- if bestNumCoinsm ci 1bestNumCoinsm
- bestNumCoinsm ? bestNumCoinsm ci
1 - return bestNumCoinsM
What is the complexity of this algorithm?
25Dynamic Programming
- Basic idea solve an instance of a problem by
taking advantage of computed solutions for
smaller subparts of the problem - Initializing from smallest cases
- Caching subproblem solutions (memoization) rather
than recomputing them - Assume an recursive relation between the current
problem and its smaller subparts
26Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source
Sink
27Manhattan Tourist Problem Formulation
Goal Find the longest path in a weighted grid.
Input A weighted grid G with two distinct
vertices, one labeled source and the other
labeled sink
Output A longest path in G from source to
sink
28MTP An Example
0
1
2
3
4
j coordinates
source
3
2
4
0
9
5
3
0
0
1
0
4
3
2
2
3
2
4
13
1
1
6
5
4
2
0
7
3
4
19
15
2
i coordinates
4
5
2
4
1
3
3
0
2
3
20
3
8
5
6
5
sink
2
1
3
2
23
4
29MTP Greedy Algorithm Is Not Optimal
1
2
5
source
3
10
5
5
2
5
1
3
5
3
1
4
2
3
promising start, but leads to bad choices!
5
0
2
0
22
0
0
0
sink
18
30MTP Recurrence
Computing the score for a point (i,j) by the
recurrence relation
31MTP Simple Recursive Program
- MTP(n,m)
- x ? MT(n-1,m)
- length of the edge from (n-
1,m) to (n,m) - y ? MT(n,m-1)
- length of the edge from
(n,m-1) to (n,m) - return max x, y
- Whats wrong with this approach?
32MTP Dynamic Programming
j
0
1
source
1
0
1
S0,1 1
i
5
1
5
S1,0 5
- Calculate optimal path for each vertex in the
graph - Each vertexs score is the maximum of the prior
vertices score plus the weight of the respective
edge in between
33MTP Dynamic Programming (contd)
j
0
1
2
source
1
2
0
1
3
S0,2 3
i
5
3
-5
1
5
4
S1,1 4
3
2
8
S2,0 8
34MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
S3,0 8
i
5
10
3
1
-5
1
5
4
13
S1,2 13
3
5
-5
2
8
9
S2,1 9
0
3
8
S3,0 8
35MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
S1,3 8
3
5
-3
3
-5
2
8
9
12
S2,2 12
0
0
0
3
8
9
S3,1 9
greedy alg. fails!
36MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
3
5
-3
2
3
3
-5
2
8
9
12
15
S2,3 15
0
0
-5
0
0
3
8
9
9
S3,2 9
37MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
Done!
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
(showing all back-traces)
3
5
-3
2
3
3
-5
2
8
9
12
15
0
0
-5
1
0
0
0
3
8
9
9
16
S3,3 16
38The Dynamic-Programming Manhattan Tourist
Algorithm
- ManhattanTourist(wd, wr, n, m)
- 1 s(0,0) ? 0
- 2 for i ? 1 to n
- 3 s(i, 0) ? s(i - 1, 0) wd (i, 0)
- 4 for j ? 1 to m
- 5 s(0, j ) ? s(0, j - 1) wr (0, j )
- 6 for i ? 1 to n
- 7 for j ? 1 to m
- 8 s(i, j ) ? max s(i - 1, j )
wd(i, j ), s(i, j - 1) wr (i, j ) - 9 return s(n, m )
Whats the complexity of this algorithm?
the running time is O(n x m) for a n by m grid
39Manhattan Is Not A Perfect Grid
What about diagonals?
- The score at point B is given by
40DAG Directed Acyclic Graph
- Since Manhattan is not a perfect regular grid, we
may represent it as a DAG - DAG for Dressing in the morning problem
41Topological ordering
- 2 different topological orderings of the DAG
42Longest Path in DAG Problem
- Goal Find a longest path between two vertices in
a weighted DAG - Input A weighted DAG G with source and sink
vertices - Output A longest path in G from source to sink
43Manhattan Is Not A Perfect Grid (contd)
Computing the score for point x is given by the
recurrence relation
- Predecessors (x) set of vertices that have
edges leading to x - The running time for a DAG G(V, E) (V is
the set of all vertices and E is the set of all
edges) is O(E) since each edge is evaluated
once
44Aligning Sequences without Insertions and
Deletions Hamming Distance
Given two DNA sequences V and W
V
W
- The Hamming distance dH(V, W) 8 is large
but the sequences are very similar
45Aligning Sequences with Insertions and Deletions
However, by shifting one sequence over one
position
V
--
W
--
- Using Hamming distance neglects insertions and
deletions in DNA - The edit distance d(v, w) 2.
46Edit Distance
- Levenshtein (1966) introduced edit distance of
two strings as the minimum number of elementary
operations (insertions, deletions, and
substitutions) to transform one string into the
other - d(v,w) MIN no. of elementary operations
- to transform v ? w
47Edit Distance (contd)
ith letter of v compare with ith letter of w
V - ATATATAT
V ATATATAT
Just one shift
Make it all line up
W TATATATA
W TATATATA
Edit distance d(v, w) 2 (one insertion and
one deletion)
48Edit Distance Example
- 5 edit operations TGCATAT ? ATCCGAT
- TGCATAT ? (delete last T)
- TGCATA ? (delete last A)
- TGCAT ? (insert A at front)
- ATGCAT ? (substitute C for 3rd G)
- ATCCAT ? (insert G before last A)
- ATCCGAT (Done)
-TG-CATAT ATCCGAT--
49Edit Distance Example (contd)
- 4 edit operations TGCATAT ? ATCCGAT
- TGCATAT ? (insert A at front)
- ATGCATAT ? (delete 6th T)
- ATGCAAT ? (substitute G for 5th A)
- ATGCGAT ? (substitute C for 3rd G)
- ATCCGAT (Done)
-TGCATAT ATCCG-AT
50Alignment 2 row representation
Given 2 DNA sequences v and w
v
m 7
w
n 6
Alignment 2 k matrix ( k m, n ) that is
optimal
letters of v
A
T
--
G
T
A
T
--
T
letters of w
A
T
C
G
--
A
--
C
T
5 matches
2 insertions
2 deletions
51The Alignment Grid
- 2 sequences used for grid
- V ATGTTAT
- W ATCGTAC
- Every alignment path is from source to sink
- Look for the path with the optimal score.
Definition of sc?
52Longest Common Subsequence (LCS) Problem
- Given two sequences v v1, v2, , vm and w
w1, w2, , wn - The LCS of v and w is a sequence of positions
in - v 1
- and a sequence of positions in
- w 1
- such that vit wjt, and 1
What is the score function here?
53LCS Example
i coords
elements of v
elements of w
j coords
(0,0)?
(1,0)?
(2,1)?
(2,2)?
(3,3)?
(3,4)?
(4,5)?
(5,5)?
(6,6)?
(7,6)?
(8,7)
positions in v
2 Matches shown in red
positions in w
1 The LCS Problem can be expressed using the grid
similar to Manhattan Tourist Problem grid
54Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
55LCS Dynamic Programming
- Find the LCS of two strings
- Input A weighted graph G with two distinct
vertices, one labeled source one labeled sink - Output A longest path in G from source to
sink - Solve using an LCS edit graph with diagonals
replaced with 1 edges if they correspond to
matches other edges have weight 0.
56Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
57Align Two Strings
- Given the strings of DNA
- v ATGTTAT
- w ATCGTAC
- One Possible Alignment of the strings
- AT_GTTAT_
ATCGT_A_C
LCS score 5. However, is this the optimal
alignment?
58Dynamic Programming Example
- There are no matches in the beginning of the
sequence - Label column i1 to be all zero, and row j1 to
be all zero
0
0
0
0
0
0
0
0
59Dynamic Programming Example
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
?value from NW 1, if vi wj ? value from North
(top) ? value from West (left)
1
1
1
1
1
1
60Alignment Backtracking
- Arrows show where the score
originated from. - if from the top
- if from the left
- if vi wj
61Backtracking Example
Find a match in row and column 2. i2, j2,5 is
a match (T). j2, i4,5,7 is
a match (T). Since vi wj, S(i,j) Si-1,j-1
1 S(2,2) S(1,1) 1 1 S(2,5) S(1,4)
1 1 S(4,2) S(3,1) 1 1 S(5,2) S(4,1)
1 1 S(7,2) S(6,1) 1 1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
2
62Backtracking Example
0
0
0
0
0
0
0
0
Continuing with the scoring algorithm gives this
result.
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
2
3
3
3
3
1
2
2
3
4
4
4
1
2
2
3
4
4
4
1
2
2
3
4
5
5
1
2
2
3
4
5
5
63Now What?
- LCS(v,w) created the alignment grid
- Now we need a way to read the best alignment of v
and w - Follow the arrows backwards from sink
64The LCS Problem
- The previous example was a solution to the
Longest Common Subsequence (LCS) problemthe
simplest form of a sequence similarity analysis. - To solve the alignment we eliminate mismatches
and allow only insertions and deletions.