Title: Dynamic Programming: Sequence alignment
1Dynamic Programming Sequence alignment
2DNA Sequence Comparison First Success Story
- Finding sequence similarities with genes of known
function is a common approach to infer a newly
sequenced genes function - In 1984 Russell Doolittle and colleagues found
similarities between cancer-causing gene and
normal growth factor (PDGF) gene - A normal growth gene switched on at the wrong
time causes cancer !
3Cystic Fibrosis
- Cystic fibrosis (CF) is a chronic and frequently
fatal genetic disease of the body's mucus glands.
CF primarily affects the respiratory systems in
children. - If a high of cystic fibrosis (CF) patients have
a certain mutation in the gene and the normal
patients dont, then that could be an indicator
of a mutation that is related to CF - A certain mutation was found in 70 of CF
patients, convincing evidence that it is a
predominant genetic diagnostics marker for CF
4Cystic Fibrosis and CFTR Gene
5Bring in the Bioinformaticians
- Gene similarities between two genes with known
and unknown function alert biologists to some
possibilities - Computing a similarity score between two genes
tells how likely it is that they have similar
functions - Dynamic programming is a technique for revealing
similarities between genes
6Motivating Dynamic Programming
7Dynamic programming exampleManhattan Tourist
Problem
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source
Sink
8Dynamic programming exampleManhattan Tourist
Problem
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source
Sink
9Manhattan Tourist Problem Formulation
Goal Find the longest path in a weighted grid.
Input A weighted grid G with two distinct
vertices, one labeled source and the other
labeled sink
Output A longest path in G from source to
sink
10MTP An Example
0
1
2
3
4
j coordinate
source
3
2
4
0
9
5
3
0
0
1
0
4
3
2
2
3
2
4
13
1
1
6
5
4
2
0
7
3
4
19
15
2
i coordinate
4
5
2
4
1
3
3
0
2
3
20
3
8
5
6
5
sink
2
1
3
2
23
4
11MTP Greedy Algorithm Is Not Optimal
1
2
5
source
3
10
5
5
2
5
1
3
5
3
1
4
2
3
promising start, but leads to bad choices!
5
0
2
0
0
0
0
sink
18
12MTP Simple Recursive Program
- MT(n,m)
- if n0 or m0
- return MT(n,m)
- x ? MT(n-1,m)
- length of the edge from (n-
1,m) to (n,m) - y ? MT(n,m-1)
- length of the edge from
(n,m-1) to (n,m) - return maxx,y
Whats wrong with this approach?
13Heres whats wrong
- M(n,m) needs M(n, m-1) and M(n-1, m)
- Both of these need M(n-1, m-1)
- So M(n-1, m-1) will be computed at least twice
- Dynamic programming the same idea as this
recursive algorithm, but keep all intermediate
results in a table and reuse
14MTP Dynamic Programming
j
0
1
source
1
0
1
S0,1 1
i
5
1
5
S1,0 5
- Calculate optimal path score for each vertex in
the graph - Each vertexs score is the maximum of the prior
vertices score plus the weight of the respective
edge in between
15MTP Dynamic Programming (contd)
j
0
1
2
source
1
2
0
1
3
S0,2 3
i
5
3
-5
1
5
4
S1,1 4
3
2
8
S2,0 8
16MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
S3,0 8
i
5
10
3
1
-5
1
5
4
13
S1,2 13
3
5
-5
2
8
9
S2,1 9
0
3
8
S3,0 8
17MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
S1,3 8
3
5
-3
3
-5
2
8
9
12
S2,2 12
0
0
0
3
8
9
S3,1 9
greedy alg. fails!
18MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
3
5
-3
2
3
3
-5
2
8
9
12
15
S2,3 15
0
0
-5
0
0
3
8
9
9
S3,2 9
19MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
Done!
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
(showing all back-traces)
3
5
-3
2
3
3
-5
2
8
9
12
15
0
0
-5
1
0
0
0
3
8
9
9
16
S3,3 16
20MTP Recurrence
Computing the score for a point (i,j) by the
recurrence relation
The running time is n x m for a n by m grid (n
of rows, m of columns)
21Manhattan Is Not A Perfect Grid
What about diagonals?
- The score at point B is given by
22Manhattan Is Not A Perfect Grid (contd)
Computing the score for point x is given by the
recurrence relation
- Predecessors (x) set of vertices that have
edges leading to x - The running time for a graph G(V, E)
(V is the set of all vertices and E is
the set of all edges) is O(E) since each
edge is evaluated once
23Traveling in the Grid
- By the time the vertex x is analyzed, the values
sy for all its predecessors y should be computed
otherwise we are in trouble. - We need to traverse the vertices in some order
- For a grid, can traverse vertices row by row,
column by column, or diagonal by diagonal - If we had, instead of a (directed) grid, an
arbitrary DAG (directed acyclic graph) ?
24Traversing the Manhattan Grid
a)
b)
- 3 different strategies
- a) Column by column
- b) Row by row
- c) Along diagonals
c)
25Topological Ordering
- A numbering of vertices of the graph is called
topological ordering of the DAG if every edge of
the DAG connects a vertex with a smaller label to
a vertex with a larger label - In other words, if vertices are positioned on a
line in an increasing order of labels then all
edges go from left to right. - How to obtain a topological sort (???)
26Longest Path in DAG Problem
- Goal Find a longest path between two vertices in
a weighted DAG - Input A weighted DAG G with source and sink
vertices - Output A longest path in G from source to sink
27Longest Path in DAG Dynamic Programming
- Suppose vertex v has indegree 3 and predecessors
u1, u2, u3 - Longest path to v from source is
- In General
- sv maxu (su weight of edge from u to v)
su1 weight of edge from u1 to v su2 weight
of edge from u2 to v su3 weight of edge from
u3 to v
28Alignment
29Aligning DNA Sequences
Alignment 2 x k matrix ( k ? m, n )
n 8
V ATCTGATG
m 7
W TGCATAC
V
W
30Longest Common Subsequence (LCS) Alignment
without Mismatches
- Given two sequences
- v v1 v2vm and w w1 w2wn
- The LCS of v and w is a sequence of positions
in - v 1
- and a sequence of positions in
- w 1
- such that it -th letter of v equals to jt-letter
of w and t is maximal
31LCS Example
i coords
elements of v
A
T
--
C
T
G
A
T
C
--
elements of w
--
T
G
C
T
--
A
--
C
A
j coords
(0,0)?
(1,0)?
(2,1)?
(2,2)?
(3,3)?
(3,4)?
(4,5)?
(5,5)?
(6,6)?
(7,6)?
(8,7)
positions in v
2 Matches shown in red
positions in w
1 Every common subsequence is a path in 2-D grid
32Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
33LCS Problem as Manhattan Tourist Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
34Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
35Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
Every path is a common subsequence. Every
diagonal edge adds an extra element to common
subsequence LCS Problem Find a path with maximum
number of diagonal edges
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
36Backtracking
- si,j allows us to compute the length of LCS for
vi and wj - sm,n gives us the length of the LCS for v and w
- How do we print the actual LCS ?
- At each step, we chose an optimal decision si,j
max () - Record which of the choices was made in order to
obtain this max
37Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
38Printing LCS Backtracking
- PrintLCS(b,v,i,j)
- if i 0 or j 0
- return
- if bi,j
- PrintLCS(b,v,i-1,j-1)
- print vi
- else
- if bi,j
- PrintLCS(b,v,i-1,j)
- else
- PrintLCS(b,v,i,j-1)
39From LCS to Alignment
- The Longest Common Subsequence problemthe
simplest form of sequence alignment allows only
insertions and deletions (no mismatches). - In the LCS Problem, we scored 1 for matches and 0
for indels - Consider penalizing indels and mismatches with
negative scores - Simplest scoring schema
- 1 match premium
- -µ mismatch penalty
- -s indel penalty
40Simple Scoring
- When mismatches are penalized by µ, indels are
penalized by s, - and matches are rewarded with 1,
- the resulting score is
- matches µ(mismatches) s (indels)
41The Global Alignment Problem
- Find the best alignment between two strings under
a given scoring schema - Input Strings v and w and a scoring schema
- Output Alignment of maximum score
- ?? -s
- 1 if match
- -µ if mismatch
- si-1,j-1 1 if vi wj
- si,j max s i-1,j-1 -µ if vi ? wj
- s i-1,j - s
- s i,j-1 - s
m mismatch penalty s indel penalty
42Scoring Matrices
- To generalize scoring, consider a (41) x(41)
scoring matrix d. - In the case of an amino acid sequence alignment,
the scoring matrix would be a (201)x(201) size.
The addition of 1 is to include the score for
comparison of a gap character -. - This will simplify the algorithm as follows
- si-1,j-1 d (vi, wj)
- si,j max s i-1,j d (vi, -)
- s i,j-1 d (-, wj)
43Making a Scoring Matrix
- Scoring matrices are created based on biological
evidence. - Alignments can be thought of as two sequences
that differ due to mutations. - Some of these mutations have little effect on the
proteins function, therefore some penalties,
d(vi , wj), will be less harsh than others.
44Scoring Matrix Example
- Notice that although R and K are different amino
acids, they have a positive score. - Why? They are both positively charged amino
acids? will not greatly change function of
protein.
45Conservation
- Amino acid changes that tend to preserve the
physico-chemical properties of the original
residue - Polar to polar
- aspartate ? glutamate
- Nonpolar to nonpolar
- alanine ? valine
- Similarly behaving residues
- leucine to isoleucine
46Local vs. Global Alignment
- The Global Alignment Problem tries to find the
longest path between vertices (0,0) and (n,m) in
the edit graph. - The Local Alignment Problem tries to find the
longest path among paths between arbitrary
vertices (i,j) and (i, j) in the edit graph. - In the edit graph with negatively-scored edges,
Local Alignment may score higher than Global
Alignment
47Local Alignment Example
Local alignment
Global alignment
48Local Alignments Why?
- Two genes in different species may be similar
over short conserved regions and dissimilar over
remaining regions. - Example
- Homeobox genes have a short region called the
homeodomain that is highly conserved between
species. - A global alignment would not find the homeodomain
because it would try to align the ENTIRE sequence
49The Local Alignment Problem
- Goal Find the best local alignment between two
strings - Input Strings v, w and scoring matrix d
- Output Alignment of substrings of v and w whose
alignment score is maximum among all possible
alignment of all possible substrings
50The Problem Is
- Long run time O(n4)
- - In the grid of size n x n there are n2
vertices (i,j) that may serve as a source. - - For each such vertex computing alignments
from (i,j) to (i,j) takes O(n2) time. - This can be remedied by allowing every point to
be the starting point
51The Local Alignment Recurrence
- The largest value of si,j over the whole edit
graph is the score of the best local alignment. - The recurrence
0 si,j max
si-1,j-1 d (vi, wj) s
i-1,j d (vi, -) s i,j-1
d (-, wj)
52Affine Gap Penalties
- In nature, a series of k indels often come as a
single event rather than a series of k single
nucleotide events
ATA__GC ATATTGC
ATAG_GC AT_GTGC
Normal scoring would give the same score for both
alignments
53Accounting for Gaps
- Gaps- contiguous sequence of spaces in one of the
rows - Score for a gap of length x is
- -(? sx)
- where ? 0 is the penalty for introducing a
gap - gap opening penalty
- ? will be large relative to s
- gap extension penalty
- because you do not want to add too much of a
penalty for extending the gap.
54Affine gap penalty in DP
- When computing si,j, need to look at si,j-1,
si,j-2, si,j-3,. and si-1,j, si-2,j, - Each cell needs O(n) time for update
- O(n2) cells
- Therefore, O(n3) algorithm
- We can still do this in O(n2) time
55Affine Gap Penalty Recurrences
Continue Gap in w (deletion)
si,j s i-1,j - s max s
i-1,j (?s) si,j s i,j-1 - s
max s i,j-1 (?s) si,j
si-1,j-1 d (vi, wj) max s i,j
s i,j
Start Gap in w (deletion) from middle
Continue Gap in v (insertion)
Start Gap in v (insertion)from middle
Match or Mismatch
End deletion from top
End insertion from bottom