Dynamic Programming: Sequence alignment - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Dynamic Programming: Sequence alignment

Description:

If a high % of cystic fibrosis (CF) patients have a certain mutation in the gene ... Cystic Fibrosis and CFTR Gene : Bring in the Bioinformaticians ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 56

Provided by: Saurab5

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Programming: Sequence alignment

1
Dynamic Programming Sequence alignment

CS 498 SS
Saurabh Sinha

2
DNA Sequence Comparison First Success Story

Finding sequence similarities with genes of known
function is a common approach to infer a newly
sequenced genes function
In 1984 Russell Doolittle and colleagues found
similarities between cancer-causing gene and
normal growth factor (PDGF) gene
A normal growth gene switched on at the wrong
time causes cancer !

3
Cystic Fibrosis

Cystic fibrosis (CF) is a chronic and frequently
fatal genetic disease of the body's mucus glands.
CF primarily affects the respiratory systems in
children.
If a high of cystic fibrosis (CF) patients have
a certain mutation in the gene and the normal
patients dont, then that could be an indicator
of a mutation that is related to CF
A certain mutation was found in 70 of CF
patients, convincing evidence that it is a
predominant genetic diagnostics marker for CF

4
Cystic Fibrosis and CFTR Gene
5
Bring in the Bioinformaticians

Gene similarities between two genes with known
and unknown function alert biologists to some
possibilities
Computing a similarity score between two genes
tells how likely it is that they have similar
functions
Dynamic programming is a technique for revealing
similarities between genes

6
Motivating Dynamic Programming
7
Dynamic programming exampleManhattan Tourist
Problem
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source

Sink
8
Dynamic programming exampleManhattan Tourist
Problem
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source

Sink
9
Manhattan Tourist Problem Formulation
Goal Find the longest path in a weighted grid.
Input A weighted grid G with two distinct
vertices, one labeled source and the other
labeled sink
Output A longest path in G from source to
sink
10
MTP An Example
0
1
2
3
4
j coordinate
source
3
2
4
0
9
5
3
0
0
1
0
4
3
2
2
3
2
4
13
1
1
6
5
4
2
0
7
3
4
19
15
2
i coordinate
4
5
2
4
1
3
3
0
2
3
20
3
8
5
6
5
sink
2
1
3
2
23
4
11
MTP Greedy Algorithm Is Not Optimal
1
2
5
source
3
10
5
5
2
5
1
3
5
3
1
4
2
3
promising start, but leads to bad choices!
5
0
2
0
0
0
0
sink
18
12
MTP Simple Recursive Program

MT(n,m)
if n0 or m0
return MT(n,m)
x ? MT(n-1,m)
length of the edge from (n-
1,m) to (n,m)
y ? MT(n,m-1)
length of the edge from
(n,m-1) to (n,m)
return maxx,y

Whats wrong with this approach?
13
Heres whats wrong

M(n,m) needs M(n, m-1) and M(n-1, m)
Both of these need M(n-1, m-1)
So M(n-1, m-1) will be computed at least twice
Dynamic programming the same idea as this
recursive algorithm, but keep all intermediate
results in a table and reuse

14
MTP Dynamic Programming
j
0
1
source
1
0
1
S0,1 1
i
5
1
5
S1,0 5

Calculate optimal path score for each vertex in
the graph
Each vertexs score is the maximum of the prior
vertices score plus the weight of the respective
edge in between

15
MTP Dynamic Programming (contd)
j
0
1
2
source
1
2
0
1
3
S0,2 3
i
5
3
-5
1
5
4
S1,1 4
3
2
8
S2,0 8
16
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
S3,0 8
i
5
10
3
1
-5
1
5
4
13
S1,2 13
3
5
-5
2
8
9
S2,1 9
0
3
8
S3,0 8
17
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
S1,3 8
3
5
-3
3
-5
2
8
9
12
S2,2 12
0
0
0
3
8
9
S3,1 9
greedy alg. fails!
18
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
3
5
-3
2
3
3
-5
2
8
9
12
15
S2,3 15
0
0
-5
0
0
3
8
9
9
S3,2 9
19
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
Done!
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
(showing all back-traces)
3
5
-3
2
3
3
-5
2
8
9
12
15
0
0
-5
1
0
0
0
3
8
9
9
16
S3,3 16
20
MTP Recurrence
Computing the score for a point (i,j) by the
recurrence relation
The running time is n x m for a n by m grid (n
of rows, m of columns)
21
Manhattan Is Not A Perfect Grid
What about diagonals?

The score at point B is given by

22
Manhattan Is Not A Perfect Grid (contd)
Computing the score for point x is given by the
recurrence relation

Predecessors (x) set of vertices that have
edges leading to x
The running time for a graph G(V, E)
(V is the set of all vertices and E is
the set of all edges) is O(E) since each
edge is evaluated once

23
Traveling in the Grid

By the time the vertex x is analyzed, the values
sy for all its predecessors y should be computed
otherwise we are in trouble.
We need to traverse the vertices in some order
For a grid, can traverse vertices row by row,
column by column, or diagonal by diagonal
If we had, instead of a (directed) grid, an
arbitrary DAG (directed acyclic graph) ?

24
Traversing the Manhattan Grid
a)
b)

3 different strategies
a) Column by column
b) Row by row
c) Along diagonals

c)
25
Topological Ordering

A numbering of vertices of the graph is called
topological ordering of the DAG if every edge of
the DAG connects a vertex with a smaller label to
a vertex with a larger label
In other words, if vertices are positioned on a
line in an increasing order of labels then all
edges go from left to right.
How to obtain a topological sort (???)

26
Longest Path in DAG Problem

Goal Find a longest path between two vertices in
a weighted DAG
Input A weighted DAG G with source and sink
vertices
Output A longest path in G from source to sink

27
Longest Path in DAG Dynamic Programming

Suppose vertex v has indegree 3 and predecessors
u1, u2, u3
Longest path to v from source is
In General
sv maxu (su weight of edge from u to v)

su1 weight of edge from u1 to v su2 weight
of edge from u2 to v su3 weight of edge from
u3 to v
28
Alignment
29
Aligning DNA Sequences
Alignment 2 x k matrix ( k ? m, n )
n 8
V ATCTGATG
m 7
W TGCATAC
V
W
30
Longest Common Subsequence (LCS) Alignment
without Mismatches

Given two sequences
v v1 v2vm and w w1 w2wn
The LCS of v and w is a sequence of positions
in
v 1
and a sequence of positions in
w 1
such that it -th letter of v equals to jt-letter
of w and t is maximal

31
LCS Example
i coords
elements of v
A
T
--
C
T
G
A
T
C
--
elements of w
--
T
G
C
T
--
A
--
C
A
j coords
(0,0)?
(1,0)?
(2,1)?
(2,2)?
(3,3)?
(3,4)?
(4,5)?
(5,5)?
(6,6)?
(7,6)?
(8,7)
positions in v
2 Matches shown in red
positions in w
1 Every common subsequence is a path in 2-D grid
32
Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
33
LCS Problem as Manhattan Tourist Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
34
Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
35
Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
Every path is a common subsequence. Every
diagonal edge adds an extra element to common
subsequence LCS Problem Find a path with maximum
number of diagonal edges
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
36
Backtracking

si,j allows us to compute the length of LCS for
vi and wj
sm,n gives us the length of the LCS for v and w
How do we print the actual LCS ?
At each step, we chose an optimal decision si,j
max ()
Record which of the choices was made in order to
obtain this max

37
Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
38
Printing LCS Backtracking

PrintLCS(b,v,i,j)
if i 0 or j 0
return
if bi,j
PrintLCS(b,v,i-1,j-1)
print vi
else
if bi,j
PrintLCS(b,v,i-1,j)
else
PrintLCS(b,v,i,j-1)

39
From LCS to Alignment

The Longest Common Subsequence problemthe
simplest form of sequence alignment allows only
insertions and deletions (no mismatches).
In the LCS Problem, we scored 1 for matches and 0
for indels
Consider penalizing indels and mismatches with
negative scores
Simplest scoring schema
1 match premium
-µ mismatch penalty
-s indel penalty

40
Simple Scoring

When mismatches are penalized by µ, indels are
penalized by s,
and matches are rewarded with 1,
the resulting score is
matches µ(mismatches) s (indels)

41
The Global Alignment Problem

Find the best alignment between two strings under
a given scoring schema
Input Strings v and w and a scoring schema
Output Alignment of maximum score
?? -s
1 if match
-µ if mismatch
si-1,j-1 1 if vi wj
si,j max s i-1,j-1 -µ if vi ? wj
s i-1,j - s
s i,j-1 - s

m mismatch penalty s indel penalty

42
Scoring Matrices

To generalize scoring, consider a (41) x(41)
scoring matrix d.
In the case of an amino acid sequence alignment,
the scoring matrix would be a (201)x(201) size.
The addition of 1 is to include the score for
comparison of a gap character -.
This will simplify the algorithm as follows
si-1,j-1 d (vi, wj)
si,j max s i-1,j d (vi, -)
s i,j-1 d (-, wj)

43
Making a Scoring Matrix

Scoring matrices are created based on biological
evidence.
Alignments can be thought of as two sequences
that differ due to mutations.
Some of these mutations have little effect on the
proteins function, therefore some penalties,
d(vi , wj), will be less harsh than others.

44
Scoring Matrix Example

Notice that although R and K are different amino
acids, they have a positive score.
Why? They are both positively charged amino
acids? will not greatly change function of
protein.

45
Conservation

Amino acid changes that tend to preserve the
physico-chemical properties of the original
residue
Polar to polar
aspartate ? glutamate
Nonpolar to nonpolar
alanine ? valine
Similarly behaving residues
leucine to isoleucine

46
Local vs. Global Alignment

The Global Alignment Problem tries to find the
longest path between vertices (0,0) and (n,m) in
the edit graph.
The Local Alignment Problem tries to find the
longest path among paths between arbitrary
vertices (i,j) and (i, j) in the edit graph.
In the edit graph with negatively-scored edges,
Local Alignment may score higher than Global
Alignment

47
Local Alignment Example
Local alignment
Global alignment
48
Local Alignments Why?

Two genes in different species may be similar
over short conserved regions and dissimilar over
remaining regions.
Example
Homeobox genes have a short region called the
homeodomain that is highly conserved between
species.
A global alignment would not find the homeodomain
because it would try to align the ENTIRE sequence

49
The Local Alignment Problem

Goal Find the best local alignment between two
strings
Input Strings v, w and scoring matrix d
Output Alignment of substrings of v and w whose
alignment score is maximum among all possible
alignment of all possible substrings

50
The Problem Is

Long run time O(n4)
- In the grid of size n x n there are n2
vertices (i,j) that may serve as a source.
- For each such vertex computing alignments
from (i,j) to (i,j) takes O(n2) time.
This can be remedied by allowing every point to
be the starting point

51
The Local Alignment Recurrence

The largest value of si,j over the whole edit
graph is the score of the best local alignment.
The recurrence

0 si,j max
si-1,j-1 d (vi, wj) s
i-1,j d (vi, -) s i,j-1
d (-, wj)

52
Affine Gap Penalties

In nature, a series of k indels often come as a
single event rather than a series of k single
nucleotide events

ATA__GC ATATTGC
ATAG_GC AT_GTGC
Normal scoring would give the same score for both
alignments
53
Accounting for Gaps

Gaps- contiguous sequence of spaces in one of the
rows
Score for a gap of length x is
-(? sx)
where ? 0 is the penalty for introducing a
gap
gap opening penalty
? will be large relative to s
gap extension penalty
because you do not want to add too much of a
penalty for extending the gap.

54
Affine gap penalty in DP

When computing si,j, need to look at si,j-1,
si,j-2, si,j-3,. and si-1,j, si-2,j,
Each cell needs O(n) time for update
O(n2) cells
Therefore, O(n3) algorithm
We can still do this in O(n2) time

55
Affine Gap Penalty Recurrences
Continue Gap in w (deletion)
si,j s i-1,j - s max s
i-1,j (?s) si,j s i,j-1 - s
max s i,j-1 (?s) si,j
si-1,j-1 d (vi, wj) max s i,j
s i,j
Start Gap in w (deletion) from middle
Continue Gap in v (insertion)
Start Gap in v (insertion)from middle
Match or Mismatch
End deletion from top
End insertion from bottom

Write a Comment

User Comments (0)