Dynamic Programming: Sequence alignment - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Dynamic Programming: Sequence alignment

Description:

If a high % of cystic fibrosis (CF) patients have a certain mutation in the gene ... Cystic Fibrosis and CFTR Gene : Bring in the Bioinformaticians ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 56
Provided by: Saurab5
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Programming: Sequence alignment


1
Dynamic Programming Sequence alignment
  • CS 498 SS
  • Saurabh Sinha

2
DNA Sequence Comparison First Success Story
  • Finding sequence similarities with genes of known
    function is a common approach to infer a newly
    sequenced genes function
  • In 1984 Russell Doolittle and colleagues found
    similarities between cancer-causing gene and
    normal growth factor (PDGF) gene
  • A normal growth gene switched on at the wrong
    time causes cancer !

3
Cystic Fibrosis
  • Cystic fibrosis (CF) is a chronic and frequently
    fatal genetic disease of the body's mucus glands.
    CF primarily affects the respiratory systems in
    children.
  • If a high of cystic fibrosis (CF) patients have
    a certain mutation in the gene and the normal
    patients dont, then that could be an indicator
    of a mutation that is related to CF
  • A certain mutation was found in 70 of CF
    patients, convincing evidence that it is a
    predominant genetic diagnostics marker for CF

4
Cystic Fibrosis and CFTR Gene
5
Bring in the Bioinformaticians
  • Gene similarities between two genes with known
    and unknown function alert biologists to some
    possibilities
  • Computing a similarity score between two genes
    tells how likely it is that they have similar
    functions
  • Dynamic programming is a technique for revealing
    similarities between genes

6
Motivating Dynamic Programming
7
Dynamic programming exampleManhattan Tourist
Problem
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source












Sink
8
Dynamic programming exampleManhattan Tourist
Problem
Imagine seeking a path (from source to sink) to
travel (only eastward and southward) with the
most number of attractions () in the Manhattan
grid
Source












Sink
9
Manhattan Tourist Problem Formulation
Goal Find the longest path in a weighted grid.
Input A weighted grid G with two distinct
vertices, one labeled source and the other
labeled sink
Output A longest path in G from source to
sink
10
MTP An Example
0
1
2
3
4
j coordinate
source
3
2
4
0
9
5
3
0
0
1
0
4
3
2
2
3
2
4
13
1
1
6
5
4
2
0
7
3
4
19
15
2
i coordinate
4
5
2
4
1
3
3
0
2
3
20
3
8
5
6
5
sink
2
1
3
2
23
4
11
MTP Greedy Algorithm Is Not Optimal
1
2
5
source
3
10
5
5
2
5
1
3
5
3
1
4
2
3
promising start, but leads to bad choices!
5
0
2
0
0
0
0
sink
18
12
MTP Simple Recursive Program
  • MT(n,m)
  • if n0 or m0
  • return MT(n,m)
  • x ? MT(n-1,m)
  • length of the edge from (n-
    1,m) to (n,m)
  • y ? MT(n,m-1)
  • length of the edge from
    (n,m-1) to (n,m)
  • return maxx,y

Whats wrong with this approach?
13
Heres whats wrong
  • M(n,m) needs M(n, m-1) and M(n-1, m)
  • Both of these need M(n-1, m-1)
  • So M(n-1, m-1) will be computed at least twice
  • Dynamic programming the same idea as this
    recursive algorithm, but keep all intermediate
    results in a table and reuse

14
MTP Dynamic Programming
j
0
1
source
1
0
1
S0,1 1
i
5
1
5
S1,0 5
  • Calculate optimal path score for each vertex in
    the graph
  • Each vertexs score is the maximum of the prior
    vertices score plus the weight of the respective
    edge in between

15
MTP Dynamic Programming (contd)
j
0
1
2
source
1
2
0
1
3
S0,2 3
i
5
3
-5
1
5
4
S1,1 4
3
2
8
S2,0 8
16
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
S3,0 8
i
5
10
3
1
-5
1
5
4
13
S1,2 13
3
5
-5
2
8
9
S2,1 9
0
3
8
S3,0 8
17
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
S1,3 8
3
5
-3
3
-5
2
8
9
12
S2,2 12
0
0
0
3
8
9
S3,1 9
greedy alg. fails!
18
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
3
5
-3
2
3
3
-5
2
8
9
12
15
S2,3 15
0
0
-5
0
0
3
8
9
9
S3,2 9
19
MTP Dynamic Programming (contd)
j
0
1
2
3
source
1
2
5
0
1
3
8
Done!
i
5
3
10
-5
-5
1
-5
1
5
4
13
8
(showing all back-traces)
3
5
-3
2
3
3
-5
2
8
9
12
15
0
0
-5
1
0
0
0
3
8
9
9
16
S3,3 16
20
MTP Recurrence
Computing the score for a point (i,j) by the
recurrence relation
The running time is n x m for a n by m grid (n
of rows, m of columns)
21
Manhattan Is Not A Perfect Grid
What about diagonals?
  • The score at point B is given by

22
Manhattan Is Not A Perfect Grid (contd)
Computing the score for point x is given by the
recurrence relation
  • Predecessors (x) set of vertices that have
    edges leading to x
  • The running time for a graph G(V, E)
    (V is the set of all vertices and E is
    the set of all edges) is O(E) since each
    edge is evaluated once

23
Traveling in the Grid
  • By the time the vertex x is analyzed, the values
    sy for all its predecessors y should be computed
    otherwise we are in trouble.
  • We need to traverse the vertices in some order
  • For a grid, can traverse vertices row by row,
    column by column, or diagonal by diagonal
  • If we had, instead of a (directed) grid, an
    arbitrary DAG (directed acyclic graph) ?

24
Traversing the Manhattan Grid
a)
b)
  • 3 different strategies
  • a) Column by column
  • b) Row by row
  • c) Along diagonals

c)
25
Topological Ordering
  • A numbering of vertices of the graph is called
    topological ordering of the DAG if every edge of
    the DAG connects a vertex with a smaller label to
    a vertex with a larger label
  • In other words, if vertices are positioned on a
    line in an increasing order of labels then all
    edges go from left to right.
  • How to obtain a topological sort (???)

26
Longest Path in DAG Problem
  • Goal Find a longest path between two vertices in
    a weighted DAG
  • Input A weighted DAG G with source and sink
    vertices
  • Output A longest path in G from source to sink

27
Longest Path in DAG Dynamic Programming
  • Suppose vertex v has indegree 3 and predecessors
    u1, u2, u3
  • Longest path to v from source is
  • In General
  • sv maxu (su weight of edge from u to v)

su1 weight of edge from u1 to v su2 weight
of edge from u2 to v su3 weight of edge from
u3 to v
28
Alignment
29
Aligning DNA Sequences
Alignment 2 x k matrix ( k ? m, n )
n 8
V ATCTGATG
m 7
W TGCATAC
V
W
30
Longest Common Subsequence (LCS) Alignment
without Mismatches
  • Given two sequences
  • v v1 v2vm and w w1 w2wn
  • The LCS of v and w is a sequence of positions
    in
  • v 1
  • and a sequence of positions in
  • w 1
  • such that it -th letter of v equals to jt-letter
    of w and t is maximal

31
LCS Example
i coords
elements of v
A
T
--
C
T
G
A
T
C
--
elements of w
--
T
G
C
T
--
A
--
C
A
j coords
(0,0)?
(1,0)?
(2,1)?
(2,2)?
(3,3)?
(3,4)?
(4,5)?
(5,5)?
(6,6)?
(7,6)?
(8,7)
positions in v
2 Matches shown in red
positions in w
1 Every common subsequence is a path in 2-D grid
32
Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
33
LCS Problem as Manhattan Tourist Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
34
Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
35
Edit Graph for LCS Problem
A
T
C
T
G
A
T
C
j
0
1
2
3
4
5
6
7
8
Every path is a common subsequence. Every
diagonal edge adds an extra element to common
subsequence LCS Problem Find a path with maximum
number of diagonal edges
0
i
T
1
G
2
C
3
A
4
T
5
A
6
C
7
36
Backtracking
  • si,j allows us to compute the length of LCS for
    vi and wj
  • sm,n gives us the length of the LCS for v and w
  • How do we print the actual LCS ?
  • At each step, we chose an optimal decision si,j
    max ()
  • Record which of the choices was made in order to
    obtain this max

37
Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
38
Printing LCS Backtracking
  • PrintLCS(b,v,i,j)
  • if i 0 or j 0
  • return
  • if bi,j
  • PrintLCS(b,v,i-1,j-1)
  • print vi
  • else
  • if bi,j
  • PrintLCS(b,v,i-1,j)
  • else
  • PrintLCS(b,v,i,j-1)

39
From LCS to Alignment
  • The Longest Common Subsequence problemthe
    simplest form of sequence alignment allows only
    insertions and deletions (no mismatches).
  • In the LCS Problem, we scored 1 for matches and 0
    for indels
  • Consider penalizing indels and mismatches with
    negative scores
  • Simplest scoring schema
  • 1 match premium
  • -µ mismatch penalty
  • -s indel penalty

40
Simple Scoring
  • When mismatches are penalized by µ, indels are
    penalized by s,
  • and matches are rewarded with 1,
  • the resulting score is
  • matches µ(mismatches) s (indels)

41
The Global Alignment Problem
  • Find the best alignment between two strings under
    a given scoring schema
  • Input Strings v and w and a scoring schema
  • Output Alignment of maximum score
  • ?? -s
  • 1 if match
  • -µ if mismatch
  • si-1,j-1 1 if vi wj
  • si,j max s i-1,j-1 -µ if vi ? wj
  • s i-1,j - s
  • s i,j-1 - s

m mismatch penalty s indel penalty

42
Scoring Matrices
  • To generalize scoring, consider a (41) x(41)
    scoring matrix d.
  • In the case of an amino acid sequence alignment,
    the scoring matrix would be a (201)x(201) size.
    The addition of 1 is to include the score for
    comparison of a gap character -.
  • This will simplify the algorithm as follows
  • si-1,j-1 d (vi, wj)
  • si,j max s i-1,j d (vi, -)
  • s i,j-1 d (-, wj)


43
Making a Scoring Matrix
  • Scoring matrices are created based on biological
    evidence.
  • Alignments can be thought of as two sequences
    that differ due to mutations.
  • Some of these mutations have little effect on the
    proteins function, therefore some penalties,
    d(vi , wj), will be less harsh than others.

44
Scoring Matrix Example
  • Notice that although R and K are different amino
    acids, they have a positive score.
  • Why? They are both positively charged amino
    acids? will not greatly change function of
    protein.

45
Conservation
  • Amino acid changes that tend to preserve the
    physico-chemical properties of the original
    residue
  • Polar to polar
  • aspartate ? glutamate
  • Nonpolar to nonpolar
  • alanine ? valine
  • Similarly behaving residues
  • leucine to isoleucine

46
Local vs. Global Alignment
  • The Global Alignment Problem tries to find the
    longest path between vertices (0,0) and (n,m) in
    the edit graph.
  • The Local Alignment Problem tries to find the
    longest path among paths between arbitrary
    vertices (i,j) and (i, j) in the edit graph.
  • In the edit graph with negatively-scored edges,
    Local Alignment may score higher than Global
    Alignment

47
Local Alignment Example
Local alignment
Global alignment
48
Local Alignments Why?
  • Two genes in different species may be similar
    over short conserved regions and dissimilar over
    remaining regions.
  • Example
  • Homeobox genes have a short region called the
    homeodomain that is highly conserved between
    species.
  • A global alignment would not find the homeodomain
    because it would try to align the ENTIRE sequence

49
The Local Alignment Problem
  • Goal Find the best local alignment between two
    strings
  • Input Strings v, w and scoring matrix d
  • Output Alignment of substrings of v and w whose
    alignment score is maximum among all possible
    alignment of all possible substrings

50
The Problem Is
  • Long run time O(n4)
  • - In the grid of size n x n there are n2
    vertices (i,j) that may serve as a source.
  • - For each such vertex computing alignments
    from (i,j) to (i,j) takes O(n2) time.
  • This can be remedied by allowing every point to
    be the starting point

51
The Local Alignment Recurrence
  • The largest value of si,j over the whole edit
    graph is the score of the best local alignment.
  • The recurrence

0 si,j max
si-1,j-1 d (vi, wj) s
i-1,j d (vi, -) s i,j-1
d (-, wj)

52
Affine Gap Penalties
  • In nature, a series of k indels often come as a
    single event rather than a series of k single
    nucleotide events

ATA__GC ATATTGC
ATAG_GC AT_GTGC
Normal scoring would give the same score for both
alignments
53
Accounting for Gaps
  • Gaps- contiguous sequence of spaces in one of the
    rows
  • Score for a gap of length x is
  • -(? sx)
  • where ? 0 is the penalty for introducing a
    gap
  • gap opening penalty
  • ? will be large relative to s
  • gap extension penalty
  • because you do not want to add too much of a
    penalty for extending the gap.

54
Affine gap penalty in DP
  • When computing si,j, need to look at si,j-1,
    si,j-2, si,j-3,. and si-1,j, si-2,j,
  • Each cell needs O(n) time for update
  • O(n2) cells
  • Therefore, O(n3) algorithm
  • We can still do this in O(n2) time

55
Affine Gap Penalty Recurrences
Continue Gap in w (deletion)
si,j s i-1,j - s max s
i-1,j (?s) si,j s i,j-1 - s
max s i,j-1 (?s) si,j
si-1,j-1 d (vi, wj) max s i,j
s i,j
Start Gap in w (deletion) from middle
Continue Gap in v (insertion)
Start Gap in v (insertion)from middle
Match or Mismatch
End deletion from top
End insertion from bottom
Write a Comment
User Comments (0)
About PowerShow.com