Dynamic-Programming%20Strategies%20for%20Analyzing%20Biomolecular%20Sequences - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamic-Programming%20Strategies%20for%20Analyzing%20Biomolecular%20Sequences

Description:

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 90
Provided by: Veriton
Category:

less

Transcript and Presenter's Notes

Title: Dynamic-Programming%20Strategies%20for%20Analyzing%20Biomolecular%20Sequences


1
Dynamic-Programming Strategies for Analyzing
Biomolecular Sequences
2
Dynamic Programming
  • Dynamic programming is a class of solution
    methods for solving sequential decision problems
    with a compositional cost structure.
  • Richard Bellman was one of the principal founders
    of this approach.

3
Two key ingredients
  • Two key ingredients for an optimization problem
    to be suitable for a dynamic-programming solution

1. optimal substructures
2. overlapping subproblems
Subproblems are dependent. (otherwise, a
divide-and-conquer approach is the choice.)
Each substructure is optimal. (Principle of
optimality)
4
Three basic components
  • The development of a dynamic-programming
    algorithm has three basic components
  • The recurrence relation (for defining the value
    of an optimal solution)
  • The tabular computation (for computing the value
    of an optimal solution)
  • The traceback (for delivering an optimal
    solution).

5
Fibonacci numbers
The Fibonacci numbers are defined by the
following recurrence



6
How to compute F10?


7
Tabular computation
  • The tabular computation can avoid recompuation.

F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
0 1 1 2 3 5 8 13 21 34 55
8
Maximum-sum interval
  • Given a sequence of real numbers a1a2an , find a
    consecutive subsequence with the maximum sum.

9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
For each position, we can compute the maximum-sum
interval starting at that position in O(n) time.
Therefore, a naive algorithm runs in O(n2) time.
9
O-notation an asymptotic upper bound
  • f(n) O(g(n)) iff there exist two positive
    constant c and n0 such that 0 f(n) cg(n)
    for all n n0

cg(n)
f(n)
n0
10
How functions grow?
30n 92n log n 26n2 0.68n3 2n
100 0.003 sec. 0.003 sec. 0.0026 sec. 0.68 sec. 4 x 1016 yr.
100,000 3.0 sec. 2.6 min. 3.0 days 22 yr.
function
n
(Assume one million operations per second.)
  • For large data sets, algorithms with a complexity
    greater than O(n log n) are often impractical!

11
Maximum-sum interval(The recurrence relation)
  • Define S(i) to be the maximum sum of the
    intervals ending at position i.

If S(i-1) lt 0, concatenating ai with its previous
interval gives less sum than ai itself.
12
Maximum-sum interval(Tabular computation)
9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7
The maximum sum
13
Maximum-sum interval(Traceback)
9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7
The maximum-sum interval 6 -2 8 4
14
Longest increasing subsequence(LIS)
  • The longest increasing subsequence is to find a
    longest increasing subsequence of a given
    sequence of distinct integers a1a2an .
  • e.g. 9 2 5 3 7 11 8 10 13 6
  • 3 7
  • 7 10 13
  • 7 11
  • 3 5 11 13

are increasing subsequences.
We want to find a longest one.
are not increasing subsequences.
15
A naive approach for LIS
  • Let Li be the length of a longest increasing
    subsequence ending at position i.

Li 1 max j 0..i-1Lj aj lt ai(use a
dummy a0 minimum, and L00)
9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 ?
16
A naive approach for LIS
  • Li 1 max j 0..i-1 Lj aj lt ai

9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 4 5
6 3
The maximum length
The subsequence 2, 3, 7, 8, 10, 13 is a longest
increasing subsequence. This method runs in O(n2)
time.
17
Binary search
  • Given an ordered sequence x1x2 ... xn, where
    x1ltx2lt ... ltxn, and a number y, a binary search
    finds the largest xi such that xilt y in O(log n)
    time.

n/2
...
n/4
n
18
Binary search
  • How many steps would a binary search reduce the
    problem size to 1?n n/2 n/4 n/8 n/16
    ... 1

How many steps? O(log n) steps.
19
An O(n log n) method for LIS
  • Define BestEndk to be the smallest number of an
    increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
BestEnd3
11
8
8
8
BestEnd4
10
10
BestEnd5
13
BestEnd6
20
An O(n log n) method for LIS
  • Define BestEndk to be the smallest number of an
    increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
6
BestEnd3
11
8
8
8
8
BestEnd4
For each position, we perform a binary search to
update BestEnd. Therefore, the running time is
O(n log n).
10
10
10
BestEnd5
13
13
BestEnd6
21
Longest Common Subsequence (LCS)
  • A subsequence of a sequence S is obtained by
    deleting zero or more symbols from S. For
    example, the following are all subsequences of
    president pred, sdn, predent.
  • The longest common subsequence problem is to find
    a maximum-length common subsequence between two
    sequences.

22
LCS
  • For instance,
  • Sequence 1 president
  • Sequence 2 providence
  • Its LCS is priden.

president providence
23
LCS
  • Another example
  • Sequence 1 algorithm
  • Sequence 2 alignment
  • One of its LCS is algm.

a l g o r i t h m a l i g n m e n t
24
How to compute LCS?
  • Let Aa1a2am and Bb1b2bn .
  • len(i, j) the length of an LCS between
    a1a2ai and b1b2bj
  • With proper initializations, len(i, j)can be
    computed as follows.

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Dot Matrix
  • Sequence ACTTAACT
  • Sequence BCGGATCAT

C G G A T C A T







CTTAACT
30
Pairwise Alignment
  • Sequence A CTTAACT
  • Sequence B CGGATCAT
  • An alignment of A and B

C---TTAACTCGGATCA--T
Sequence A
Sequence B
31
Pairwise Alignment
  • Sequence A CTTAACT
  • Sequence B CGGATCAT
  • An alignment of A and B

Mismatch
Match
C---TTAACTCGGATCA--T
Deletion gap
Insertion gap
32
Alignment Graph
  • Sequence A CTTAACT
  • Sequence B CGGATCAT

C G G A T C A T







CTTAACT
C---TTAACTCGGATCA--T
33
A simple scoring scheme
  • Match 8 (w(x, y) 8, if x y)
  • Mismatch -5 (w(x, y) -5, if x ? y)
  • Each gap symbol -3 (w(-,x)w(x,-)-3)

C - - - T T A A C TC G G A T C A - - T
8 -3 -3 -3 8 -5 8 -3 -3
8 12
Alignment score
34
An optimal alignment-- the alignment of maximum
score
  • Let Aa1a2am and Bb1b2bn .
  • Si,j the score of an optimal alignment between
    a1a2ai and b1b2bj
  • With proper initializations, Si,j can be
    computedas follows.

35
Computing Si,j

j
w(ai,bj)
w(ai,-)
i
w(-,bj)
Sm,n
36
Initializations

C G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
-3
-6
-9
-12
-15
-18
-21
CTTAACT
37
S3,5 ?

C G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 ?
-12
-15
-18
-21
CTTAACT
38
S3,5 ?

C G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 5 -1 -4 9
-12 -1 -3 -5 6 3 0 7 6
-15 -4 -6 -8 3 1 -2 8 5
-18 -7 -9 -11 0 -2 9 6 3
-21 -10 -12 -14 -3 8 6 4 14
CTTAACT
optimal score
39
C T T A A C TC G G A T C A T
8 5 5 8 -5 8 -3 8 14

C G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 5 -1 -4 9
-12 -1 -3 -5 6 3 0 7 6
-15 -4 -6 -8 3 1 -2 8 5
-18 -7 -9 -11 0 -2 9 6 3
-21 -10 -12 -14 -3 8 6 4 14
CTTAACT
40
Global Alignment vs. Local Alignment
  • global alignment
  • local alignment

41
An optimal local alignment
  • Si,j the score of an optimal local alignment
    ending at ai and bj
  • With proper initializations, Si,j can be
    computedas follows.

42
local alignment
Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T
0 0 0 0 0 0 0 0 0
0 8 5 2 0 0 8 5 2
0 5 3 0 0 8 5 3 13
0 2 0 0 0 8 5 2 11
0 0 0 0 8 5 3 ?
0
0
0
CTTAACT
43
local alignment
Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T
0 0 0 0 0 0 0 0 0
0 8 5 2 0 0 8 5 2
0 5 3 0 0 8 5 3 13
0 2 0 0 0 8 5 2 11
0 0 0 0 8 5 3 13 10
0 0 0 0 8 5 2 11 8
0 8 5 2 5 3 13 10 7
0 5 3 0 2 13 10 8 18
CTTAACT
The best score
44
A C - TA T C A T 8-38-38 18

C G G A T C A T
0 0 0 0 0 0 0 0 0
0 8 5 2 0 0 8 5 2
0 5 3 0 0 8 5 3 13
0 2 0 0 0 8 5 2 11
0 0 0 0 8 5 3 13 10
0 0 0 0 8 5 2 11 8
0 8 5 2 5 3 13 10 7
0 5 3 0 2 13 10 8 18
CTTAACT
The best score
45
Affine gap penalties
  • Match 8 (w(x, y) 8, if x y)
  • Mismatch -5 (w(x, y) -5, if x ? y)
  • Each gap symbol -3 (w(-,x)w(x,-)-3)
  • Each gap is charged an extra gap-open penalty -4.

-4
-4
C - - - T T A A C TC G G A T C A - - T
8 -3 -3 -3 8 -5 8 -3 -3
8 12
Alignment score 12 4 4 4
46
Affine gap penalties
  • A gap of length k is penalized x ky.

gap-open penalty
  • Three cases for alignment endings
  • ...x...x
  • ...x...-
  • ...-...x

gap-symbol penalty
an aligned pair
a deletion
an insertion
47
Affine gap penalties
  • Let D(i, j) denote the maximum score of any
    alignment between a1a2ai and b1b2bj ending with
    a deletion.
  • Let I(i, j) denote the maximum score of any
    alignment between a1a2ai and b1b2bj ending with
    an insertion.
  • Let S(i, j) denote the maximum score of any
    alignment between a1a2ai and b1b2bj.

48
Affine gap penalties
49
Affine gap penalties
-y
w(ai,bj)
-x-y
D
-x-y
S
I
-y
50
PAM (percent accepted mutation)
  • Scoring function is critical element
  • above SF is not good enough
  • Developing a new scoring function
  • Based on a set of native protein sequences

51
20 x 20 substitution matrix
52
PAM-250
53
PAM 1
The Dayhoff Matrix  Proteins evolve through a
succession of independent point mutations, that
are accepted in a population and subsequently can
be observed in the sequence pool. (Dayhoff, M.O.
et al. (1978) Atlas of Protein Sequence and
Structure. Vol. 5, Suppl. 3 National Biomedical
Research Foundation, Washington D.C. U.S.A).
First step Pair Exchange Frequencies  A PAM
(Percent Accepted Mutation) is one accepted point
mutation on the path between two sequences, per
100 residues.
54
PAM 2 Second step Frequencies of Occurrence of
each amino acid
         1978        1991 L       0.085      
0.091 A       0.087       0.077 G      
0.089       0.074 S       0.070      
0.069 V       0.065       0.066 E      
0.050       0.062 T       0.058      
0.059 K       0.081       0.059 I      
0.037       0.053 D       0.047      
0.052 R       0.041       0.051 P      
0.051       0.051 N       0.040      
0.043 Q       0.038       0.041 F      
0.040       0.040 Y       0.030      
0.032 M       0.015       0.024 H      
0.034       0.023 C       0.033      
0.020 W       0.010       0.014
Computing the relative frequency of occurrence of
a.a. over a large, Sufficiently varied protein
sequence set i.e occurrence in the sample space
55
PAM 3 Third step Relative Mutabilities
All values are taken relative to alanine, which
is arbitrarily set at 100.
         1978        1991 A         100        
100 C          20          44 D        
106          86 E         102         
77 F          41          51 G         
49          50 H          66         
91 I          96         103 K         
56          72 L          40         
54 M          94          93 N        
134         104 P          56         
58 Q          93          84 R         
65          83 S         120        
117 T          97         107 V         
74          98 W          18         
25 Y          41          50
56
PAM 4 Fourth step Mutation Probability matrix
The probability that an amino acid in row a of
the matrix will replace the amino acid in column
b the mutability of amino acid b, multiplied by
the pair exchange frequency for aa divided by the
sum of all pair exchange frequencies for amino
acid a
Matrix M is a Markov model of evolution
57
PAM 5 Last step the log-odds matrix
log to base 10 a value of 1 would mean that the
corresponding pair has been observed 10 times
more frequently than expected by chance.  The
most commonly used matrix is the matrix from the
1978 edition of the Dayhoff atlas, at PAM 250
this is also frequently referred to as the PAM250
matrix. 
k-PAM defines as
58
PAM-250
59
BLOSUM Matrices
  • BLOSUM matrices are based on local alignments.
  • BLOSUM 62 is a matrix calculated from comparisons
    of sequences with no less than 62 divergence.
  • All BLOSUM matrices are based on observed
    alignments they are not extrapolated from
    comparisons of closely related proteins.
  • BLOSUM 62 is the default matrix in BLAST 2.0.
    Though it is tailored for comparisons of
    moderately distant proteins, it performs well in
    detecting closer relationships. A search for
    distant relatives may be more sensitive with a
    different matrix.

60
BLOSUM62 is the BLAST default
61
BLOSUM 62
62
Relationship between PAM BLOSUM
63
Gap Penalty
64
k best local alignments
  • Smith-Waterman(Smith and Waterman, 1981
    Waterman and Eggert, 1987)
  • linear-space versionsim (Huang and Miller, 1991)
  • linear-space variantssim2 (Chao et al., 1995)
    sim3 (Chao et al., 1997)
  • FASTA(Wilbur and Lipman, 1983 Lipman and
    Pearson, 1985)
  • linear-space band alignment (Chao et al., 1992)
  • BLAST(Altschul et al., 1990 Altschul et al.,
    1997)
  • restricted affine gap penalties (Chao, 1999)

65
FASTA
  1. Find runs of identities, and identify regions
    with the highest density of identities.
  2. Re-score using PAM matrix, and keep top scoring
    segments.
  3. Eliminate segments that are unlikely to be part
    of the alignment.
  4. Optimize the alignment in a band.

66
FASTA
Step 1 Find runes of identities, and identify
regions with the highest density of identities.
67
FASTA
Step 2 Re-score using PAM matrix, andkeep top
scoring segments.
68
FASTA
Step 3 Eliminate segments that are unlikely to
be part of the alignment.
69
FASTA
Step 4 Optimize the alignment in a band.
70
BLAST
  • Build the hash table for Sequence A.
  • Scan Sequence B for hits.
  • Extend hits.

71
BLAST
Step 1 Build the hash table for Sequence A.
(3-tuple example)
For protein sequences Seq. A ELVISAdd xyz to
the hash table if Score(xyz, ELV) ? TAdd
xyz to the hash table if Score(xyz, LVI) ?
TAdd xyz to the hash table if Score(xyz,
VIS) ? T
For DNA sequences Seq. A AGATCGAT
12345678 AAAAAC..AGA 1..ATC 3..CGA
5..GAT 2 6..TCG 4..TTT

72
BLAST
Step2 Scan sequence B for hits.
73
BLAST
Step2 Scan sequence B for hits.
Step 3 Extend hits.
BLAST 2.0 saves the time spent in extension, and
considers gapped alignments.
hit
Terminate if the score of the sxtension fades
away.
74
Remarks
  • Filtering is based on the observation that a good
    alignment usually includes short identical or
    very similar fragments.
  • The idea of filtration was used in both FASTA and
    BLAST.

75
Linear-space ideasHirschberg, 1975 Myers and
Miller, 1988
m/2
76
Two subproblems½ original problem size
m/4
m/2
3m/4
77
Four subproblems¼ original problem size
m/4
m/2
3m/4
78
Time and Space Complexity
  • Space O(MN)
  • TimeO(MN)(1 ½ ¼ ) O(MN)

2
79
Band Alignment
Sequence A

Sequence B
80
Band Alignment in Linear Space
The remaining subproblems are no longer only half
of the original problem. In worst case, this
could cause an additional log n factor in time.
81
Band Alignment in Linear Space
82
Multiple sequence alignment (MSA)
  • The multiple sequence alignment problem is to
    simultaneously align more than two sequences.

Seq1 GCTC Seq2 AC Seq3 GATC
GC-TC A---C G-ATC
83
How to score an MSA?
  • Sum-of-Pairs (SP-score)

GC-TC A---C
Score

GC-TC A---C G-ATC
GC-TC G-ATC
Score
Score


A---C G-ATC
Score
84
Defining scores for alignment columns
  • infocon Stojanovic et al., 1999
  • Each column is assigned a score that measures its
    information content, based on the frequencies of
    the letters both within the column and within the
    alignment.

CGGATCATGGACTTAACATTGAAGAGAACATAGTA
85
Defining scores (contd)
  • phylogen Stojanovic et al., 1999
  • columns are scored based on the evolutionary
    relationships among the sequences implied by a
    supplied phylogenetic tree.

Score 2
Score 1
T
T
TTTCC
T
T
T
C
T
T
T T T C C
T T T C C
86
MSA for three sequences
  • an O(n3) algorithm

87
General MSA
  • For k sequences of length n O(nk)
  • NP-Complete (Wang and Jiang)
  • The exact multiple alignment algorithms for many
    sequences are not feasible.
  • Some approximation algorithms are given.(e.g.,
    2- l/k for any fixed l by Bafna et al.)

88
Progressive alignment
  • A heuristic approach proposed by Feng and
    Doolittle.
  • It iteratively merges the most similar pairs.
  • Once a gap, always a gap

The time for progressive alignment in most cases
is roughly the order of the time for computing
all pairwise alignment, i.e., O(k2n2) .
A B C D E
89
Concluding remarks
  • Three essential components of the
    dynamic-programming approach
  • the recurrence relation
  • the tabular computation
  • the traceback
  • The dynamic-programming approach has been used in
    a vast number of computational problems in
    bioinformatics.
Write a Comment
User Comments (0)
About PowerShow.com