Title: DynamicProgramming Strategies for Analyzing Biomolecular Sequences
1Dynamic-Programming Strategies for Analyzing
Biomolecular Sequences
- Kun-Mao Chao (???)
- Department of Computer Science and Information
Engineering - National Taiwan University, Taiwan
- E-mail kmchao_at_csie.ntu.edu.tw
- WWW http//www.csie.ntu.edu.tw/kmchao
2Dynamic Programming
- Dynamic programming is a class of solution
methods for solving sequential decision problems
with a compositional cost structure. - Richard Bellman was one of the principal founders
of this approach.
3Two key ingredients
- Two key ingredients for an optimization problem
to be suitable for a dynamic-programming solution
2. overlapping subproblems
1. optimal substructures
Subproblems are dependent. (otherwise, a
divide-and-conquer approach is the choice.)
Each substructure is optimal. (Principle of
optimality)
4Three basic components
- The development of a dynamic-programming
algorithm has three basic components - The recurrence relation (for defining the value
of an optimal solution) - The tabular computation (for computing the value
of an optimal solution) - The traceback (for delivering an optimal
solution).
5Fibonacci numbers
The Fibonacci numbers are defined by the
following recurrence
6How to compute F10?
7Tabular computation
- The tabular computation can avoid recompuation.
8Maximum-sum interval
- Given a sequence of real numbers a1a2an , find a
consecutive subsequence with the maximum sum.
9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
For each position, we can compute the maximum-sum
interval starting at that position in O(n) time.
Therefore, a naive algorithm runs in O(n2) time.
9O-notation an asymptotic upper bound
- f(n) O(g(n)) iff there exist two positive
constant c and n0 such that 0 f(n) cg(n)
for all n n0
cg(n)
f(n)
n0
10How functions grow?
function
n
(Assume one million operations per second.)
- For large data sets, algorithms with a complexity
greater than O(n log n) are often impractical!
11Maximum-sum interval(The recurrence relation)
- Define S(i) to be the maximum sum of the
intervals ending at position i.
If S(i-1) lt 0, concatenating ai with its previous
interval gives less sum than ai itself.
12Maximum-sum interval(Tabular computation)
9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7
The maximum sum
13Maximum-sum interval(Traceback)
9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7
The maximum-sum interval 6 -2 8 4
14Two fundamental problems we recently solved (I)
(joint work with Lin and Jiang)
- Given a sequence of real numbers of length n and
an upper bound U, find a consecutive subsequence
of length at most U with the maximum sum --- an
O(n)-time algorithm.
U 3 9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
15Joint work with Huang, Jiang and Lin
- Xiaoqiu Huang (???)Iowa State University, USA
- Tao Jiang (??) University of California
Riverside, USA - Yaw-Ling Lin (???)Providence University, Taiwan
16Two fundamental problems we recently solved (II)
(joint work with Lin and Jiang)
- Given a sequence of real numbers of length n and
a lower bound L, find a consecutive subsequence
of length at least L with the maximum average.
--- an O(n log L)-time algorithm.
L 4 3 2 14 6 6 2 10 2 6 6 14 2 1
17Another example
- Given a sequence as follows 2, 6.6, 6.6, 3, 7,
6, 7, 2 - and L 2, the highest-average interval is the
squared area, which has the average value 20/3. - 2, 6.6, 6.6, 3, 7, 6, 7, 2
18CG rich regions
- Our method can be used to locate a region of
length at least L with the highest CG ratio in
O(n log L) time.
ATGACTCGAGCTCGTCA 00101011011011010
Search for an interval of length at least L with
the highest average.
19Length-unconstrained version
3 2 14 6 6 2 10 2 6 6 14 2 1
The maximum element is the answer. It can be done
in O(n) time.
20Q computing density in O(1) time?
- prefix-sum(i) S1S2Si,
- all n prefix sums are computable in O(n) time.
- sum(i, j) prefix-sum(j) prefix-sum(i-1)
- density(i, j) sum(i, j) / (j-i1)
j
i
prefix-sum(j)
prefix-sum(i-1)
21A naive algorithm
- A simple shift algorithm can compute the
highest-average interval of a fixed length in
O(n) time - Try L, L1, L2, ..., n. In total, O(n2).
22An O(nL)-time algorithm (Huang, CABIOS 94)
- Observing that there exists an optimal interval
of length bounded by 2L, we immediately have an
O(nL)-time algorithm.
We can bisect a region of length gt 2L into two
segments, where each of them is of length gt L.
23Good partners
- Finding good partner g(i) for each i1, 2, ,
n-L1.
i L
g(i)
L
maximing avgi, g(i)
24Right-Skew Decomposition
- Partition S into substrings S1,S2,,Sk such that
- each Si is a right-skew substring of S
- the average of any prefix is always less than or
equal to the average of the remaining suffix. - density(S1) gt density(S2) gt gt density(Sk)
- Lin, Jiang, Chao
- Unique
- Computable in linear time.
25An O(n log L)-time algorithm (Lin, Jiang, Chao,
JCSS02)
- Decreasingly right-skew decomposition (O(n) time)
5
6
7.5
5
9
8
9
8
7
8
9 7 8 1 9 8
3 7 2 8
26An O(n log L)-time algorithm
- Jumping tables that allows binary search.
- O(log L) time for each position to find its good
partner, therefore the total running time is O(n
log L). - We also implemented a program that linearly scans
right-skew segments for the good partnership. Our
empirical tests showed that it ran in linear time
empirically.
27Goldwasswer, Kao, and Lus recent progress
- An O(n) time algorithm for the problem
- Appeared in Proceedings of the Second Workshop on
Algorithms in Bioinformatics (WABI), Rome, Itay,
Sep. 17-21, 2002.
28A new important observation
- i lt j lt g(j) lt g(i) implies
- density(i, g(i)) is no more than density(j, g(j))
29k non-overlapping maximum-average segments (Lin,
Huang, Jiang, Chao, Bioinformatics)
- Maintain all candidates in a priority queue.
- A new k-best algorithm algorithms on trees (Lin
and Chao (2003))
30Longest increasing subsequence(LIS)
- The longest increasing subsequence is to find a
longest increasing subsequence of a given
sequence of distinct integers a1a2an .
- e.g. 9 2 5 3 7 11 8 10 13 6
- 3 7
- 7 10 13
- 7 11
- 3 5 11 13
are increasing subsequences.
We want to find a longest one.
are not increasing subsequences.
31A naive approach for LIS
- Let Li be the length of a longest increasing
subsequence ending at position i.
Li 1 max j 0..i-1Lj aj lt ai(use a
dummy a0 minimum, and L00)
9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 ?
32A naive approach for LIS
- Li 1 max j 0..i-1 Lj aj lt ai
9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 4 5
6 3
The maximum length
The subsequence 2, 3, 7, 8, 10, 13 is a longest
increasing subsequence. This method runs in O(n2)
time.
33Binary search
- Given an ordered sequence x1x2 ... xn, where
x1ltx2lt ... ltxn, and a number y, a binary search
finds the largest xi such that xilt y in O(log n)
time.
n/2
...
n/4
n
34Binary search
- How many steps would a binary search reduce the
problem size to 1?n n/2 n/4 n/8 n/16
... 1
How many steps? O(log n) steps.
35An O(n log n) method for LIS
- Define BestEndk to be the smallest number of an
increasing subsequence of length k.
9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
BestEnd3
11
8
8
8
BestEnd4
10
10
BestEnd5
13
BestEnd6
36An O(n log n) method for LIS
- Define BestEndk to be the smallest number of an
increasing subsequence of length k.
9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
6
BestEnd3
11
8
8
8
8
BestEnd4
For each position, we perform a binary search to
update BestEnd. Therefore, the running time is
O(n log n).
10
10
10
BestEnd5
13
13
BestEnd6
37Longest Common Subsequence (LCS)
- A subsequence of a sequence S is obtained by
deleting zero or more symbols from S. For
example, the following are all subsequences of
president pred, sdn, predent. - The longest common subsequence problem is to find
a maximum-length common subsequence between two
sequences.
38LCS
- For instance,
- Sequence 1 president
- Sequence 2 providence
- Its LCS is priden.
president providence
39LCS
- Another example
- Sequence 1 algorithm
- Sequence 2 alignment
- One of its LCS is algm.
a l g o r i t h m a l i g n m e n t
40How to compute LCS?
- Let Aa1a2am and Bb1b2bn .
- len(i, j) the length of an LCS between
a1a2ai and b1b2bj - With proper initializations, len(i, j)can be
computed as follows.
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45Bioinformatics
46Bioinformatics and Computational Biology-Related
Journals
- Bioinformatics (previously called CABIOS)
- Bulletin of Mathematical Biology
- Computers and Biomedical Research
- Genome Research
- Genomics
- Journal of Bioinformatics and Computational
Biology - Journal of Computational Biology
- Journal of Molecular Biology
- Nature
- Science
47Bioinformatics and Computational Biology-Related
Conferences
- the first IEEE Computer Society Bioinformatics
Conference (CSB 2002, CA, USA) - Intelligent Systems for Molecular Biology (ISMB
2003, Brisbane, Australia) - Pacific Symposium on Biocomputing(PSB 2003,
Kauai, Hawaii, USA) - The Seventh Annual International Conference on
Research in Computational Molecular
Biology (RECOMB 2003, Berlin, Germany)
48Bioinformatics and Computational Biology-Related
Books
- Calculating the Secrets of Life Applications of
the Mathematical Sciences in Molecular Biology,
by Eric S. Lander and Michael S. Waterman (1995) - Introduction to Computational Biology Maps,
Sequences, and Genomes, by Michael S. Waterman
(1995) - Introduction to Computational Molecular Biology,
by Joao Carlos Setubal and Joao Meidanis (1996) - Algorithms on Strings, Trees, and Sequences
Computer Science and Computational Biology, by
Dan Gusfield (1997) - Computational Molecular Biology An Algorithmic
Approach, by Pavel Pevzner (2000) - Introduction to Bioinformatics, by Arthur M. Lesk
(2002)
49Useful Websites
- MIT Biology Hypertextbook
- http//www.mit.edu8001/afs/athena/course/other/es
gbio/www/7001main.html - The International Society for Computational
Biology - http//www.iscb.org/
- National Center for Biotechnology Information
(NCBI, NIH) - http//www.ncbi.nlm.nih.gov/
- European Bioinformatics Institute (EBI)
- http//www.ebi.ac.uk/
- DNA Data Bank of Japan (DDBJ)
- http//www.ddbj.nig.ac.jp/
50Sequence Alignment
51Dot Matrix
- Sequence ACTTAACT
- Sequence BCGGATCAT
C G G A T C A T
CTTAACT
52Pairwise Alignment
- Sequence A CTTAACT
- Sequence B CGGATCAT
- An alignment of A and B
C---TTAACTCGGATCA--T
Sequence A
Sequence B
53Pairwise Alignment
- Sequence A CTTAACT
- Sequence B CGGATCAT
- An alignment of A and B
Mismatch
Match
C---TTAACTCGGATCA--T
Deletion gap
Insertion gap
54Alignment Graph
- Sequence A CTTAACT
- Sequence B CGGATCAT
C G G A T C A T
CTTAACT
C---TTAACTCGGATCA--T
55A simple scoring scheme
- Match 8 (w(x, y) 8, if x y)
- Mismatch -5 (w(x, y) -5, if x ? y)
- Each gap symbol -3 (w(-,x)w(x,-)-3)
C - - - T T A A C TC G G A T C A - - T
8 -3 -3 -3 8 -5 8 -3 -3
8 12
Alignment score
56An optimal alignment-- the alignment of maximum
score
- Let Aa1a2am and Bb1b2bn .
- Si,j the score of an optimal alignment between
a1a2ai and b1b2bj - With proper initializations, Si,j can be
computedas follows.
57Computing Si,j
j
w(ai,bj)
w(ai,-)
i
w(-,bj)
Sm,n
58Initializations
C G G A T C A T
CTTAACT
59S3,5 ?
C G G A T C A T
CTTAACT
60S3,5 ?
C G G A T C A T
CTTAACT
optimal score
61C T T A A C TC G G A T C A T
8 5 5 8 -5 8 -3 8 14
C G G A T C A T
CTTAACT
62Global Alignment vs. Local Alignment
- global alignment
- local alignment
63An optimal local alignment
- Si,j the score of an optimal local alignment
ending at ai and bj - With proper initializations, Si,j can be
computedas follows.
64local alignment
Match 8 Mismatch -5 Gap symbol -3
C G G A T C A T
CTTAACT
65local alignment
Match 8 Mismatch -5 Gap symbol -3
C G G A T C A T
CTTAACT
The best score
66A C - TA T C A T 8-38-38 18
C G G A T C A T
CTTAACT
The best score
67Affine gap penalties
- Match 8 (w(x, y) 8, if x y)
- Mismatch -5 (w(x, y) -5, if x ? y)
- Each gap symbol -3 (w(-,x)w(x,-)-3)
- Each gap is charged an extra gap-open penalty -4.
-4
-4
C - - - T T A A C TC G G A T C A - - T
8 -3 -3 -3 8 -5 8 -3 -3
8 12
Alignment score 12 4 4 4
68Affine gap panalties
- A gap of length k is penalized x ky.
gap-open penalty
- Three cases for alignment endings
- ...x...x
- ...x...-
- ...-...x
gap-symbol penalty
an aligned pair
a deletion
an insertion
69Affine gap penalties
- Let D(i, j) denote the maximum score of any
alignment between a1a2ai and b1b2bj ending with
a deletion. - Let I(i, j) denote the maximum score of any
alignment between a1a2ai and b1b2bj ending with
an insertion. - Let S(i, j) denote the maximum score of any
alignment between a1a2ai and b1b2bj.
70Affine gap penalties
71Affine gap penalties
-y
w(ai,bj)
-x-y
D
-x-y
S
I
-y
72k best local alignments
- Smith-Waterman(Smith and Waterman, 1981
Waterman and Eggert, 1987) - FASTA(Wilbur and Lipman, 1983 Lipman and
Pearson, 1985) - BLAST(Altschul et al., 1990 Altschul et al.,
1997)
73k best local alignments
- Smith-Waterman(Smith and Waterman, 1981
Waterman and Eggert, 1987) - linear-space version sim (Huang and Miller,
1991) - linear-space variants sim2 (Chao et al., 1995)
sim3 (Chao et al., 1997) - Chaining local alignments Gap3 (Huang and Chao,
2003) - FASTA(Wilbur and Lipman, 1983 Lipman and
Pearson, 1985) - linear-space band alignment (Chao et al., 1992)
- BLAST(Altschul et al., 1990 Altschul et al.,
1997) - Enhanced PatternHunter (Yang et al., coming soon)
74FASTA
- Find runs of identities, and identify regions
with the highest density of identities. - Re-score using PAM matrix, and keep top scoring
segments. - Eliminate segments that are unlikely to be part
of the alignment. - Optimize the alignment in a band.
75FASTA
Step 1 Find runes of identities, and identify
regions with the highest density of identities.
Sequence B
Sequence A
76FASTA
Step 2 Re-score using PAM matrix, andkeep top
scoring segments.
77FASTA
Step 3 Eliminate segments that are unlikely to
be part of the alignment.
78FASTA
Step 4 Optimize the alignment in a band.
79BLAST
- Basic Local Alignment Search Tool(by Altschul,
Gish, Miller, Myers and Lipman) - The central idea of the BLAST algorithm is that a
statistically significant alignment is likely to
contain a high-scoring pair of aligned words.
80The maximal segment pair measure
- A maximal segment pair (MSP) is defined to be the
highest scoring pair of identical length segments
chosen from 2 sequences.(for DNA Identities
5 Mismatches -4)
- The MSP score may be computed in time
proportional to the product of their lengths.
(How?) An exact procedure is too time consuming. - BLAST heuristically attempts to calculate the MSP
score.
the highest scoring pair
81BLAST
- Build the hash table for Sequence A.
- Scan Sequence B for hits.
- Extend hits.
82BLAST
Step 1 Build the hash table for Sequence A.
(3-tuple example)
For protein sequences Seq. A ELVISAdd xyz to
the hash table if Score(xyz, ELV) ? TAdd
xyz to the hash table if Score(xyz, LVI) ?
TAdd xyz to the hash table if Score(xyz,
VIS) ? T
For DNA sequences Seq. A AGATCGAT
12345678 AAAAAC..AGA 1..ATC 3..CGA
5..GAT 2 6..TCG 4..TTT
83BLAST
Step2 Scan sequence B for hits.
84BLAST
Step2 Scan sequence B for hits.
Step 3 Extend hits.
BLAST 2.0 saves the time spent in extension, and
considers gapped alignments.
hit
Terminate if the score of the sxtension fades
away. (That is, when we reach a segment pair
whose score falls a certain distance below the
best score found for shorter extensions.)
85Remarks
- Filtering is based on the observation that a good
alignment usually includes short identical or
very similar fragments. - The idea of filtration was used in both FASTA and
BLAST.
86Linear-space ideasHirschberg, 1975 Myers and
Miller, 1988
m/2
87Two subproblems½ original problem size
m/4
m/2
3m/4
88Four subproblems¼ original problem size
m/4
m/2
3m/4
89Time and Space Complexity
- Space O(MN)
- TimeO(MN)(1 ½ ¼ ) O(MN)
2
90Band Alignment(Joint work with W. Pearson and W.
Miller)
Sequence A
Sequence B
91Band Alignment in Linear Space
The remaining subproblems are no longer only half
of the original problem. In worst case, this
could cause an additional log n factor in time.
92Band Alignment in Linear Space
93Multiple sequence alignment (MSA)
- The multiple sequence alignment problem is to
simultaneously align more than two sequences.
Seq1 GCTC Seq2 AC Seq3 GATC
GC-TC A---C G-ATC
94How to score an MSA?
GC-TC A---C
Score
GC-TC A---C G-ATC
GC-TC G-ATC
Score
Score
A---C G-ATC
Score
95MSA for three sequences
96General MSA
- For k sequences of length n O(nk)
- NP-Complete (Wang and Jiang)
- The exact multiple alignment algorithms for many
sequences are not feasible. - Some approximation algorithms are given.(e.g.,
2- l/k for any fixed l by Bafna et al.)
97Progressive alignment
- A heuristic approach proposed by Feng and
Doolittle. - It iteratively merges the most similar pairs.
- Once a gap, always a gap
The time for progressive alignment in most cases
is roughly the order of the time for computing
all pairwise alignment, i.e., O(k2n2) .
A B C D E
98Concluding remarks
- Three essential components of the
dynamic-programming approach - the recurrence relation
- the tabular computation
- the traceback
- The dynamic-programming approach has been used in
a vast number of computational problems in
bioinformatics.