Title: LCS and Extensions to Global and Local Alignment
1LCS and Extensions to Global and Local Alignment
- Dr. Nancy Warter-Perez
- June 26, 2003
2Overview
- Recursion
- Recursive solution to hydrophobicity sliding
window problem - LCS
- Smith-Waterman Algorithm
- Extensions to LCS
- Global Alignment
- Local Alignment
- Affine Gap Penalties
- Programming Workshop 6
3Project References
- http//www.sbc.su.se/arne/kurser/swell/pairwise_a
lignments.html - Computational Molecular Biology An Algorithmic
Approach, Pavel Pevzner - Introduction to Computational Biology Maps,
sequences, and genomes, Michael Waterman - Algorithms on Strings, Trees, and Sequences
Computer Science and Computational Biology, Dan
Gusfield
4Recursion
- Problems can be solved iteratively or recursively
- Recursion is useful in cases where you are
building upon a partial solution - Consider the hydrophobicity problem
5- Main.cpp
- include ltiostreamgt
- include ltstringgt
- using namespace std
- include "hydro.h"
- double hydro25 1.8,0,2.5,-3.5,-3.5,2.8,-0.4,-
3.2,4.5,0,-3.9,3.8,1.9,-3.5,0, - -1.6,-3.5,-4.5,-0.8,-0.7,0,4.2,-0.9,0,-1.3
- void main ()
- string seq int ws, i
- cout ltlt "This program will compute the
hydrophobicity of an sequence of amino acids.\n" - cout ltlt "Please enter the sequence "ltlt
flush cin gtgt seq - for(i 0 i lt seq.size() i)
- if((seq.data()i gt 'a') (seq.data()i
lt 'z')) - seq.at(i) seq.data()i - 32
- cout ltlt "Please enter the window size "ltlt
flush cin gtgt ws
6- Hydro.cpp
- include ltiostreamgt
- include ltstringgt
- using namespace std
- include "hydro.h"
- void print_hydro(string seq, int ws, int i,
double sum) - void compute_hydro(string seq, int ws)
- cout ltlt "\n\nThe hydrophocity values are" ltlt
endl - print_hydro(seq, ws, seq.size()-1, 0)
-
- void print_hydro(string seq, int ws, int i,
double sum) - if(i -1)
- return
- if(i gt seq.size() - ws)
- sum hydroseq.data()i - 'A'
- else
- sum sum - hydroseq.data()iws - 'A'
hydroseq.data()i - 'A' - print_hydro(seq, ws, i-1, sum)
7hydro.h extern double hydro25 void
compute_hydro(string seq, int ws)
8Dynamic Programming
- Applied to optimization problems
- Useful when
- Problem can be recursively divided into
sub-problems - Sub-problems are not independent
9Longest Common Subsequence (LCS) Problem
- Reference Pevzner
- Can have insertion and deletions but no
substitutions (no mismatches) - Ex V ATCTGAT
- W TGCATA
- LCS TCTA
10LCS Problem (cont.)
- Similarity score
- si-1,j
- si,j max si,j-1
- si-1,j-1 1, if vi wj
- On board example Pevzner Fig 6.1
11Indels insertions and deletions (e.g., gaps)
- alignment of V and W
- V rows of similarity matrix (vertical axis)
- W columns of similarity matrix (horizontal
axis) - Space (gap) in W ? (UP)
- insertion
- Space (gap) in V ? (LEFT)
- deletion
- Match (no mismatch in LCS) (DIAG)
12LCS(V,W) Algorithm
- for i 0 to n
- si,0 0
- for j 1 to m
- s0,j 0
- for i 1 to n
- for j 1 to m
- if vi wj
- si,j si-1,j-1 1 bi,j DIAG
- else if si-1,j gt si,j-1
- si,j si-1,j bi,j UP
- else
- si,j si,j-1 bi,j LEFT
13Print-LCS(V,i,j)
- if i 0 or j 0
- return
- if bi,j DIAG
- PRINT-LCS(V, i-1, j-1)
- print vi
- else if bi,j UP
- PRINT-LCS(V, i-1, j)
- else
- PRINT-LCS(V, I, j-1)
14Classic Papers
- Needleman, S.B. and Wunsch, C.D. A General Method
Applicable to the Search for Similarities in
Amino Acid Sequence of Two Proteins. J. Mol.
Biol., 48, pp. 443-453, 1970.(http//poweredge.sta
nford.edu/BioinformaticsArchive/ClassicArticlesArc
hive/needlemanandwunsch1970.pdf) - Smith, T.F. and Waterman, M.S. Identification of
Common Molecular Subsequences. J. Mol. Biol.,
147, pp. 195-197, 1981.(http//poweredge.stanford.
edu/BioinformaticsArchive/ClassicArticlesArchive/s
mithandwaterman1981.pdf) - Smith, T.F. The History of the Genetic Sequence
Databases. Genomics, 6, pp. 701-707, 1990.
(http//poweredge.stanford.edu/BioinformaticsArchi
ve/ClassicArticlesArchive/smith1990.pdf)
15Smith-Waterman (1 of 3)
Algorithm The two molecular sequences will be
Aa1a2 . . . an, and Bb1b2 . . . bm. A
similarity s(a,b) is given between sequence
elements a and b. Deletions of length k are given
weight Wk. To find pairs of segments with high
degrees of similarity, we set up a matrix H .
First set Hk0 Hol 0 for 0 lt k lt n and 0 lt
l lt m. Preliminary values of H have the
interpretation that H i j is the maximum
similarity of two segments ending in ai and bj.
respectively. These values are obtained from the
relationship HijmaxHi-1,j-1 s(ai,bj), max
Hi-k,j Wk, maxHi,j-l - Wl , 0 ( 1 )
k
gt 1 l gt 1 1 lt i lt n and 1 lt j
lt m.
16Smith-Waterman (2 of 3)
- The formula for Hij follows by considering the
possibilities for ending the segments at any ai
and bj. - If ai and bj are associated, the similarity is
- Hi-l,j-l s(ai,bj).
- (2) If ai is at the end of a deletion of length
k, the similarity is - Hi k, j - Wk .
- (3) If bj is at the end of a deletion of length
1, the similarity is - Hi,j-l - Wl. (typo in paper)
- (4) Finally, a zero is included to prevent
calculated negative similarity, indicating no
similarity up to ai and bj.
17Smith-Waterman (3 of 3)
The pair of segments with maximum similarity is
found by first locating the maximum element of H.
The other matrix elements leading to this maximum
value are than sequentially determined with a
traceback procedure ending with an element of H
equal to zero. This procedure identifies the
segments as well as produces the corresponding
alignment. The pair of segments with the next
best similarity is found by applying the
traceback procedure to the second largest element
of H not associated with the first traceback.
18Extend LCS to Global Alignment
- si-1,j ?(vi, -)
- si,j max si,j-1 ?(-, wj)
- si-1,j-1 ?(vi, wj)
- ?(vi, -) ?(-, wj) -? fixed gap penalty
- ?(vi, wj) score for match or mismatch can be
fixed, from PAM or BLOSUM
19Extend to Local Alignment
- 0 (no negative scores)
- si-1,j ?(vi, -)
- si,j max si,j-1 ?(-, wj)
- si-1,j-1 ?(vi, wj)
- ?(vi, -) ?(-, wj) -? fixed gap penalty
- ?(vi, wj) score for match or mismatch can be
fixed, from PAM or BLOSUM
20Discussion on adding affine gap penalties
- Affine gap penalty
- Score for a gap of length x
- -(? ?x)
- Where
- ? gt 0 is the insert gap penalty
- ? gt 0 is the extend gap penalty
21Alignment with Gap Penalties Can apply to global
or local (w/ zero) algorithms
- ?si,j max ?si-1,j - ?
- si-1,j - (? ?)
- ?si,j max ?si1,j-1 - ?
- si,j-1 - (? ?)
- si-1,j-1 ?(vi, wj)
- si,j max ?si,j
- ?si,j
- Note keeping with traversal order in Figure 6.1,
? is replaced by ?, and ? is replaced by ?
22Programming Workshop 6
- Implement LCS
- LCS(V,W)
- b and s are global matrices
- Print-LCS(V,i,j)
- Write a program that uses LCS and Print-LCS. The
program should prompt the user for 2 sequences
and print the longest common sequence.