LCS and Extensions to Global and Local Alignment - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

LCS and Extensions to Global and Local Alignment

Description:

Smith-Waterman Algorithm. Extensions to LCS ... Smith-Waterman (1 of 3) Algorithm. The two molecular sequences will be A=a1a2. ... Smith-Waterman (2 of 3) ... – PowerPoint PPT presentation

Number of Views:174

Avg rating:3.0/5.0

Slides: 23

Provided by: lanct

Category:

more less

Transcript and Presenter's Notes

Title: LCS and Extensions to Global and Local Alignment

1
LCS and Extensions to Global and Local Alignment

Dr. Nancy Warter-Perez
June 26, 2003

2
Overview

Recursion
Recursive solution to hydrophobicity sliding
window problem
LCS
Smith-Waterman Algorithm
Extensions to LCS
Global Alignment
Local Alignment
Affine Gap Penalties
Programming Workshop 6

3
Project References

http//www.sbc.su.se/arne/kurser/swell/pairwise_a
lignments.html
Computational Molecular Biology An Algorithmic
Approach, Pavel Pevzner
Introduction to Computational Biology Maps,
sequences, and genomes, Michael Waterman
Algorithms on Strings, Trees, and Sequences
Computer Science and Computational Biology, Dan
Gusfield

4
Recursion

Problems can be solved iteratively or recursively
Recursion is useful in cases where you are
building upon a partial solution
Consider the hydrophobicity problem

Main.cpp
include ltiostreamgt
include ltstringgt
using namespace std
include "hydro.h"
double hydro25 1.8,0,2.5,-3.5,-3.5,2.8,-0.4,-
3.2,4.5,0,-3.9,3.8,1.9,-3.5,0,
-1.6,-3.5,-4.5,-0.8,-0.7,0,4.2,-0.9,0,-1.3
void main ()
string seq int ws, i
cout ltlt "This program will compute the
hydrophobicity of an sequence of amino acids.\n"
cout ltlt "Please enter the sequence "ltlt
flush cin gtgt seq
for(i 0 i lt seq.size() i)
if((seq.data()i gt 'a') (seq.data()i
lt 'z'))
seq.at(i) seq.data()i - 32
cout ltlt "Please enter the window size "ltlt
flush cin gtgt ws

Hydro.cpp
include ltiostreamgt
include ltstringgt
using namespace std
include "hydro.h"
void print_hydro(string seq, int ws, int i,
double sum)
void compute_hydro(string seq, int ws)
cout ltlt "\n\nThe hydrophocity values are" ltlt
endl
print_hydro(seq, ws, seq.size()-1, 0)
void print_hydro(string seq, int ws, int i,
double sum)
if(i -1)
return
if(i gt seq.size() - ws)
sum hydroseq.data()i - 'A'
else
sum sum - hydroseq.data()iws - 'A'
hydroseq.data()i - 'A'
print_hydro(seq, ws, i-1, sum)

7
hydro.h extern double hydro25 void
compute_hydro(string seq, int ws)
8
Dynamic Programming

Applied to optimization problems
Useful when
Problem can be recursively divided into
sub-problems
Sub-problems are not independent

9
Longest Common Subsequence (LCS) Problem

Reference Pevzner
Can have insertion and deletions but no
substitutions (no mismatches)
Ex V ATCTGAT
W TGCATA
LCS TCTA

10
LCS Problem (cont.)

Similarity score
si-1,j
si,j max si,j-1
si-1,j-1 1, if vi wj
On board example Pevzner Fig 6.1

11
Indels insertions and deletions (e.g., gaps)

alignment of V and W
V rows of similarity matrix (vertical axis)
W columns of similarity matrix (horizontal
axis)
Space (gap) in W ? (UP)
insertion
Space (gap) in V ? (LEFT)
deletion
Match (no mismatch in LCS) (DIAG)

12
LCS(V,W) Algorithm

for i 0 to n
si,0 0
for j 1 to m
s0,j 0
for i 1 to n
for j 1 to m
if vi wj
si,j si-1,j-1 1 bi,j DIAG
else if si-1,j gt si,j-1
si,j si-1,j bi,j UP
else
si,j si,j-1 bi,j LEFT

13
Print-LCS(V,i,j)

if i 0 or j 0
return
if bi,j DIAG
PRINT-LCS(V, i-1, j-1)
print vi
else if bi,j UP
PRINT-LCS(V, i-1, j)
else
PRINT-LCS(V, I, j-1)

14
Classic Papers

Needleman, S.B. and Wunsch, C.D. A General Method
Applicable to the Search for Similarities in
Amino Acid Sequence of Two Proteins. J. Mol.
Biol., 48, pp. 443-453, 1970.(http//poweredge.sta
nford.edu/BioinformaticsArchive/ClassicArticlesArc
hive/needlemanandwunsch1970.pdf)
Smith, T.F. and Waterman, M.S. Identification of
Common Molecular Subsequences. J. Mol. Biol.,
147, pp. 195-197, 1981.(http//poweredge.stanford.
edu/BioinformaticsArchive/ClassicArticlesArchive/s
mithandwaterman1981.pdf)
Smith, T.F. The History of the Genetic Sequence
Databases. Genomics, 6, pp. 701-707, 1990.
(http//poweredge.stanford.edu/BioinformaticsArchi
ve/ClassicArticlesArchive/smith1990.pdf)

15
Smith-Waterman (1 of 3)
Algorithm The two molecular sequences will be
Aa1a2 . . . an, and Bb1b2 . . . bm. A
similarity s(a,b) is given between sequence
elements a and b. Deletions of length k are given
weight Wk. To find pairs of segments with high
degrees of similarity, we set up a matrix H .
First set Hk0 Hol 0 for 0 lt k lt n and 0 lt
l lt m. Preliminary values of H have the
interpretation that H i j is the maximum
similarity of two segments ending in ai and bj.
respectively. These values are obtained from the
relationship HijmaxHi-1,j-1 s(ai,bj), max
Hi-k,j Wk, maxHi,j-l - Wl , 0 ( 1 )
k
gt 1 l gt 1 1 lt i lt n and 1 lt j
lt m.
16
Smith-Waterman (2 of 3)

The formula for Hij follows by considering the
possibilities for ending the segments at any ai
and bj.
If ai and bj are associated, the similarity is
Hi-l,j-l s(ai,bj).
(2) If ai is at the end of a deletion of length
k, the similarity is
Hi k, j - Wk .
(3) If bj is at the end of a deletion of length
1, the similarity is
Hi,j-l - Wl. (typo in paper)
(4) Finally, a zero is included to prevent
calculated negative similarity, indicating no
similarity up to ai and bj.

17
Smith-Waterman (3 of 3)
The pair of segments with maximum similarity is
found by first locating the maximum element of H.
The other matrix elements leading to this maximum
value are than sequentially determined with a
traceback procedure ending with an element of H
equal to zero. This procedure identifies the
segments as well as produces the corresponding
alignment. The pair of segments with the next
best similarity is found by applying the
traceback procedure to the second largest element
of H not associated with the first traceback.
18
Extend LCS to Global Alignment

si-1,j ?(vi, -)
si,j max si,j-1 ?(-, wj)
si-1,j-1 ?(vi, wj)
?(vi, -) ?(-, wj) -? fixed gap penalty
?(vi, wj) score for match or mismatch can be
fixed, from PAM or BLOSUM

19
Extend to Local Alignment

0 (no negative scores)
si-1,j ?(vi, -)
si,j max si,j-1 ?(-, wj)
si-1,j-1 ?(vi, wj)
?(vi, -) ?(-, wj) -? fixed gap penalty
?(vi, wj) score for match or mismatch can be
fixed, from PAM or BLOSUM

20
Discussion on adding affine gap penalties

Affine gap penalty
Score for a gap of length x
-(? ?x)
Where
? gt 0 is the insert gap penalty
? gt 0 is the extend gap penalty

21
Alignment with Gap Penalties Can apply to global
or local (w/ zero) algorithms

?si,j max ?si-1,j - ?
si-1,j - (? ?)
?si,j max ?si1,j-1 - ?
si,j-1 - (? ?)
si-1,j-1 ?(vi, wj)
si,j max ?si,j
?si,j
Note keeping with traversal order in Figure 6.1,
? is replaced by ?, and ? is replaced by ?

22
Programming Workshop 6

Implement LCS
LCS(V,W)
b and s are global matrices
Print-LCS(V,i,j)
Write a program that uses LCS and Print-LCS. The
program should prompt the user for 2 sequences
and print the longest common sequence.

Write a Comment

User Comments (0)