Title: Algorithms for Pairwise Sequence Alignment
1Algorithms for Pairwise Sequence Alignment
- Craig A. Struble, Ph.D.
- Marquette University
2Overview
- Pairwise Sequence Alignment
- Dynamic Programming Solution
- Global Alignment
- Local Alignment
- BLAST and FASTA
3Pairwise Sequence Alignment
- As weve seen, sequence similarity is an
indicator of homology - There are other uses for sequence similarity
- Database queries
- Comparative genomics
4Pairwise Sequence Alignment
- Example
- Which one is better?
HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
5Scoring
- To compare two sequence alignments, calculate a
score - PAM or BLOSUM matrices
- Matches and mismatches
- Gap penalty
- Initiating a gap
- Gap extension penalty
- Extending a gap
6Example
- Gap penalty -8
- Gap extension -8
A E G H W
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
W -3 -3 -3 -3 15
HEAGAWGHE-E
--P-AW-HEAE
(-8) (-8) (-1) 5 15 (-8) 10 6
(-8) 6 9
HEAGAWGHE-E
Exercise Calculate for
P-A--W-HEAE
7Formal Description
- Problem PairSeqAlign
- Input Two sequences x,y
- Scoring matrix s
- Gap penalty d
- Gap extension penalty e
-
- Output The optimal sequence alignment
8How Difficult Is This?
- Consider two sequences of length n
- There are
- possible global alignments, and we need to find
an optimal one from amongst those!
9So what?
- So at n 20, we have over 120 billion possible
alignments - We want to be able to align much, much longer
sequences - Some proteins have 1000 amino acids
- Genes can have several thousand base pairs
10Dynamic Programming
- General algorithmic development technique
- Reuses the results of previous computations
- Store intermediate results in a table for reuse
- Look up in table for earlier result to build from
11Global Alignment
- Needleman-Wunsch 1970
- Idea Build up optimal alignment from optimal
alignments of subsequences
HEAG --P- -25
Add score from table
HEAGA --P-A -20
HEAG- --P-A -33
HEAGA --P -33
Gap with bottom
Top and bottom
Gap with top
12Global Alignment
- Notation
- xi ith letter of string x
- yj jth letter of string y
- x1..i Prefix of x from letters 1 through I
- F matrix of optimal scores
- F(i,j) represents optimal score lining up x1..i
with y1..j - d gap penalty
- s scoring matrix
13Global Alignment
- The work is to build up F
- Initialize F(0,0) 0, F(i,0) id, F(0,j)jd
- Fill from top left to bottom right using the
recursive relation
14Global Alignment
yj aligned to gap
F(i-1,j-1) F(i,j-1)
F(i-1,j) F(i,j)
Move ahead in both
s(xi,yj)
d
d
xi aligned to gap
While building the table, keep track of where
optimal score came from, reverse arrows
15Example
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16
W -24
H -32
E -40
A -48
E -56
16Completed Table
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
17Traceback
- Trace arrows back from the lower right to top
left - Diagonal both
- Up upper gap
- Left lower gap
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
HEAGAWGHE-E --P-AW-HEAE
18Summary
- Uses recursion to fill in intermediate results
table - Uses O(nm) space and time
- O(n2) algorithm
- Feasible for moderate sized sequences, but not
for aligning whole genomes.
19Local Alignment
- Smith-Waterman (1981)
- Another dynamic programming solution
20Example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
21Traceback
Start at highest score and traceback to first 0
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
AWGHE AW-HE
22Summary
- Similar to global alignment algorithm
- For this to work, expected match with random
sequence must have negative score. - Behavior is like global alignment otherwise
- Similar extensions for repeated and overlap
matching - Care must be given to gap penalties to maintain
O(nm) time complexity
23Repeat and Overlap Matches
- Repeat matches allow for sections of a sequence
to match repeatedly - Repeated domain or motif
- Overlap matches
- Matching when the two sequences overlap
- Does not penalize overhanging ends
x
x
y
y
24BLAST
- O(n2) algorithms are too slow for large scale
searches - BLAST developed by Altschul et al (1990)
- Uses probabilistic approach to searching
- Idea True alignments will have a short stretch
of identities (perfect match)
25BLAST Overview
- Make a list of neighborhood words
- Length 3 for proteins, 11 for nucleic acids
- Match query with score higher than some threshold
- Usually 2 bits per residue
- Scans database for words
- When a hit is obtained, extends the match in both
direction as ungapped alignment
26FASTA
- Pearson Lipman (1988)
- Find all matching words of length ktup
- 1 or 2 for proteins, 4 or 6 for DNA
- Look for diagonals supporting word matches
- Extend with ungapped alignment
- Join ungapped regions with gaps