Pair-wise Sequence Alignment (II)

About This Presentation

Title:

Pair-wise Sequence Alignment (II)

Description:

Title: PowerPoint Presentation Author: heringa Last modified by: Jaap Heringa Created Date: 2/20/2003 6:00:44 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 53

Provided by: heri4

Category:

more less

Transcript and Presenter's Notes

Title: Pair-wise Sequence Alignment (II)

1
Introduction to bioinformatics 2008 Lecture 6
Pair-wise Sequence Alignment (II)
2
What can sequence alignment tell us about
structure HSSP Sander Schneider, 1991
30 sequence identity
3
Sequence alignmentHistory of Dynamic Programming
algorithm
1970 Needleman-Wunsch global pair-wise
alignment Needleman SB, Wunsch CD (1970) A
general method applicable to the search for
similarities in the amino acid sequence of two
proteins, J Mol Biol. 48(3)443-53. 1981
Smith-Waterman local pair-wise alignment Smith,
TF, Waterman, MS (1981) Identification of common
molecular subsequences. J. Mol. Biol. 147,
195-197.
4
Global dynamic programming
j-1 j
i-1 i
Value from residue exchange matrix
H(i-1,j-1) S(i,j) H(i-1,j) - g H(i,j-1) - g
diagonal vertical horizontal
H(i,j) Max
This is a recursive formula
5
Dynamic programming
j
i
The cell i, j contains the alignment score of
the best scoring alignment score of subsequence
1..i and 1..j, that is, the subsequences up to
i, j Cell i, j does not know what that best
scoring alignment is (it is one out of many
possibilities)
6
Global dynamic programmingPAM250, Gap 6 (linear)
S P E A R E
0 -6 -12 -18 -24 -30 -36
S -6 2 -4 -10 -16 -22 -28
H -12 -4 2 -3 -9 -14 -20
A -18 -10 -3 0 -1 -7 -13
K -24 -16 -9 -3 -1 2 -4
E -30 -22 -15 -5 -3 -2 6
S P E A R E
S 2 1 0 1 0 0
H -1 0 1 -1 2 1
A 1 1 0 2 -2 0
K 0 -1 0 -1 3 0
E 0 -1 4 0 -1 4
These values are copied from the PAM250 matrix
(see earlier slide) and represent the S(i, j)
values in the DP formula (back two slides)
SPEARE S-HAKE
The easy algorithm is only for linear gap
penalties
Higgs Attwood, p. 124 Note There are errors
in the matrices!!
7
Global dynamic programmingPAM250, Gap 6 (linear)
S P E A R E
0 -6 -12 -18 -24 -30 -36
S -6 2 -4 -10 -16 -22 -28
H -12 -4 2 -3 -9 -14 -20
A -18 -10 -3 2 -1 -7 -13
K -24 -16 -9 -3 1 2 -4
E -30 -22 -15 -5 -3 -2 6
S P E A R E
S 2 1 0 1 0 0
H -1 0 1 -1 2 1
A 1 1 0 2 -2 0
K 0 -1 0 -1 3 0
E 0 -1 4 0 -1 4
These values are copied from the PAM250 matrix
(see earlier slide)
Start in left upper cell before either sequence
(circled in red). Path will end in lower right
cell (circled in blue)
SPEARE S-HAKE
The easy algorithm is only for linear gap
penalties
Higgs Attwood, p. 124 Note There are errors
in the matrices!!
8
DP is a two-step process

Forward step calculate scores
Trace back start at highest score and
reconstruct the path leading to the highest score
These two steps lead to the highest scoring
alignment (the optimal alignment)
This is guaranteed when you use DP!

9
Global dynamic programming
j-1 j
i-1 i
H(i-1,j-1) S(i,j) H(i-1,j) - g H(i,j-1) - g
diagonal vertical horizontal
H(i,j) Max

Problem with simple DP approach
Can only do linear gap penalties
Not suitable for affine and concave penalties,
but algorithm can be extended to deal with affine
penalties (preceding lecture)

10
Global dynamic programmingusing affine penalties
j-2 j-1 j
Looking back from cell (i, j) we can adapt the
algorithm for affine gap penalties by looking
back to four more cells (magenta)
i-2 i-1 i
If you came from here, gap was already open, so
apply gap-extension penalty
If you came from here, gap was opened so apply
gap-open penalty
11
Time and memory complexity of DP

The time complexity is O(n2) if you would align
two sequences of n residues, you would need to
perform n2 algorithmic steps (square search
matrix has n2 cells that need to be filled)
The memory (space) complexity is also O(n2) if
you would align two sequences of n residues, you
would need a square search matrix of n by n
containing n2 cells

12
Global dynamic programmingall types of gap
penalties
j-1
The complexity of this DP algorithm is increased
from O(n2) to O(n3)
The gap length is known exactly and so any gap
penalty regime can be used
i-1
MaxS0ltxlti-1, j-1 - Gap(i-x-1) Si-1,j-1 MaxSi-1,
0ltyltj-1 - Gap(j-y-1)
Si,j si,j Max
13
Global dynamic programmingif affine penalties
are used
j-1
i-1
MaxS0ltxlti-1, j-1 -Go -(i-x-1)Ge Si-1,j-1 MaxSi
-1, 0ltyltj-1 -Go -(j-y-1)Ge
Si,j si,j Max
14
DP recipe for using affine gap penalties
j-1
i-1

Mi,j is optimal alignment (highest scoring
alignment until i,j)
Check
preceding row until j-2 apply appropriate score
and gap penalties
preceding row until i-2 apply appropriate score
and gap penalties
and celli-1, j-1 apply score for celli-1,
j-1

15
Global dynamic programmingAffine penalties
Gapo10, Gape2
D W V T A L K
0 -12 -14 -16 -18 -20 -22 -24
T -12 8 -9 -6 -5 -9 -11 -14
D -14 0 9 2 2 3 -5 -3 -34
W -16 -13 25 11 5 4 9 0 -21
V -18 -10 -4 37 21 19 19 15 -16
L -20 -14 -2 23 46 31 37 26 1
K -22 -12 -9 17 33 53 39 50 14
-34 -29 -1 17 39 27 50
D W V T A L K
T 8 3 8 11 9 9 8
D 12 1 6 8 8 4 8
W 1 25 2 3 2 6 5
V 6 2 12 8 8 10 6
L 4 6 10 9 6 14 5
K 8 5 6 8 7 5 13
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrix
cell (-8 is the lowest number in the PAM250
matrix)
The extra bottom row and rightmost column give
the final global alignment scores
16
DP is a two-step process

Forward step calculate scores
Trace back start at highest score and
reconstruct the path leading to the highest score
These two steps lead to the highest scoring
alignment (the optimal alignment)
This is guaranteed when you use DP!

17
Global dynamic programming
18
Semi-global pairwise alignment

Global alignment all gaps are penalised
Semi-global alignment N- and C-terminal gaps
(end-gaps) are not penalised
MSTGAVLIY--TS-----
---GGILLFHRTSGTSNS

End-gaps
End-gaps
19
Semi-global dynamic programmingPAM250, Gap 6
(linear)
S P E A R E
0 0 0 0 0 0 0
S 0 2 1 0 1 0 0
H 0 -1 2 2 -1 3 1
A 0 1 0 2 4 -2 3
K 0 0 0 0 1 7 1
E 0 0 -1 4 0 1 11
S P E A R E
S 2 1 0 1 0 0
H -1 0 1 -1 2 1
A 1 1 0 2 -2 0
K 0 -1 0 -1 3 0
E 0 -1 4 0 -1 4
These values are copied from the PAM250 matrix
(see earlier slide)
Start in left upper cell before either sequence
(circled in red). Path will end in cell anywhere
in the bottom row or rightmost columns with the
highest score
SPEARE -SHAKE
The easy algorithm is only for linear gap
penalties
Higgs Attwood, p. 124 Note There are errors
in the matrices!!
20
Semi-global dynamic programming- two examples
with different gap penalties -
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrix
cell (-8 is the lowest number in the PAM250
matrix)
21
Semi-global pairwise alignment

Applications of semi-global
Finding a gene in genome
Placing marker onto a chromosome
Generally if one sequence is much longer than
the other
Danger if gap penalties high -- really bad
alignments for divergent sequences

22
There are three kinds of alignments

Global alignment
Semi-global alignment
Local alignment

23
Local dynamic programming (Smith Waterman,
1981)
LCFVMLAGSTVIVGTR
E D A S T I L C G S
Negative numbers
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open, extension)
AGSTVIVG A-STILCG
24
Local dynamic programming (Smith and Waterman,
1981)basic algorithm
j-1 j
i-1 i
H(i-1,j-1) S(i,j) H(i-1,j) - g H(i,j-1) - g 0
diagonal vertical horizontal
H(i,j) Max
25
Example local alignment of two sequences

Align two DNA sequences
GAGTGA
GAGGCGA (note the length difference)?
Parameters of the algorithm
Match score(A,A) 1
Mismatch score(A,T) -1
Gap g -2

Mi-1, j-1 1
Mi, j-1 2
max
Mi, j
Mi-1, j 2
0
26
The algorithm. Step 1 init
Mi-1, j-1 1

Create the matrix
Initiation
No beginning row/column
Just apply the equation

Mi, j-1 2
Mi, j
max
Mi-1, j 2
0
27
The algorithm. Step 2 fill in
Mi-1, j-1 1
Mi, j-1 2
Mi, j
max

Perform the forward step

Mi-1, j 2
0
28
The algorithm. Step 2 fill in
Mi-1, j-1 1
Mi, j-1 2
Mi, j
max

Perform the forward step

Mi-1, j 2
0
29
The algorithm. Step 2 fill in
Mi-1, j-1 1
Mi, j-1 2
Mi, j
max

Were done
Find the highest cell anywhere in the matrix

Mi-1, j 2
0
30
The algorithm. Step 3 trace back
Mi-1, j-1 1

Reconstruct path leading to highest scoring cell
Trace back until zero alignment path can begin
and terminate anywhere in matrix
Alignment GAG
GAG

Mi, j-1 2
Mi, j
max
Mi-1, j 2
0
31
Local dynamic programmingMatch/mismatch 1/-1,
Gap 2
A T G A C G T
0 0 0 0 0 0 0 0
T 0 0 1 0 0 0 0 1
A 0 1 0 0 1 0 0 0
G 0 0 0 1 0 0 1 0
A 0 1 0 0 2 0 0 0
C 0 0 0 0 0 3 1 0
T 0 0 1 0 0 0 2 2
Fill the matrix (forward pass), then do trace
back from highest cell anywhere in the matrix
till you reach 0 or the beginning of a sequence
32
Local dynamic programmingMatch/mismatch 1/-1,
Gap 2
A T G A C G T
0 0 0 0 0 0 0 0
T 0 0 1 0 0 0 0 1
A 0 1 0 0 1 0 0 0
G 0 0 0 1 0 0 1 0
A 0 1 0 0 2 0 0 0
C 0 0 0 0 0 3 1 0
T 0 0 1 0 0 0 2 2
GAC GAC
Fill the matrix (forward pass), then do trace
back from highest cell anywhere in the matrix
till you reach 0 or the beginning of a sequence
33
Local dynamic programming (Smith Waterman,
1981)
j-1
This is the general DP algorithm, which is
suitable for linear, affine and concave
penalties, although for the example here affine
penalties are used
i-1
Gap opening penalty
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
Gap extension penalty
34
Local dynamic programming
35
Global or Local Pairwise alignment
B
B
C
A
A
B
A
A
C
A
B
C
A
Local
B
Local
A
B
C
A
B
C
B
A
Global
Global
A
B
C
A
36
Globin fold ? protein myoglobin PDB 1MBN
Alpha-helices are labelled A (blue) to H
(red). The D helix can be missing in some
globins What happens with the alignment if
D-helix containing globin sequences are aligned
with D-less ones?
37
? sandwich ? protein immunoglobulin PDB 7FAB
Immunoglobulinstructures have variable regions
where numbers of amino acids can vary
substantially
38
TIM barrel ? / ? protein Triose phosphate
IsoMerase PDB 1TIM
The evolutionary history of this protein family
has been the subject of rigorous debate.
Arguments have been made in favor of both
convergent and divergent evolution. Because of
the general lack of sequence homology, the
ancestry of this molecule is still a mystery.
39
What does all this mean for alignments?

Alignments need to be able to skip secondary
structural elements to complete domains (i.e.
putting gaps opposite these motifs in the shorter
sequence).
Depending on gap penalties chosen, the algorithm
might have difficulty with making such long gaps
(for example when using high affine gap
penalties), resulting in incorrect alignment.
Alignments are only meaningful for homologous
sequences (with a common ancestor)

40
There are three kinds of pairwise alignments

Global alignment align all residues in both
sequences all gaps are penalised
Semi-global alignment align all residues in
both sequences end gaps are not penalised (zero
end gap penalties)
Local alignment align one part of each
sequence end gaps are not applicable

41
Easy global DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
i-1

Mi,j is optimal alignment (highest scoring
alignment until i, j)
At each cell i, j in search matrix, check Max
coming from
any cell in preceding row until j-2 add score
for celli, j minus appropriate gap penalties
any cell in preceding column until i-2 add score
for celli, j minus appropriate gap penalties
or celli-1, j-1 add score for celli, j
Select highest scoring cell in bottom row and
rightmost column and do trace-back

42
Lets do an example global alignmentGotohs DP
algorithm with affine gap penalties (PAM250,
Pi10, Pe2)
D W V T A L K
0 -12 -14 -16 -18 -20 -22 -24
T -12 0 -17 -14 -13 -17 -19 -22 -22
D -14 -8 -7 -14 -14 -13 -42
W -16 -21 9 -13 -19 -18
V -18 -18 -20 13 -3 -16
L -20 -22 -18 -1 14 -1 -14
K -22 -20 -21 -12
-24 -42 -41 -18 -16 -14 -12 0
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
PAM250
Cell (D2, T4) can alternatively come from two
cells (same score) high-road or low-road
Row and column 0 are filled with 0, -12, -14,
-16, if global alignment is used (for
N-terminal end-gaps) also extra row and column
at the end to calculate the score including
C-terminal end-gap penalties. Note that only
non-diagonal arrows are indicated for clarity
(no arrow means that you go back to earlier
diagonal cell).
43
Lets do another example semi-global
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
D W V T A L K
T 0 -5 0 3
D 4 -7 -7
W -7 21 -13
V -2 -13 25 9
L
K
PAM250
Starting row and column 0, and extra column at
right or extra row at bottom is not necessary
when using semi global alignment (zero end-gaps).
Rest works as under global alignment.
44
Easy local DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
i-1

Mi,j is optimal alignment (highest scoring
alignment until i, j)
At each cell i, j in search matrix, check Max
coming from
any cell in preceding row until j-2 add score
for celli, j minus appropriate gap penalties
any cell in preceding column until i-2 add score
for celli, j minus appropriate gap penalties
celli-1, j-1 add score for celli, j
or 0
Select highest scoring cell anywhere in matrix
and do trace-back until zero-valued cell or start
of sequence(s)

45
Lets do yet another example local
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
D W V T A L K
T 0 0 0 3
D 4 0 0 0
W 0 21 0 0
V 0 0 25 9
L 0 0 11
K 0 0
PAM250
Extra start/end columns/rows not necessary (no
end-gaps). Each negative scoring cell is set to
zero. Highest scoring cell may be found anywhere
in search matrix after calculating it. Trace
highest scoring cell back to first cell with zero
value (or the beginning of one or both sequences)
46
Dot plots

Way of representing (visualising) sequence
similarity without doing dynamic programming (DP)
Make search matrix as for DP, but locally
represent sequence similarity by averaging using
a sliding window

47
Dot-plots
Dot plots are calculated using a diagonal window
of preset length that is slid through the search
matrix --typically the central cell holds the
window score (e.g. sum, average)
48
Dot-plotsa simple way to visualise sequence
similarity
Filter 6/10 residues have to match...
Can be a bit messy, though...
49
Dot-plots, what about...

Insertions/deletions -- DNA and proteins
Duplications (e.g. tandem repeats) DNA and
proteins
Inversions -- DNA

Dot plots are calculated using a diagonal window
of preset length that is slid through the search
matrix --typically the central cell holds the
window score (e.g. sum, average)
50
Dot-plots, self-comparison
51

a heuristic
Heuristics A rule of thumb that often helps in
solving a certain class of problems, but makes no
guarantees.Perkins, DN (1981) The Mind's Best
Work

52
For your first exam D1Make sure you
understand and can carry out 1. the simple DP
algorithm for global, semi-global and local
alignment (using linear gap penalties but make
sure you know the extension of the basic
algorithm for affine gap penalties) and 2. The
general DP algorithm for global, semi-global and
local alignment (using linear, affine and concave
gap penalties)!

Write a Comment

User Comments (0)