Title: Sequence analysis of nucleic acids and proteins: part 1
1Sequence analysis of nucleic acids and proteins
part 1
Similarity search Lecture by Terry Speed
- Based on Chapter 3 of
Post-genome Bioinformatics by Minoru
Kanehisa, Oxford University
Press, 2000
2 Pairwise sequence alignment by the dynamic
programming algorithm. The algorithm involves
finding the optimal path in the path matrix. (a),
which is equivalent to searching the optimal
solution in the search tree (b).
- (a) Path Matrix (b) Search Tree
A
I
M
S
A
M
O
S
X
X
. . . . .
. . . . . . . . .
Alignment AIM-S A-MOS
Pruning by an optimization function
3 Methods for computing the optimal score in the
dynamic programming algorithm (a ) the gap
penalty is a constant. (b) the gap
penalty is a linear function of the gap length.
Di-1, j-1
d
ws(i), t(j)
d
b
Di, j(2)
Di,j
ws(i), t(j)
b
Di,j(1)
Di,j(3)
4 Concepts of global and local optimality in the
pairwise sequence alignment. The distinction is
made as to how the initial values are assigned to
the path matrix.
(a) Global vs. Global
(b) Local vs. Global
0
0 0 . . . . . . 0
(c) Local vs. Local
0 0 . . . . . . 0
. . . . 0
X
5Dynamic programming to find edit distances
- Edit operation M, R, I, D - Edit transcript
A string over the alphabet M, R, I, D that
describes a transformation of one string into
another. Example R D I M D M M A -
T H S A - R T - S - Edit
(Levens(h)tein) distance The minimum number of
edit operations necessary to transform one
string into another. (Note matches are not
counted.) Example R D I M D M 1 1 1
0 1 0 4
6The recurrence
- Stage position in the edit transcript -
State I, D, M, or R - Optimal value function
D(i, j) where D(i, j) edit distance of
Seq11...i and Seq21...j - Recurrence
relation 1 D(i-1, j) D(i, j) min 1
D(i, j-1) t(i, j) D(i-1, j-1) ,
where t(i, j)
7The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 M 1 A
2 T 3 H 4 S 5
8The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 M 1
A 2 T 3 H 4 S 5
9The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 M
1 A 2 T 3 H 4 S 5
10The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
M 1 A 2 T 3 H 4 S 5
11The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 A 2 2 T 3 3 H 4 4 S 5 5
12The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 A 2 2 T 3 3 H 4 4 S 5 5
13The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 A 2 2 T 3 3 H 4 4 S 5 5
14The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 H 4 4 S 5 5
15The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 S 5 5
16The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 3 3 3 3 S 5 5 4 4 4 3
17The traceback
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 3 3 3 3 S 5 5 4 4 4 3
18The solutions - 1
1 0 1 1 0 3 D M R R M M A T H S - A R
T S
19The traceback
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 3 3 3 3 S 5 5 4 4 4 3
20The solutions - 2
1 0 1 0 1 0 3 D M I M D M M A - T H S
- A R T - S
21The traceback
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 3 3 3 3 S 5 5 4 4 4 3
22The solutions - 3
1 1 0 1 0 3 R R M D M M A T H S A R
T - S
Life must be lived forwards and understood
backwards. - Søren Kierkegaard
23BLOSUM62 SCORING MATRIX
134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQI
TPEDLASETLLI
137 LDSNSVDLVLMGVPPRNVEV
EAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM
DD 6
DR -2
From Henikoff 1996
24Scoring Matrices
- Physical/Chemical similarities
- comparing two sequences according to the
properties of their residues may highlight
regions of structural similarity - Identity matrices
- by stressing only identities in the alignment,
stretches of sequence that may have diverged will
not penalise any remaining common features
25Scoring Matrices (ctd)
- As the direct source of residue by residue
comparison scores the scoring matrix you choose
will have a major impact on the alignment
calculated - The most commonly used will be one of the
mutation matrices - PAM, BLOSUM
- The matrix that performs best will be the matrix
that reflects the evolutionary separation of the
sequences being aligned