Sequence analysis of nucleic acids and proteins: part 1 - PowerPoint PPT Presentation

About This Presentation

Title:

Sequence analysis of nucleic acids and proteins: part 1

Description:

Sequence analysis of nucleic acids and proteins: part 1 ... of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000 ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 26

Provided by: cen7187

Category:

more less

Transcript and Presenter's Notes

Title: Sequence analysis of nucleic acids and proteins: part 1

1
Sequence analysis of nucleic acids and proteins
part 1
Similarity search Lecture by Terry Speed

Based on Chapter 3 of
Post-genome Bioinformatics by Minoru
Kanehisa, Oxford University
Press, 2000

2
Pairwise sequence alignment by the dynamic
programming algorithm. The algorithm involves
finding the optimal path in the path matrix. (a),
which is equivalent to searching the optimal
solution in the search tree (b).

(a) Path Matrix (b) Search Tree

A
I
M
S
A
M
O
S
X
X
. . . . .
. . . . . . . . .
Alignment AIM-S A-MOS
Pruning by an optimization function
3
Methods for computing the optimal score in the
dynamic programming algorithm (a ) the gap
penalty is a constant. (b) the gap
penalty is a linear function of the gap length.

(a) (b)

Di-1, j-1
d
ws(i), t(j)
d
b
Di, j(2)
Di,j
ws(i), t(j)
b
Di,j(1)
Di,j(3)
4
Concepts of global and local optimality in the
pairwise sequence alignment. The distinction is
made as to how the initial values are assigned to
the path matrix.
(a) Global vs. Global
(b) Local vs. Global
0
0 0 . . . . . . 0
(c) Local vs. Local
0 0 . . . . . . 0
. . . . 0
X
5
Dynamic programming to find edit distances
- Edit operation M, R, I, D - Edit transcript
A string over the alphabet M, R, I, D that
describes a transformation of one string into
another. Example R D I M D M M A -
T H S A - R T - S - Edit
(Levens(h)tein) distance The minimum number of
edit operations necessary to transform one
string into another. (Note matches are not
counted.) Example R D I M D M 1 1 1
0 1 0 4
6
The recurrence
- Stage position in the edit transcript -
State I, D, M, or R - Optimal value function
D(i, j) where D(i, j) edit distance of
Seq11...i and Seq21...j - Recurrence
relation 1 D(i-1, j) D(i, j) min 1
D(i, j-1) t(i, j) D(i-1, j-1) ,
where t(i, j)
7
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 M 1 A
2 T 3 H 4 S 5
8
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 M 1
A 2 T 3 H 4 S 5
9
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 M
1 A 2 T 3 H 4 S 5
10
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
M 1 A 2 T 3 H 4 S 5
11
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 A 2 2 T 3 3 H 4 4 S 5 5
12
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 A 2 2 T 3 3 H 4 4 S 5 5
13
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 A 2 2 T 3 3 H 4 4 S 5 5
14
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 H 4 4 S 5 5
15
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 S 5 5
16
The tabulation , D(i, j)
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 3 3 3 3 S 5 5 4 4 4 3
17
The traceback
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 3 3 3 3 S 5 5 4 4 4 3
18
The solutions - 1
1 0 1 1 0 3 D M R R M M A T H S - A R
T S
19
The traceback
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 3 3 3 3 S 5 5 4 4 4 3
20
The solutions - 2
1 0 1 0 1 0 3 D M I M D M M A - T H S
- A R T - S
21
The traceback
Seq2(j) A R T S Seq1(i) 0 1 2 3 4 0 0 1 2
3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4
4 3 3 3 3 S 5 5 4 4 4 3
22
The solutions - 3
1 1 0 1 0 3 R R M D M M A T H S A R
T - S
Life must be lived forwards and understood
backwards. - Søren Kierkegaard
23
BLOSUM62 SCORING MATRIX
134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQI
TPEDLASETLLI
137 LDSNSVDLVLMGVPPRNVEV
EAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM
DD 6
DR -2
From Henikoff 1996
24
Scoring Matrices

Physical/Chemical similarities
comparing two sequences according to the
properties of their residues may highlight
regions of structural similarity
Identity matrices
by stressing only identities in the alignment,
stretches of sequence that may have diverged will
not penalise any remaining common features

25
Scoring Matrices (ctd)

As the direct source of residue by residue
comparison scores the scoring matrix you choose
will have a major impact on the alignment
calculated
The most commonly used will be one of the
mutation matrices
PAM, BLOSUM
The matrix that performs best will be the matrix
that reflects the evolutionary separation of the
sequences being aligned