Sequence Alignment as a Similarity Metric I - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Sequence Alignment as a Similarity Metric I

Description:

Two types of linear scores for gapping 'linear' Gap open penalty 'affine' Gap extension penalty ... Considering affine gap penalties in global alignment by NW ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 34
Provided by: BAM70
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment as a Similarity Metric I


1
Sequence Alignment as a Similarity Metric I Jan
17 2008 Jonas S Almeida, MDACC http//jonasalme
ida.info jalmeida_at_mdanderson.org
This document http//ibl.mdanderson.org/jalmeid
a/lixo/Sequence_Alignment_I.ppt
2
The scoring model (cont)
Two types of linear scores for gapping
linear
- s(g)
affine
0
g
Gap extension penalty
Gap open penalty
3
The probabilistic rationale for s(g)y(g)
(Practical note log base 2 more usual)
Exerc. 2.2.
4
Needleman-Wunsch global alignment
Alignment of sequences xi1,2,,n and yj1,2,,m
Dynamic programing of best alignment scoring
matrix built aligning successively bigger
segments of the two sequences
F(i,j)score of best alignment between
x(1,,i) and y(1,,j).
Trick reach this value by using the result of
aligning smaller segments
5
Needleman-Wunsch
From scoring matrix S
Linear gap penalty
F(i-1,j-1)
F(i-1,j)
-d
s(xi,yj)
F(i,j-1)
F(i,j)
Max( )
-d
Traceback for each iteration keep track of what
was the maximum value (the index)
6
Filling F
y1 y ym
0
-jd
-md
-d
F(1,j)
-d
F(1,1)
F(1,m)
F(i,j)
F(i,m)
-id
F(i,1)
F(n,j)
-nd
F(n,m)
F(n,1)
7
Traceback proceedure
1 1
0 1
1 1
1 0
x ( 1 -- 2 3 )
1 0 1 1
F(n,m)?( , , ,
)
s(1,1)
-d
s(2,3)
-d
1 1 1 0
y ( 1 2 3 --)
8
Filling F (simple filling - slower)
y1 y ym
0
-jd
-md
-d
F(1,j)
-d
F(1,1)
F(1,m)
F(i,j)
F(i,m)
-id
F(i,1)
F(n,j)
-nd
F(n,m)
F(n,1)
9
Filling F (diagonal filling faster
vectorization)
y1 y ym
0
-jd
-md
-d
F(1,j)
-d
F(1,1)
F(1,m)
F(i,j)
F(i,m)
-id
F(i,1)
F(n,j)
-nd
F(n,m)
F(n,1)
10
Scale dependency of algorithms
Big-O notation
Proportionality between running time, t, and
size, n, of input arguments
Example the identification of F matrix in
dynamic programming is O(n2)
Generalization
O(nb)
btag(a)
log(t)
a
0
log(n)
11
  • Global Alignment by the Needleman-Wunsch
    algorithm

From scoring matrix S
Linear gap penalty
F

0
-d
-md
-2d
-d
F(1,2)
F(1,)
F(1,1)
F(1,m)
-2d
F(2,1)
F(2,m)
F(2,2)
F(2,)

Total score
F(,1)
F(,2)
F(,)
F(,m)
F(n,1)
-nd
F(n,2)
F(n,m)
F(n,)
Trace back
12
2. Local alignment by the Smith-Waterman algorithm
To break uniformative pairwise combinations
F

0
0
0
0
Total score
0
F(1,2)
F(1,)
F(1,1)
F(1,m)
0
0
F(2,1)
F(2,m)
F(2,2)
F(2,)

F(,1)
F(,2)
max
F(,)
F(,m)
F(n,1)
0
F(n,2)
F(n,m)
F(n,)
Trace back
13
threshold value
3. Repeated matches
(Local alignment)
Total score
F
F(0,2)
F(0,)
0
F(0,1)
F(0,m)
F(0,m1)
Trace back
14
threshold value
3. Overlap matches
n
---
(Local alignment)
(Global alignment)
F
F(0,2)
F(0,)
0
F(0,1)
F(0,m)
Total score
F(0,m1)
0
max
Trace back
15
Considering linear gap penalties in global
alignment by NW algorithm
From scoring matrix S
Linear gap penalty
16
Two types of linear scores for gapping
linear
- s(g)
affine
0
g
Gap extension penalty
Gap open penalty
17
Considering linear gap penalties in global
alignment by NW algorithm
j
i
O(n2)
18
Considering affine gap penalties in global
alignment by NW algorithm
O(nb)
k
j
btag(a)
log(t)
i
O(n2)
O(n3)
a
It can be partially simplified in some cases but
2ltblt3
0
log(n)
19
However,
20
(No Transcript)
21
j0,..,n
Ix
i0,..,n
-e
Ix (0,1)
s(xi,yj)
M (1,1)
-d
S(xi,yi)
s(xi,yj)
j0,..,n
j0,..,n
M
i0,..,n
Iy
-d
Iy (1,0)
i0,..,n
-e
3 state Finite State Automaton (FSA)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
p.31
26
Alternative formulations of FSA
27
(No Transcript)
28
The basic idea of preprocessing sequences
1) Embed
2) Sort (re-index database)
3) Search template by divide and conquer
4) Use best matches as seeds for alignment
29
The statistics of global sequence comparison
Determined with reference to random model
1. Three possibilities
i) Real non-homologous sequences
ii) Real sequences that are shuffled (composition
is preserved)
iii) Sequences generated randomly based upon model
2. Statistics of global comparison obtained by
Monte Carlo experiments
1
L(Sgt?)
0
?
30
(No Transcript)
31
The statistics of local sequence comparison
  • Without gaps

Variations on Smith-Waterman and Sellers will
find segments whose scores cannot be improved by
extension or trimming ? HSPs ( High Scoring
Segment Pairs)
gt Sequence 1 GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLH
AHKL gt Sequence 2 GSGYLVGDSLTFVDLLVAQHTADLLAANAALL
DEFPQFKAHQE
S28 GSAQVKGHGKKVADALTNAVAHVDD-M-PN-A-LSALSDLHAHKL
GS G D L A H D N A L AH
(Needleman) GSGYLVGDSLTFVDLLV-A-QHTADLLAANAALLDE
FPQFKAHQE
S43 GHGKKVADALT--N-AVA-HVDDMPNALSALSD G G
VDLT VA H D A AL D (Smith
Waterman) GSGYLVGDSLTFVDLLVAQHTADLLAANAALLD
S33 GHGKKVADALT.KKVADALTNAVAHVDDMP. G G VDLT
AD L A DP (Repeated
Local) GSGYLVGDSLTFQHTADLLAANAALLDEFPQ
S29 GHGKKVADALT--N-AVA-HVDD-M-PN-A-LSALSDLHAHKL G
G VDLT VA H D N A L
AH (Repeated Overlap) GSGYLVGDSLTFVDLLVAQHT
ADLLAANAALLDEFPQFKAHQE
32
The statistics of local sequence comparison
33
Homework Write a program that will take as
input arguments two sequences a substitution
matrix and uses Needleman-Wunchs algorithm to
align them. Test it with two protein sequences
and either a PAM or BLOSUM matrix. Return the m
file and the input arguments by email.
Write a Comment
User Comments (0)
About PowerShow.com