Lecture 5 Pairwise sequence alignment algorithms

About This Presentation

Title:

Lecture 5 Pairwise sequence alignment algorithms

Description:

Local - Smith Waterman. Heuristics for the real world. BLAST. Fundamentals. When we compare sequences, we are looking for evidence that they have diverged ... – PowerPoint PPT presentation

Number of Views:173

Avg rating:3.0/5.0

Slides: 114

Provided by: karch7

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 5 Pairwise sequence alignment algorithms

1
Lecture 5Pairwise sequence alignment algorithms

Rachel Karchin
BME 580.688/580.488 and CS 600.488 Spring 2009

2
Outline

Fundamentals and basics
Exact alignment algorithms
Global Needleman-Wunsch
Local - Smith Waterman
Heuristics for the real world
BLAST

3
Fundamentals

When we compare sequences, we are looking for
evidence that they have diverged from a common
ancestor by a process of mutation and selection.

Durbin et al. Biological Sequence Analysis
4
Fundamentals

Assume a simplified model of evolution
Substitutions
Deletions
Insertions

Durbin et al. Biological Sequence Analysis
5
Pairwise alignment basics
GACCC GACAA
Two sequences
What is the best way to match up their shared
equivalent positions?
What is a shared equivalent position?
6
Pairwise alignment basics
GACCC GACAA
Two sequences
What is the best way to match up their shared
equivalent positions?
GACCC GACAA
GACCC-- --GACAA
GACC-C GACAA-
G-ACCC GACAA-
GACCC- GA-CAA
GACCC- GAC-AA
7
Pairwise alignment basics
Tabular representation
GACCC GACAA
Two sequences
8
Pairwise alignment basics
GACCC GACAA
Tabular representation
Two sequences
9
Pairwise alignment basics
Tabular representation
GACCC- G-ACAA
10
Pairwise alignment basics
Tabular representation
GACCC- G-ACAA
11
Pairwise alignment

Which of all of these alignments is the best one?

12
Pairwise alignment

Requires a scoring system that gives points or
penalties for matches and gaps

13
Pairwise alignment

Also requires efficient search strategy to select
the best alignment or best k alignments out of
very large number of possible alignments

14
Scoring system

Match points
Indel costs

What do we want out of such a scoring system?
15
Scoring matrices

PAM matrices invented by Margaret Dayhoff in
1978.
Based on mutations in 70 protein families
Reconstructed the ancestral sequence of each
family

16
Scoring matrices

Count number of changes for each amino acid and
divide by a normalization factor of exposure to
mutation
Assumes that the observed substitution is the
result of a single mutation event.

17
Scoring matrices

PAM1 gives the probability of a 1 change in
sequence
One substitution per 100 amino acids
PAM100, PAM250 etc. are linear extrapolations of
PAM1
PAM250 is probability of 250 change in sequence

Tutorial to do it yourself http//www.biorecipes
.com/Dayhoff/code.html
18
Scoring matrices
Do these look like probabilities?
19
Scoring matrices
Do these look like probabilities?
20
Scoring matrices

Henikoff and Henikoff (1992)
Based on amino acid substitutions in conserved
amino acid patterns from the BLOCKS database
BLOSUMn from blocks which are n identical

21
Scoring matrices

BLOCKS found with motif software which finds
patterns of amino acids at arbitrary intervals
Protein families defined by presence of a common
motif
Motifs strung together into ungapped blocks of
sequence

22
BLOCKS
http//blocks.fhcrc.org/
23
Nucleotide substitution matrices

How might you design one?

24
Gap penalties

Simplest approach is a constant gap penalty
Cost for opening a gap d
Every gap gets same penalty, regardless of length

25
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d -10
26
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d -5
What has changed?
27
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d 0
What has changed?
28
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d 0
What has changed? How might we assess which
value of d is best?
29
Gap penalties

Linear gap penalty
Cost for opening a gap d
Cost for gap extenstion d
Each gap of one base (or residue) gets penalized
the same, regardless of whether or not it is
contiguous with other gaps
Alignment with fewer gaps favored
Penalty for large gap is same as many small gaps

30
Gap penalties

Linear gap penalty
Cost for opening a gap d
Cost for gap extenstion d
Each gap of one base (or residue) gets penalized
the same, regardless of whether or not it is
contiguous with other gaps
Alignment with fewer gaps favored
Penalty for large gap is same as many small gaps

What do you think about this from a biological
point of view?
31
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap linear d -10
What has changed?
32
Gap penalties

Affine gap penalty
Biologically, more likely to see a single gap of
10 than 10 single gaps
We can model this using a gap opening penalty o
and a gap extension penalty e
Gap of length l gets penalty o (l-1) e

33
Gap penalties

Affine gap penalty
Biologically, more likely to see a single gap of
10 than 10 single gaps
We can model this using a gap opening penalty o
and a gap extension penalty e
Gap of length l gets penalty o (l-1) e

Should e be more or less negative than o ? Why?
34
Gap penalties

Affine gap penalty
Biologically, more likely to see a single gap of
10 than 10 single gaps
We can model this using a gap opening penalty o
and a gap extension penalty e
Gap of length l gets penalty o (l-1) e

Should e be more or less negative than o
? Why? Less. Because a few large gaps are
morerealistic than manys mall gaps. Want to
encourage gap extension.
35
Gap penalties

Build your intuition about parameters for
pairwise sequence alignment.
Gap penalty functions often written as ?(.)

http//fasta.bioch.virginia.edu/noptalign/
36
Dynamic programming for sequence alignment

Scoring system
Search strategy

37
Dynamic programming for sequence alignment

Key insight is that we can align prefixes and
then extend them one position at a time
Efficient search through the huge alignment space
is done by caching intermediate results

38
Dynamic programming for sequence alignment

General idea

Matrix Initialization
Two sequences
GACCC ACAA
39
Dynamic programming for sequence alignment

General idea

Matrix Initialization
Why?
Two sequences
GACCC ACAA
40
Dynamic programming for sequence alignment

General idea

Aligning each nucleotidewith the empty string
isequivalent to putting a gap atthe beginning
of the alignment. What kind of gap penalty is
being used here?
Matrix Initialization
Why?
Two sequences
GACCC ACAA
41
Dynamic programming for sequence alignment

General idea

What is this?
Matrix Initialization
Two sequences
GACCC ACAA
42
Dynamic programming for sequence alignment

General idea

What is this?
Score for aligning two empty sequences
Matrix Initialization
Two sequences
GACCC ACAA
43
Dynamic programming for sequence alignment

General idea

Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
44
To fill in each M(i,j), need to look where?

General idea

Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
45
M(i-1,j)
M(i-1,j-1)
To fill in each M(i,j), need to look where?
M(i,j)
M(i,j-1)

General idea

Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
46
Dynamic programming for sequence alignment

General idea

Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
What else do we need here?
47
Dynamic programming for sequence alignment
constant
48
Dynamic programming for sequence alignment
Whats this?
constant
49
Dynamic programming for sequence alignment
Whats this?
constant
50
Dynamic programming for sequence alignment
Whats this?
constant
51
Dynamic programming for sequence alignment
constant
52
Dynamic programming for sequence alignment
constant
53
Dynamic programming for sequence alignment
constant
54
Dynamic programming for sequence alignment
constant
55
Dynamic programming for sequence alignment
constant
56
Dynamic programming for sequence alignment
constant
57
Dynamic programming for sequence alignment
constant
58
Dynamic programming for sequence alignment
constant
59
Dynamic programming for sequence alignment
Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
Scoring system
S(i,j)
?(.) -2
constant
60
Dynamic programming for sequence alignment
constant
61
Dynamic programming for sequence alignment
Only one winner is shown for each
maxoperation, but there are often ties, with
thewinner selected at random. Thus, there are
many optimal alignmentsand even more that are
slightly less than optimal (sub-optimal
alignments)
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
Scoring system
S(i,j)
?(.) -2
constant
62
Dynamic programming for sequence alignment
Time and space complexity?
constant
63
Dynamic programming for sequence alignment
Traceback
Where do we start?
64
Dynamic programming for sequence alignment
Traceback
Are we doing a globalor local alignment here?
65
Dynamic programming for sequence alignment
Traceback
Are we doing a globalor local alignment here?
Bottom right cell givesus the optimal global
alignment score
66
Dynamic programming for sequence alignment
Traceback
Want to identify the alignmentwith over the
length of bothsequences Backtrack over the
winningpath How?
67
Dynamic programming for sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpath How? Have to identify the
directioninto which each cell was reached.
68
Dynamic programming for sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpath How? Have to identify the
directioninto which each cell was
reached. Forward pointers tell us
69
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
70
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
71
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path What does our alignment look like
at this point?
72
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path What does our alignment look like
at this point?
CCC CAA
73
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
ACCC ACAA
74
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
GACCC -ACAA
75
Global pairwise sequence alignment

Global pairwise sequence alignment by dynamic
programming introduced by Needleman and Wunsch in
Known as the Needleman-Wunsch algorithm.

In what situations might we want a global
sequence alignment?
In what situations might this not be desirable?

77
Local pairwise sequence alignment

General idea

Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
78
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Zero win gt start ofa new alignment
(reset) Alignment can begin andend anywhere in
the matrix
79
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
80
Local pairwise sequence alignment
Do we set a pointer from the parent to this cell?
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
81
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
82
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
83
Local pairwise sequence alignment
Do we set a pointer from the parent to this cell?
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
84
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
85
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
86
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
87
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
88
Local pairwise sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpaths How? Where do we start?
89
Local pairwise sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpaths How? Where do we start?
Largest value in the matrixis score of the best
localalignment between subsequences
90
Local pairwise sequence alignment
Traceback
91
Local pairwise sequence alignment
Traceback
92
Local pairwise sequence alignment
Traceback
93
Local pairwise sequence alignment
Traceback
What does this alignmentlook like?
94
Local pairwise sequence alignment
Traceback
What does this alignmentlook like?
CCC CCA
95
Local pairwise sequence alignment
Traceback
For this to work, what is requirement for
expected value of a randommatch? Why?
96
Local pairwise sequence alignment

Smith-Waterman (1981)
Gotoh (affine gap scores) (1982)

97
Motivation for heuristic alignments

For two sequences of length m and n
Time to get optimal alignment score is T(mn)
T notation like O notation but provides upper and
lower bounds on asymptotic behavior (O is only
for upper)
Space required to recover the alignment (naively
T(mn) but can reduce to T(mn)
Not realistic for large scale problems
Sequence (or sequence database size) in the
billions

98
Big picture strategy for heuristic alignments

Do fast preprocessing to identify regions where
more expensive algorithms can be used

Filter candidates byseeking a high
scoringalignment near each one
Generate candidate patterns that
indicatepresence of a high scoring alignment
Repeat withprogressivelymoresensitivealgoirthm
s
Courtesy of WU lecture 1
99
Big picture strategy for heuristic alignments

Key issues to address
What are good candidate patterns?
How do we generate and find them efficiently?
How do we eliminate redundant candidates?
How do we search for alignments near candidates?
How do we score (and filter) alignments?

Courtesy of WU lectures on alignment problems
100
BLAST
Biologically real alignments very likelyto
contain a short stretch of identitiesor very
high scoring matches.
Parallel computing for bioinformatics and
computational biology, Zomaya
101
BLAST
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
102
BLAST
Q
D
ACTGA
GACTGC
Word List
ACTCTGTGA
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
103
BLAST
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
104
BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
What are they?
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
105
BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
ACTCTG
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
106
BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
ACTCTG
What kind of data structures might you use to
store the words? The word/seed associations?
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
107
BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
ACT CTG
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
108
BLAST
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
109
BLAST
HSP(high-scoring pair)
GACTGC ACT CTG
GACTGC ACTG
D
GACTGC
Seeds?
ACT CTG
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
110
BLAST

Combine consistent HSPs
They cant overlap
They must be in same order in Q and D sequences
These consistent local aligments are assessed
for statistical significance

Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
111
BLAST

Parameterization includes choice of W (word
length) and T (threshold for scores in the
initial word list)

How will make W and T longer or shorter impact
the finallist of statistically significant local
alignments produced by BLAST?What about its
speed?
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
112
What you should know

What alignments mean in terms of a model of
sequence evolution
Start to develop intuition about dynamic
programming and how to use it
Basic ideas behind heuristic alignment approach
such as BLAST

113
Credits

Materials for this lecture were taken from many
sources including the following web pages, which
you may want to explore on your own

http//www.life.umd.edu/labs/delwiche/bsci348s/lec
/PAMmatrices.html
http//web.cecs.pdx.edu/ps/CapStone03/dynvis/Simi
larityApplet.html
http//fasta.bioch.virginia.edu/noptalign/

Write a Comment

User Comments (0)