Title: Lecture 5 Pairwise sequence alignment algorithms
1Lecture 5Pairwise sequence alignment algorithms
- Rachel Karchin
- BME 580.688/580.488 and CS 600.488 Spring 2009
2Outline
- Fundamentals and basics
- Exact alignment algorithms
- Global Needleman-Wunsch
- Local - Smith Waterman
- Heuristics for the real world
- BLAST
3Fundamentals
- When we compare sequences, we are looking for
evidence that they have diverged from a common
ancestor by a process of mutation and selection.
Durbin et al. Biological Sequence Analysis
4Fundamentals
- Assume a simplified model of evolution
- Substitutions
- Deletions
- Insertions
Durbin et al. Biological Sequence Analysis
5Pairwise alignment basics
GACCC GACAA
Two sequences
What is the best way to match up their shared
equivalent positions?
What is a shared equivalent position?
6Pairwise alignment basics
GACCC GACAA
Two sequences
What is the best way to match up their shared
equivalent positions?
GACCC GACAA
GACCC-- --GACAA
GACC-C GACAA-
G-ACCC GACAA-
GACCC- GA-CAA
GACCC- GAC-AA
7Pairwise alignment basics
Tabular representation
GACCC GACAA
Two sequences
8Pairwise alignment basics
GACCC GACAA
Tabular representation
Two sequences
9Pairwise alignment basics
Tabular representation
GACCC- G-ACAA
10Pairwise alignment basics
Tabular representation
GACCC- G-ACAA
11Pairwise alignment
- Which of all of these alignments is the best one?
12Pairwise alignment
- Requires a scoring system that gives points or
penalties for matches and gaps
13Pairwise alignment
- Also requires efficient search strategy to select
the best alignment or best k alignments out of
very large number of possible alignments
14Scoring system
What do we want out of such a scoring system?
15Scoring matrices
- PAM matrices invented by Margaret Dayhoff in
1978. - Based on mutations in 70 protein families
- Reconstructed the ancestral sequence of each
family
16Scoring matrices
- Count number of changes for each amino acid and
divide by a normalization factor of exposure to
mutation - Assumes that the observed substitution is the
result of a single mutation event.
17Scoring matrices
- PAM1 gives the probability of a 1 change in
sequence - One substitution per 100 amino acids
- PAM100, PAM250 etc. are linear extrapolations of
PAM1 - PAM250 is probability of 250 change in sequence
Tutorial to do it yourself http//www.biorecipes
.com/Dayhoff/code.html
18Scoring matrices
Do these look like probabilities?
19Scoring matrices
Do these look like probabilities?
20Scoring matrices
- Henikoff and Henikoff (1992)
- Based on amino acid substitutions in conserved
amino acid patterns from the BLOCKS database - BLOSUMn from blocks which are n identical
21Scoring matrices
- BLOCKS found with motif software which finds
patterns of amino acids at arbitrary intervals - Protein families defined by presence of a common
motif - Motifs strung together into ungapped blocks of
sequence
22BLOCKS
http//blocks.fhcrc.org/
23Nucleotide substitution matrices
- How might you design one?
24Gap penalties
- Simplest approach is a constant gap penalty
- Cost for opening a gap d
- Every gap gets same penalty, regardless of length
25Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d -10
26Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d -5
What has changed?
27Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d 0
What has changed?
28Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d 0
What has changed? How might we assess which
value of d is best?
29Gap penalties
- Linear gap penalty
- Cost for opening a gap d
- Cost for gap extenstion d
- Each gap of one base (or residue) gets penalized
the same, regardless of whether or not it is
contiguous with other gaps - Alignment with fewer gaps favored
- Penalty for large gap is same as many small gaps
30Gap penalties
- Linear gap penalty
- Cost for opening a gap d
- Cost for gap extenstion d
- Each gap of one base (or residue) gets penalized
the same, regardless of whether or not it is
contiguous with other gaps - Alignment with fewer gaps favored
- Penalty for large gap is same as many small gaps
What do you think about this from a biological
point of view?
31Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap linear d -10
What has changed?
32Gap penalties
- Affine gap penalty
- Biologically, more likely to see a single gap of
10 than 10 single gaps - We can model this using a gap opening penalty o
and a gap extension penalty e - Gap of length l gets penalty o (l-1) e
33Gap penalties
- Affine gap penalty
- Biologically, more likely to see a single gap of
10 than 10 single gaps - We can model this using a gap opening penalty o
and a gap extension penalty e - Gap of length l gets penalty o (l-1) e
Should e be more or less negative than o ? Why?
34Gap penalties
- Affine gap penalty
- Biologically, more likely to see a single gap of
10 than 10 single gaps - We can model this using a gap opening penalty o
and a gap extension penalty e - Gap of length l gets penalty o (l-1) e
Should e be more or less negative than o
? Why? Less. Because a few large gaps are
morerealistic than manys mall gaps. Want to
encourage gap extension.
35Gap penalties
- Build your intuition about parameters for
pairwise sequence alignment. - Gap penalty functions often written as ?(.)
http//fasta.bioch.virginia.edu/noptalign/
36Dynamic programming for sequence alignment
- Scoring system
- Search strategy
37Dynamic programming for sequence alignment
- Key insight is that we can align prefixes and
then extend them one position at a time - Efficient search through the huge alignment space
is done by caching intermediate results
38Dynamic programming for sequence alignment
Matrix Initialization
Two sequences
GACCC ACAA
39Dynamic programming for sequence alignment
Matrix Initialization
Why?
Two sequences
GACCC ACAA
40Dynamic programming for sequence alignment
Aligning each nucleotidewith the empty string
isequivalent to putting a gap atthe beginning
of the alignment. What kind of gap penalty is
being used here?
Matrix Initialization
Why?
Two sequences
GACCC ACAA
41Dynamic programming for sequence alignment
What is this?
Matrix Initialization
Two sequences
GACCC ACAA
42Dynamic programming for sequence alignment
What is this?
Score for aligning two empty sequences
Matrix Initialization
Two sequences
GACCC ACAA
43Dynamic programming for sequence alignment
Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
44To fill in each M(i,j), need to look where?
Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
45M(i-1,j)
M(i-1,j-1)
To fill in each M(i,j), need to look where?
M(i,j)
M(i,j-1)
Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
46Dynamic programming for sequence alignment
Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
What else do we need here?
47Dynamic programming for sequence alignment
constant
48Dynamic programming for sequence alignment
Whats this?
constant
49Dynamic programming for sequence alignment
Whats this?
constant
50Dynamic programming for sequence alignment
Whats this?
constant
51Dynamic programming for sequence alignment
constant
52Dynamic programming for sequence alignment
constant
53Dynamic programming for sequence alignment
constant
54Dynamic programming for sequence alignment
constant
55Dynamic programming for sequence alignment
constant
56Dynamic programming for sequence alignment
constant
57Dynamic programming for sequence alignment
constant
58Dynamic programming for sequence alignment
constant
59Dynamic programming for sequence alignment
Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
Scoring system
S(i,j)
?(.) -2
constant
60Dynamic programming for sequence alignment
constant
61Dynamic programming for sequence alignment
Only one winner is shown for each
maxoperation, but there are often ties, with
thewinner selected at random. Thus, there are
many optimal alignmentsand even more that are
slightly less than optimal (sub-optimal
alignments)
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
Scoring system
S(i,j)
?(.) -2
constant
62Dynamic programming for sequence alignment
Time and space complexity?
constant
63Dynamic programming for sequence alignment
Traceback
Where do we start?
64Dynamic programming for sequence alignment
Traceback
Are we doing a globalor local alignment here?
65Dynamic programming for sequence alignment
Traceback
Are we doing a globalor local alignment here?
Bottom right cell givesus the optimal global
alignment score
66Dynamic programming for sequence alignment
Traceback
Want to identify the alignmentwith over the
length of bothsequences Backtrack over the
winningpath How?
67Dynamic programming for sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpath How? Have to identify the
directioninto which each cell was reached.
68Dynamic programming for sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpath How? Have to identify the
directioninto which each cell was
reached. Forward pointers tell us
69Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
70Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
71Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path What does our alignment look like
at this point?
72Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path What does our alignment look like
at this point?
CCC CAA
73Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
ACCC ACAA
74Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
GACCC -ACAA
75Global pairwise sequence alignment
- Global pairwise sequence alignment by dynamic
programming introduced by Needleman and Wunsch in
- Known as the Needleman-Wunsch algorithm.
76- In what situations might we want a global
sequence alignment? - In what situations might this not be desirable?
77Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
78Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Zero win gt start ofa new alignment
(reset) Alignment can begin andend anywhere in
the matrix
79Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
80Local pairwise sequence alignment
Do we set a pointer from the parent to this cell?
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
81Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
82Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
83Local pairwise sequence alignment
Do we set a pointer from the parent to this cell?
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
84Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
85Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
86Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
87Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
88Local pairwise sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpaths How? Where do we start?
89Local pairwise sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpaths How? Where do we start?
Largest value in the matrixis score of the best
localalignment between subsequences
90Local pairwise sequence alignment
Traceback
91Local pairwise sequence alignment
Traceback
92Local pairwise sequence alignment
Traceback
93Local pairwise sequence alignment
Traceback
What does this alignmentlook like?
94Local pairwise sequence alignment
Traceback
What does this alignmentlook like?
CCC CCA
95Local pairwise sequence alignment
Traceback
For this to work, what is requirement for
expected value of a randommatch? Why?
96Local pairwise sequence alignment
- Smith-Waterman (1981)
- Gotoh (affine gap scores) (1982)
97Motivation for heuristic alignments
- For two sequences of length m and n
- Time to get optimal alignment score is T(mn)
- T notation like O notation but provides upper and
lower bounds on asymptotic behavior (O is only
for upper) - Space required to recover the alignment (naively
T(mn) but can reduce to T(mn) - Not realistic for large scale problems
- Sequence (or sequence database size) in the
billions
98Big picture strategy for heuristic alignments
- Do fast preprocessing to identify regions where
more expensive algorithms can be used
Filter candidates byseeking a high
scoringalignment near each one
Generate candidate patterns that
indicatepresence of a high scoring alignment
Repeat withprogressivelymoresensitivealgoirthm
s
Courtesy of WU lecture 1
99Big picture strategy for heuristic alignments
- Key issues to address
- What are good candidate patterns?
- How do we generate and find them efficiently?
- How do we eliminate redundant candidates?
- How do we search for alignments near candidates?
- How do we score (and filter) alignments?
Courtesy of WU lectures on alignment problems
100BLAST
Biologically real alignments very likelyto
contain a short stretch of identitiesor very
high scoring matches.
Parallel computing for bioinformatics and
computational biology, Zomaya
101BLAST
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
102BLAST
Q
D
ACTGA
GACTGC
Word List
ACTCTGTGA
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
103BLAST
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
104BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
What are they?
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
105BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
ACTCTG
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
106BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
ACTCTG
What kind of data structures might you use to
store the words? The word/seed associations?
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
107BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
ACT CTG
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
108BLAST
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
109BLAST
HSP(high-scoring pair)
GACTGC ACT CTG
GACTGC ACTG
D
GACTGC
Seeds?
ACT CTG
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
110BLAST
- Combine consistent HSPs
- They cant overlap
- They must be in same order in Q and D sequences
- These consistent local aligments are assessed
for statistical significance
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
111BLAST
- Parameterization includes choice of W (word
length) and T (threshold for scores in the
initial word list)
How will make W and T longer or shorter impact
the finallist of statistically significant local
alignments produced by BLAST?What about its
speed?
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
112What you should know
- What alignments mean in terms of a model of
sequence evolution - Start to develop intuition about dynamic
programming and how to use it - Basic ideas behind heuristic alignment approach
such as BLAST
113Credits
- Materials for this lecture were taken from many
sources including the following web pages, which
you may want to explore on your own
http//www.life.umd.edu/labs/delwiche/bsci348s/lec
/PAMmatrices.html
http//web.cecs.pdx.edu/ps/CapStone03/dynvis/Simi
larityApplet.html
http//fasta.bioch.virginia.edu/noptalign/