Title: Pairwise alignment
1Pairwise alignment
- - Score-based alignments, dynamic programming
- Problems with score-based alignments
- Markov chains, hidden Markov models, examples
- The Viterbi algorithm
- Pair HMMs
- Forward, Backward, and Posterior probabilities
- Posterior Decoding, comparison with score-based
alignments
2Sequence alignment
- To study how nucleotide (or amino acid) sequences
evolve, first a precise matching between
homologous nucleotides in two (homologous)
nucleotide sequences has to be made. - Homology is an old concept in biology, and
means having a similar form. In evolutionary
biology, it means more specifically, having a
similar form because of shared ancestry. For
example, the skeletons of mammals are homologous
in this sense, but eyes have evolved
independently a number of times, so theyre not
homologous (even though they may be very similar
in form and function). - The word is also used for nucleotide sequences.
Here it means sequence similarity because of
shared ancestry (although it is often used as
synonym of sequence similarity). Ancestry is
tricky to determine, especially when sequences
are short, highly divergent, or contain many
repetitive patterns (because of the possibility
that these repeats have arisen independently).
So in practice, sequence similarity is often used
as a proxy for homology. Other characteristics,
such as the function or 3D structure of the
protein encoded by the DNA sequence are used as
well.
Hexokinase I (glucose metabolism enzyme) is
found in mammals, plants, and yeast.
3Needleman-Wunsch algorithm
- Two sequences are considered homologous is
theyre likely to have evolved from a common
ancestor, because of their sequence similarity. - To quantify this, it is first assumed that during
evolution the sequences have been changed by two
processes (1) single nucleotide substitutions,
and (2) short sequence insertions and deletions
(indels). Other mutations do occur (e.g.
multiple-nucleotide substitutions, duplications,
inversions, ...), but theyre usually ignored
because of their relatively low frequency and/or
the technical difficulty of dealing with them. - Considering only nucleotide substitutions and
indels, the homology between 2 sequences can be
summarized by an alignment. It implies a minimum
set of substitutions and indels for that
homology. - Historically the first algorithm for calculating
alignments is the Needleman-Wunsch algorithm
(1970). It finds the alignment that optimizes a
score, which penalizes both nucleotide
mis-matches at aligned positions, and gaps. This
approach is a form of weighted parsimony. - The score function has the following form
- S ?n,m aligned M(n,m) ?gaps h(length),
- where M(m,n) is the score matrix, and h(k) is
the gap penalty function. The latter is almost
always taken to be a linear affine function, - h(k) a b (k-1)
- where a is the gap opening penalty, and b is
the gap extension penalty.
CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCA
GGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAA
G----CAGACGCA
4Needleman-Wunsch algorithm
CGACATTAA--ATA CGACGTTAACGATT
An alignment can be represented as a path through
a table, with the sequences along the edges of
the table.
C G A C G T T A A C G A T T
C ?
G ?
A ?
C ?
G ?
T ?
T ?
A ?
A ? ? ?
A ?
T ?
A ?
5Needleman-Wunsch algorithm
CGACATTAA--ATA CGACGTTAACGATT
The problem with determining the top-scoring
alignment is that there are very many paths. If
we ignore the diagonal steps (i.e. only gaps...),
the number of paths can be computed recursively,
by adding the entries above and to the left and
writing the result in the box. The numbers you
get are called Pascals triangle, or the
binomial coefficients. The number of alignments
of sequences of length N and M is therefore at
least Simply going through all possible
alignments is therefore impossible.
C G A C G T T A A C G A T T
C 1 1 1 1 1 1 1 1 1 1
G 1 2 3 4 5 6 7 8 9
A 1 3 6 10 15 21 28 36
C 1 4 10 20 35 54 84
G 1 5 15 35 70 126
T 1 6 21 56 126
T 1 7 28 84
A 1 8 36
A 1 9
A 1
T
A
6Needleman-Wunsch algorithm
CGACATTAA--ATA CGACGTTAACGATT
- Needleman-Wunsch is, yet again, an application of
dynamic programming. Here, the intermediate
problem to solve is the computation of the
maximum score S(n,m) of the first n nucleotides
of sequence u, and the first m nucleotides of
sequence v. - The recursive formula for this score is
- S(n,m) max
- S(n-1,m-1) M( u(n), v(m) ),
- maxkgt1 S(n-k,m) h(k),
- maxkgt1 S(n,m-k) h(k)
- (To make the formulas look clean, boundary
conditions such as S(-1,n)0, and the allowed
range of k, are not written explicitly). - This algorithm takes O(n3) time. It can be
improved to O(n2) by using 3 tables which collect
the scores that (1) end in an aligned pair, (2)
end in a gap in u, and (3) end in a gap in v.
The linear affine form of h allows simple
updating of the gap-ending scores using
neighbouring values.
C G A C G T T A A C G A T T
C 10 -10 -15 -20
G -10 20
A -15 30
C -20 40
G 50
T 30 60
T 40 70
A 50 80
A 60 90 70 65
A 70 85 65 75
T 65 80 60 85
A 60 90 70 80
gap opening -10 match 10 gap extension -5
mismatch -5
7Needleman-Wunsch algorithm
CGACATTAA--ATA CGACGTTAACGATT
With the table filled out, the bottom-right entry
gives the score of the best-scoring
alignment. By starting in this entry, and
re-computing the score, you can figure out which
term of the max formula gave rise to this
score. Go to the corresponding entry and repeat
until you arrive at the top-left entry. The path
now corresponds to the alignment. This algorithm
computes the best-scoring global alignment all
nucleotides participate, and gaps at the
beginning or end are penalized as much as gaps in
the middle. The Smith-Waterman algorithm is a
variation of Needleman-Wunsch which does not
penalize gaps at the beginning or end. It
computes the best-scoring local alignment.
C G A C G T T A A C G A T T
C 10 -10 -15 -20
G -10 20
A -15 30
C -20 40
G 50
T 30 60
T 40 70
A 50 80
A 60 90 70 65
A 70 85 65 75
T 65 80 60 85
A 60 90 70 80
gap opening -10 match 10 gap extension -5
mismatch -5
8Alignments are often wrong
Sensitivity for true homology
Incorrect bases in alignment
9Biases in alignments
A gap wander (Holmes Durbin, JCB 5
1998) B,C gap attraction D gap annihilation
10Problems with score-based alignments
- The Needleman-Wunsch algorithm (and its
elaborations) optimize a score to arrive at a
single preferred alignment. This approach is
unsatisfactory for several reasons - 1. It is unclear how to determine the score
matrix and gap penalties given only unaligned
sequences. - 2. Even if somehow the underlying evolutionary
process is known (e.g. as substitution and indel
rates) it is unclear how the scores and penalties
relate to this. - 3. It is impossible to recover the true
alignment with certainty, particularly if the
sequences are quite divergent. But how to
quantify the alignments reliability? - 4. The N-W alignments are biased in certain ways,
and provide no obvious ways to address the
problem. - These observations suggests that a probabilistic
(or statistical) approach to alignments might be
helpful.
11Markov chains
- A Markov chain is a discrete-time stochastic
process X(t) that has the Markov property. - Discrete time means that t ? T and T is the set
of integers ...,-2,-1,0,1,2,..., a discrete
set. - The state space may be either discrete or
continuous. Well only consider discrete state
spaces here (but continuous state spaces are also
useful in biology, e.g. to model the evolution of
gene expression levels). - The Markov property is the property that the
future dynamics of X(t), given the present state,
is independent of the past states. Symbolically, - P( X(t1)yt1 X(t)yt, X(t-1)yt-1,
X(t-2)yt-2,.... ) P( X(t1)yt1 X(t)yt ) - This means that the behaviour of X(t) is
completely determined by the matrices -
- Apq(t) P( X(t1)q X(t)p )
- (Strictly speaking, they are matrices only if
the state space is finite.) These matrices are
called the transition matrices of X(t). They may
well depend on t. If they do not, X(t) is called
a time-homogeneous Markov chain.
12Markov chains
- Suppose you know the probabilities that the
Markov chain is in a particular state for time t
lets say you know the vector u where - ua P( X(t) a )
- Then the probability that X(t1)b is
- ?a P( X(t)a ) P( X(t1)b X(t)a )
- () ?a ua Aab
- u A b
- because the summation () is just the definition
of the product of a vector and a matrix.
Applying the same formula with u u A some
number of times, you get - P( X(tn)b ) u An b
- Conditioning on X(t)a means settings u
(0,..,1,...,0), with the 1 at the coordinate
corresponding to a - P( X(tn)b X(t)a ) An ab
13Markov chains
- Simple example Two states, and transition
probabilities between them. - This system models the fact that yesterdays
weather is a reasonable prediction for todays
weather. - Here is a typical realization
14Hidden Markov model (HMM)
Elaboration of the simple weather model, now as
an HMM. The underlying states, which are
considered unobservable, are H and L representing
periods of high and low atmospheric pressure.
The observable is the weather (sunny or rainy).
Here is a typical realization of this HMM H
H H H H H H H H L L
L H H H H H L L L
L H H
15Hidden Markov models
- A hidden Markov model (HMM) is a Markov chain
X(t), and a stochastic variable Y(t) representing
emissions or observations. - X(t) has the usual Markov property. Y(t) is only
dependent on X(t) (more precisely, Y(t) is
conditionally independent of all other Y(s) given
X(t).) These dependencies are summarized in this
picture -
-
(wikipedia Hidden Markov
model) - The state X(t) is considered unobservable the
only observables are the Y(t). -
- (Note The Y(t) may also emit the empty symbol,
denoted ?. If these are used, they are simply
left out of the emitted symbol sequence. Only
this symbol sequence is considered to be
observable this then also makes the time t
unobservable, since it is unknown how many empty
symbols have been output). - Usually the Markov chain starts at t0, and in a
particular state, which is called the start
state. Often one of the states in the Markov
chain is labeled end state, and the process
stops when it reaches this state. In this way
HMMs can be made to produce finite symbol
sequences.
16Hidden Markov models
- The unobserved Markov chain is defined as usual
a number of states 1,2,...,N, and a transition
matrix A. - The Y(t) are a stochastic variable whose discrete
state space is called the emission alphabet. A
probability distribution P( Yy Xx )
determines the emissions conditional on the state
of X(t). - HMMs are generative models they specify how a
sequence of observations are generated randomly.
This can be useful for simulations. Their power
however derives from the existence of a number of
efficient algorithms that compute the answer to
important questions relating to HMMs. - Questions you may want to ask include
- (i) What is the most likely state path X(t)
generating a given observed sequence Y? - (ii) What is the likelihood of a given model
generating a sequence Y? - (iii) What is the posterior probability that a
symbol Yi from the sequence Y was generated by
some state X? - (iv) For a given sequence Y, what matrix A and
emission probabilities P( Yy Xx ) maximize
the likelihood of Y? - These questions are solved by (i) the Viterbi
algorithm, (ii) the Forward or Backward
algorithms, (iii) the Forward and Backward
algorithms combined, and (iv) the Baum-Welch
algorithm.
17Profile HMMs
Profile HMMs are used to model DNA sequence
motifs in longer DNA sequences. Examples of
sequence motifs are DNA binding motifs for
transcription factors, or protein domains (in
which case amino acids can also be used as
alphabet).
18Profile HMMs
Adding the x states (which emit nucleotides
according to the background distribution) allow
the motif to be anywhere inside a longer DNA
sequence. The black dots are silent states
(states that do not emit symbols).
19Profile HMMs
Adding more silent states allow letters in the
motif to be skipped (with a certain probability),
so model e.g. protein domains with slightly
variable lengths.
20Profile HMMs
Similarly, insert states allow the insertion
of, in principle, any length of sequence
somewhere within the motif.
21Profile HMMs
This is the profile HMM architecture used in the
HMMER and SAM packages. An extensive list of
HMMs for protein domains can be found in the Pfam
database.
Sean Eddy David Haussler HMMER
SAM Washington U. UCSC
A loopback transition allows multiple motifs to
occur in a single sequence (but note that in this
model, the sequence always includes at least one
motif. Using this model to find motifs will
always get you one).
22Viterbi algorithm
Andrew Viterbi (1935-), Italian-American engineer
- This algorithm determines the most likely state
path conditional on producing the observed
sequence Y. Straightforward computation would
involve examining all possible paths, which is
virtually impossible. The Viterbi algorithm
solves the problem efficiently, using dynamic
programming, in a way that is very similar tothe
Needleman-Wunsch algorithm (except that this one
is in 1 dimension but see later) - The intermediate result that can be computed
recursively is - v(i,x) the probability of the most probable
path that begins in the start state, emits the
first i symbols of Y, and ends up in state x. - Suppose for convenience that all states except
the start state emit a symbol. Then
v(0,start)1, v(0,x)0 for other states x. - For igt0, v can be computed recursively
- v(i,x) maxs v(i-1,s) Asx P( YYi Xx
) () - The red factor has already been computed. After
all v(i,x) have been computed, a traceback
procedure recovers the most probable path. - (Note 1 The convention has been taken that a
state emits its corresponding symbol as soon as
it is entered. One can also choose to emit once
a state is left. Further, emissions may be
assigned to transitions rather than states this
is called a Mealy model or machine the
conventional choice is called a Moore model.) - (Note 2 If some states x do not emit a symbol,
the factor v(i-1,s) must be replaced by v(i,s) in
(). This makes the formula self-referential.
For the Viterbi algorithm, the computation ()
can simply be carried out a number of times,
until none of the terms changes any further.
Depending on the structure of the Markov chain,
computing the v(i,x) in a suitable order may
decrease the number of iterations required.)
23Pair HMMs
- Up until now, the HMMs emitted a single sequence
of symbols. A straightforward generalization
allows the emission of two symbol sequences at
the same time. Such HMMs are called pair HMMs. - Each state emits one symbol to each sequence. As
before, either or both may be the empty symbol (
instead of ?) - Pair HMMs can be used to model sequence
alignments. Simultaneously emitted nucleotides
correspond to homologous pairs, and are
correlated because of their shared ancestry.
States that emit nucleotides to only one sequence
model gaps in the other sequence.
24Viterbi for pair HMMs
- To show what changes going from standard to pair
HMMs, here is the Viterbi algorithm for pair
HMMs. The observable sequences are denoted Y and
Z. - The probabilities to be computed are v(i,j,x),
and represent the probability of the most
probable path that begins in the start state,
emits the first i symbols from Y and the first j
symbols from Z, and ends up in state x. - The recursion is
- v(i,j,x) maxs v(i-p, j-q,s) Asx P(YYi,
ZZj Xx) - Here, p1 if x emits into Y, 0 otherwise
similarly q1 if x emits into Z, 0 otherwise. If
x does not output to both sequences, the emission
probability should change accordingly (e.g.
P(YYiXx) for emission into Y only). - The maximum path probability can be read off from
v(length(Y),length(Z),end) the actual path can
be obtained by backtracing. - For simplicity I assume that none of the states
(except for start) is silent, i.e. they all
emit into at least one sequence. If not, the
recursion again becomes self-referential, and the
same remarks as with the 1-dimensional Viterbi
algorithm apply.
25Viterbi for pair HMMs
- The Viterbi algorithm computes the single path
X(0),X(1),...,X(t) through the Markov chain
that has the highest probability of
simultaneously emitting the two observed
sequences, denoted Y and Z. The path encodes
which nucleotides from Y and Z were emitted
simultaneously, i.e. it encodes the alignment. - Symbolically, the Viterbi algorithm first
computes - maxX P( X, Y, Z )
- and then the traceback part computes
- arg maxX P( X , Y, Z ) arg max X P( X
Y, Z ) - The last equation holds because P(XY,Z)
P(X,Y,Z) / P(Y,Z), and because the denominator
does not depend on X, the full and conditional
probabilities reach their maxima for the same
path X.
26Alignments Best scoring / most likely path
(Needleman-Wunsch, Smith-Waterman, Viterbi)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
27Likelihood the Forward algorithm
- The Viterbi algorithm is very similar to the
Needleman-Wunsch algorithm (the states playing
the role of index to the three tables mentioned
briefly) it has the same time complexity, and
indeed for the three-state HMM of the example,
you can give a score matrix and gap penalties
that make them compute identical alignments. - The HMM approach has the important advantage
above score-based approaches such as N-W that it
is fairly straightforward to determine the model
parameters if you know how the sequences evolve,
i.e. if you know the evolutionary model (drawback
1 of score-based aligners). - Drawback 2 concerned determining the model
parameters (or the evolutionary model) from
sequences alone. The HMM approach also has a
method to do this, by maximum likelihood
parameter estimators. For this, it is necessary
to be able to calculate the likelihood. This
involves considering (summing over) all possible
paths, instead of just the most likely one. The
Forward algorithm does this.
28The Forward algorithm (for pair HMMs)
- The algorithm is very similar to the Viterbi
algorithm (without traceback). The probabilities
to be computed are slightly different w(i,j,x)
now represent the sum of probabilities of paths
through the Markov chain that emit the first i
symbols from Y and the first j symbols from Z,
and end up in state x. The final result which
appears in the table entry w(length(Y),length(Z),
end) is the likelihood of the sequences Y and Z. - Symbolically, the Forward algorithm computes
- P(Y, Z) ?X P( X, Y, Z )
- The recursion is
- w(i,j,x) ? s w(i-p, j-q,s) Asx P(YYi,
ZZj Xx) - Here, p and q have the same definitions as
before, and the same caveats-for-the-sake-of-simpl
icity apply. - Note Dealing with silent states is more involved
than with Viterbi. As before, silent states
(where neither Y nor Z emits a symbol) create
self-references in the recursion. In this case
they are linear equations, which in general must
be solved by a matrix inversion. In practice,
most situations involve 1x1 matrices, which of
course can be dealt with by a simple division. - Note the numbers w(i,j,x) are not conditional
probabilities they are not even true
probabilities, as in some cases they can be gt1.
29The Backward algorithm (for pair HMMs)
- The Backward algorithm computes the same thing as
the Forward algorithm the likelihood that the
HMM emits the sequences Y and Z symbolically
P(Y, Z) ?X P( X, Y, Z ). So why bother? The
answer is that the table of intermediate results
are different, and the Forward and Backward
tables can be used together to compute other
interesting quantities, such as posterior
probabilities. - The algorithm is very similar to the Forward
algorithm, but starts the recursion and the
bottom-right end of the table, working its way
(backwards) to the top-left. - The quantities u now represent the conditional
probability of the model emitting the symbols
i1,i2,... of Y and j1,j2,... of Z, using
paths that start in x and end up in the end
state. The emission associated with x is not
included in this likelihood. - Initialization is u(length(Y),length(Z),end)1,
and the recursion is - u(i,j,x) ? s u(ip s, jq s,s) Axs
P(YYi1, ZZj1 Xs) - Here, p and q have the same definitions as
before, but now refer to the state s rather than
x. - Note Dealing with silent states is more involved
than with Viterbi. As before, silent states
(where neither Y nor Z emits a symbol) create
self-references in the recursion. In this case
they are linear equations, which in general must
be solved by a matrix inversion. In practice,
most situations involve 1x1 matrices, which can
be dealt with by a simple division.
30The Backward algorithm Sampling from the
posterior
- Recall that the quantities u computed by the
Backward algorithm represent the conditional
probability of the model emitting the symbols
i1,i2,... of Y and j1,j2,... of Z, using
paths that start in x and end up in the end
state. - In other words, u(i,j,x) gives the probability
that the model will produce the remainder of Y
and Z, conditional on starting in state x. These
probabilities can be used to sample from the
posterior distribution of paths that generate Y
and Z - x ? start, i ? 0, j? 0
- While x is not end
- Compute r s u(ip s, jq s,s) Axs
P(YYi1, ZZj1 Xs) for all states s - Normalize the r so that they sum to 1, and draw
a state s from the distribution - x?s, i?ip s, j?jq s
-
31Alignments Posterior probabilities
(Durbin, Eddy, Krogh, Mitchison 1998)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
32Posterior probabilities
- Using the table of intermediate results w(i,j,x)
and u(i,j,x) computed by the Forward and Backward
algorithms, it is possible to compute certain
posterior probabilities. - For example, lets denote by i j x the event
that the HMM emitted the symbols Yi and Zj from
state x (assuming that x emits into both
sequences an alternative interpretation can be
formulated when x does not.) The probability of
this event, and the HMM producing the sequences Y
and Z, is - P(Y, Z, i j x) w(i,j,x) u(i,j,x)
- Since the likelihood of emitting Y and Z is
- P(Y, Z) w(length(Y),length(Z),end)
u(0,0,start), - the probability of the event Yi Zj x
conditional on observing Y and Z is - P(i j x Y, Z ) w(i,j,x) u(i,j,x) /
u(0,0,start) - If the state x represents the aligned state in
an alignment HMM, this would be the posterior
probability of Yi and Zj to be aligned.
33Posterior probabilities
34Posterior probabilities ()
- As another example, lets consider the event that
the HMM transitioned from state x to s just after
emitting the symbols Yi and Zj (from state x,
assuming it emits into both sequences again,
this is not essential.) Denote this event by i
j x s. The probability of this event, and the
HMM producing the sequences Y and Z, is - P(Y, Z, i j x s) w(i,j,x) Axs P(YYi1,
ZZj1 Xs) u(ip,jq,s) - where again p and q refer to the emission of
state s. Dividing by the total likelihood of
observing the sequences Y and Z results in the
conditional probability of i j x s - P( i j x s Y, Z ) w(i,j,x) Axs
P(YYi1, ZZj1 Xs) u(ip,jq,s) /
u(0,0,start) - Conditional on the sequences Y and Z, and the
position within these sequences, this is the
expected number of times that the transition x-gts
was taken by the HMM. Now we can marginalize
over the position variables i and j -
- E( number of times that the transition x?s was
taken emitting Y, Z ) - ? i,j w(i,j,x) Axs P(YYi1, ZZj1 Xs)
u(ip,jq,s) / u(0,0,start) - The conditional expected number of times that
particular symbols were emitted from particular
states can be computed in a very similar way.
35Posteriors Good predictors of accuracy
36Posterior Decoding
- The good correspondence of posterior
probabilities and the actual frequencies that
particular states were visited, suggests that
posteriors may be interpreted as the
reliability of any particular prediction. The
idea of Posterior Decoding is to use these
posteriors to construct alignments. PD is not a
single algorithm, but a set of closely related
procedures. - Generally, P.D. maximizes some overall measure of
the posterior probability over a set of paths. -
The paths can consist of HMM states, or of
alignment columns.- The overall measure can be a
(weighted) sum of posteriors, or the product of
posteriors. - As before, this is accomplished by a dynamic
programming algorithm. - For alignments, the product-of-posteriors measure
turned out to work best. - PD is a heuristic procedure in contrast to e.g.
the Viterbi algorithm, there is no direct
(probabilistic) interpretation of the alignment.
For instance, when HMM states are used to build
paths, the resulting PD path may not be a valid
path through the HMM.
37Posterior decoding
a
b
c
d
e
Paths Alignment columns Posteriors Sum of
posteriors for all states contributing to a
column. E.g. The posterior for Yi and Zj to be
aligned is P(i j e) The posterior for Yi to
be involved in a gap is ?j P(i j a or i j
b or i j c or i j d Y, Z) To emphasize
the marginalization of the gap position j within
the secondsequence, the resulting algorithm is
called Marginalized Posterior Decoding.
38Alignments Posterior probabilities
(Durbin, Eddy, Krogh, Mitchison 1998)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
39Alignments Posterior probabilities
(Durbin, Eddy, Krogh, Mitchison 1998)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
40Alignments Posterior probabilities
(Durbin, Eddy, Krogh, Mitchison 1998)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
41Posterior Decoding is better than Viterbi
42Posterior decoding is better than score-based
alignment
43Summary
- HMMs are a powerful framework to model sequential
data with randomness - Modelling alignments using HMMs provide a
connection between alignment models and models of
sequence evolution - Provides practical and conceptual framework to
deal with uncertainties - Posterior decoding is more accurate and less
biased than either score-based approaches or
Viterbi. - Baum-Welch algorithm (not described) allows
efficient parameter estimation from unaligned
sequences