Pairwise alignment - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Pairwise alignment

Description:

Pairwise alignment - Score-based alignments, dynamic programming Problems with score-based alignments Markov chains, hidden Markov models, examples – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 44
Provided by: LeoGoo5
Category:

less

Transcript and Presenter's Notes

Title: Pairwise alignment


1
Pairwise alignment
  • - Score-based alignments, dynamic programming
  • Problems with score-based alignments
  • Markov chains, hidden Markov models, examples
  • The Viterbi algorithm
  • Pair HMMs
  • Forward, Backward, and Posterior probabilities
  • Posterior Decoding, comparison with score-based
    alignments

2
Sequence alignment
  • To study how nucleotide (or amino acid) sequences
    evolve, first a precise matching between
    homologous nucleotides in two (homologous)
    nucleotide sequences has to be made.
  • Homology is an old concept in biology, and
    means having a similar form. In evolutionary
    biology, it means more specifically, having a
    similar form because of shared ancestry. For
    example, the skeletons of mammals are homologous
    in this sense, but eyes have evolved
    independently a number of times, so theyre not
    homologous (even though they may be very similar
    in form and function).
  • The word is also used for nucleotide sequences.
    Here it means sequence similarity because of
    shared ancestry (although it is often used as
    synonym of sequence similarity). Ancestry is
    tricky to determine, especially when sequences
    are short, highly divergent, or contain many
    repetitive patterns (because of the possibility
    that these repeats have arisen independently).
    So in practice, sequence similarity is often used
    as a proxy for homology. Other characteristics,
    such as the function or 3D structure of the
    protein encoded by the DNA sequence are used as
    well.

Hexokinase I (glucose metabolism enzyme) is
found in mammals, plants, and yeast.
3
Needleman-Wunsch algorithm
  • Two sequences are considered homologous is
    theyre likely to have evolved from a common
    ancestor, because of their sequence similarity.
  • To quantify this, it is first assumed that during
    evolution the sequences have been changed by two
    processes (1) single nucleotide substitutions,
    and (2) short sequence insertions and deletions
    (indels). Other mutations do occur (e.g.
    multiple-nucleotide substitutions, duplications,
    inversions, ...), but theyre usually ignored
    because of their relatively low frequency and/or
    the technical difficulty of dealing with them.
  • Considering only nucleotide substitutions and
    indels, the homology between 2 sequences can be
    summarized by an alignment. It implies a minimum
    set of substitutions and indels for that
    homology.
  • Historically the first algorithm for calculating
    alignments is the Needleman-Wunsch algorithm
    (1970). It finds the alignment that optimizes a
    score, which penalizes both nucleotide
    mis-matches at aligned positions, and gaps. This
    approach is a form of weighted parsimony.
  • The score function has the following form
  • S ?n,m aligned M(n,m) ?gaps h(length),
  • where M(m,n) is the score matrix, and h(k) is
    the gap penalty function. The latter is almost
    always taken to be a linear affine function,
  • h(k) a b (k-1)
  • where a is the gap opening penalty, and b is
    the gap extension penalty.

CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCA
GGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAA
G----CAGACGCA
4
Needleman-Wunsch algorithm
CGACATTAA--ATA CGACGTTAACGATT
An alignment can be represented as a path through
a table, with the sequences along the edges of
the table.
C G A C G T T A A C G A T T
C ?
G ?
A ?
C ?
G ?
T ?
T ?
A ?
A ? ? ?
A ?
T ?
A ?
5
Needleman-Wunsch algorithm
CGACATTAA--ATA CGACGTTAACGATT
The problem with determining the top-scoring
alignment is that there are very many paths. If
we ignore the diagonal steps (i.e. only gaps...),
the number of paths can be computed recursively,
by adding the entries above and to the left and
writing the result in the box. The numbers you
get are called Pascals triangle, or the
binomial coefficients. The number of alignments
of sequences of length N and M is therefore at
least Simply going through all possible
alignments is therefore impossible.
C G A C G T T A A C G A T T
C 1 1 1 1 1 1 1 1 1 1
G 1 2 3 4 5 6 7 8 9
A 1 3 6 10 15 21 28 36
C 1 4 10 20 35 54 84
G 1 5 15 35 70 126
T 1 6 21 56 126
T 1 7 28 84
A 1 8 36
A 1 9
A 1
T
A
6
Needleman-Wunsch algorithm
CGACATTAA--ATA CGACGTTAACGATT
  • Needleman-Wunsch is, yet again, an application of
    dynamic programming. Here, the intermediate
    problem to solve is the computation of the
    maximum score S(n,m) of the first n nucleotides
    of sequence u, and the first m nucleotides of
    sequence v.
  • The recursive formula for this score is
  • S(n,m) max
  • S(n-1,m-1) M( u(n), v(m) ),
  • maxkgt1 S(n-k,m) h(k),
  • maxkgt1 S(n,m-k) h(k)
  • (To make the formulas look clean, boundary
    conditions such as S(-1,n)0, and the allowed
    range of k, are not written explicitly).
  • This algorithm takes O(n3) time. It can be
    improved to O(n2) by using 3 tables which collect
    the scores that (1) end in an aligned pair, (2)
    end in a gap in u, and (3) end in a gap in v.
    The linear affine form of h allows simple
    updating of the gap-ending scores using
    neighbouring values.

C G A C G T T A A C G A T T
C 10 -10 -15 -20
G -10 20
A -15 30
C -20 40
G 50
T 30 60
T 40 70
A 50 80
A 60 90 70 65
A 70 85 65 75
T 65 80 60 85
A 60 90 70 80
gap opening -10 match 10 gap extension -5
mismatch -5
7
Needleman-Wunsch algorithm
CGACATTAA--ATA CGACGTTAACGATT
With the table filled out, the bottom-right entry
gives the score of the best-scoring
alignment. By starting in this entry, and
re-computing the score, you can figure out which
term of the max formula gave rise to this
score. Go to the corresponding entry and repeat
until you arrive at the top-left entry. The path
now corresponds to the alignment. This algorithm
computes the best-scoring global alignment all
nucleotides participate, and gaps at the
beginning or end are penalized as much as gaps in
the middle. The Smith-Waterman algorithm is a
variation of Needleman-Wunsch which does not
penalize gaps at the beginning or end. It
computes the best-scoring local alignment.
C G A C G T T A A C G A T T
C 10 -10 -15 -20
G -10 20
A -15 30
C -20 40
G 50
T 30 60
T 40 70
A 50 80
A 60 90 70 65
A 70 85 65 75
T 65 80 60 85
A 60 90 70 80
gap opening -10 match 10 gap extension -5
mismatch -5
8
Alignments are often wrong
Sensitivity for true homology
Incorrect bases in alignment
9
Biases in alignments
A gap wander (Holmes Durbin, JCB 5
1998) B,C gap attraction D gap annihilation
10
Problems with score-based alignments
  • The Needleman-Wunsch algorithm (and its
    elaborations) optimize a score to arrive at a
    single preferred alignment. This approach is
    unsatisfactory for several reasons
  • 1. It is unclear how to determine the score
    matrix and gap penalties given only unaligned
    sequences.
  • 2. Even if somehow the underlying evolutionary
    process is known (e.g. as substitution and indel
    rates) it is unclear how the scores and penalties
    relate to this.
  • 3. It is impossible to recover the true
    alignment with certainty, particularly if the
    sequences are quite divergent. But how to
    quantify the alignments reliability?
  • 4. The N-W alignments are biased in certain ways,
    and provide no obvious ways to address the
    problem.
  • These observations suggests that a probabilistic
    (or statistical) approach to alignments might be
    helpful.

11
Markov chains
  • A Markov chain is a discrete-time stochastic
    process X(t) that has the Markov property.
  • Discrete time means that t ? T and T is the set
    of integers ...,-2,-1,0,1,2,..., a discrete
    set.
  • The state space may be either discrete or
    continuous. Well only consider discrete state
    spaces here (but continuous state spaces are also
    useful in biology, e.g. to model the evolution of
    gene expression levels).
  • The Markov property is the property that the
    future dynamics of X(t), given the present state,
    is independent of the past states. Symbolically,
  • P( X(t1)yt1 X(t)yt, X(t-1)yt-1,
    X(t-2)yt-2,.... ) P( X(t1)yt1 X(t)yt )
  • This means that the behaviour of X(t) is
    completely determined by the matrices
  • Apq(t) P( X(t1)q X(t)p )
  • (Strictly speaking, they are matrices only if
    the state space is finite.) These matrices are
    called the transition matrices of X(t). They may
    well depend on t. If they do not, X(t) is called
    a time-homogeneous Markov chain.

12
Markov chains
  • Suppose you know the probabilities that the
    Markov chain is in a particular state for time t
    lets say you know the vector u where
  • ua P( X(t) a )
  • Then the probability that X(t1)b is
  • ?a P( X(t)a ) P( X(t1)b X(t)a )
  • () ?a ua Aab
  • u A b
  • because the summation () is just the definition
    of the product of a vector and a matrix.
    Applying the same formula with u u A some
    number of times, you get
  • P( X(tn)b ) u An b
  • Conditioning on X(t)a means settings u
    (0,..,1,...,0), with the 1 at the coordinate
    corresponding to a
  • P( X(tn)b X(t)a ) An ab

13
Markov chains
  • Simple example Two states, and transition
    probabilities between them.
  • This system models the fact that yesterdays
    weather is a reasonable prediction for todays
    weather.
  • Here is a typical realization

14
Hidden Markov model (HMM)
Elaboration of the simple weather model, now as
an HMM. The underlying states, which are
considered unobservable, are H and L representing
periods of high and low atmospheric pressure.
The observable is the weather (sunny or rainy).
Here is a typical realization of this HMM H
H H H H H H H H L L
L H H H H H L L L
L H H
15
Hidden Markov models
  • A hidden Markov model (HMM) is a Markov chain
    X(t), and a stochastic variable Y(t) representing
    emissions or observations.
  • X(t) has the usual Markov property. Y(t) is only
    dependent on X(t) (more precisely, Y(t) is
    conditionally independent of all other Y(s) given
    X(t).) These dependencies are summarized in this
    picture


  • (wikipedia Hidden Markov
    model)
  • The state X(t) is considered unobservable the
    only observables are the Y(t).
  • (Note The Y(t) may also emit the empty symbol,
    denoted ?. If these are used, they are simply
    left out of the emitted symbol sequence. Only
    this symbol sequence is considered to be
    observable this then also makes the time t
    unobservable, since it is unknown how many empty
    symbols have been output).
  • Usually the Markov chain starts at t0, and in a
    particular state, which is called the start
    state. Often one of the states in the Markov
    chain is labeled end state, and the process
    stops when it reaches this state. In this way
    HMMs can be made to produce finite symbol
    sequences.

16
Hidden Markov models
  • The unobserved Markov chain is defined as usual
    a number of states 1,2,...,N, and a transition
    matrix A.
  • The Y(t) are a stochastic variable whose discrete
    state space is called the emission alphabet. A
    probability distribution P( Yy Xx )
    determines the emissions conditional on the state
    of X(t).
  • HMMs are generative models they specify how a
    sequence of observations are generated randomly.
    This can be useful for simulations. Their power
    however derives from the existence of a number of
    efficient algorithms that compute the answer to
    important questions relating to HMMs.
  • Questions you may want to ask include
  • (i) What is the most likely state path X(t)
    generating a given observed sequence Y?
  • (ii) What is the likelihood of a given model
    generating a sequence Y?
  • (iii) What is the posterior probability that a
    symbol Yi from the sequence Y was generated by
    some state X?
  • (iv) For a given sequence Y, what matrix A and
    emission probabilities P( Yy Xx ) maximize
    the likelihood of Y?
  • These questions are solved by (i) the Viterbi
    algorithm, (ii) the Forward or Backward
    algorithms, (iii) the Forward and Backward
    algorithms combined, and (iv) the Baum-Welch
    algorithm.

17
Profile HMMs
Profile HMMs are used to model DNA sequence
motifs in longer DNA sequences. Examples of
sequence motifs are DNA binding motifs for
transcription factors, or protein domains (in
which case amino acids can also be used as
alphabet).
18
Profile HMMs
Adding the x states (which emit nucleotides
according to the background distribution) allow
the motif to be anywhere inside a longer DNA
sequence. The black dots are silent states
(states that do not emit symbols).
19
Profile HMMs
Adding more silent states allow letters in the
motif to be skipped (with a certain probability),
so model e.g. protein domains with slightly
variable lengths.
20
Profile HMMs
Similarly, insert states allow the insertion
of, in principle, any length of sequence
somewhere within the motif.
21
Profile HMMs
This is the profile HMM architecture used in the
HMMER and SAM packages. An extensive list of
HMMs for protein domains can be found in the Pfam
database.
Sean Eddy David Haussler HMMER
SAM Washington U. UCSC
A loopback transition allows multiple motifs to
occur in a single sequence (but note that in this
model, the sequence always includes at least one
motif. Using this model to find motifs will
always get you one).
22
Viterbi algorithm
Andrew Viterbi (1935-), Italian-American engineer
  • This algorithm determines the most likely state
    path conditional on producing the observed
    sequence Y. Straightforward computation would
    involve examining all possible paths, which is
    virtually impossible. The Viterbi algorithm
    solves the problem efficiently, using dynamic
    programming, in a way that is very similar tothe
    Needleman-Wunsch algorithm (except that this one
    is in 1 dimension but see later)
  • The intermediate result that can be computed
    recursively is
  • v(i,x) the probability of the most probable
    path that begins in the start state, emits the
    first i symbols of Y, and ends up in state x.
  • Suppose for convenience that all states except
    the start state emit a symbol. Then
    v(0,start)1, v(0,x)0 for other states x.
  • For igt0, v can be computed recursively
  • v(i,x) maxs v(i-1,s) Asx P( YYi Xx
    ) ()
  • The red factor has already been computed. After
    all v(i,x) have been computed, a traceback
    procedure recovers the most probable path.
  • (Note 1 The convention has been taken that a
    state emits its corresponding symbol as soon as
    it is entered. One can also choose to emit once
    a state is left. Further, emissions may be
    assigned to transitions rather than states this
    is called a Mealy model or machine the
    conventional choice is called a Moore model.)
  • (Note 2 If some states x do not emit a symbol,
    the factor v(i-1,s) must be replaced by v(i,s) in
    (). This makes the formula self-referential.
    For the Viterbi algorithm, the computation ()
    can simply be carried out a number of times,
    until none of the terms changes any further.
    Depending on the structure of the Markov chain,
    computing the v(i,x) in a suitable order may
    decrease the number of iterations required.)

23
Pair HMMs
  • Up until now, the HMMs emitted a single sequence
    of symbols. A straightforward generalization
    allows the emission of two symbol sequences at
    the same time. Such HMMs are called pair HMMs.
  • Each state emits one symbol to each sequence. As
    before, either or both may be the empty symbol (
    instead of ?)
  • Pair HMMs can be used to model sequence
    alignments. Simultaneously emitted nucleotides
    correspond to homologous pairs, and are
    correlated because of their shared ancestry.
    States that emit nucleotides to only one sequence
    model gaps in the other sequence.

24
Viterbi for pair HMMs
  • To show what changes going from standard to pair
    HMMs, here is the Viterbi algorithm for pair
    HMMs. The observable sequences are denoted Y and
    Z.
  • The probabilities to be computed are v(i,j,x),
    and represent the probability of the most
    probable path that begins in the start state,
    emits the first i symbols from Y and the first j
    symbols from Z, and ends up in state x.
  • The recursion is
  • v(i,j,x) maxs v(i-p, j-q,s) Asx P(YYi,
    ZZj Xx)
  • Here, p1 if x emits into Y, 0 otherwise
    similarly q1 if x emits into Z, 0 otherwise. If
    x does not output to both sequences, the emission
    probability should change accordingly (e.g.
    P(YYiXx) for emission into Y only).
  • The maximum path probability can be read off from
    v(length(Y),length(Z),end) the actual path can
    be obtained by backtracing.
  • For simplicity I assume that none of the states
    (except for start) is silent, i.e. they all
    emit into at least one sequence. If not, the
    recursion again becomes self-referential, and the
    same remarks as with the 1-dimensional Viterbi
    algorithm apply.

25
Viterbi for pair HMMs
  • The Viterbi algorithm computes the single path
    X(0),X(1),...,X(t) through the Markov chain
    that has the highest probability of
    simultaneously emitting the two observed
    sequences, denoted Y and Z. The path encodes
    which nucleotides from Y and Z were emitted
    simultaneously, i.e. it encodes the alignment.
  • Symbolically, the Viterbi algorithm first
    computes
  • maxX P( X, Y, Z )
  • and then the traceback part computes
  • arg maxX P( X , Y, Z ) arg max X P( X
    Y, Z )
  • The last equation holds because P(XY,Z)
    P(X,Y,Z) / P(Y,Z), and because the denominator
    does not depend on X, the full and conditional
    probabilities reach their maxima for the same
    path X.

26
Alignments Best scoring / most likely path
(Needleman-Wunsch, Smith-Waterman, Viterbi)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
27
Likelihood the Forward algorithm
  • The Viterbi algorithm is very similar to the
    Needleman-Wunsch algorithm (the states playing
    the role of index to the three tables mentioned
    briefly) it has the same time complexity, and
    indeed for the three-state HMM of the example,
    you can give a score matrix and gap penalties
    that make them compute identical alignments.
  • The HMM approach has the important advantage
    above score-based approaches such as N-W that it
    is fairly straightforward to determine the model
    parameters if you know how the sequences evolve,
    i.e. if you know the evolutionary model (drawback
    1 of score-based aligners).
  • Drawback 2 concerned determining the model
    parameters (or the evolutionary model) from
    sequences alone. The HMM approach also has a
    method to do this, by maximum likelihood
    parameter estimators. For this, it is necessary
    to be able to calculate the likelihood. This
    involves considering (summing over) all possible
    paths, instead of just the most likely one. The
    Forward algorithm does this.

28
The Forward algorithm (for pair HMMs)
  • The algorithm is very similar to the Viterbi
    algorithm (without traceback). The probabilities
    to be computed are slightly different w(i,j,x)
    now represent the sum of probabilities of paths
    through the Markov chain that emit the first i
    symbols from Y and the first j symbols from Z,
    and end up in state x. The final result which
    appears in the table entry w(length(Y),length(Z),
    end) is the likelihood of the sequences Y and Z.
  • Symbolically, the Forward algorithm computes
  • P(Y, Z) ?X P( X, Y, Z )
  • The recursion is
  • w(i,j,x) ? s w(i-p, j-q,s) Asx P(YYi,
    ZZj Xx)
  • Here, p and q have the same definitions as
    before, and the same caveats-for-the-sake-of-simpl
    icity apply.
  • Note Dealing with silent states is more involved
    than with Viterbi. As before, silent states
    (where neither Y nor Z emits a symbol) create
    self-references in the recursion. In this case
    they are linear equations, which in general must
    be solved by a matrix inversion. In practice,
    most situations involve 1x1 matrices, which of
    course can be dealt with by a simple division.
  • Note the numbers w(i,j,x) are not conditional
    probabilities they are not even true
    probabilities, as in some cases they can be gt1.

29
The Backward algorithm (for pair HMMs)
  • The Backward algorithm computes the same thing as
    the Forward algorithm the likelihood that the
    HMM emits the sequences Y and Z symbolically
    P(Y, Z) ?X P( X, Y, Z ). So why bother? The
    answer is that the table of intermediate results
    are different, and the Forward and Backward
    tables can be used together to compute other
    interesting quantities, such as posterior
    probabilities.
  • The algorithm is very similar to the Forward
    algorithm, but starts the recursion and the
    bottom-right end of the table, working its way
    (backwards) to the top-left.
  • The quantities u now represent the conditional
    probability of the model emitting the symbols
    i1,i2,... of Y and j1,j2,... of Z, using
    paths that start in x and end up in the end
    state. The emission associated with x is not
    included in this likelihood.
  • Initialization is u(length(Y),length(Z),end)1,
    and the recursion is
  • u(i,j,x) ? s u(ip s, jq s,s) Axs
    P(YYi1, ZZj1 Xs)
  • Here, p and q have the same definitions as
    before, but now refer to the state s rather than
    x.
  • Note Dealing with silent states is more involved
    than with Viterbi. As before, silent states
    (where neither Y nor Z emits a symbol) create
    self-references in the recursion. In this case
    they are linear equations, which in general must
    be solved by a matrix inversion. In practice,
    most situations involve 1x1 matrices, which can
    be dealt with by a simple division.

30
The Backward algorithm Sampling from the
posterior
  • Recall that the quantities u computed by the
    Backward algorithm represent the conditional
    probability of the model emitting the symbols
    i1,i2,... of Y and j1,j2,... of Z, using
    paths that start in x and end up in the end
    state.
  • In other words, u(i,j,x) gives the probability
    that the model will produce the remainder of Y
    and Z, conditional on starting in state x. These
    probabilities can be used to sample from the
    posterior distribution of paths that generate Y
    and Z
  • x ? start, i ? 0, j? 0
  • While x is not end
  • Compute r s u(ip s, jq s,s) Axs
    P(YYi1, ZZj1 Xs) for all states s
  • Normalize the r so that they sum to 1, and draw
    a state s from the distribution
  • x?s, i?ip s, j?jq s

31
Alignments Posterior probabilities
(Durbin, Eddy, Krogh, Mitchison 1998)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
32
Posterior probabilities
  • Using the table of intermediate results w(i,j,x)
    and u(i,j,x) computed by the Forward and Backward
    algorithms, it is possible to compute certain
    posterior probabilities.
  • For example, lets denote by i j x the event
    that the HMM emitted the symbols Yi and Zj from
    state x (assuming that x emits into both
    sequences an alternative interpretation can be
    formulated when x does not.) The probability of
    this event, and the HMM producing the sequences Y
    and Z, is
  • P(Y, Z, i j x) w(i,j,x) u(i,j,x)
  • Since the likelihood of emitting Y and Z is
  • P(Y, Z) w(length(Y),length(Z),end)
    u(0,0,start),
  • the probability of the event Yi Zj x
    conditional on observing Y and Z is
  • P(i j x Y, Z ) w(i,j,x) u(i,j,x) /
    u(0,0,start)
  • If the state x represents the aligned state in
    an alignment HMM, this would be the posterior
    probability of Yi and Zj to be aligned.

33
Posterior probabilities
34
Posterior probabilities ()
  • As another example, lets consider the event that
    the HMM transitioned from state x to s just after
    emitting the symbols Yi and Zj (from state x,
    assuming it emits into both sequences again,
    this is not essential.) Denote this event by i
    j x s. The probability of this event, and the
    HMM producing the sequences Y and Z, is
  • P(Y, Z, i j x s) w(i,j,x) Axs P(YYi1,
    ZZj1 Xs) u(ip,jq,s)
  • where again p and q refer to the emission of
    state s. Dividing by the total likelihood of
    observing the sequences Y and Z results in the
    conditional probability of i j x s
  • P( i j x s Y, Z ) w(i,j,x) Axs
    P(YYi1, ZZj1 Xs) u(ip,jq,s) /
    u(0,0,start)
  • Conditional on the sequences Y and Z, and the
    position within these sequences, this is the
    expected number of times that the transition x-gts
    was taken by the HMM. Now we can marginalize
    over the position variables i and j
  • E( number of times that the transition x?s was
    taken emitting Y, Z )
  • ? i,j w(i,j,x) Axs P(YYi1, ZZj1 Xs)
    u(ip,jq,s) / u(0,0,start)
  • The conditional expected number of times that
    particular symbols were emitted from particular
    states can be computed in a very similar way.

35
Posteriors Good predictors of accuracy
36
Posterior Decoding
  • The good correspondence of posterior
    probabilities and the actual frequencies that
    particular states were visited, suggests that
    posteriors may be interpreted as the
    reliability of any particular prediction. The
    idea of Posterior Decoding is to use these
    posteriors to construct alignments. PD is not a
    single algorithm, but a set of closely related
    procedures.
  • Generally, P.D. maximizes some overall measure of
    the posterior probability over a set of paths. -
    The paths can consist of HMM states, or of
    alignment columns.- The overall measure can be a
    (weighted) sum of posteriors, or the product of
    posteriors.
  • As before, this is accomplished by a dynamic
    programming algorithm.
  • For alignments, the product-of-posteriors measure
    turned out to work best.
  • PD is a heuristic procedure in contrast to e.g.
    the Viterbi algorithm, there is no direct
    (probabilistic) interpretation of the alignment.
    For instance, when HMM states are used to build
    paths, the resulting PD path may not be a valid
    path through the HMM.

37
Posterior decoding
a
b
c
d
e
Paths Alignment columns Posteriors Sum of
posteriors for all states contributing to a
column. E.g. The posterior for Yi and Zj to be
aligned is P(i j e) The posterior for Yi to
be involved in a gap is ?j P(i j a or i j
b or i j c or i j d Y, Z) To emphasize
the marginalization of the gap position j within
the secondsequence, the resulting algorithm is
called Marginalized Posterior Decoding.
38
Alignments Posterior probabilities
(Durbin, Eddy, Krogh, Mitchison 1998)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
39
Alignments Posterior probabilities
(Durbin, Eddy, Krogh, Mitchison 1998)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
40
Alignments Posterior probabilities
(Durbin, Eddy, Krogh, Mitchison 1998)
A T C A T C T G C A G T
A C C G T T C A C A A T G G A T
41
Posterior Decoding is better than Viterbi
42
Posterior decoding is better than score-based
alignment
43
Summary
  • HMMs are a powerful framework to model sequential
    data with randomness
  • Modelling alignments using HMMs provide a
    connection between alignment models and models of
    sequence evolution
  • Provides practical and conceptual framework to
    deal with uncertainties
  • Posterior decoding is more accurate and less
    biased than either score-based approaches or
    Viterbi.
  • Baum-Welch algorithm (not described) allows
    efficient parameter estimation from unaligned
    sequences
Write a Comment
User Comments (0)
About PowerShow.com