Title: Parameter Estimation For HMM
1Parameter Estimation For HMM
Background Readings Chapter 3.3 in the book,
Biological Sequence Analysis, Durbin et al., 2001.
2Reminder Hidden Markov Model
Markov Chain transition probabilities p(Si1
tSi s) ast Emission probabilities p(Xi b
Si s) es(b)
3Reminder Most Probable state path
Given an output sequence x (x1,,xL), A most
probable path s (s1,,sL) is one which
maximizes p(sx).
4Reminder Viterbis algorithm for most probable
path
s1
s2
sL-1
sL
si
0
X1
X2
XL-1
XL
Xi
We add the special initial state 0.
Initialization v0(0) 1 , vk(0) 0 for k gt 0
For i1 to L do for each state l
vl(i) el(xi) MAXk vk(i-1)akl
ptri(l)argmaxkvk(i-1)akl storing previous
state for reconstructing the path Termination
the probability of the most probable path
p(s1,,sLx1,,xl)
5Predicting CpG islands via most probable path
Output symbols A, C, G, T (4 letters). Markov
Chain states 4 - states and 4 states, two
for each letter (8 states). The transitions
probabilities ast and ek(b) will be discussed
soon. The most probable path found by Viterbis
algorithm predicts CpG islands.
Experiment (Durbin et al, p. 60-61) shows that
the predicted islands are shorter than the
assumed ones. In addition quite a few false
negatives are found.
6Reminder finding most probable state
fl(i) p(x1,,xi,sil ), the probability of a
path which emits (x1,..,xi) and in which state
sil. bl(i) p(xi1,,xL,sil), the probability
of a path which emits (xi1,..,xL) and in which
state sil.
- The forward algorithm finds fk(si)
P(x1,,xi,sik) k 1,...m. - The backward algorithm finds bk(si)
P(xi1,,xLsik ) k 1,...m. - Return p(Sikx) fk(si) bk(si) k1,...,m.
- To Compute for every i simply run the forward
and backward algorithms once, and compute fk(si)
bk(si) for every i, k.
7 Finding the probability that a letter is in a
CpG island via the algorithm for most probable
state
The probability that an occurrence of G is in a
CpG island ( state) is ?s p(Si s x) ?s
F(Sis )B(Sis ) Where the summation is
formally over the 4 states, but actually only
state G need to be considered (why?)
8Parameter Estimation for HMM
An HMM model is defined by the parameters akl
and ek(b), for all states k,l and all symbols b.
Let ? denote the collection of these parameters.
9Parameter Estimation for HMM
To determine the values of (the parameters in) ?,
use a training set x1,...,xn, where each xj
is a sequence which is assumed to fit the
model. Given the parameters ?, each sequence xj
has an assigned probability p(xj?) (or p(xj
?,HMM)).
10ML Parameter Estimation for HMM
The elements of the training set x1,...,xn, are
assumed to be independent, p(x1,..., xn?) ?j
p (xj?). ML parameter estimation looks for ?
which maximizes the above. The exact method for
finding or approximating this ? depends on the
nature of the training set used.
11Data for HMM
- Possible properties of (the sequences in) the
training set - For each xj, what is our information on the
states si (the symbols x i are usually known). - The size (number of sequences) of the training set
12Case 1 Sequences are fully known
We know the complete structure of each sequence
in the training set x1,...,xn. We wish to
estimate akl and ek(b) for all pairs of states k,
l and symbols b.
By the ML method, we look for parameters ? which
maximize the probability of the sample set
p(x1,...,xn ?) MAX? p(x1,...,xn ?).
13Case 1 ML method
For each xj we have
Let mkl i si-1k,sil (in xj).
mk(b)isik,xib (in xj).
14Case 1 (cont)
By the independence of the xjs, p(x1,...,xn
?)?ip(xj?).
Thus, if Akl (transitions from k to l) in the
training set, and Ek(b) (emissions of symbol b
from state k) in the training set, we have
15Case 1 (cont)
So we need to find akls and ek(b)s which
maximize
Subject to
16Case 1 (cont)
Rewriting, we need to maximize
17Case 1 (cont)
Then we will maximize also F. Each of the above
is a simpler ML problem, which is treated next.
18A simpler case ML parameters estimation for a
die
Let X be a random variable with 6 values x1,,x6
denoting the six outcomes of a die. Here the
parameters are ? ?1,?2,?3,?4,?5, ?6 ,
??i1 Assume that the data is one sequence Data
(x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)
So we have to maximize
Subject to ?1?2 ?3 ?4 ?5 ?61 and ?i
?0
19Side comment Sufficient Statistics
- To compute the probability of data in the die
example we only require to record the number of
times Ni falling on side i (namely,N1, N2,,N6). - We do not need to recall the entire sequence of
outcomes - Ni i16 is called sufficient statistics for
the multinomial sampling.
20Sufficient Statistics
- A sufficient statistics is a function of the data
that summarizes the relevant information for the
likelihood - Formally, s(Data) is a sufficient statistics if
for any two datasets D and D - s(Data) s(Data ) ? P(Data?) P(Data?)
Exercise Define sufficient statistics for the
HMM model.
21Maximum Likelihood Estimate
By the ML approach, we look for parameters that
maximizes the probability of data (i.e., the
likelihood function ). Usually one maximizes the
log-likelihood function which is easier to do and
gives an identical answer
22Finding the Maximum
Rearranging terms
Divide the jth equation by the ith equation
Sum from j1 to 6
23Generalization for distribution with any number
n of outcomes
Let X be a random variable with n values x1,,xk
denoting the k outcomes of an iid experiments,
with parameters ? ?1,?2,...,?k (?i is the
probability of xi). Again, the data is one
sequence of length n Data (xi1,xi2,...,xin)
Then we have to maximize
Subject to ?1?2 .... ?k1
24Generalization for n outcomes (cont)
By treatment identical to the die case, the
maximum is obtained when for all i
Hence the MLE is given by
25Fractional Exponents
Some models allow nis which are not integers
(eg, when we are uncertain of a die outcome, and
consider it 6 with 20 confidence and 5 with
80) We still can have
And the same analysis yields
26Apply the ML method to HMM
Let Akl (transitions from k to l) in the
training set. Ek(b) (emissions of symbol b
from state k) in the training set. We need to
27Apply to HMM (cont.)
We apply the previous technique to get for each k
the parameters akll1,..,m and ek(b)b?S
Which gives the optimal ML parameters
28Adding pseudo counts in HMM
If the sample set is too small, we may get a
biased result. In this case we modify the actual
count by our prior knowledge/belief rkl is our
prior belief and transitions from k to l. rk(b)
is our prior belief on emissions of b from state
k.
29Summary of Case 1 Sequences are fully known
We know the complete structure of each sequence
in the training set x1,...,xn. We wish to
estimate akl and ek(b) for all pairs of states k,
l and symbols b.
We just showed a method which finds the (unique)
parameters ? which maximizes p(x1,...,xn ?)
MAX? p(x1,...,xn ?).
30Case 2 State paths are unknown
In this case only the values of the xis of the
input sequences are known. This is a ML problem
with missing data. We wish to find ? so that
p(x?)MAX?p(x?). For each sequence
x, p(x?)?s p(x,s?), taken over all state
paths s.
31Case 2 State paths are unknown
So we need to maximize p(x?)?s p(x,s?), where
the summation is over all the sequences S which
produce the output sequence x. Finding ? which
maximizes ?s p(x,s?) is hard. Unlike finding ?
which maximizes p(x,s?) for a single sequence
(x,s).
32ML Parameter Estimation for HMM
- The general process for finding ? in this case is
- Start with an initial value of ?.
- Find ? so that p(x1,..., xn?) gt p(x1,...,
xn?) - set ? ?.
- Repeat until some convergence criterion is met.
A general algorithm of this type is the
Expectation Maximization algorithm, which we will
meet later. For the specific case of HMM, it is
the Baum-Welch training.
33Case 2 State paths are unknown
A general algorithm that solves such a problem is
the Expectation Maximization algorithm, which we
will meet later. For the specific case of HMM, it
is the Baum-Welch training.
34Baum Welch training
We start with some values of akl and ek(b), which
define prior values of ?. Baum-Welch training is
an iterative algorithm which attempts to replace
? by a ? s.t. p(x?) gt p(x?) Each iteration
consists of few steps
35Baum Welch step 1
Count expected number of state transitions For
each sequence xj, for each i and for each k,l,
compute the posterior state transitions
probabilities
P(si-1k, sil xj,?)
36Recall the forward algorithm
The task Compute f(hi) P(x1,,xi,hi) for
i1,,L (namely, considering evidence up to time
slot i).
Initial step P(x1, h1) P(h1) P(x1h1)
Recall that the backward algorithm is similar
multiplying matrices from the end to the start
and summing on hidden states.
37Baum Welch training
Claim
38Step 1 Computing P(si-1k, sil xj,?)
P(x1,,xL,si-1k,sil) P(x1,,xi-1,si-1k)
aklek(xi ) P(xi1,,xL sil)
fk(i-1) aklel(xi ) bl(i)
p(si-1k,sil xj)
? ? fk(i-1) aklek(xi ) bl(i)
l
K
39Step 1 (end)
for each pair (k,l), compute the expected number
of state transitions from k to l
40Baum-Welch Step 2
for each state k and each symbol b, compute the
expected number of emissions of b from k
41Baum-Welch step 3
Use the Akls, Ek(b)s to compute the new values
of akl and ek(b). These values define ?.
It can be shown that p(x1,..., xn?) gt
p(x1,..., xn?) i.e, ? increases the
probability of the data This procedure is
iterated, until some convergence criterion is met.
42Case 2 State paths are unknownViterbi training
Also start from given values of akl and ek(b),
which defines prior values of ?. Viterbi
training attempts to maximize the probability of
a most probable path i.e., maximize p((s(x1),..
,s(xn)) ?, x1,..,xn ) Where s(xj) is the most
probable (under ?) path for xj.
43Case 2 State paths are unknownViterbi training
- Each iteration
- Find a set s(xj) of most probable paths, which
maximize - p(s(x1),..,s(xn) ?, x1,..,xn )
- 2. Find ?, which maximizes
- p(s(x1),..,s(xn) ?, x1,..,xn )
- Note In 1. the maximizing arguments are the
paths, in 2. it is ?. - 3. Set ?? , and repeat. Stop when paths are not
changed.
44Case 2 State paths are unknownViterbi training
p(s(x1),..,s(xn) ?, x1,..,xn ) can be
expressed in a closed form (since we are using a
single path for each xj), so this time
convergence is achieved when the optimal paths
are not changed any more.