Parameter Estimation For HMM - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Parameter Estimation For HMM

Description:

Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001. – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 43

Provided by: edut1550

Category:

more less

Transcript and Presenter's Notes

Title: Parameter Estimation For HMM

1
Parameter Estimation For HMM
Background Readings Chapter 3.3 in the book,
Biological Sequence Analysis, Durbin et al., 2001.
2
Reminder Hidden Markov Model
Markov Chain transition probabilities p(Si1
tSi s) ast Emission probabilities p(Xi b
Si s) es(b)
3
Reminder Most Probable state path
Given an output sequence x (x1,,xL), A most
probable path s (s1,,sL) is one which
maximizes p(sx).
4
Reminder Viterbis algorithm for most probable
path
s1
s2
sL-1
sL
si
0
X1
X2
XL-1
XL
Xi
We add the special initial state 0.
Initialization v0(0) 1 , vk(0) 0 for k gt 0
For i1 to L do for each state l
vl(i) el(xi) MAXk vk(i-1)akl
ptri(l)argmaxkvk(i-1)akl storing previous
state for reconstructing the path Termination
the probability of the most probable path
p(s1,,sLx1,,xl)
5
Predicting CpG islands via most probable path
Output symbols A, C, G, T (4 letters). Markov
Chain states 4 - states and 4 states, two
for each letter (8 states). The transitions
probabilities ast and ek(b) will be discussed
soon. The most probable path found by Viterbis
algorithm predicts CpG islands.
Experiment (Durbin et al, p. 60-61) shows that
the predicted islands are shorter than the
assumed ones. In addition quite a few false
negatives are found.
6
Reminder finding most probable state
fl(i) p(x1,,xi,sil ), the probability of a
path which emits (x1,..,xi) and in which state
sil. bl(i) p(xi1,,xL,sil), the probability
of a path which emits (xi1,..,xL) and in which
state sil.

The forward algorithm finds fk(si)
P(x1,,xi,sik) k 1,...m.
The backward algorithm finds bk(si)
P(xi1,,xLsik ) k 1,...m.
Return p(Sikx) fk(si) bk(si) k1,...,m.
To Compute for every i simply run the forward
and backward algorithms once, and compute fk(si)
bk(si) for every i, k.

7
Finding the probability that a letter is in a
CpG island via the algorithm for most probable
state
The probability that an occurrence of G is in a
CpG island ( state) is ?s p(Si s x) ?s
F(Sis )B(Sis ) Where the summation is
formally over the 4 states, but actually only
state G need to be considered (why?)
8
Parameter Estimation for HMM
An HMM model is defined by the parameters akl
and ek(b), for all states k,l and all symbols b.
Let ? denote the collection of these parameters.
9
Parameter Estimation for HMM
To determine the values of (the parameters in) ?,
use a training set x1,...,xn, where each xj
is a sequence which is assumed to fit the
model. Given the parameters ?, each sequence xj
has an assigned probability p(xj?) (or p(xj
?,HMM)).
10
ML Parameter Estimation for HMM
The elements of the training set x1,...,xn, are
assumed to be independent, p(x1,..., xn?) ?j
p (xj?). ML parameter estimation looks for ?
which maximizes the above. The exact method for
finding or approximating this ? depends on the
nature of the training set used.
11
Data for HMM

Possible properties of (the sequences in) the
training set
For each xj, what is our information on the
states si (the symbols x i are usually known).
The size (number of sequences) of the training set

12
Case 1 Sequences are fully known
We know the complete structure of each sequence
in the training set x1,...,xn. We wish to
estimate akl and ek(b) for all pairs of states k,
l and symbols b.
By the ML method, we look for parameters ? which
maximize the probability of the sample set
p(x1,...,xn ?) MAX? p(x1,...,xn ?).
13
Case 1 ML method
For each xj we have
Let mkl i si-1k,sil (in xj).
mk(b)isik,xib (in xj).
14
Case 1 (cont)
By the independence of the xjs, p(x1,...,xn
?)?ip(xj?).
Thus, if Akl (transitions from k to l) in the
training set, and Ek(b) (emissions of symbol b
from state k) in the training set, we have
15
Case 1 (cont)
So we need to find akls and ek(b)s which
maximize
Subject to

16
Case 1 (cont)
Rewriting, we need to maximize
17
Case 1 (cont)
Then we will maximize also F. Each of the above
is a simpler ML problem, which is treated next.
18
A simpler case ML parameters estimation for a
die
Let X be a random variable with 6 values x1,,x6
denoting the six outcomes of a die. Here the
parameters are ? ?1,?2,?3,?4,?5, ?6 ,
??i1 Assume that the data is one sequence Data
(x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)
So we have to maximize
Subject to ?1?2 ?3 ?4 ?5 ?61 and ?i
?0
19
Side comment Sufficient Statistics

To compute the probability of data in the die
example we only require to record the number of
times Ni falling on side i (namely,N1, N2,,N6).
We do not need to recall the entire sequence of
outcomes
Ni i16 is called sufficient statistics for
the multinomial sampling.

20
Sufficient Statistics

A sufficient statistics is a function of the data
that summarizes the relevant information for the
likelihood
Formally, s(Data) is a sufficient statistics if
for any two datasets D and D
s(Data) s(Data ) ? P(Data?) P(Data?)

Exercise Define sufficient statistics for the
HMM model.
21
Maximum Likelihood Estimate
By the ML approach, we look for parameters that
maximizes the probability of data (i.e., the
likelihood function ). Usually one maximizes the
log-likelihood function which is easier to do and
gives an identical answer
22
Finding the Maximum
Rearranging terms
Divide the jth equation by the ith equation
Sum from j1 to 6
23
Generalization for distribution with any number
n of outcomes
Let X be a random variable with n values x1,,xk
denoting the k outcomes of an iid experiments,
with parameters ? ?1,?2,...,?k (?i is the
probability of xi). Again, the data is one
sequence of length n Data (xi1,xi2,...,xin)
Then we have to maximize
Subject to ?1?2 .... ?k1
24
Generalization for n outcomes (cont)
By treatment identical to the die case, the
maximum is obtained when for all i
Hence the MLE is given by
25
Fractional Exponents
Some models allow nis which are not integers
(eg, when we are uncertain of a die outcome, and
consider it 6 with 20 confidence and 5 with
80) We still can have
And the same analysis yields
26
Apply the ML method to HMM
Let Akl (transitions from k to l) in the
training set. Ek(b) (emissions of symbol b
from state k) in the training set. We need to
27
Apply to HMM (cont.)
We apply the previous technique to get for each k
the parameters akll1,..,m and ek(b)b?S
Which gives the optimal ML parameters
28
Adding pseudo counts in HMM
If the sample set is too small, we may get a
biased result. In this case we modify the actual
count by our prior knowledge/belief rkl is our
prior belief and transitions from k to l. rk(b)
is our prior belief on emissions of b from state
k.
29
Summary of Case 1 Sequences are fully known
We know the complete structure of each sequence
in the training set x1,...,xn. We wish to
estimate akl and ek(b) for all pairs of states k,
l and symbols b.
We just showed a method which finds the (unique)
parameters ? which maximizes p(x1,...,xn ?)
MAX? p(x1,...,xn ?).
30
Case 2 State paths are unknown
In this case only the values of the xis of the
input sequences are known. This is a ML problem
with missing data. We wish to find ? so that
p(x?)MAX?p(x?). For each sequence
x, p(x?)?s p(x,s?), taken over all state
paths s.
31
Case 2 State paths are unknown
So we need to maximize p(x?)?s p(x,s?), where
the summation is over all the sequences S which
produce the output sequence x. Finding ? which
maximizes ?s p(x,s?) is hard. Unlike finding ?
which maximizes p(x,s?) for a single sequence
(x,s).
32
ML Parameter Estimation for HMM

The general process for finding ? in this case is
Start with an initial value of ?.
Find ? so that p(x1,..., xn?) gt p(x1,...,
xn?)
set ? ?.
Repeat until some convergence criterion is met.

A general algorithm of this type is the
Expectation Maximization algorithm, which we will
meet later. For the specific case of HMM, it is
the Baum-Welch training.
33
Case 2 State paths are unknown
A general algorithm that solves such a problem is
the Expectation Maximization algorithm, which we
will meet later. For the specific case of HMM, it
is the Baum-Welch training.
34
Baum Welch training
We start with some values of akl and ek(b), which
define prior values of ?. Baum-Welch training is
an iterative algorithm which attempts to replace
? by a ? s.t. p(x?) gt p(x?) Each iteration
consists of few steps
35
Baum Welch step 1
Count expected number of state transitions For
each sequence xj, for each i and for each k,l,
compute the posterior state transitions
probabilities
P(si-1k, sil xj,?)
36
Recall the forward algorithm
The task Compute f(hi) P(x1,,xi,hi) for
i1,,L (namely, considering evidence up to time
slot i).
Initial step P(x1, h1) P(h1) P(x1h1)
Recall that the backward algorithm is similar
multiplying matrices from the end to the start
and summing on hidden states.
37
Baum Welch training
Claim
38
Step 1 Computing P(si-1k, sil xj,?)
P(x1,,xL,si-1k,sil) P(x1,,xi-1,si-1k)
aklek(xi ) P(xi1,,xL sil)
fk(i-1) aklel(xi ) bl(i)
p(si-1k,sil xj)
? ? fk(i-1) aklek(xi ) bl(i)
l
K
39
Step 1 (end)
for each pair (k,l), compute the expected number
of state transitions from k to l
40
Baum-Welch Step 2
for each state k and each symbol b, compute the
expected number of emissions of b from k
41
Baum-Welch step 3
Use the Akls, Ek(b)s to compute the new values
of akl and ek(b). These values define ?.
It can be shown that p(x1,..., xn?) gt
p(x1,..., xn?) i.e, ? increases the
probability of the data This procedure is
iterated, until some convergence criterion is met.
42
Case 2 State paths are unknownViterbi training
Also start from given values of akl and ek(b),
which defines prior values of ?. Viterbi
training attempts to maximize the probability of
a most probable path i.e., maximize p((s(x1),..
,s(xn)) ?, x1,..,xn ) Where s(xj) is the most
probable (under ?) path for xj.
43
Case 2 State paths are unknownViterbi training

Each iteration
Find a set s(xj) of most probable paths, which
maximize
p(s(x1),..,s(xn) ?, x1,..,xn )
2. Find ?, which maximizes
p(s(x1),..,s(xn) ?, x1,..,xn )
Note In 1. the maximizing arguments are the
paths, in 2. it is ?.
3. Set ?? , and repeat. Stop when paths are not
changed.

44
Case 2 State paths are unknownViterbi training
p(s(x1),..,s(xn) ?, x1,..,xn ) can be
expressed in a closed form (since we are using a
single path for each xj), so this time
convergence is achieved when the optimal paths
are not changed any more.

Write a Comment

User Comments (0)