Title: Hidden Markov Models
1Hidden Markov Models
- Richard Golden
- (following approach of Chapter 9 of Manning and
Schutze, 2000) - REVISION DATE April 15 (Tuesday), 2003
2VMM (Visible Markov Model)
3HMM Notation
- State Sequence Variables X1, , XT1
- Output Sequence Variables O1, , OT
- Set of Hidden States (S1, , SN)
- Output Alphabet (K1, , KM)
- Initial State Probabilities (?1, .., ?N)
?ip(X1Si), i1,,N - State Transition Probabilities (aij)
i,j?1,,Naij p(Xt1Xt), t1,,T - Emission Probabilities (bij) i?1,,N,j
?1,,Mbijp(Xt1SiXtSj), t1,,T
4HMM State-Emission Representation
- Note that sometimes a Hidden Markov Model is
represented by having the emission arrows come
off the arcs - In this situation you would have a lot more
emission arrows because theres a lot more arcs - But the transition and emission probabilities are
the sameit just takes longer to draw on your
powerpoint presentation (self-conscious
presentation)
a110.7
b110.6
b120.1
b130.3
?11
a120.3
b220.7
b230.2
?20
a210.5
S2
b210.1
a220.5
5Arc-Emission Representation
- Note that sometimes a Hidden Markov Model is
represented by having the emission arrows come
off the arcs - In this situation you would have a lot more
emission arrows because theres a lot more arcs - But the transition and emission probabilities are
the sameit just takes longer to draw on your
powerpoint presentation (self-conscious
presentation)
6Fundamental Questions for HMMs
- MODEL FIT
- How can we compute likelihood of observations and
hidden states given known emission and transition
probabilities? - Computep(Dog/NOUN,is/VERB,Good/ADJ
aij,bkm) - How can we compute likelihood of observations
given known emission and transition
probabilities? p(Dog,is,Good aij,bkm)
7Fundamental Questions for HMMs
- INFERENCE
- How can we infer the sequence of hidden states
given the observations and the known emission and
transition probabilities? - Maximize
- p(Dog/?,is/?, Good/? aij,bkm)with
respect to the unknown labels
8Fundamental Questions for HMMs
- LEARNING
- How can we estimate the emission and transition
probabilities given observations and assuming
that hidden states are observable during learning
process? - How can we estimate emission and transition
probabilities given observations only?
9Direct Calculation of Model Fit(note use of
Markov Assumptions) Part 1
Follows directly from the definition of a
conditional probability p(o,x)p(ox)p(x)
EXAMPLEP(Dog/NOUN,is/VERB,Good/ADJ
aij,bij) p(Dog,is,GoodNOUN,VERB,ADJ
aij,bij) X p(NOUN,VERB,ADJ aij,bij)
10Direct Calculation of Likelihood of Labeled
Observations(note use of Markov
Assumptions)Part 2
EXAMPLECompute p(DOG/NOUN,is/VERB,good/ADJ
aij,bkm)
11Graphical Algorithm Representation of Direct
Calculation of Likelihood of Observations and
Hidden States (not hard!)
Note that good is The name Of the dogj So it is
a Noun!
The likelihood of a particular labeled sequence
of observations (e.g., p(Dog/NOUN,is/VERB,Goo
d/NOUNaij,bkm)) may be computed Using the
direct calculation method using following
simple graphical algorithm. Specifically,
p(K3/S1, K2/S2, K1/S1 aij,bkm))
?1b13a12b22a21b11
12Extension to case where the likelihood of the
observations given parameters is needed(e.g., p(
Dog, is, good aij,bij)
KILLER EQUATION!!!!!
13Efficiency of Calculations is Important (e.g.,
Model-Fit)
- Assume 1 multiplication per microsecond
- Assume N1000 word vocabulary and T7 word
sentence. - (2T1)NT1 multiplications by direct
calculation yields (2(7)1)(1000)(71) is
about 475,000 million years of computer time!!! - 2N2T multiplications using forward methodis
about 14 seconds of computer time!!!
14Forward, Backward, and Viterbi Calculations
- Forward calculation methods are thus very useful.
- Forward, Backward, and Viterbi Calculations will
now be discussed.
15Forward Calculations Overview
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
K1
K2
K3
b120.1
b130.3
b110.6
a110.7
S1
S1
?1
a120.3
a210.5
?2
S2
S2
S2
b230.2
a220.5
b210.1
b220.1
K1
K2
K3
K1
K2
K1
K2
K3
K3
16Forward Calculations Time 2 (1 word example)
TIME 2
NOTE that ?1 (2) ?2 (2) is the likelihood of
the observation/word K3in this 1 word
example
K1
K2
K3
b130.3
a110.7
S1
?1
a120.3
a210.5
?2
S2
S2
a220.5
b230.2
K1
K2
K3
17Forward Calculations Time 3 (2 word example)
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
b120.1
?1(3)
b110.6
a110.7
S1
S1
?1
a120.3
a210.5
?2
S2
S2
S2
a220.5
b210.1
b220.1
K1
K2
K3
K1
K2
K3
18Forward Calculations Time 4 (3 word example)
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
K1
K2
K3
b120.1
b130.3
b110.6
a110.7
S1
S1
S1
?1
a120.3
a210.5
?2
S2
S2
S2
S2
a220.5
b230.2
b210.1
b220.1
K1
K2
K3
K1
K2
K1
K2
K3
K3
19Forward Calculation of Likelihood Function
(emit and jump)
t1(0-word) t2(1-word) t3 (2-word) t4(3-word)
?1(t) 1.0 ?1 1 0.21 ?1(1) a11 b13 ?2(1) a21 b23 0.0462?1(2)a11 b12 ?2(2)a21 b12 0.021294
?2(t) 0.0 ?2 0 0.09 ?1(1) a12 b13?2(1) a22 b23 0.0378 0.010206
L(t) p(K1 Kt) 1.0?1(1) ?2(1) 0.3?1(2) ?2(2) 0.084?1(3) ?2(3) 0.0315?1(4) ?2(4)
20Backward Calculations Overview
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
K1
K2
K3
b120.1
b130.3
b110.6
a110.7
S1
S1
?1
a120.3
a210.5
?2
S2
S2
S2
b230.2
a220.5
b210.1
b220.1
K1
K2
K3
K1
K2
K1
K2
K3
K3
21Backward Calculations Time 4
TIME 4
K1
K2
K3
b110.6
S1
S2
b210.1
K1
K2
K3
22Backward Calculations Time 3
TIME 3
K1
K2
K3
b110.6
S1
S2
b210.1
K1
K2
K3
23Backward Calculations Time 2
TIME 3
TIME 4
TIME 2
NOTE that ?1 (2) ?2 (2) is the likelihood the
observation/word sequence K2,K1in this 2 word
example
K1
K2
K3
K1
K2
K3
b120.1
b130.3
a110.7
S1
S1
a120.3
a210.5
a220.5
S2
S2
b230.2
b220.7
K1
K2
K1
K2
K3
K3
24Backward Calculations Time 1
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
K1
K2
K3
b120.1
b130.3
b110.6
a110.7
S1
S1
?1
a120.3
a210.5
?2
S2
S2
S2
b230.2
a220.5
b210.1
b220.1
K1
K2
K3
K1
K2
K1
K2
K3
K3
25Backward Calculation of Likelihood Function
(EMIT AND JUMP)
t1 t2 t3 t4
?1(t) 0.0315 0.045 a11b11 ?1(1) a12b21 ?1(1) 0.6 b11 1
?2(t) 0.029 0.245 a11b11 ?1(1) a12b21 ?1(1) 0.1 b21 1
L(t) p(Kt KT) 0.0315?1 ?1(1) ?2 ?2(1) 0.290?1(2) ?2(2) 0.7?1(3) ?2(3) 1
26You get same answer going forward or backward!!
Backward
Forward
t1 t2 t3 t4
?1(t) 0.0315 0.045 a11b11 ?1(1) a12b21 ?1(1) 0.6 b11 1
?2(t) 0.029 0.245 a11b11 ?1(1) a12b21 ?1(1) 0.1 b21 1
L(t) p(Kt KT) 0.0315?1 ?1(1) ?2 ?2(1) 0.290?1(2) ?2(2) 0.7?1(3) ?2(3) 1
t1 t2 t3 t4
?1(t) 1.0 ?1 1 0.21 ?1(1) a11 b13 ?2(1) a21 b23 0.0462?1(2)a11 b12 ?2(2)a21 b12 0.021294
?2(t) 0.0 ?2 0 0.09 ?1(1) a12 b13?2(1) a22 b23 0.0378 0.010206
L(t) p(K1 Kt) 1.0?1(1) ?2(1) 0.3?1(2) ?2(2) 0.084?1(3) ?2(3) 0.0315?1(4) ?2(4)
27The Forward-Backward Method
- Note the forward method computes
- Note the backward method computes (tgt1)
- We can do the forward-backward methodwhich
computes p(K1,,KT) using formula (using any
choice of t1,,T1!)
28Example Forward-Backward Calculation!
Backward
Forward
t1 t2 t3 t4
?1(t) 0.0315 0.045 a11b11 ?1(1) a12b21 ?1(1) 0.6 b11 1
?2(t) 0.029 0.245 a11b11 ?1(1) a12b21 ?1(1) 0.1 b21 1
L(t) p(Kt KT) 0.0315?1 ?1(1) ?2 ?2(1) 0.290?1(2) ?2(2) 0.7?1(3) ?2(3) 1
t1 t2 t3 t4
?1(t) 1.0 ?1 1 0.21 ?1(1) a11 b13 ?2(1) a21 b23 0.0462?1(2)a11 b12 ?2(2)a21 b12 0.021294
?2(t) 0.0 ?2 0 0.09 ?1(1) a12 b13?2(1) a22 b23 0.0378 0.010206
L(t) p(K1 Kt) 1.0?1(1) ?2(1) 0.3?1(2) ?2(2) 0.084?1(3) ?2(3) 0.0315?1(4) ?2(4)
29Solution to Problem 1
- The hard part of the 1st Problem was to find
the likelihood of the observations for an HMM - We can now do this using either theforward,
backward, or forward-backwardmethod.
30Solution to Problem 2 Viterbi Algorithm(Computin
g Most Probable Labeling)
- Consider direct calculation of labeledobservation
s - Previously we summed these likelihoods together
across all possible labelings to solve the first
problemwhich was to compute the likelihood of
the observationsgiven the parameters (Hard part
of HMM Question 1!). - We solved this problem using forward or backward
method. - Now we want to compute all possible labelings and
theirrespective likelihoods and pick the
labeling which isthe largest!
EXAMPLECompute p(DOG/NOUN,is/VERB,good/ADJ
aij,bkm)
31Efficiency of Calculations is Important (e.g.,
Most Likely Labeling Problem)
- Just as in the forward-backward calculations
wecan solve problem of computing likelihood of
every possible one of the NT labelings
efficiently - Instead of millions of years of computing time we
can solve the problem in several seconds!!
32Viterbi Algorithm Overview (same setup as
forward algorithm)
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
K1
K2
K3
b120.1
b130.3
b110.6
a110.7
S1
S1
?1
a120.3
a210.5
?2
S2
S2
S2
b230.2
a220.5
b210.1
b220.1
K1
K2
K3
K1
K2
K1
K2
K3
K3
33Forward Calculations Time 2 (1 word example)
TIME 2
K1
K2
K3
b130.3
a110.7
S1
?11
a120.3
a210.5
?20
S2
S2
a220.5
b230.2
K1
K2
K3
34Backtracking Time 2 (1 word example)
TIME 2
K1
K2
K3
b130.3
a110.7
S1
?11
a120.3
a210.5
?20
S2
S2
a220.5
b230.2
K1
K2
K3
35Forward Calculations (2 word example)
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
b120.1
b130.3
S1
S1
a110.7
?1
a210.5
a120.3
?2
a220.5
S2
S2
S2
b230.2
b220.1
K1
K2
K3
K1
K2
K3
36BACKTRACKING (2 word example)
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
b120.1
b130.3
S1
S1
a110.7
?1
a210.5
a120.3
?2
a220.5
S2
S2
S2
b230.2
b220.1
K1
K2
K3
K1
K2
K3
37Formal Analysis of 2 word case
38Forward Calculations Time 4 (3 word example)
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
K1
K2
K3
b120.1
b130.3
b110.6
a110.7
S1
S1
S1
?1
a120.3
a210.5
?2
S2
S2
S2
S2
a220.5
b230.2
b210.1
b220.1
K1
K2
K3
K1
K2
K1
K2
K3
K3
39Backtracking to Obtain Labeling for 3 word case
TIME 3
TIME 4
TIME 2
K1
K2
K3
K1
K2
K3
K1
K2
K3
b120.1
b130.3
b110.6
a110.7
S1
S1
S1
?1
a120.3
a210.5
?2
S2
S2
S2
S2
a220.5
b230.2
b210.1
b220.1
K1
K2
K3
K1
K2
K1
K2
K3
K3
40Formal Analysis of 3 word case
41Third Fundamental QuestionParameter Estimation
- Make Initial Guess for aij and bkm
- Compute probability one hidden state follows
another given aij and bkm and sequence of
observations.(computed using forward-backward
algorithm) - Compute probability of observed state given a
hidden state given aij and bkm and sequence
of observations.(computed using forward-backward
algorithm) - Use these computed probabilities tomake an
improved guess for aij and bkm - Repeat this process until convergence
- Can be shown that this algorithm does infact
converge to correct choice for aij and
bkmassuming that the initial guess was close
enough..