Title: Markov Models
1Markov Models
- Like the Bayesian network, a Markov model is a
graph composed of - states that represent the state of a process
- edges that indicate how to move from one state to
another where edge is annotated with a
probability indicating the likelihood of taking
that transition - Unlike the Bayesian network, the Markov models
nodes are meant to convey temporal states so a
transition from state1 to state2 means that in
time 1 you are in state 1 and in time 2 you have
moved on to state 2 - An ordinary Markov model contains states that are
observable so that the transition probabilities
are the only mechanism that determines the state
transitions - We will find a more useful version of the Markov
model to be the hidden Markov model, covered in a
few of slides
2A Markov Model
- In the Markov model, we move from state to state
based on simple probabilities - going from S3 to S2 has a likelihood of a32
- going from S3 to S3 has a likelihood of a33
- going from S3 to S4 has a likelihood of a34
- likelihoods are usually computed stochastically
(statistically)
We will use our Markov model to compute the
likelihoods of a number of state transitions that
might be of interest For instance, if we start
in S1, what is the probability of going from S1
to S2 to S3 to S4 to S5 and back to S1? What is
the probability of going from S1 to S1 to S1 to
S1 to S2? Etc.
3Example Weather Forecasting
- On any day, it will either be
- rainy/snowy, cloudy or sunny
- we have the following probability matrix to
denote given any particular day, what the weather
will be like tomorrow - so the probability, given today is sunny that
tomorrow will be sunny is 0.8 - the probability, given today is rainy/snowy that
tomorrow is cloudy is 0.2 - to compute a sequence, we multiply them together,
so if today is sunny then the probability that
the next two days will be sunny is 0.8 0.8, and
the probability that the next three days will be
cloudy is 0.1 0.1 0.1
R/S Cloudy Sunny
R/S .4 .3 .3
Cloudy .2 .6 .2
Sunny .1 .1 .8
4Continued
- Lets assume today is cloudy and find the most
likely sequence of three days - There are 8 such sequences
- cloudy, cloudy, cloudy .6 .6 .36
- cloudy, cloudy, rainy .6 .2 .12
- cloudy, cloudy, sunny .6 .2 .12
- cloudy, rainy, cloudy .2 .3 .06
- cloudy, rainy, rainy .2 .4 .08
- cloudy, rainy, sunny .2 .3 .06
- cloudy, sunny, cloudy .2 .1 .02
- cloudy, sunny, rainy .2 .1 .02
- cloudy, sunny, sunny .2 .8 .16
- for simplicity, assume rainy really means rainy
or snowy - So the most likely sequence is three cloudy days
in a row because today is cloudy - But what if we didnt know what today would be?
5Enhanced Example
- Lets assume that the probability of the first day
being cloudy .5, rainy .2 and sunny .3 - These are our prior probabilities
- Since we do not know the first day is cloudy, we
now have 27 possible combinations - CCC, CCR, CCS, CRC, CRR, CRS, CSC, CSR, CSS, RCC,
RCR, RCS, RRC, RRR, RRS, RSC, RSR, RSS, SCC, SCR,
SCS, SRC, SRR, SRS, SSC, SSR, SSS - The most likely sequence now is SSS .3 .8
.8 .192 even though cloudy is the most likely
first day (the probably for CCC .5 .6 .6
.18) - So, as with a Bayesian network, we have prior
probabilities and multiply them by our
conditional probabilities, which here are known
as transitional probabilities
6HMM
- Most interesting AI problems cannot be solved by
a Markov model because there are unknown states
in our real world problems - in speech recognition, we can build a Markov
model to predict the next word in an utterance by
using the probabilities of how often any given
word follows another - how often does lamb follow little?
- A hidden Markov model (HMM) is a Markov model
where the probabilities are actually
probabilistic functions that are based in part on
the current state, which is hidden (unknown or
unobservable) - determining which transition to take will require
additional knowledge than merely the state
transition probabilities
7Example Speech Recognition
- We have observations, the acoustic signal
- But hidden from us is intention that created the
signal - For instance, at time t1, we know what the signal
looks like in terms of data, but we dont know
what the intended sound was (the phoneme or
letter or word) - The goal in speech recognition is to identify the
actual utterance (in terms of phonetic units or
words) - but the phonemes/words are hidden to us
- We add to our model hidden (unobservable) states
and appropriate probabilities for transitions - the observables are not states in our network,
but transition links - the hidden states are the elements of the
utterance (e.g., phonemes), which is what we are
trying to identify - we must search the HMM to determine what hidden
state sequence best represents the input
utterance
8Example HMM
- Here, X1, X2 and X3 are the hidden states
- y1, y2, y3, y4 are observations
- Aij are the transition probabilities of moving
from state i to state j - bij make up the output probabilities from hidden
node i to observation j that is, what is the
probability of seeing output yj given that we are
in state xi?
- Three problems associated with HMMs
- Given HMM, compute the probability of a given
output sequence - Given HMM and output sequence, compute most
likely state transitions - 3. Given HMM and output sequence, compute the
transition probabilities
9Formal Definition of an HMM
- The HMM is a graph, G V, E
- V is the set of vertices (nodes, states)
- E is the set of directed edges, or the
transitions between pairs of nodes - The HMM must have three sets of probabilities
- Each node in V that can be the first state of a
sequence has a prior probability (we can denote
nodes that cannot be the first state as having
prior probability 0) - For each state transition (edge in E), we need a
transition probability - For each node that has an associated observation,
we need an output probability - Commonly an HMM will represent some k distinct
time periods where the states at time i are
completely connected to the states at time i-1
and time i1 although not always - So, if there are n states and o possible
observations at any time, there would be n prior
probabilities, n(k-1) transition probabilities,
and no output probabilities
10Some Sample HMMs
11HMM Problem 1
- As stated previously, there are three problems
that we can solve with our HMM - Problem 1 given an HMM and an output sequence,
compute the probability of generating that
particular output sequence (e.g., what is the
likelihood of seeing this particular sequence of
observations?) - We have an observation sequence O O1 O2 O3 Ok
and states - Recall that we have 3 types of probabilities,
prior probabilities, transition probabilities and
output probabilities - We generate every possible sequence of hidden
states through the HMM from 1 to k and compute - ps1 bs1(O1) as1s2 bs2(O2) as2s3 bs3(O3)
ask-1sk bsk(Ok) - Where p is the prior probability, a is the
transition probability and b is the output
probability - Since there are a number of sequences through the
HMM, we compute the above probability for each
sequence and sum them up
12Brief Example
We have 3 time units, t1, t2, t3 and each has 2
states, s1, s2 p(s1 at t1) .8, p(s2 at t1) .2
and there are 3 possible outputs , A, B, C Our
transition probabilities a are p(s1, s1) .7,
p(s1, s2) .3 and p(s2, s2) .6,
p(s2, s1) .4 Our output probabilities are p(A,
s1) .5, p(B, s1) .4, p(C, s1) .1
p(A, s2) .7, p(B, s2) .3, p(B, s2)
0 What is the probability of generating A, B,
C? Possible sequences are s1 s1 s1 .8 .5
.7 .4 .3 .1 0.00336 s1 s1 s2 .8
.5 .7 .4 .3 0 0.0 s1 s2 s1 .8
.5 .3 .3 .4 .1 0.00144 s1 s2 s2
.8 .5 .3 .3 .6 0 0.0 s2 s1 s1
.2 .7 .4 .4 .7 .1 0.001568 s2 s1
s2 .2 .7 .4 .4 .3 0 0.0 s2 s2
s1 .2 .7 .6 .3 .4 .1 0.001008 s2
s2 s2 .2 .7 .6 .3 .6 0 0.0
Likelihood of the sequence A, B, C is 0.00336
0.00144 0.001568 0.001008
0.007376
13More Efficient Solution
- You might notice that there is a lot of
repetition in our computation from the last slide - In fact, the number of sequences is O(k nk)
- When we compute s2 s2 s2, we had already
computed s1 s2 s2, so the last half of the
computation was already done - By using dynamic programming, we can reduce the
number of computations - this is particularly relevant when the sequence
is far longer than 3 states and has far more
states per time unit than 2 - We use a dynamic programming algorithm called the
Forward algorithm (see the next slide) - Even though we have a reasonably efficient means
of solving problem 1, there is little need to
solve this problem!
14The Forward Algorithm
- We solve the problem in three steps
- The initialization step sets the probabilities of
starting at each initial state at time 1 as - a1(i) pibi(O1) for all states i
- That is, the probability of starting at some
state i is the prior probability for i the
output probability of seeing observation O1 from
state i - The main step is recursive for all times after 1
- at1(j) S at(i)aijbj(Ot1) for all states j
at time t1 - That is, at time t1, the probability of being at
state j is the sum of all of the previous states
at time t leading to state j (at(i)aij) times
the output probability of seeing Ot1 at time t1 - The final step is to sum up the probabilities of
ending in each of the states at time n (sum up
an(j) for all states j)
15HMM Problem 2
- Given a sequence of observations, compute the
optimal sequence of state transitions that would
cause those observations - Alternatively, we could say that the optimal
sequence best explains the observations - We need to define what we mean by optimal
- The sequence that contains the most individual
states with the highest likelihoods? - this sequence would contain the most states that
appear to be correct states notice that this
solution does not take into account transitions - The sequence that contains the most number of
correct pairs of states in the sequence - this would take into account transitions
- or most number of correct triples, most number of
correct quadruples, etc - The sequence that is the most likely (probable)
overall
16The Viterbi Algorithm
- We do not know which of the sequences that were
generated from problem 1 is actually the best
path, we didnt keep track of that - But through recursion and dynamic programming, we
did keep track of portions of paths - So we will again use recursion
- The recursive step works like this
- Lets assume at some time t, we know the best
paths to all states - At time t1, we extend each of the best paths to
time t by finding the best transition from time t
to a state at t1 - that is, we have to find a state at time t1 such
that the path to time t transition to t1 is
best - we not only compute the new probability, but
remember the path to this point
17Viterbi Formally Described
- Initialization step
- d1(i) pibi(O1) same as in the forward
algorithm - y1(i) 0 this array will represent the state
that maximized our path leading to the prior
state - The recursive step
- dt1(j) max dt(i)aijbj(Ot1) here, we
look at all of the previous states i at time t,
and compute the state transition from t to t1
that gives us the maximum value of dt(i)aij
multiply that by the likelihood of this state
being true given this time units observation
(see the next slide for a visual representation) - yt1(j) argmax dt(i)aij which i from the
possible preceding states led to the maximum
value? Store that
18Continued
- Termination step
- p maxdn(i) the probability that the path
selected is correct is the path that has the
largest probability as found in the final time
step from the last recursive call - q argmax dn(i) this is the last state
reached - Path backtracking
- Now that we have found the best path, we
backtrack using the array y starting at yq
until we reach time unit 1
At time t-1, we know the best paths to reach each
of the states Now at time t, we look at each
state si, and try to extend the path from t-1 to t
19Viterbi in Pseudocode
20Example Rainy and Sunny Days
- Your colleague in another city either walks to
work or drives every day and his decision is
usually based on the weather - Given daily emails that include whether he has
walked or driven to work, you want to guess the
most likely sequence of whether the days were
rainy or sunny - Two hidden states rainy and sunny
- Two observables walking and driving
- Assume equal likelihood of the first day being
rainy or sunny - Transitional probabilities
- rainy given yesterday was (rainy .7, sunny
.3) - sunny given yesterday was (rainy .4, sunny
.6) - Output (emission) probabilities
- rainy given walking .1, driving .9
- sunny given walking .8, driving .2
- Given that your colleague walked, drove, walked,
what is the most likely sequence of days?
21Example Continued
Day 1 is easy to compute, prior probability
output probability The initial path to get to day
1 is merely from state 0
22Example Continued
We determine that from day 1, it is more likely
to reach sunny from rainy it is more likely to
reach rainy from rainy as well, so day 2s path
to sunny is from rainy, and day 2s path from
rainy is from rainy
23Example Concluded
From day 2, it is more likely to reach sunny from
sunny and it is more likely to reach rainy from
sunny, but day 3s most likely state is rainy.
Since we reached the rainy state from sunny, and
we reached Day 2s sunny state from rainy, we now
have the most likely path rainy, sunny, rainy
24Why Problem 2?
- Unlike problem 1 which didnt seem to have any
useful AI applications, problem 2 is has many
different types of AI problems that it could
solve - This can be used to solve any number of credit
assignment problems - given a speech signal, what was uttered (what
phonemes or words were uttered)? - given a set of symptoms, what disease(s) is the
patient suffering from? - given a misspelled word, which word was intended?
- given a series of events, what caused them?
- What we have are a set of observations (symptoms,
manifestations) and we want to explain them - The HMM and Viterbi give us the ability to
generate the best explanation where the term best
means the most likely sequence through all of the
states
25How Do We Obtain our Probabilities?
- We saw one of the issues involved Bayesian
probabilities was gathering accurate
probabilities - Like Bayesian probabilities, we need both prior
probabilities and transition probabilities (the
probability of moving from one state to another) - But here we also need output (or emission)
probabilities - We can accumulate probabilities through counting
- Given N cases, how many started at state s1? s2?
s3? - although do we have enough cases to give us a
good representative mix of probabilities? - Given N cases, out of all state transitions, how
often do we move from s1 to s2? From s2 to s3?
Etc - again, are there enough cases to give us a good
distribution for transition probabilities? - How do we obtain the output probabilities? That
is, how do we determine the likelihood of seeing
output Oi in state Sj? - Thats trickier, and thats where HMM problem 3
comes in
26HMM Problem 3
- The final problem for HMMs is the most
interesting and also the most challenging - Given HMM and output sequence, update the various
probabilities - It turns out that there is an algorithm for
modifying probabilities given a set of correct
test cases - The algorithm is called the Baum-Welch algorithm
(also known as the Estimation-Modification or EM
algorithm) which uses as a component, the
forward-backward algorithm - we already saw the forward portion of the forward
algorithm, now we will take a look at the
backward portion, which as you might expect, is
very similar
27Forward-Backward
- We compute the forward probabilities as before
- computing at(i) for each time unit t and each
state i - The backward portion is similar but reversed
- computing bt(i) for each time unit t and each
state i - Initialization step
- bt(i) 1 unlike the forward algorithm which
used the prior probabilities, here we start at 1
(notice that we also start at time t, not time 1) - Recursive step
- bt(i) Saij bj(Ot1)bt1(j) the probability
of reaching state i at time t backwards, is the
sum of transitions from all states at time t1
the probability of reaching state j at time t1
the probability of being at state j given output
Ot1 - this recursive step is almost the same as the
step in the forward algorithm except that we use
b instead of a
28Baum-Welch (EM)
- Now that we have computed all the forward and
backward path probabilities, how do we use them? - First, we need to add a new value, the
probability of being in state i at time t and
transitioning to state j, which we will call
xt(i, j) - Fortunately, once we have run the
forward-backward algorithm, this is easy to
compute as - xt(i, j) at(i)aijbj(Ot1)bt1(j) /
denominator - Before describing the denominator, lets
understand the numerator - this is the product of the probability of being
at state i at time t multiplied by the transition
probability of going from i to j multiplied by
the output probability of seeing Ot1 at time t1
multiplied by the probability of being at state j
at time t1 - that is, it is the value derived by the forward
algorithm for state i at time t the value
derived by the backward algorithm for state j at
time t1 transition output probabilities
29Continued
- The denominator is a normalizing value so that
all of our probabilities xt(i, j) for all states
i and j add up to 1 for time t - So this is merely the sum for all i and all j of
at(i)aijbj(Ot1)bt1(j) - Now we have some additional work
- We add gt(i) S xt(i, j) for all j at time t
- This represents the expected number of times we
are at state i at time t - If we sum up gt(i) for all times t, we have the
number of expected times we are in state I - Now recall that we may have started with improper
probabilities (prior, transition and output)
30Re-estimation
- By running the system on some test cases, we can
accumulate probabilities of how likely a
transition is, or how likely we start in a given
state (prior probability) or how likely a state
is for a given observation - At this point of the Baum Welch algorithm, we
have accumulated a summation (from the previous
slide) of various states we have visited - p(observation i state j) (expected number of
times we saw observation i in the test case /
number of times we achieved state j) (our
observation probabilities) - p(state i state j) (expected number of
transitions from i to j / number of times we were
in state j) (our transition probabilities) - p(state i) a1(i)b1(i) / Sa1(i)b1(i) for all
states i (this is the prior probability)
31Continued
- The math may be elusive, and the amount of
computations required is intensive but now we
have the ability to - Start with estimated probabilities (they dont
even have to be very good) - Use training examples to adjust the probabilities
- And continue until the probabilities stabilize
- that is, between iterations of Baum-Welch, they
do not change (or their change is less than a
given error rate) - So HMMs can be said to learn the proper
probabilities through training examples - Each training example is merely the observations
and the expected output (hidden states) - The better the initial probabilities, the more
likely it will be that the algorithm will
converge to a stable state quickly, the worse the
initial probabilities, the longer it will take
32Example Determining the Weather
- Here, we have an HMM that attempts to determine
for each day, whether it was hot or cold - observations are the number of ice cream cones a
person ate (1-3) - the following probabilities are estimates that we
will correct through learning
p(C) p(H) p(START)
p(1) 0.7 0.1 If today is cold (C) or hot (H), how many cones did I prob. eat? If today is cold (C) or hot (H), how many cones did I prob. eat?
p(2) 0.2 0.2 If today is cold (C) or hot (H), how many cones did I prob. eat? If today is cold (C) or hot (H), how many cones did I prob. eat?
p(3) 0.1 0.7 If today is cold (C) or hot (H), how many cones did I prob. eat? If today is cold (C) or hot (H), how many cones did I prob. eat?
p(C) 0.8 0.1 0.5 If today is cold or hot, what will tomorrow probably be? If today is cold or hot, what will tomorrow probably be?
p(H) 0.1 0.8 0.5 If today is cold or hot, what will tomorrow probably be? If today is cold or hot, what will tomorrow probably be?
p(STOP) 0.1 0.1 0 If today is cold or hot, what will tomorrow probably be? If today is cold or hot, what will tomorrow probably be?
33Computing a Path Through the HMM
- Assume we know that the person ate in order, the
following cones 2, 3, 3, 2, 3, 2, 3, 2, 2, 3,
1, - What days were hot and what days were cold?
- P(day i is hot j cones) ai(H) bi(H) /
(ai(C) bi(C) ai(H) bi(H) ) - a(H), b(H), a(C) and b(C) were all computed
using the forward-backward algorithm - We started with guesses for our initial
probabilities - Now that we have run one iteration of
forward-backward, we can apply re-estimation - Sum up the values of our computations P(C 1)s
and P(C) - Recompute P(1 C) sum P(C 1) / P(C)
- we also do the same for P(C 2), and P(C 3) to
compute P(2 C) and P(3 C) as well as the hot
days for P(1 H), P(2 H), P(3 H) - And we recompute P(C C), P(C H), etc
- Now our probabilities are more accurate (although
not necessarily correct)
34Continued
- We update the probabilities (see below)
- since our original probabilities will impact how
good these estimates are, we repeat the entire
process with another iteration of
forward-backward followed by re-estimation - we continue to do this until our probabilities
converge into a stable state - So, our initial probabilities will be important
only in that they will impact the number of
iterations required to reach these stable
probabilities
p(C) p(H) p(START)
p(1) 0.6765 0.0584
p(2) 0.2188 0.4251
p(3) 0.1047 0.5165
p(C) 0.8757 0.0925 0.1291
p(H) 0.109 0.8652 0.8709
p(STOP) 0.0153 0.0423 0
35Convergence and Perplexity
- This system converged in 10 iterations to the
probabilities shown in the table below - Our original transition probabilities were part
of our model of weather - updating them is fine, but what would happen if
we had started with different probabilities? say
p(HC) .25 instead of .1? - the perplexity of a model is essentially the
degree to which we will be surprised by the
results of our model because of the guesses we
made when assigning a random probability like
p(HC) - We want our model to have a minimal perplexity so
that it is most realistic
p(C) p(H) p(START)
p(1) 0.6406 7.1E-05
p(2) 0.1481 0.5343
p(3) 0.2113 0.4657
p(C) 0.9338 0.0719 5.1E-15
p(H) 0.0662 0.865 1.0
p(STOP) 1.0E-15 0.0632 0
36Two Problems With HMMs
- There are two primary problems with using HMMs
- The first is minor what if a probability
(whether output or transition) is 0? - Because we are multiplying probabilities
together, this would cause a path that goes
through the state will have a probability of 0
and so will never be selected - To get around this problem, we will replace any 0
probabilities with some minimum probability (say
.001) - The other is the complexity of the search
- Imagine we are using an HMM for speech
recognition where the hidden states are the
possible phonemes (say there are 30 of them) and
the utterance consists of some 100 phonemes
(perhaps 20 words) - Recall that the complexity for the forward
algorithm is O(TNT) where N is 30 and T is 100!
Ouch - So we might use a beam search to reduce the
number of possible paths searched
37Beam Search
- A beam search is a combination of the heuristic
search idea along with a breadth-first search - The beam search algorithm examines all of the
next states accessible and evaluates them - for an HMM, the evaluation is the probability a
or b depending on whether we are doing a forward
or backward pass - In order to reduce the complexity of the search,
only some of the states at this time interval are
retained - we might either keep the top k where k is a
constant (known as the beam width) or we can use
a threshold value and prune away states that do
not exceed the threshold value - if we discard a state, we are actually discarding
the entire path that led us to that state (recall
that the path would be the path that had the
highest probability leading to that particular
state at that time)
38Forms of HMMs
- One of the most common form of HMM is called an
Ergodic model this is a fully connected model,
that is, every state has an edge to every other
state - From ealier in the lecture, we saw a slide of
examples the bull/bear market and the
cloudy/sunny/rainy day are examples - The weather/ice cream cone example could be
thought of as an Ergodic model, but instead we
would prefer to envision each day as being in a
new state, so this leads us to the Forward
Trellis model
Each variant of HMM has its own training
algorithms although they are all based on
Baum-Welch
39Bakis and Factorial HMMs
- The Bakis model is one used to denote precise
temporal changes where states transition left to
right across the model where each state
represents a new time unit - States may also loop back onto themselves
- This is often used in speech recognition for
instance to represent portions of a phonetic unit - see below to the left
- Factorial HMMs when the system is so complex
that a given state cannot represent the process
of a single state in the model - at time i, there will be multiple states, all of
which lead to multiple successor states and all
of which have emission probabilities from the
observations input (see the figure below)
40Hierarchical HMM
- We use this model when each state is itself a
self-contained probabilistic model including
their own hidden nodes - That is, a state has its own internal HMM
- The rationale for having a HHMM is that each
state can represent a sequence of observations
instead of a one-to-one mapping of observation
and state - For instance, q2 might consist of 3 or more
observations as shown in the figure
41N-Grams
- N-Gram HMMs the transition probabilities here
are not just from the previous time unit, but
from n-1 prior time units - The N-gram is primarily used in natural language
understanding or genetics types of problems where
we can accumulate the transition probabilities
from some corpus of data
- The bi-gram is the most common form of n-gram
used in natural language understanding - To the right is some bigram data for the
frequency of two-letter pairs in English (out of
2000 words) - Tri-grams are also somewhat commonly used but it
is rare to go beyond tri-grams
TH 50 AT 25 ST 20 ER 40 EN 25 IO 18 ON 39
ES 25 LE 18 AN 38 OF 25 IS 17 RE 36 OR 25
OU 17 HE 33 NT 24 AR 16 IN 31 EA 22 AS 16
ED 30 TI 22 DE 16 ND 30 TO 22 RT 16 HA 26
IT 20 VE 16
42Applications for HMMs
- The first impressive use of HMMs in AI was for
speech recognition (in the late 80s) - Since then, a lot of other applications have been
tested - Hand written character recognition
- Natural language understanding
- word sense disambiguation
- machine translation
- word matching (for misspelled words)
- semantic tagging of words (could be useful for
the semantic web) - Bioinformatics (e.g., protein structure
predictions, gene analysis and sequencing
predictions) - Market predictions
- Diagnosis of mechanical systems