Title: Machine Learning
1Machine Learning
2The Markov Property
- A stochastic process has the Markov property if
the conditional probability of future states of
the process, depends only upon the present state. - i.e. what Im likely to do next
- depends only on where I am
- now, NOT on how I got here.
- P(qt qt-1,,q1) P(qt qt-1)
- Which processes have the Markov property?
3Markov model for Dow Jones
4The Dishonest Casino
- A casino has two dice
- Fair die
- P(1) P(2) P(5) P(6) 1/6
- Loaded die
- P(1) P(2) P(5) 1/10 P(6) ½
- I think the casino switches back and forth
between fair and loaded die once every 20 turns,
on average
5My dishonest casino model
This is a hidden Markov model (HMM)
0.05
0.95
0.95
FAIR
LOADED
P(1F) 1/6 P(2F) 1/6 P(3F) 1/6 P(4F)
1/6 P(5F) 1/6 P(6F) 1/6
P(1L) 1/10 P(2L) 1/10 P(3L) 1/10 P(4L)
1/10 P(5L) 1/10 P(6L) 1/2
0.05
6Elements of a Hidden Markov Model
- A finite set of states Q q1, ..., qK
- A set of transition probabilities between states,
A - each aij, in A is the prob. of going from state
i to state j -
- The probability of starting in each state P
p1, , pK each pK in P is the probability of
starting in state k - A set of emission probabilities, B
- where each bi(oj) in B is the probability of
observing output oj when in state i
7My dishonest casino model
This is a HIDDEN Markov model because the states
are not directly observable. If the fair die
were red and the unfair die were blue, then the
Markov model would NOT be hidden.
0.05
0.95
0.95
FAIR
LOADED
0.05
8HMMs are good for
- Speech Recognition
- Gene Sequence Matching
- Text Processing
- Part of speech tagging
- Information extraction
- Handwriting recognition
9The Three Basic Problems for HMMs
- Given observation sequence O(o1o2oT), of
events from the alphabet ?, and HMM model ?
(A,B,?) - Problem 1 (Evaluation)
- What is P(O ?), the probability of the
observation sequence, given the model - Problem 2 (Decoding)
- What sequence of states Q(q1q2qT) best explains
the observations - Problem 3 (Learning)
- How do we adjust the model parameters ? (A,B,?)
to maximize P(O ? )?
10The Evaluation Problem
- Given observation sequence O and HMM ?, compute
P(O ?) - Helps us pick which model is the best one
O 1,6,6,2,6,3,6,6
11Computing P(O?)
- Naïve Try every path through the model
- Sum the probabilities of all possible paths
- This can be intractable. O(NT)
- What we do instead
- The Forward Algorithm. O(N2T)
12The Forward Algorithm
13The inductive step,
- Computation of ?t(j) by summing all previous
values ?t-1(i) for all i
A hidden state at time t-1
transition probability
?t-1(i)
?t(j)
14Forward Algorithm Example
Model
P(1F) 1/6 P(2F) 1/6 P(3F) 1/6 P(4F)
1/6 P(5F) 1/6 P(6F) 1/6
P(1L) 1/10 P(2L) 1/10 P(3L) 1/10 P(4L)
1/10 P(5L) 1/10 P(6L) 1/2
Start prob
P (fair) .7 P (loaded) .3
Observation sequence 1,6,6,2
?2(i)
?1(i)
?3(i)
?4(i)
?1(1)0.051/6 ?1(2)0.051/6
?2(1)0.051/6 ?2(2)0.051/6
?3(1)0.051/6 ?3(2)0.051/6
0.71/6
State 1 (fair)
?3(1)0.951/10 ?3(2)0.951/10
?2(1)0.951/2 ?2(2)0.951/2
?1(1)0.951/2 ?1(2)0.951/2
State 2 (loaded)
0.31/10
15Markov model for Dow Jones
16Forward trellis for Dow Jones
17The Decoding Problem
- What sequence of states Q(q1q2qT) best explains
the observation sequence O(o1o2oT)? - Helps us find the path through a model.
ART
N
V
ADV
The dog sat quietly
18The Decoding Problem
- What sequence of states Q(q1q2qT) best explains
the observation sequence O(o1o2oT)? - Viterbi Decoding
- slight modification of the forward algorithm
- the major difference is the maximization over
previous states - Note Most likely state sequence is not the same
as the sequence of most likely states
19The Viterbi Algorithm
20The Forward inductive step
ot-1
ot
at-1(j)
21The Viterbi inductive step
Keep track of who the predecessor was at each
step.
ot-1
ot
vt-1(i)
22Viterbi for Dow Jones
23The Learning Problem
- Given O, how do we adjust the model parameters ?
(A,B,?) to maximize P(O ? )? - In other words How do we make a hidden Markov
Model that best models the what we observe?
24Baum-Welch Local Maximization
- 1st step You determine
- The number of hidden states, N
- The emission (observation alphabet)
- 2nd step randomly assign values to
- A - the transition probabilities
- B - the observation (emission) probabilities
- - the starting state probabilities
- 3rd step Let the machine re-estimate
- A, B, p
25Estimation Formulae
26Learning transitions
27Math
28Estimation of starting probs.
This is number of transitions from i at time t
29Estimation Formulae
30Estimation Formulae
k
31What are we maximizing again?
32The game is
- EITHER the current model is at a local maximum
and - reestimate current model
- OR our reestimate will be slightly better and
- reestimate ! current model
- SO we feed in the reestimate as the current
model, over and over until we cant improve any
more.
33Caveats
- This is a kind of hill-climbing technique
- Often has serious problems with local maxima
- You dont know when youre done
34Sohow else could we do this?
- Standard gradient descent techniques?
- Hill climb?
- Beam search?
- Genetic Algorithm?
35Back to the fundamental question
- Which processes have the Markov property?
- What if a hidden state variable is included?(an
in an HMM)