Part II. Statistical NLP - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Part II. Statistical NLP

Description:

Advanced Artificial Intelligence Part II. Statistical NLP Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme – PowerPoint PPT presentation

Number of Views:142

Avg rating:3.0/5.0

Slides: 65

Provided by: www2Infor3

Category:

more less

Transcript and Presenter's Notes

Title: Part II. Statistical NLP

1
Advanced Artificial Intelligence

Part II. Statistical NLP

Hidden Markov Models Wolfram Burgard, Luc De
Raedt, Bernhard Nebel, Lars Schmidt-Thieme
Most slides taken (or adapted) from David Meir
Blei, Figures from Manning and Schuetze and from
Rabiner
2
Contents

Markov Models
Hidden Markov Models
Three problems - three algorithms
Decoding
Viterbi
Baum-Welsch
Next chapter
Application to part-of-speech-tagging
(POS-tagging)
Largely chapter 9 of Statistical NLP, Manning and
Schuetze, or Rabiner, A tutorial on HMMs and
selected applications in Speech Recognition,
Proc. IEEE

3
Motivations and Applications

Part-of-speech tagging / Sequence tagging
The representative put chairs on the table
AT NN VBD NNS IN AT NN
AT JJ NN VBZ IN AT NN
Some tags
AT article, NN singular or mass noun, VBD
verb, past tense, NNS plural noun, IN
preposition, JJ adjective

4
Bioinformatics

Durbin et al. Biological Sequence Analysis,
Cambridge University Press.
Several applications, e.g. proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC..

5
Other Applications

Speech Recognition from
From Acoustic signals infer
Infer Sentence
Robotics
From Sensory readings
Infer Trajectory / location

6
What is a (Visible) Markov Model ?

Graphical Model (Can be interpreted as Bayesian
Net)
Circles indicate states
Arrows indicate probabilistic dependencies
between states
State depends only on the previous state
The past is independent of the future given the
present.
Recall from introduction to N-gramms !!!

7
Markov Model Formalization
S
S
S
S
S

S, P, A
S s1sN are the values for the hidden states
Limited Horizon (Markov Assumption)
Time Invariant (Stationary)
Transition Matrix A

8
Markov Model Formalization
A
A
A
A
S
S
S
S
S

S, P, A
S s1sN are the values for the hidden states
P pi are the initial state probabilities
A aij are the state transition probabilities

9
What is the probability of a sequence of states ?
10
What is an HMM?

Graphical Model
Circles indicate states
Arrows indicate probabilistic dependencies
between states
HMM Hidden Markov Model

11
What is an HMM?

Green circles are hidden states
Dependent only on the previous state

12
What is an HMM?

Purple nodes are observed states
Dependent only on their corresponding hidden
state
The past is independent of the future given the
present

13
HMM Formalism
S
S
S
S
S
K
K
K
K
K

S, K, P, A, B
S s1sN are the values for the hidden states
K k1kM are the values for the observations

14
HMM Formalism
A
A
A
A
S
S
S
S
S
B
B
B
K
K
K
K
K

S, K, P, A, B
P pi are the initial state probabilities
A aij are the state transition probabilities
B bik are the observation state probabilities
Note sometimes one uses B bijk
output then depends on previous state /
transition as well

15
The crazy soft drink machine

Fig 9.2

16
Probability of lem,ice ?

Sum over all paths taken through HMM
Start in CP
1 x 0.3 x 0.7 x 0.1
1 x 0.3 x 0.3 x 0.7

17
HMMs and Bayesian Nets (1)
x1
xt-1
xt
xt1
xT
18
HMM and Bayesian Nets (2)
x1
xt1
xT
xt
xt-1
oT
o1
ot
ot-1
ot1
Conditionally independent of
Given
Because of d-separation
The past is independent of the future given the
present.
19
Inference in an HMM

Compute the probability of a given observation
sequence
Given an observation sequence, compute the most
likely hidden state sequence
Given an observation sequence and set of possible
models, which model most closely fits the data?

20
Decoding
o1
ot
ot-1
ot1
Given an observation sequence and a model,
compute the probability of the observation
sequence
21
Decoding
22
Decoding
23
Decoding
24
Decoding
25
Decoding
26
Dynamic Programming
27
Forward Procedure

Special structure gives us an efficient solution
using dynamic programming.
Intuition Probability of the first t
observations is the same for all possible t1
length state sequences.
Define

28
Forward Procedure
29
Forward Procedure
30
Forward Procedure
31
Forward Procedure
32
Forward Procedure
33
Forward Procedure
34
Forward Procedure
35
Forward Procedure
36
Dynamic Programming
37
Backward Procedure
x1
xt1
xT
xt
xt-1
oT
o1
ot
ot-1
ot1
Probability of the rest of the states given the
first state
38
(No Transcript)
39
Decoding Solution
x1
xt1
xT
xt
xt-1
oT
o1
ot
ot-1
ot1
Forward Procedure
Backward Procedure
Combination
40
(No Transcript)
41
Best State Sequence

Find the state sequence that best explains the
observations
Two approaches
Individually most likely states
Most likely sequence (Viterbi)

42
Best State Sequence (1)
43
Best State Sequence (2)

Find the state sequence that best explains the
observations
Viterbi algorithm

44
Viterbi Algorithm
x1
xt-1
j
oT
o1
ot
ot-1
ot1
The state sequence which maximizes the
probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
45
Viterbi Algorithm
x1
xt-1
xt
xt1
Recursive Computation
46
Viterbi Algorithm
x1
xt-1
xt
xt1
xT
Compute the most likely state sequence by working
backwards
47
HMMs and Bayesian Nets (1)
x1
xt-1
xt
xt1
xT
48
HMM and Bayesian Nets (2)
x1
xt1
xT
xt
xt-1
oT
o1
ot
ot-1
ot1
Conditionally independent of
Given
Because of d-separation
The past is independent of the future given the
present.
49
Inference in an HMM

Compute the probability of a given observation
sequence
Given an observation sequence, compute the most
likely hidden state sequence
Given an observation sequence and set of possible
models, which model most closely fits the data?

50
Dynamic Programming
51
Parameter Estimation
A
A
A
A
B
B
B
B
B

Given an observation sequence, find the model
that is most likely to produce that sequence.
No analytic method
Given a model and observation sequence, update
the model parameters to better fit the
observations.

52
(No Transcript)
53
Parameter Estimation
A
A
A
A
B
B
B
B
B
Probability of traversing an arc
Probability of being in state i
54
Parameter Estimation
A
A
A
A
B
B
B
B
B
Now we can compute the new estimates of the model
parameters.
55
Instance of Expectation Maximization

We have that
We may get stuck in local maximum (or even saddle
point)
Nevertheless, Baum-Welch usually effective

56
Some Variants

So far, ergodic models
All states are connected
Not always wanted
Epsilon or null-transitions
Not all states/transitions emit output symbols
Parameter tying
Assuming that certain parameters are shared
Reduces the number of parameters that have to be
estimated
Logical HMMs (Kersting, De Raedt, Raiko)
Working with structured states and observation
symbols
Working with log probabilities and addition
instead of multiplication of probabilities
(typically done)

57
The Most Important Thing
A
A
A
A
B
B
B
B
B
We can use the special structure of this model to
do a lot of neat math and solve problems that are
otherwise not solvable.
58
HMMs from an Agent Perspective