Title: HMM for POS Tagging
1HMM for POS Tagging
- Heng Ji
- hengji_at_cs.qc.cuny.edu
- Feb 4, 2008
Acknowledgement some slides from Ralph Grishman,
Nicolas Nicolov
2Outline
- HMMs and Viterbi algorithm
3Machine Learning based POS Tagging
- Statistical approaches
- Machine learning of rules
- Role of corpus
- No corpus (hand-written)
- No machine learning (hand-written)
- Unsupervised learning from raw data
- Supervised learning from annotated data
4The Basic Idea
- For a string of words
- W w1w2w3wn
- find the string of POS tags
- T t1 t2 t3 tn
- which maximizes P(TW)
- i.e., the probability of tag string T given that
the word string was W - i.e., that W was tagged T
5But, the Sparse Data Problem
- Rich Models often require vast amounts of data
- Count up instances of the string "heat oil in a
large pot" in the training corpus, and pick the
most common tag assignment to the string.. - Too many possible combinations
6POS Tagging as Sequence Classification
- We are given a sentence (an observation or
sequence of observations) - Secretariat is expected to race tomorrow
- What is the best sequence of tags that
corresponds to this sequence of observations? - Probabilistic view
- Consider all possible sequences of tags
- Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1wn.
7Getting to HMMs
- We want, out of all sequences of n tags t1tn the
single tag sequence such that P(t1tnw1wn) is
highest. - Hat means our estimate of the best one
- Argmaxx f(x) means the x such that f(x) is
maximized
8Getting to HMMs
- This equation is guaranteed to give us the best
tag sequence - But how to make it operational? How to compute
this value? - Intuition of Bayesian classification
- Use Bayes rule to transform this equation into a
set of other probabilities that are easier to
compute
9Goal of POS Tagging
- We want the best set of tags for a sequence of
words (a sentence) - W a sequence of words
- T a sequence of tags
Our Goal
- Example
- P((NN NN P DET ADJ NN) (heat oil in a
large pot))
10Reminder ApplyBayes Theorem (1763)
likelihood
prior
posterior
Our Goal To maximize it!
marginal likelihood
Reverend Thomas Bayes Presbyterian minister
(1702-1761)
11How to Count
- P(WT) and P(T) can be counted from a large
- hand-tagged corpus and smooth them to get
rid of the zeroes
12Count P(WT) and P(T)
- Assume each word in the sequence depends only on
its corresponding tag
13Count P(T)
history
- Make a Markov assumption and use N-grams over
tags ... - P(T) is a product of the probability of N-grams
that make it up
14Example a Moore Machine
- Goal What is the most probable sequence of
animals if you hear Moo, Hello, Quack.
Hello!
15A Hidden Markov Model (HMM)
16The State Space of a Moore Machine
17Viterbi Decoding of a Moore Machine
quack
moo
hello
t0
t1
t2
t3
t4
START
1
0
0
0
0
110.9
0.90.50.1
COW
0.9
0
0.045
0
0
0.0450.30.6
0.90.30.4
0.0081
DUCK
0
0
0
0.108
0.0324
0.1080.50.6
0.3240.21
END
0
0
0
0
0.00648
18Computing Probabilities
- viterbi s, t max(s) ( viterbi s,
t-1 transition probability P(s s)
emission probability P (tokent s) ) - for each s, t
- record which s, t-1 contributed the maximum
19Analyzing
20A Simple POS HMM
21Word Emission ProbabilitiesP ( word state )
- A two-word language fish and sleep
- Suppose in our training corpus,
- fish appears 8 times as a noun and 4 times as a
verb - sleep appears twice as a noun and 6 times as a
verb - Emission probabilities
- Noun
- P(fish noun) 0.8
- P(sleep noun) 0.2
- Verb
- P(fish verb) 0.4
- P(sleep verb) 0.6
22Viterbi Probabilities
23 24Token 1 fish
25Token 1 fish
26Token 2 sleep (if fish is verb)
27Token 2 sleep (if fish is verb)
28Token 2 sleep (if fish is a noun)
29Token 2 sleep (if fish is a noun)
30Token 2 sleeptake maximum,set back pointers
31Token 2 sleeptake maximum,set back pointers
32Token 3 end
33Token 3 endtake maximum,set back pointers
34Decodefish nounsleep verb