Title: 6122009
1LING 138/238 SYMBSYS 138Intro to Computer Speech
and Language Processing
- Lecture 6 Part of Speech Tagging (II)
- October 14, 2004
- Neal Snider
Thanks to Dan Jurafsky, Jim Martin, Dekang Lin,
and Bonnie Dorr for some of the examples and
details in these slides!
2Week 3 Part of Speech tagging
- Part of speech tagging
- Parts of speech
- Whats POS tagging good for anyhow?
- Tag sets
- Rule-based tagging
- Statistical tagging
- TBL tagging
3Rule-based tagging
- Start with a dictionary
- Assign all possible tags to words from the
dictionary - Write rules by hand to selectively remove tags
- Leaving the correct tag for each word.
43 methods for POS tagging
- Rule-based tagging
- (ENGTWOL)
- Stochastic (Probabilistic) tagging
- HMM (Hidden Markov Model) tagging
- Transformation-based tagging
- Brill tagger
5Statistical Tagging
- Based on probability theory
- Today well go over a few basic ideas of
probability theory - Then well do HMM and TBL tagging.
6Probability and part of speech tags
- Whats the probability of drawing a 2 from a deck
of 52 cards with four 2s? - Whats the probability of a random word (from a
random dictionary page) being a verb?
7Probability and part of speech tags
- Whats the probability of a random word (from a
random dictionary page) being a verb? - How to compute each of these
- All words just count all the words in the
dictionary - of ways to get a verb number of words which
are verbs! - If a dictionary has 50,000 entries, and 10,000
are verbs. P(V) is 10000/50000 1/5 .20
8Probability and Independent Events
- Whats the probability of picking two verbs
randomly from the dictionary - Events are independent, so multiply probs
- P(w1V,w2V) P(V) P(V)
- 1/5 1/5
- 0.04
- What if events are not independent?
9Conditional Probability
- Written P(AB).
- Lets say A is its raining.
- Lets say P(A) in drought-stricken California is
.01 - Lets say B is it was sunny ten minutes ago
- P(AB) means what is the probability of it
raining now if it was sunny 10 minutes ago - P(AB) is probably way less than P(A)
- Lets say P(AB) is .0001
10Conditional Probability and Tags
- P(Verb) is the probability of a randomly selected
word being a verb. - P(Verbrace) is whats the probability of a word
being a verb given that its the word race? - Race can be a noun or a verb.
- Its more likely to be a noun.
- P(Verbrace) can be estimated by looking at some
corpus and saying out of all the times we saw
race, how many were verbs? - In Brown corpus, P(Verbrace) 96/98 .98
- How to calculate for a tag sequence, say
P(NNDT)?
11Stochastic Tagging
- Based on probability of certain tag occurring
given various possibilities - Necessitates a training corpus
- No probabilities for words not in corpus.
- Training corpus may be too different from test
corpus.
12Stochastic Tagging (cont.)
- Simple Method Choose most frequent tag in
training text for each word! - Result 90 accuracy
- Why?
- Baseline Others will do better
- HMM is an example
13HMM Tagger
- Intuition Pick the most likely tag for this
word. - HMM Taggers choose tag sequence that maximizes
this formula - P(wordtag) P(tagprevious n tags)
- Let T t1,t2,,tnLet W w1,w2,,wnFind POS
tags that generate a sequence of words, i.e.,
look for most probable sequence of tags T
underlying the observed words W.
14HMM Tagger
- argmaxT P(TW)
- argmaxTP(WT)P(T) Bayes Rule
- argmaxTP(w1wnt1tn)P(t1tn)
-
- Remember, we are trying to find the sequence T
that will maximize P(TW) so this equation is
calculated over the whole sentence.
15HMM Tagger
- Assume word is dependent only on its own POS tag
it is independent of the others around it - argmaxTP(w1t1)P(w2t2)P(wntn)P(t1)P(t2t1)P
(tntn-1)
16Bigram HMM Tagger
- Also assume that probability is dependent only on
previous tag - For each word and possible tag, need to
calculate - P(ti) P(witi)P(titi-1)
- then multiply this over each possible tag and
each word over the sequence of tags
17Bigram HMM Tagger
- How do we compute P(titi-1)?
- c(ti-1ti)/c(ti-1)
- How do we compute P(witi)?
- c(wi,ti)/c(ti)
- How do we compute the most probable tag sequence?
- Viterbi algorithm
18An Example
- Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN - People/NNS continue/VBP to/TO inquire/VB the DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN - to/TO race/???the/DT race/???
- ti argmaxj P(tjti-1)P(witj)
- i num of word in sequence, j num among
possible tags - maxP(VBTO)P(raceVB) , P(NNTO)P(raceNN)
- Brown
- P(NNTO) .021 P(raceNN) .00041 .000007
- P(VBTO) .34 P(raceVB) .00003 .00001
19Viterbi Algorithm
S1
S2
S4
S3
S5
20Transformation-Based Tagging (Brill Tagging)
- Combination of Rule-based and stochastic tagging
methodologies - Like rule-based because rules are used to specify
tags in a certain environment - Like stochastic approach because machine learning
is usedwith tagged corpus as input - Input
- tagged corpus
- dictionary (with most frequent tags)
21Transformation-Based Tagging (cont.)
- Basic Idea
- Set the most probable tag for each word as a
start value - Change tags according to rules of type if word-1
is a determiner and word is a verb then change
the tag to noun in a specific order - Training is done on tagged corpus
- Write a set of rule templates
- Among the set of rules, find one with highest
score - Continue from 2 until lowest score threshold is
passed - Keep the ordered set of rules
- Rules make errors that are corrected by later
rules
22TBL Rule Application
- Tagger labels every word with its most-likely tag
- For example race has the following probabilities
in the Brown corpus - P(NNrace) .98
- P(VBrace) .02
- Transformation rules make changes to tags
- Change NN to VB when previous tag is TO
is/VBZ expected/VBN to/TO race/NN
tomorrow/NNbecomes is/VBZ expected/VBN to/TO
race/VB tomorrow/NN
23TBL Rule Learning
- 2 parts to a rule
- Triggering environment
- Rewrite rule
- The range of triggering environments of templates
(from Manning Schutze 1999363)
Schema ti-3 ti-2 ti-1 ti ti1 ti2 ti3 1 2
3 4 5 6 7 8 9
24TBL The Tagging Algorithm
- Step 1 Label every word with most likely tag
(from dictionary) - Step 2 Check every possible transformation
select one which most improves tagging - Step 3 Re-tag corpus applying the rules
- Repeat 2-3 until some criterion is reached, e.g.,
X correct with respect to training corpus - RESULT Sequence of transformation rules
25TBL Rule Learning (cont.)
- Problem Could apply transformations ad
infinitum! - Constrain the set of transformations with
templates - Replace tag X with tag Y, provided tag Z or word
Z appears in some position - Rules are learned in ordered sequence
- Rules may interact.
- Rules are compact and can be inspected by humans
26Templates for TBL
27TBL Problems
- First 100 rules achieve 96.8 accuracyFirst 200
rules achieve 97.0 accuracy - Execution Speed TBL tagger is slower than HMM
approach - Learning Speed Brills implementation over a day
(600k tokens) - BUT
- (1) Learns small number of simple,
non-stochastic rules - (2) Can be made to work faster with FST
- (3) Best performing algorithm on unknown words
28Tagging Unknown Words
- New words added to (newspaper) language 20 per
month - Plus many proper names
- Increases error rates by 1-2
- Method 1 assume they are nouns
- Method 2 assume the unknown words have a
probability distribution similar to words only
occurring once in the training set. - Method 3 Use morphological information, e.g.,
words ending with ed tend to be tagged VBN.
29Using Morphological Information
30Evaluation
- The result is compared with a manually coded
Gold Standard - Typically accuracy reaches 96-97
- This may be compared with result for a baseline
tagger (one that uses no context). - Important 100 is impossible even for human
annotators.