Title: CMSC 723 / LING 645: Intro to Computational Linguistics
1CMSC 723 / LING 645 Intro to Computational
Linguistics
September 22, 2004 Dorr Porter Stemmer,Intro to
Probabilistic NLP and N-grams (chap
6.1-6.3) Prof. Bonnie J. DorrDr. Christof
MonzTA Adam Lee
2Computational Morphology (continued)
- The Rules and the Lexicon
- General versus Specific
- Regular versus Irregular
- Accuracy, speed, space
- The Morphology of a language
- Approaches
- Lexicon only
- Lexicon and Rules
- Finite-state Automata
- Finite-state Transducers
- Rules only
3Lexicon-Free MorphologyPorter Stemmer
- Lexicon-Free FST Approach
- By Martin Porter (1980)http//www.tartarus.org/7
Emartin/PorterStemmer/ - Cascade of substitutions given specific
conditions - GENERALIZATIONS
- GENERALIZATION
- GENERALIZE
- GENERAL
- GENER
4Porter Stemmer
- Definitions
- C string of one or more consonants, where a
consonant is anything other than A E I O U or (Y
preceded by C) - V string of one or more vowels
- M Measure, roughly with number of syllables
- Words (C)(VC)M(V)
- M0 TR, EE, TREE, Y, BY
- M1 TROUBLE, OATS, TREES, IVY
- M2 TROUBLES, PRIVATE, OATEN, ORRERY
- Conditions
- S - stem ends with S
- v - stem contains a V
- d - stem ends with double C, e.g., -TT, -SS
- o - stem ends CVC, where second C is not W, X
or Y, e.g., -WIL, HOP
5Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y
- Step 1 Plural Nouns and Third Person Singular
Verbs - SSES ? SS caresses ? caress
- IES ? I ponies ? poni
- ties ? ti
- SS ? SS caress ? caress
- S ? cats ? cat
Step 2a Verbal Past Tense and Progressive
Forms (Mgt0) EED ? EE feed ? feed, agreed
? agree i (v) ED ? plastered ?
plaster, bled ? bled ii (v) ING ? motoring
? motor, sing ? sing
Step 2b If 2a.i or 2a.ii is successful, Cleanup
AT ? ATE conflat(ed) ? conflate BL ?
BLE troubl(ed) ? trouble IZ ? IZE
siz(ed) ? size (d and not (L or S or
Z)) hopp(ing) ? hop, tann(ed) ? tan ?
single letter hiss(ing) ? hiss, fizz(ed) ? fizz
(M1 and o) ? E fail(ing) ? fail,
fil(ing) ? file
6Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y
- Step 3 Y ? I
- (v) Y ? I happy ? happi
- sky ? sky
7Porter Stemmer
- Step 4 Derivational Morphology I Multiple
Suffixes - (mgt0) ATIONAL -gt ATE relational
-gt relate - (mgt0) TIONAL -gt TION conditional
-gt condition - rational
-gt rational - (mgt0) ENCI -gt ENCE valenci
-gt valence - (mgt0) ANCI -gt ANCE hesitanci
-gt hesitance - (mgt0) IZER -gt IZE digitizer
-gt digitize - (mgt0) ABLI -gt ABLE conformabli
-gt conformable - (mgt0) ALLI -gt AL radicalli
-gt radical - (mgt0) ENTLI -gt ENT differentli
-gt different - (mgt0) ELI -gt E vileli
- gt vile - (mgt0) OUSLI -gt OUS analogousli
-gt analogous - (mgt0) IZATION -gt IZE
vietnamization -gt vietnamize - (mgt0) ATION -gt ATE predication
-gt predicate - (mgt0) ATOR -gt ATE operator
-gt operate - (mgt0) ALISM -gt AL feudalism
-gt feudal - (mgt0) IVENESS -gt IVE decisiveness
-gt decisive - (mgt0) FULNESS -gt FUL hopefulness
-gt hopeful - (mgt0) OUSNESS -gt OUS callousness
-gt callous
8Porter Stemmer
- Step 5 Derivational Morphology II More Multiple
Suffixes - (mgt0) ICATE -gt IC triplicate
-gt triplic - (mgt0) ATIVE -gt formative
-gt form - (mgt0) ALIZE -gt AL formalize
-gt formal - (mgt0) ICITI -gt IC electriciti
-gt electric - (mgt0) ICAL -gt IC electrical
-gt electric - (mgt0) FUL -gt hopeful
-gt hope - (mgt0) NESS -gt goodness
-gt good
9Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y
- Step 6 Derivational Morphology III Single
Suffixes - (mgt1) AL -gt revival
-gt reviv - (mgt1) ANCE -gt allowance
-gt allow - (mgt1) ENCE -gt inference
-gt infer - (mgt1) ER -gt airliner
-gt airlin - (mgt1) IC -gt gyroscopic
-gt gyroscop - (mgt1) ABLE -gt adjustable
-gt adjust - (mgt1) IBLE -gt defensible
-gt defens - (mgt1) ANT -gt irritant
-gt irrit - (mgt1) EMENT -gt replacement
-gt replac - (mgt1) MENT -gt adjustment
-gt adjust - (mgt1) ENT -gt dependent
-gt depend - (mgt1 and (S or T)) ION -gt adoption
-gt adopt - (mgt1) OU -gt homologou
-gt homolog - (mgt1) ISM -gt communism
-gt commun - (mgt1) ATE -gt activate
-gt activ - (mgt1) ITI -gt angulariti
-gt angular - (mgt1) OUS -gt homologous
-gt homolog - (mgt1) IVE -gt effective
-gt effect
10Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y
- Step 7a Cleanup
- (mgt1) E ? probate ? probat
- rate ?
rate - (m1 and not o) E ? cease ? ceas
- Step 7b More Cleanup
- (m gt 1 and d and L) controll ? control
- ? single letter roll ? roll
-
-
11Porter Stemmer
- Errors of Omission
- European Europe
- analysis analyzes
- matrices matrix
- noise noisy
- explain explanation
- Errors of Commission
- organization organ
- doing doe
- generalization generic
- numerical numerous
- university universe
From Krovetz 93
12Why (not) Statistics for NLP?
- Pro
- Disambiguation
- Error Tolerant
- Learnable
- Con
- Not always appropriate
- Difficult to debug
13Weighted Automata/Transducers
- Speech recognition storing a pronunciation
lexicon - Augmentation of FSA Each arc is associated with
a probability
14Pronunciation network for about
15Noisy Channel
16Probability Definitions
- Experiment (trial)
- Repeatable procedure with well-defined possible
outcomes - Sample space
- Complete set of outcomes
- Event
- Any subset of outcomes from sample space
- Random Variable
- Uncertain outcome in a trial
17More Definitions
- Probability
- How likely is it to get a particular outcome?
- Rate of getting that outcome in all trials
- Probability of drawing a spade from 52
well-shuffled playing cards - Distribution Probabilities associated with each
outcome a random variable can take - Each outcome has probability between 0 and 1
- The sum of all outcome probabilities is 1.
18Conditional Probability
- What is P(AB)?
- First, what is P(A)?
- P(It is raining) .06
- Now what about P(AB)?
- P(It is raining It was clear 10 minutes
ago) .004
Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
19Independence
- What is P(A,B) if A and B are independent?
- P(A,B)P(A) P(B) iff A,B independent.
- P(heads,tails) P(heads) P(tails) .5 .5
.25 - P(doctor,blue-eyes) P(doctor) P(blue-eyes)
.01 .2 .002 - What if A,B independent?
- P(AB)P(A) iff A,B independent
- Also P(BA)P(B) iff A,B independent
20Bayes Theorem
- Swap the order of dependence
- Sometimes easier to estimate one kind of
dependence than the other
21What does this have to do with the Noisy Channel
Model?
(O)
(H)
22Noisy Channel Applied to Word Recognition
- argmaxw P(wO) argmaxw P(Ow) P(w)
- Simplifying assumptions
- pronunciation string correct
- word boundaries known
- Problem
- Given n iy, what is correct dictionary word?
- What do we need?
ni knee, neat, need, new
23What is the most likely word given ni?
- Now compute likelihood P(niw), then multiply
24Why N-grams?
- Compute likelihood P(niw), then multiply
- Unigram approach ignores context
- Need to factor in context (n-gram)
- Use P(needI) instead of just P(need)
- Note P(newI) lt P(needI)
25Next Word Predictionborrowed from J. Hirschberg
- From a NY Times story...
- Stocks plunged this .
- Stocks plunged this morning, despite a cut in
interest rates - Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
... - Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began
26Next Word Prediction (cont)
- Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last
- Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last Tuesday's terrorist attacks.
27Human Word Prediction
- Domain knowledge
- Syntactic knowledge
- Lexical knowledge
28Claim
- A useful part of the knowledge needed to allow
Word Prediction can be captured using simple
statistical techniques. - Compute
- probability of a sequence
- likelihood of words co-occurring
29Why would we want to do this?
- Rank the likelihood of sequences containing
various alternative alternative hypotheses - Assess the likelihood of a hypothesis
30Why is this useful?
- Speech recognition
- Handwriting recognition
- Spelling correction
- Machine translation systems
- Optical character recognizers
31Handwriting Recognition
- Assume a note is given to a bank teller, which
the teller reads as I have a gub. (cf. Woody
Allen) - NLP to the rescue .
- gub is not a word
- gun, gum, Gus, and gull are words, but gun has a
higher probability in the context of a bank
32Real Word Spelling Errors
- They are leaving in about fifteen minuets to go
to her house. - The study was conducted mainly be John Black.
- The design an construction of the system will
take more than a year. - Hopefully, all with continue smoothly in my
absence. - Can they lave him my messages?
- I need to notified the bank of.
- He is trying to fine out.
33For Spell Checkers
- Collect list of commonly substituted words
- piece/peace, whether/weather, their/there ...
- ExampleOn Tuesday, the whether On
Tuesday, the weather
34Language Model
- Definition Language model is a model that
enables one to compute the probability, or
likelihood, of a sentence S, P(S). - Lets look at different ways of computing P(S) in
the context of Word Prediction
35Word Prediction Simple vs. Smart
- SimpleEvery word follows every other word w/
equal probability (0-gram) - Assume V is the size of the vocabulary
- Likelihood of sentence S of length n is 1/V
1/V 1/V - If English has 100,000 words, probability of
each next word is 1/100000 .00001
- SmarterProbability of each next word is related
to word frequency (unigram) - Likelihood of sentence S P(w1) P(w2)
P(wn) - Assumes probability of each word is independent
of probabilities of other words. - Even smarter Look at probability given previous
words (N-gram) - Likelihood of sentence S P(w1) P(w2w1)
P(wnwn-1) - Assumes probability of each word is dependent
on probabilities of other words.
36Chain Rule
- Conditional Probability
- P(A1,A2) P(A1) P(A2A1)
- The Chain Rule generalizes to multiple events
- P(A1, ,An) P(A1) P(A2A1) P(A3A1,A2)P(AnA1
An-1) - Examples
- P(the dog) P(the) P(dog the)
- P(the dog bites) P(the) P(dog the) P(bites
the dog)
37Relative Frequencies and Conditional Probabilities
- Relative word frequencies are better than equal
probabilities for all words - In a corpus with 10K word types, each word would
have P(w) 1/10K - Does not match our intuitions that different
words are more likely to occur (e.g. the) - Conditional probability more useful than
individual relative word frequencies - Dog may be relatively rare in a corpus
- But if we see barking, P(dogbarking) may be very
large
38For a Word String
- In general, the probability of a complete string
of words w1wn is - P(w )
- P(w1)P(w2w1)P(w3w1..w2)P(wnw1wn-1)
-
- But this approach to determining the
probability of a word sequence is not very
helpful in general.
39Markov Assumption
- How do we compute P(wnw1n-1)? Trick Instead of
P(rabbitI saw a), we use P(rabbita). - This lets us collect statistics in practice
- A bigram model P(the barking dog)
P(theltstartgt)P(barkingthe)P(dogbarking)
- Markov models are the class of probabilistic
models that assume that we can predict the
probability of some future unit without looking
too far into the past - Specifically, for N2 (bigram) P(w1) ?
P(wkwk-1)
- Order of a Markov model length of prior context
- bigram is first order, trigram is second order,
40Counting Words in Corpora
- What is a word?
- e.g., are cat and cats the same word?
- September and Sept?
- zero and oh?
- Is seventy-two one word or two? ATT?
- Punctuation?
- How many words are there in English?
- Where do we find the things to count?
41Corpora
- Corpora are (generally online) collections of
text and speech - Examples
- Brown Corpus (1M words)
- Wall Street Journal and AP News corpora
- ATIS, Broadcast News (speech)
- TDT (text and speech)
- Switchboard, Call Home (speech)
- TRAINS, FM Radio (speech)
42Training and Testing
- Probabilities come from a training corpus, which
is used to design the model. - overly narrow corpus probabilities don't
generalize - overly general corpus probabilities don't
reflect task or domain - A separate test corpus is used to evaluate the
model, typically using standard metrics - held out test set
- cross validation
- evaluation differences should be statistically
significant
43Terminology
- Sentence unit of written language
- Utterance unit of spoken language
- Word Form the inflected form that appears in
the corpus - Lemma lexical forms having the same stem, part
of speech, and word sense - Types (V) number of distinct words that might
appear in a corpus (vocabulary size) - Tokens (N) total number of words in a corpus
- Types seen so far (T) number of distinct words
seen so far in corpus (smaller than V and N)
44Simple N-Grams
- An N-gram model uses the previous N-1 words to
predict the next oneP(wn wn-N1 wn-N2 wn-1
) - unigrams P(dog)
- bigrams P(dog big)
- trigrams P(dog the big)
- quadrigrams P(dog chasing the big)
45Using N-Grams
- Recall that
- N-gram P(wnw1 ) P(wnwn-N1)
- Bigram P(w1) ? P(wkwk-1)
- For a bigram grammar
- P(sentence) can be approximated by multiplying
all the bigram probabilities in the sequence - ExampleP(I want to eat Chinese food) P(I
ltstartgt) P(want I) P(to want) P(eat to)
P(Chinese eat) P(food Chinese)
46A Bigram Grammar Fragment from BERP
Eat on .16 Eat Thai .03
Eat some .06 Eat breakfast .03
Eat lunch .06 Eat in .02
Eat dinner .05 Eat Chinese .02
Eat at .04 Eat Mexican .02
Eat a .04 Eat tomorrow .01
Eat Indian .04 Eat dessert .007
Eat today .03 Eat British .001
47Additional BERP Grammar
ltstartgt I .25 Want some .04
ltstartgt Id .06 Want Thai .01
ltstartgt Tell .04 To eat .26
ltstartgt Im .02 To have .14
I want .32 To spend .09
I would .29 To be .02
I dont .08 British food .60
I have .04 British restaurant .15
Want to .65 British cuisine .01
Want a .05 British lunch .01
48Computing Sentence Probability
- P(I want to eat British food) P(Iltstartgt)
P(wantI) P(towant) P(eatto) P(Britisheat)
P(foodBritish) .25.32.65.26.001.60
.000080 - vs. I want to eat Chinese food .00015
- Probabilities seem to capture syntactic facts,
world knowledge - eat is often followed by a NP
- British food is not too popular
- N-gram models can be trained by counting and
normalization
49BERP Bigram Counts
I Want To Eat Chinese Food lunch
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
50BERP Bigram Probabilities Use Unigram Count
- Normalization divide bigram count by unigram
count of first word.
I Want To Eat Chinese Food Lunch
3437 1215 3256 938 213 1506 459
- Computing the probability of I I
- P(II) C(I,I)/C(I) 8 / 3437 .0023
- A bigram grammar is an NxN matrix of
probabilities, where N is the vocabulary size
51Learning a Bigram Grammar
- The formula P(wnwn-1) C(wn,wn-1)/C(wn-1) is
used for bigram parameter estimation - Relative Frequency
- Maximum Likelihood Estimation (MLE) Parameter
set maximizes likelihood of training set T given
model M P(TM).
52What do we learn about the language?
- What about...
- P(I I) .0023
- P(I want) .0025
- P(I food) .013
- What's being captured with ...
- P(want I) .32
- P(to want) .65
- P(eat to) .26
- P(food Chinese) .56
- P(lunch eat) .055
53Readings for next time