Title: BASIC TECHNIQUES IN STATISTICAL NLP
1BASIC TECHNIQUES IN STATISTICAL NLP
- Word predictionn-gramssmoothing
2Statistical Methods in NLE
- Two characteristics of NL make it desirable to
endow programs with the ability to LEARN from
examples of past use - VARIETY (no programmer can really take into
account all possibilities) - AMBIGUITY (need to have ways of choosing between
alternatives) - In a number of NLE applications, statistical
methods are very common - The simplest application WORD PREDICTION
3We are good at word prediction
Stocks plunged this morning, despite a cut in
interest
Stocks plunged this morning, despite a cut in
interestrates by the Federal Reserve, as Wall
Stocks plunged this morning, despite a cut in
interestrates by the Federal Reserve, as
WallStreet began .
4Real Spelling Errors
They are leaving in about fifteen minuets to go
to her house The study was conducted mainly be
John Black. The design an construction of the
system will take more than one year. Hopefully,
all with continue smoothly in my absence. Can
they lave him my messages? I need to notified the
bank of this problem. He is trying to fine out.
5Handwriting recognition
- From Woody Allens Take the Money and Run (1969)
- Allen (a bank robber), walks up to the teller and
hands her a note that reads. "I have a gun. Give
me all your cash." - The teller, however, is puzzled, because he reads
"I have a gub." "No, it's gun", Allen says. - "Looks like 'gub' to me," the teller says, then
asks another teller to help him read the note,
then another, and finally everyone is arguing
over what the note means.
6Applications of word prediction
- Spelling checkers
- Mobile phone texting
- Speech recognition
- Handwriting recognition
- Disabled users
7Statistics and word prediction
- The basic idea underlying the statistical
approach to word prediction is to use the
probabilities of SEQUENCES OF WORDS to choose
the most likely next word / correction of
spelling error - I.e., to compute
- For all words w, and predict as next word the one
for which this (conditional) probability is
highest.
P(w W1 . WN-1)
8Using corpora to estimate probabilities
- But where do we get these probabilities? Idea
estimate them by RELATIVE FREQUENCY. - The simplest method Maximum Likelihood Estimate
(MLE). Count the number of words in a corpus,
then count how many times a given sequence is
encountered. - Maximum because doesnt waste any probability
on events not in the corpus
9Maximum Likelihood Estimation for conditional
probabilities
- In order to estimate P(wW1 WN), we can use
instead - Cfr.
- P(AB) P(AB) / P(B)
10Aside counting words in corpora
- Keep in mind that its not always so obvious what
a word is (cfr. yesterday) - In text
- He stepped out into the hall, was delighted to
encounter a brother. (From the Brown corpus.) - In speech
- I do uh main- mainly business data processing
- LEMMAS cats vs cat
- TYPES vs. TOKENS
11The problem sparse data
- In principle, we would like the n of our models
to be fairly large, to model long distance
dependencies such as - Sue SWALLOWED the large green
- However, in practice, most events of encountering
sequences of words of length greater than 3
hardly ever occur in our corpora! (See below) - (Part of the) Solution we APPROXIMATE the
probability of a word given all previous words
12The Markov Assumption
- The probability of being in a certain state only
depends on the previous state - P(Xn Sk X1 Xn-1) P(Xn
SkXn-1) - This is equivalent to the assumption that the
next state only depends on the previous m inputs,
for m finite - (N-gram models / Markov models can be seen as
probabilistic finite state automata)
13The Markov assumption for language n-grams
models
- Making the Markov assumption for word prediction
means assuming that the probability of a word
only depends on the previous n words (N-GRAM
model)
14Bigrams and trigrams
- Typical values of n are 2 or 3 (BIGRAM or TRIGRAM
models) - P(WnW1 .. W n-1) P(WnW n-2,W n-1)
- P(W1,Wn) ? P(Wi W i-2,W i-1)
- What bigram model means in practice
- Instead of P(rabbitJust the other day I saw a)
- We use P(rabbita)
- Unigram P(dog)Bigram P(dogbig)Trigram
P(dogthe,big)
15The chain rule
- So how can we compute the probability of
sequences of words longer than 2 or 3? We use the
CHAIN RULE - E.g.,
- P(the big dog) P(the) P(bigthe) P(dogthe big)
- Then we use the Markov assumption to reduce this
to manageable proportions
16Example the Berkeley Restaurant Project (BERP)
corpus
- BERP is a speech-based restaurant consultant
- The corpus contains user queries examples
include - Im looking for Cantonese food
- Id like to eat dinner someplace nearby
- Tell me about Chez Panisse
- Im looking for a good place to eat breakfast
17Computing the probability of a sentence
- Given a corpus like BERP, we can compute the
probability of a sentence like I want to eat
Chinese food - Making the bigram assumption and using the chain
rule, the probability can be approximated as
follows - P(I want to eat Chinese food)
- P(Isentence start) P(wantI)
P(towant)P(eatto) - P(Chineseeat)P(foodChinese)
18Bigram counts
19How the bigram probabilities are computed
- Example of P(I,I)
- C(I,I) 8
- C(I) 8 1087 13 . 3437
- P(II) 8 / 3437 .0023
20Bigram probabilities
21The probability of the example sentence
- P(I want to eat Chinese food) ?
- P(Isentence start) P(wantI) P(towant)
P(eatto) P(Chineseeat) P(foodChinese) - .25 .32 .65 .26 .002 .60 .000016
22Examples of actual bigram probabilities computed
using BERP
23Visualizing an n-gram based language model the
Shannon/Miller/Selfridge method
- For unigrams
- Choose a random value r between 0 and 1
- Print out w such that P(w) r
- For bigrams
- Choose a random bigram P(wltsgt)
- Then pick up bigrams to follow as before
24The Shannon/Miller/Selfridge method trained on
Shakespeare
25Approximating Shakespeare, contd
26A more formal evaluation mechanism
27The downside
- The entire Shakespeare oeuvre consists of
- 884,647 tokens (N)
- 29,066 types (V)
- 300,000 bigrams
- All of Jane Austens novels (on Manning and
Schuetzes website) - N 617,091 tokens
- V 14,585 types
28Comparing Austen n-grams unigrams
In person she was inferior to
1-gram P(.) P(.) P(.) P(.)
1 the .034 the .034 the .034 the .034
2 to .032 to .032 to .032 to .032
3 and .030 and .030 and .030
8 was .015 was .015
13 she .011
1701 inferior .00005
29Comparing Austen n-grams bigrams
In person she was inferior to
2-gram P(.person) P(.she) P(.was) P(.inferior)
1 and .099 had .0141 not .065 to .212
2 who .099 was .122 a .052
23 she .009
inferior 0
30Comparing Austen n-grams trigrams
In person she was inferior to
3-gram P(.In,person) P(.person, she) P(.she,was) P(.was,inferior)
1 UNSEEN UNSEEN did .05 not .057 UNSEEN UNSEEN
2 was .05 very .038
inferior 0
31Maybe with a larger corpus?
- Words such as ergativity unlikely to be found
outside a corpus of linguistic articles - More in general Zipfs law
32Zipfs law for the Brown corpus
33Addressing the zeroes
- SMOOTHING is re-evaluating some of the
zero-probability and low-probability n-grams,
assigning them non-zero probabilities - Add-one
- Witten-Bell
- Good-Turing
- BACK-OFF is using the probabilities of lower
order n-grams when higher order ones are not
available - Backoff
- Linear interpolation
34Add-one (Laplaces Law)
35Effect on BERP bigram counts
36Add-one bigram probabilities
37The problem
38The problem
- Add-one has a huge effect on probabilities e.g.,
P(towant) went from .65 to .28! - Too much probability gets removed from n-grams
actually encountered - (more precisely the discount factor
39Witten-Bell Discounting
- How can we get a better estimate of the
probabilities of things we havent seen? - The Witten-Bell algorithm is based on the idea
that a zero-frequency N-gram is just an event
that hasnt happened yet - How often these events happen? We model this by
the probability of seeing an N-gram for the first
time (we just count the number of times we first
encountered a type)
40Witten-Bell the equations
- Total probability mass assigned to zero-frequency
N-grams -
- (NB T is OBSERVED types, not V)
- So each zero N-gram gets the probability
41Witten-Bell why discounting
- Now of course we have to take away something
(discount) from the probability of the events
seen more than once
42Witten-Bell for bigrams
- We relativize the types to the previous word
43Add-one vs. Witten-Bell discounts for unigrams in
the BERP corpus
Word Add-One Witten-Bell
I .68 .97
want .42 .94
to .69 .96
eat .37 .88
Chinese .12 .91
food .48 .94
lunch .22 .91
44One last discounting method .
- The best-known discounting method is GOOD-TURING
(Good, 1953) - Basic insight re-estimate the probability of
N-grams with zero counts by looking at the number
of bigrams that occurred once - For example, the revised count for bigrams that
never occurred is estimated by dividing N1, the
number of bigrams that occurred once, by N0, the
number of bigrams that never occurred
45Combining estimators
- A method often used (generally in combination
with discounting methods) is to use lower-order
estimates to help with higher-order ones - Backoff (Katz, 1987)
- Linear interpolation (Jelinek and Mercer, 1980)
46Backoff the basic idea
47Backoff with discounting
48Readings
- Jurafsky and Martin, chapter 6
- The Statistics Glossary
- Word prediction
- For mobile phones
- For disabled users
- Further reading Manning and Schuetze, chapters 6
(Good-Turing)
49Acknowledgments
- Some of the material in these slides was taken
from lecture notes by Diane Litman James Martin