Title: LING 406 Intro to Computational Linguistics Estimation and Smoothing
1LING 406Intro to Computational
LinguisticsEstimation and Smoothing
- Richard Sproat
- URL http//catarina.ai.uiuc.edu/L406_08/
2This Lecture
- N-gram models
- Sparse data
- Smoothing
- Add One
- Witten-Bell
- Good-Turing
- Backoff
- Other issues
- Good-Turing and Word Frequency Distributions
- Good-Turing and Morphological Productivity
3N-gram models
- Remember the chain rule
- P(w1w2w3 . . .wn) P(w1)P(w2w1)P(w3w1w2) . . .
- Problem is we cant model all these conditional
probabilities - N-gram models approximate P(w1w2w3 . . .wn) by
setting a bound on the amount of previous
context. - This is the Markov assumption, and n-grams are
often termed Markov models
4N-gram models
5For example
6Implementational detail
7Example from Berkeley Restaurant Project (BERP),
approximately 10,000 sentences
8BERP example
9BERP example
10BERP bigram counts
11BERP bigram probabilities
12What do we learn about the language?
13Approximating Shakespeare
- As we increase the value of N, the accuracy of
the n-gram model increases - Generating sentences with random unigrams
- Every enter now severally so, let
- Hill he late speaks or! a more to leg less first
you enter - With bigrams
- What means, sir. I confess she? then all sorts,
he is trim, captain. - Why dost stand forth thy canopy, forsooth he is
this palpable hit the King Henry. - Trigrams
- Sweet prince, Falstaff shall die.
- This shall forbid it should be branded, if renown
made it empty. - Tetragrams
- What! I will go seek the traitor Gloucester.
- Will you not tell me who I am?
14Approximating Shakespeare
- There are 884,647 tokens, with 29,066 word form
types, in Shakespeares works - Shakespeare produced 300,000 bigram types out of
844 million possible bigrams so, 99.96 of the
possible bigrams were never seen (have zero
entries in the table). - Tetragrams are worse Whats coming out looks
like Shakespeare because it is Shakespeare. - The zeroes in the table are causing problems we
are being forced down a path of selecting only
the tetragrams that Shakespeare used not a very
good model of Shakespeare, in fact - This is the sparse data problem
15Sparse data
- In fact the sparse data problem extends beyond
zeroes - the occurs about 28,000 times in Shakespeare, so
by the MLE - P(the) 28000/884647 .032
- womenkind occurs once, so
- P(womenkind) 1/884647 .0000011
- Do we believe this?
16N-gram training sensitivity
- If we repeated the Shakespeare experiment but
trained on a Wall Street Journal corpus, there
would be little overlap in the output - This has major implications for corpus selection
or design
17Some useful empirical observations a review
- A small number of events occur with high
frequency - A large number of events occur with low frequency
- You can quickly collect statistics on the high
frequency events - You might have to wait an arbitrarily long time
to get valid statistics on low frequency events - Some of the zeroes in the table are really
zeroes. But others are simply low frequency
events you havent seen yet. - Whatever are we to do?
18Smoothing general issues
- Smoothing techniques manipulate the counts of the
seen and unseen cases and replace each count c by
an adjusted count c. - Alternatively we can view smoothing as producing
an adjusted probability P from an original
probability P. - More sophisticated smoothing techniques try to
arrange it so that the probability estimates of
the higher counts are not changed too much, since
we tend to trust those.
19Smoothing Add One
20Add One
21Witten-Bell
22Witten-Bell
23Witten-Bell
24Good-Turing
25Backoff
26Deleted Interpolation
27Kneser-Ney modeling
- Lower-order ngrams are only used when
higher-order ngrams are lacking - So build these lower-order ngrams to suit that
situation - New York is frequent
- York is not too frequent except after New
- If the previous word is New then we dont care
about the unigram estimate of York - If the previous word is not New then we dont
want to be counting all those cases when New
occurs before York
28Kneser-Ney Modeling
29Guess the training source
30Guess the training source
31Guess the training source
32Guess the training source
33For thine own amusement
- http//catarina.ai.uiuc.edu/ngramgen
34Estimation techniques miscellanea
- What if you have reason to doubt your counts? In
some what used to be recent but is now not so
recent work (Riley, Roark and Sproat, 2003),
weve tried to generalize Good-Turing to the case
where the counts are fractional as in the
(lattice) output of a speech recognizer. - Chen and Goodman (1998) http//citeseer.nj.nec.com
/22209.html is an oft-cited study of these
various techniques (and many others) and how
effective they are. - By the way, we havent said anything about how
one measures effectiveness. - There are a couple of ways
- Actually use the n-gram language model in a real
system (such as an ASR system) - Measure the perplexity on some held-out corpus
- Well get to those later
35Smoothing isnt just for ngrams
- The Good-Turing estimate of the probability mass
of the unseen cases is related to the growth of
the vocabulary - It gives you a measure of how likely it is that
there are more where that came from - Hence it can be used to measure the productivity
of a process
36Growth rate of the vocabulary (Baayen 2001)
37Measures of morphological productivity
38Some sample P scores from Dutch and English
39Related points
- Baayen and Sproat (1996) showed that the best
predictor of the prior probability of a given
usage of an unseen morphologically complex word
is the most frequent usage among the hapax
legomena (see http//acl.ldc.upenn.edu/J/J96/J96-2
001.pdf). - Sproat and Shih (1996) showed that root compounds
in Chinese are productive using a Good-Turing
estimate
40Chinese root compounds
41Summary
- N-gram models are an approximation to the correct
model as given by the chain rule - N-gram models are relatively easy to use, but
suffer from severe sparse data problems - There are a variety of techniques for
ameliorating sparse data problems - These techniques relate more generally to word
frequency distributions and are useful in areas
beyond n-gram modeling