Title: Advanced Smoothing, Evaluation of Language Models
1Advanced Smoothing, Evaluation of Language Models
2Witten-Bell Discounting
- A zero ngram is just an ngram you havent seen
yetbut every ngram in the corpus was unseen
onceso... - How many times did we see an ngram for the first
time? Once for each ngram type (T) - Est. total probability of unseen bigrams as
- View training corpus as series of events, one for
each token (N) and one for each new type (T) - We can divide the probability mass equally among
unseen bigrams.or we can condition the
probability of an unseen bigram on the first word
of the bigram - Discount values for Witten-Bell are much more
reasonable than Add-One
3Good-Turing Discounting
- Re-estimate amount of probability mass for zero
(or low count) ngrams by looking at ngrams with
higher counts - Estimate
- E.g. N0s adjusted count is a function of the
count of ngrams that occur once, N1 - Assumes
- word bigrams follow a binomial distribution
- We know number of unseen bigrams (VxV-seen)
4Interpolation and Backoff
- Typically used in addition to smoothing
techniques/ discounting - Example trigrams
- Smoothing gives some probability mass to all the
trigram types not observed in the training data - We could make a more informed decision! How?
- If backoff finds an unobserved trigram in the
test data, it will back off to bigrams (and
ultimately to unigrams) - Backoff doesnt treat all unseen trigrams alike
- When we have observed a trigram, we will rely
solely on the trigram counts
5Backoff methods (e.g. Katz 87)
- For e.g. a trigram model
- Compute unigram, bigram and trigram probabilities
- In use
- Where trigram unavailable back off to bigram if
available, o.w. unigram probability - E.g An omnivorous unicorn
6Smoothing Simple Interpolation
- Trigram is very context specific, very noisy
- Unigram is context-independent, smooth
- Interpolate Trigram, Bigram, Unigram for best
combination - Find ?0lt???lt1 by optimizing on held-out data
- Almost good enough
7Smoothing Held-out estmation
- Finding parameter values
- Split data into training, heldout, test
- Try lots of different values for ?? ? on heldout
data, pick best - Test on test data
- Sometimes, can use tricks like EM (estimation
maximization) to find values - How much data for training, heldout, test?
- Answer enough test data to be statistically
significant. (1000s of words perhaps)
8Summary
- N-gram probabilities can be used to estimate the
likelihood - Of a word occurring in a context (N-1)
- Of a sentence occurring at all
- Smoothing techniques deal with problems of unseen
words in a corpus
9Practical Issues
- Represent and compute language model
probabilities on log format - p1 ? p2 ? p3 ? p4 exp (log p1 log p2 log
p3 log p4)
10Class-based n-grams
- P(wiwi-1) P(cici-1) x P(wici)
- Factored Language Models
11Evaluating language models
- We need evaluation metrics to determine how good
our language models predict the next word - Intuition one should average over the
probability of new words
12Some basic information theory
- Evaluation metrics for language models
- Information theory measures of information
- Entropy
- Perplexity
13Entropy
- Average length of most efficient coding for a
random variable - Binary encoding
14Entropy
- Example betting on horses
- 8 horses, each horse is equally likely to win
- (Binary) Message required
001, 010, 011, 100, 101, 110, 111, 000 - 3-bit message required
15Entropy
- 8 horses, some horses are more likely to win
- Horse 1 ½ 0
- Horse 2 ¼ 10
- Horse 3 1/8 110
- Horse 4 1/16 1110
- Horse 5-8 1/64 111100, 111101, 111110, 111111
16Perplexity
- Entropy H
- Perplexity 2H
- Intuitively weighted average number of choices a
random variable has to make - Equally likely horses Entropy 3 Perplexity 23
8 - Biased horses Entropy 2 Perplexity 22 4
17Entropy
- Uncertainty measure (Shannon)
- given a random variable x
- r 2, pi probability the event is i
- Biased coin
- -0.8 lg 0.8 -0.2 lg 0.2 0.258 0.464
0.722 - Unbiased coin
- - 2 0.5 lg 0.5 1
- lg log2 (log base 2)
- entropy H(x) Shannon uncertainty
- Perplexity
- (average) branching factor
- weighted average number of choices a random
variable has to make - Formula 2H
- directly related to the entropy value H
- Examples
- Biased coin
- 20.722 0.52
- Unbiased coin
- - 21 2
18Entropy and Word Sequences
- Given a word sequence
- W w1 wn
- Entropy for word sequences of length n in
language L - H(w1 wn) -? p(w1 wn) log p(w1 wn)
- over all sequences of length n in language L
- Entropy rate for word sequences of length n
- 1/n H(w1 wn)
- -1/n ? p(w1 wn) log p(w1 wn)
- Entropy rateH(L) limngt? -1/n ? p(w1 wn) log
p(w1 wn) - n is number of words in the sequence
- Shannon-McMillan-Breiman theorem
- H(L) limn?? - 1/n log p(w1 wn)
- select sufficiently large n
- possible then to take a single sequence
- instead of summing over all possible w1 wn
- long sequence will contain many shorter sequences
19Entropy of a sequence
- Finite sequence strings from a language L
- Entropy rate (per-word entropy)
20Entropy of a language
- Entropy rate of language L
- Shannon-McMillan-Breimann Theorem
- If a language is stationary and ergodic
- A single sequence if it is long enough is
representative for the language