LING 406 Intro to Computational Linguistics Estimation and Smoothing - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

LING 406 Intro to Computational Linguistics Estimation and Smoothing

Description:

Good-Turing and Word Frequency Distributions. Good-Turing and ... Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 42

Provided by: serrano5

Category:

more less

Transcript and Presenter's Notes

Title: LING 406 Intro to Computational Linguistics Estimation and Smoothing

1
LING 406Intro to Computational
LinguisticsEstimation and Smoothing

Richard Sproat
URL http//catarina.ai.uiuc.edu/L406_08/

2
This Lecture

N-gram models
Sparse data
Smoothing
Add One
Witten-Bell
Good-Turing
Backoff
Other issues
Good-Turing and Word Frequency Distributions
Good-Turing and Morphological Productivity

3
N-gram models

Remember the chain rule
P(w1w2w3 . . .wn) P(w1)P(w2w1)P(w3w1w2) . . .
Problem is we cant model all these conditional
probabilities
N-gram models approximate P(w1w2w3 . . .wn) by
setting a bound on the amount of previous
context.
This is the Markov assumption, and n-grams are
often termed Markov models

4
N-gram models
5
For example
6
Implementational detail
7
Example from Berkeley Restaurant Project (BERP),
approximately 10,000 sentences
8
BERP example
9
BERP example
10
BERP bigram counts
11
BERP bigram probabilities
12
What do we learn about the language?
13
Approximating Shakespeare

As we increase the value of N, the accuracy of
the n-gram model increases
Generating sentences with random unigrams
Every enter now severally so, let
Hill he late speaks or! a more to leg less first
you enter
With bigrams
What means, sir. I confess she? then all sorts,
he is trim, captain.
Why dost stand forth thy canopy, forsooth he is
this palpable hit the King Henry.
Trigrams
Sweet prince, Falstaff shall die.
This shall forbid it should be branded, if renown
made it empty.
Tetragrams
What! I will go seek the traitor Gloucester.
Will you not tell me who I am?

14
Approximating Shakespeare

There are 884,647 tokens, with 29,066 word form
types, in Shakespeares works
Shakespeare produced 300,000 bigram types out of
844 million possible bigrams so, 99.96 of the
possible bigrams were never seen (have zero
entries in the table).
Tetragrams are worse Whats coming out looks
like Shakespeare because it is Shakespeare.
The zeroes in the table are causing problems we
are being forced down a path of selecting only
the tetragrams that Shakespeare used not a very
good model of Shakespeare, in fact
This is the sparse data problem

15
Sparse data

In fact the sparse data problem extends beyond
zeroes
the occurs about 28,000 times in Shakespeare, so
by the MLE
P(the) 28000/884647 .032
womenkind occurs once, so
P(womenkind) 1/884647 .0000011
Do we believe this?

16
N-gram training sensitivity

If we repeated the Shakespeare experiment but
trained on a Wall Street Journal corpus, there
would be little overlap in the output
This has major implications for corpus selection
or design

17
Some useful empirical observations a review

A small number of events occur with high
frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high
frequency events
You might have to wait an arbitrarily long time
to get valid statistics on low frequency events
Some of the zeroes in the table are really
zeroes. But others are simply low frequency
events you havent seen yet.
Whatever are we to do?

18
Smoothing general issues

Smoothing techniques manipulate the counts of the
seen and unseen cases and replace each count c by
an adjusted count c.
Alternatively we can view smoothing as producing
an adjusted probability P from an original
probability P.
More sophisticated smoothing techniques try to
arrange it so that the probability estimates of
the higher counts are not changed too much, since
we tend to trust those.

19
Smoothing Add One
20
Add One
21
Witten-Bell
22
Witten-Bell
23
Witten-Bell
24
Good-Turing
25
Backoff
26
Deleted Interpolation
27
Kneser-Ney modeling

Lower-order ngrams are only used when
higher-order ngrams are lacking
So build these lower-order ngrams to suit that
situation
New York is frequent
York is not too frequent except after New
If the previous word is New then we dont care
about the unigram estimate of York
If the previous word is not New then we dont
want to be counting all those cases when New
occurs before York

28
Kneser-Ney Modeling
29
Guess the training source
30
Guess the training source
31
Guess the training source
32
Guess the training source
33
For thine own amusement

http//catarina.ai.uiuc.edu/ngramgen

34
Estimation techniques miscellanea

What if you have reason to doubt your counts? In
some what used to be recent but is now not so
recent work (Riley, Roark and Sproat, 2003),
weve tried to generalize Good-Turing to the case
where the counts are fractional as in the
(lattice) output of a speech recognizer.
Chen and Goodman (1998) http//citeseer.nj.nec.com
/22209.html is an oft-cited study of these
various techniques (and many others) and how
effective they are.
By the way, we havent said anything about how
one measures effectiveness.
There are a couple of ways
Actually use the n-gram language model in a real
system (such as an ASR system)
Measure the perplexity on some held-out corpus
Well get to those later

35
Smoothing isnt just for ngrams

The Good-Turing estimate of the probability mass
of the unseen cases is related to the growth of
the vocabulary
It gives you a measure of how likely it is that
there are more where that came from
Hence it can be used to measure the productivity
of a process

36
Growth rate of the vocabulary (Baayen 2001)
37
Measures of morphological productivity
38
Some sample P scores from Dutch and English
39
Related points

Baayen and Sproat (1996) showed that the best
predictor of the prior probability of a given
usage of an unseen morphologically complex word
is the most frequent usage among the hapax
legomena (see http//acl.ldc.upenn.edu/J/J96/J96-2
001.pdf).
Sproat and Shih (1996) showed that root compounds
in Chinese are productive using a Good-Turing
estimate

40
Chinese root compounds
41
Summary

N-gram models are an approximation to the correct
model as given by the chain rule
N-gram models are relatively easy to use, but
suffer from severe sparse data problems
There are a variety of techniques for
ameliorating sparse data problems
These techniques relate more generally to word
frequency distributions and are useful in areas
beyond n-gram modeling

Write a Comment

User Comments (0)