LING 406 Intro to Computational Linguistics Estimation and Smoothing - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

LING 406 Intro to Computational Linguistics Estimation and Smoothing

Description:

Good-Turing and Word Frequency Distributions. Good-Turing and ... Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 42
Provided by: serrano5
Category:

less

Transcript and Presenter's Notes

Title: LING 406 Intro to Computational Linguistics Estimation and Smoothing


1
LING 406Intro to Computational
LinguisticsEstimation and Smoothing
  • Richard Sproat
  • URL http//catarina.ai.uiuc.edu/L406_08/

2
This Lecture
  • N-gram models
  • Sparse data
  • Smoothing
  • Add One
  • Witten-Bell
  • Good-Turing
  • Backoff
  • Other issues
  • Good-Turing and Word Frequency Distributions
  • Good-Turing and Morphological Productivity

3
N-gram models
  • Remember the chain rule
  • P(w1w2w3 . . .wn) P(w1)P(w2w1)P(w3w1w2) . . .
  • Problem is we cant model all these conditional
    probabilities
  • N-gram models approximate P(w1w2w3 . . .wn) by
    setting a bound on the amount of previous
    context.
  • This is the Markov assumption, and n-grams are
    often termed Markov models

4
N-gram models
5
For example
6
Implementational detail
7
Example from Berkeley Restaurant Project (BERP),
approximately 10,000 sentences
8
BERP example
9
BERP example
10
BERP bigram counts
11
BERP bigram probabilities
12
What do we learn about the language?
13
Approximating Shakespeare
  • As we increase the value of N, the accuracy of
    the n-gram model increases
  • Generating sentences with random unigrams
  • Every enter now severally so, let
  • Hill he late speaks or! a more to leg less first
    you enter
  • With bigrams
  • What means, sir. I confess she? then all sorts,
    he is trim, captain.
  • Why dost stand forth thy canopy, forsooth he is
    this palpable hit the King Henry.
  • Trigrams
  • Sweet prince, Falstaff shall die.
  • This shall forbid it should be branded, if renown
    made it empty.
  • Tetragrams
  • What! I will go seek the traitor Gloucester.
  • Will you not tell me who I am?

14
Approximating Shakespeare
  • There are 884,647 tokens, with 29,066 word form
    types, in Shakespeares works
  • Shakespeare produced 300,000 bigram types out of
    844 million possible bigrams so, 99.96 of the
    possible bigrams were never seen (have zero
    entries in the table).
  • Tetragrams are worse Whats coming out looks
    like Shakespeare because it is Shakespeare.
  • The zeroes in the table are causing problems we
    are being forced down a path of selecting only
    the tetragrams that Shakespeare used not a very
    good model of Shakespeare, in fact
  • This is the sparse data problem

15
Sparse data
  • In fact the sparse data problem extends beyond
    zeroes
  • the occurs about 28,000 times in Shakespeare, so
    by the MLE
  • P(the) 28000/884647 .032
  • womenkind occurs once, so
  • P(womenkind) 1/884647 .0000011
  • Do we believe this?

16
N-gram training sensitivity
  • If we repeated the Shakespeare experiment but
    trained on a Wall Street Journal corpus, there
    would be little overlap in the output
  • This has major implications for corpus selection
    or design

17
Some useful empirical observations a review
  • A small number of events occur with high
    frequency
  • A large number of events occur with low frequency
  • You can quickly collect statistics on the high
    frequency events
  • You might have to wait an arbitrarily long time
    to get valid statistics on low frequency events
  • Some of the zeroes in the table are really
    zeroes. But others are simply low frequency
    events you havent seen yet.
  • Whatever are we to do?

18
Smoothing general issues
  • Smoothing techniques manipulate the counts of the
    seen and unseen cases and replace each count c by
    an adjusted count c.
  • Alternatively we can view smoothing as producing
    an adjusted probability P from an original
    probability P.
  • More sophisticated smoothing techniques try to
    arrange it so that the probability estimates of
    the higher counts are not changed too much, since
    we tend to trust those.

19
Smoothing Add One
20
Add One
21
Witten-Bell
22
Witten-Bell
23
Witten-Bell
24
Good-Turing
25
Backoff
26
Deleted Interpolation
27
Kneser-Ney modeling
  • Lower-order ngrams are only used when
    higher-order ngrams are lacking
  • So build these lower-order ngrams to suit that
    situation
  • New York is frequent
  • York is not too frequent except after New
  • If the previous word is New then we dont care
    about the unigram estimate of York
  • If the previous word is not New then we dont
    want to be counting all those cases when New
    occurs before York

28
Kneser-Ney Modeling
29
Guess the training source
30
Guess the training source
31
Guess the training source
32
Guess the training source
33
For thine own amusement
  • http//catarina.ai.uiuc.edu/ngramgen

34
Estimation techniques miscellanea
  • What if you have reason to doubt your counts? In
    some what used to be recent but is now not so
    recent work (Riley, Roark and Sproat, 2003),
    weve tried to generalize Good-Turing to the case
    where the counts are fractional as in the
    (lattice) output of a speech recognizer.
  • Chen and Goodman (1998) http//citeseer.nj.nec.com
    /22209.html is an oft-cited study of these
    various techniques (and many others) and how
    effective they are.
  • By the way, we havent said anything about how
    one measures effectiveness.
  • There are a couple of ways
  • Actually use the n-gram language model in a real
    system (such as an ASR system)
  • Measure the perplexity on some held-out corpus
  • Well get to those later

35
Smoothing isnt just for ngrams
  • The Good-Turing estimate of the probability mass
    of the unseen cases is related to the growth of
    the vocabulary
  • It gives you a measure of how likely it is that
    there are more where that came from
  • Hence it can be used to measure the productivity
    of a process

36
Growth rate of the vocabulary (Baayen 2001)
37
Measures of morphological productivity
38
Some sample P scores from Dutch and English
39
Related points
  • Baayen and Sproat (1996) showed that the best
    predictor of the prior probability of a given
    usage of an unseen morphologically complex word
    is the most frequent usage among the hapax
    legomena (see http//acl.ldc.upenn.edu/J/J96/J96-2
    001.pdf).
  • Sproat and Shih (1996) showed that root compounds
    in Chinese are productive using a Good-Turing
    estimate

40
Chinese root compounds
41
Summary
  • N-gram models are an approximation to the correct
    model as given by the chain rule
  • N-gram models are relatively easy to use, but
    suffer from severe sparse data problems
  • There are a variety of techniques for
    ameliorating sparse data problems
  • These techniques relate more generally to word
    frequency distributions and are useful in areas
    beyond n-gram modeling
Write a Comment
User Comments (0)
About PowerShow.com