Chapter 6: Statistical Inference: n-gram Models over Sparse Data

About This Presentation
Title:

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Description:

'Markov Assumption' word is affected only by its 'prior local context' (last few words) ... Nra = number of n-grams occurring r times in a-th part of training set ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Chapter 6: Statistical Inference: n-gram Models over Sparse Data


1
Chapter 6 Statistical Inference n-gram Models
over Sparse Data
  • TDM Seminar
  • Jonathan Henke
  • http//www.sims.berkeley.edu/jhenke/Tdm/TDM-Ch6.p
    pt

2
Basic Idea
  • Examine short sequences of words
  • How likely is each sequence?
  • Markov Assumption word is affected only by
    its prior local context (last few words)

3
Possible Applications
  • OCR / Voice recognition resolve ambiguity
  • Spelling correction
  • Machine translation
  • Confirming the author of a newly discovered work
  • Shannon game

4
Shannon Game
  • Claude E. Shannon. Prediction and Entropy of
    Printed English, Bell System Technical Journal
    3050-64. 1951.
  • Predict the next word, given (n-1) previous words
  • Determine probability of different sequences by
    examining training corpus

5
Forming Equivalence Classes (Bins)
  • n-gram sequence of n words
  • bigram
  • trigram
  • four-gram

6
Reliability vs. Discrimination
  • large green ___________
  • tree? mountain? frog? car?
  • swallowed the large green ________
  • pill? broccoli?

7
Reliability vs. Discrimination
  • larger n more information about the context of
    the specific instance (greater discrimination)
  • smaller n more instances in training data,
    better statistical estimates (more reliability)

8
Selecting an nVocabulary (V) 20,000 words
n Number of bins
2 (bigrams) 400,000,000
3 (trigrams) 8,000,000,000,000
4 (4-grams) 1.6 x 1017
9
Statistical Estimators
  • Given the observed training data
  • How do you develop a model (probability
    distribution) to predict future events?

10
Statistical Estimators
Example Corpus five Jane Austen novels N
617,091 words V 14,585 unique words Task
predict the next word of the trigram inferior to
________ from test data, Persuasion In
person, she was inferior to both sisters.
11
Instances in the Training Corpusinferior to
________
12
Maximum Likelihood Estimate
13
Actual Probability Distribution
14
Actual Probability Distribution
15
Smoothing
  • Develop a model which decreases probability of
    seen events and allows the occurrence of
    previously unseen n-grams
  • a.k.a. Discounting methods
  • Validation Smoothing methods which utilize a
    second batch of test data.

16
LaPlaces Law(adding one)
17
LaPlaces Law(adding one)
18
LaPlaces Law
19
Lidstones Law
P probability of specific n-gram C count of
that n-gram in training data N total n-grams in
training data B number of bins (possible
n-grams) ? small positive number M.L.E ?
0LaPlaces Law ? 1Jeffreys-Perks Law ?
½
20
Jeffreys-Perks Law
21
Objections to Lidstones Law
  • Need an a priori way to determine ?.
  • Predicts all unseen events to be equally likely
  • Gives probability estimates linear in the M.L.E.
    frequency

22
Smoothing
  • Lidstones Law (incl. LaPlaces Law and
    Jeffreys-Perks Law) modifies the observed counts
  • Other methods modify probabilities.

23
Held-Out Estimator
  • How much of the probability distribution should
    be held out to allow for previously unseen
    events?
  • Validate by holding out part of the training
    data.
  • How often do events unseen in training data occur
    in validation data?
  • (e.g., to choose ? for Lidstone model)

24
Held-Out Estimator
  • r C(w1 wn)

25
Testing Models
  • Hold out 5 10 for testing
  • Hold out 10 for validation (smoothing)
  • For testing useful to test on multiple sets of
    data, report variance of results.
  • Are results (good or bad) just the result of
    chance?

26
Cross-Validation(a.k.a. deleted estimation)
  • Use data for both training and validation
  • Divide test data into 2 parts
  • Train on A, validate on B
  • Train on B, validate on A
  • Combine two models

A
B
train
validate
Model 1
validate
train
Model 2

Model 1
Model 2
Final Model
27
Cross-Validation
  • Two estimates

Nra number of n-grams occurring r times in a-th
part of training set Trab total number of those
found in b-th part
Combined estimate
(arithmetic mean)
28
Good-Turing Estimator
  • r adjusted frequency
  • Nr number of n-gram-types which occur r times
  • E(Nr) expected value
  • E(Nr1) lt E(Nr)

29
Discounting Methods
  • First, determine held-out probability
  • Absolute discounting Decrease probability of
    each observed n-gram by subtracting a small
    constant
  • Linear discounting Decrease probability of each
    observed n-gram by multiplying by the same
    proportion

30
Combining Estimators
  • (Sometimes a trigram model is best, sometimes a
    bigram model is best, and sometimes a unigram
    model is best.)
  • How can you develop a model to utilize different
    length n-grams as appropriate?

31
Simple Linear Interpolation(a.k.a., finite
mixture modelsa.k.a., deleted interpolation)
  • weighted average of unigram, bigram, and trigram
    probabilities

32
Katzs Backing-Off
  • Use n-gram probability when enough training data
  • (when adjusted count gt k k usu. 0 or 1)
  • If not, back-off to the (n-1)-gram probability
  • (Repeat as needed)

33
Problems with Backing-Off
  • If bigram w1 w2 is common
  • but trigram w1 w2 w3 is unseen
  • may be a meaningful gap, rather than a gap due to
    chance and scarce data
  • i.e., a grammatical null
  • May not want to back-off to lower-order
    probability
Write a Comment
User Comments (0)