Statistical Inference: ngram Models over Sparse Data

1 / 35
About This Presentation
Title:

Statistical Inference: ngram Models over Sparse Data

Description:

bigram PELE PMLE. Still too much discount? Yes. P(she was inferior to both sisters) Bigram ELE - PELE = 6.89 10-20 ( =0.5) Worse than Unigram MLE. Low prob than ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 36
Provided by: mitel

less

Transcript and Presenter's Notes

Title: Statistical Inference: ngram Models over Sparse Data


1
Statistical Inference n-gram Models over Sparse
Data
  • Huang Like
  • 2007-04-27

2
Outline
  • Purpose of Statistical NLP
  • Forming Equivalence Classes
  • Statistical Estimators
  • Combining Estimators
  • Conclusions

3
Purpose of Statistical NLP
  • Doing statistical inference for NLP
  • Why statistical inference for NLP tasks?
  • Some NLP data generated by some distribution
  • Make inference about distribution
  • How? training data --gt equivalent classes (EC)
  • Finding good statistical estimated for EC
  • Combining multiple estimators

4
Example Language Modelling
  • Purpose Predict next word given previous words
  • Applications
  • speech or optical character recognition
  • spelling correction
  • handwriting recognition
  • machine translation
  • Methods applicable to
  • word sense disambiguation
  • probabilistic parsing

5
Outline
  • Purpose of Statistical NLP
  • Forming Equivalence Classes
  • Statistical Estimators
  • Combining Estimators
  • Conclusions

6
Reliability vs. Discrimination
  • Prediction mapping from past to future
  • From classificatory features to target feature
  • Independence assumption data does not depend on
    other features (or just minor dependence)
  • More features
  • More bins, greater discrimination
  • Less training data, lower statistical reliability

7
n-gram models
  • Predicting next word estimatingP(wm w1
    wm-1)
  • Using history h w1 wm-1
  • Markov assumption only the prior n-1 word
    affects wmP(wm h) P(wm wm-n1 wm-1)
  • n 2 bigram, n 3 trigram,
  • n 4 four-gram

8
How large should n be?
  • Large n to capture long distance dependency
  • Example Sue swallow the large green ____
  • Longer context predicts pill, frog
  • Local context predicts tree, car, mountain
  • But high order n-gram not realistic
  • Bigram has 400 million bins
  • Trigram has 8 trillion bins

9
n-gram for Austens novel
  • Data available from project Gutenberg
  • 40 M of clean plain ASCII files
  • Training data Emma, Mansfield Park, Pride and
    Prejudice, Sense and Sensibility
  • Testing Persuasion
  • Corpus N 617,091 words, Vocabulary V 14,585
    word types
  • Leaving out all punctuation
  • Keeping case distinction

10
Outline
  • Purpose of Statistical NLP
  • Forming Equivalence Classes
  • Statistical Estimators
  • Combining Estimators
  • Conclusions

11
Statistical Estimators
  • n-gram model
  • P(wn w1 wn-1) P(w1 wn)/P(w1 wn-1)
  • C(w1 w2 wn) frequency of w1n w1 w2 wn
  • w1 wn-1 h history of preceding words
  • N number of training instances
  • Problem? What if r C(w1 w2 wn) is 0 or 1
  • Smoothing using N0 , N1 , N2 , T1 , T2
  • N r number of distinct n gram with r instances
  • T r r N r total count of n grams with r
    instances

12
Maximum Likelihood Estimation (MLE)
  • MLE estimates from relative frequencies
  • P(as) C(as) / N
  • P(as) 8/10, P(more) 1/10, P(a) 1/10, P(x)
    0, all x ? as, more, a
  • PMLE(w1 wn) C(w1 wn) / N
  • PMLE(wn w1 wn-1) C(w1wn) / C(w1wn-1)
  • PMLE gives highest probability to sample
  • Why? Not wasting any probability value on unseen

13
MLEs problem and solution
  • When using model to predict testing data
  • Many (Majority) of word types are unseen
  • Zero prob propagates and wipe out other prob
  • Is more data a solution?
  • Never a general solution
  • Consider all numbers following the year
  • Solution
  • Decreasing prob of seen events
  • No-zero prob for unseen events

14
Using MLE for n-gram of Austen
  • Sentence In person she was inferior to both
    sisters
  • Unigram - not best, but still useful for
    prediction
  • Bigram - generally increase prob.
  • Trigram can work brilliantly, sometimes
  • P(was person she) 0.5
  • Four gram useless
  • Intuition - use high n-gram when possible
  • Zero prob still exists

15
Laplace's law etc
  • Laplaces law or Adding One
  • PLAP(w1 wn) (C(w1 wn)1) /(NB)fLAP(w1
    wn) (C(w1 wn)1) N /(NB)
  • For r gt 0, fLAP lt fMLE r
  • For r 0, fLAP gt 0 (fMLEr)

16
PLAP gives too much prob to N0
  • Depends on size of vocabulary V wi
  • B gtgt N
  • AP corpus
  • N22 M, V400 K,
  • BV2160G
  • N075G
  • 46.5 prob goes to N0 bigrams (N0fLAP / N)
  • Actually, 9.2 of word instances in AP (testing)
    are unseen

17
(No Transcript)
18
Lidstone's law ,Jeffreys-Perks law (ELE)
  • PLid(w1 wn) (C(w1 wn)?) /(NB?)
  • Interpolation between MLE uniform PLid(w1
    wn) ?C(w1 wn)/N(1-?)/B
  • ?N/(N ? B)
  • ?0.5, Jeffreys-Perks law
  • Better than PLAP but problems remain
  • Which value for ??
  • Still linear to MLE, dependent on B and N

19
Applying MLE and ELE to Austen
  • P(notwas) 0.065?0.036
  • bigram PELE lt PMLE
  • Still too much discount? Yes
  • P(she was inferior to both sisters)
  • Bigram ELE - PELE 6.89 ? 10-20 (?0.5)
  • Worse than Unigram MLE
  • Low prob than PMLE

20
Held out estimation
  • Tr ? w1 wn C(w1 wn)r Cho(w1 wn)
  • Tr total count of n-gram in HO
  • Tr Tr, ho where Ctrain(w1 wn) r
  • Tr / Nr is average frequency in HO
  • f ho(w1 wn) Tr, ho / Nr
  • Use hold out frequency to estimate P
  • Pho(w1 wn) f ho(w1 wn) / N

21
Why held out data?
  • Training data
  • Study and develop model
  • Training model
  • Test data
  • Should never look at test data
  • Held out Simulation of test data
  • Training Train Held out
  • HO simulated testing (HO ? testing)

22
Gold standard to evaluate f
  • In training data
  • Nr number of n-gram types with count ftrain r
  • Tr total number of n-gram instances (ftrain
    r)
  • In testing data (test)
  • Tr total number of n-grams in test (count
    ftrain r)
  • femp Tr / Nr ?(Ctest(w1 wn) ) / Nrfor
    (w1 wn) which Ctrain(w1 wn) r
  • Example
  • N0 10000, N0 3 with count 1, 1, 2
  • femp (112) / 10000 0.0004

23
Cross-validation (deleted estimation)
  • Divide training data N into N0 N1
  • Nra number of r count bigrams in Na
  • Trab total occurance of Nra in Nb
  • Pho(w1 wn) Tr01/Nr0N or Tr10/Nr1N
  • Pdel(w1 wn) (Tr01Tr10)/(Nr0Nr1)N
  • Effective, close to gold standard
  • Overestimate P for r 0
  • Underestimate when r 1

24
Leave one out
  • Two sets of training data Dividing Tr into (Tr0
    , Tr1 ) (N-1, 1)
  • Pdel(w1 wn) (Tr01Tr10)/(Nr0Nr1)N where
    C(w1 wn) r
  • Rotation this one for N times
  • Closely related to Good-Turing method

25
Good-Turing estimation
  • Good (1953) attributes GT to Turing
  • Based on binomial distribution
  • Works for many situations including n-gram
  • PGT r/N, r (r1)E(Nr1)/E(Nr)
  • Redistribution of prob value

26
Issues with Using Good-Turing
  • Problems
  • r0 for max r because E(Nr1)0
  • Nr monotonic but not smooth
  • Solution
  • Adjust r only when r lt k 10
  • Use smoothed value S(r) instead of Nr
  • Renormalize so prob values sum to 1

27
Simple Good-Turing
  • Due to Gale and Sampson (1995)
  • What S(r) ?
  • For low r, use S(r) Nr directly
  • For high r
  • Nr -gt Sr a r b (b lt -1) log Nr a b log r
  • Estimate a and b by linear regression for high r
  • Need to have probs summing up to 1

28
Briefly noted
  • Absolute discounting
  • Pabs(w1 wn) (r-?)/N if r gt 0
    Pabs(w1 wn) (B-N0)?/ N0N if r 0
  • ? ? 0.77 works best except for r 1
  • Linear discounting
  • Pld(w1 wn) (1-?) r / N if r gt 0
    Pld(w1 wn) ? / N0 if r 0
  • Cannot be justified
  • Discounting too much for hight-count events
  • High-count events more reliable statistically

29
Outline
  • Purpose of Statistical NLP
  • Forming Equivalence Classes
  • Statistical Estimators
  • Combining Estimators
  • Conclusions

30
Combining Estimators
  • Simple linear interpolation
  • Pli (wnwn-2 wn-1) ?1P(wn) ?2P(wnwn-1)?3P(wnw
    n-2 wn-1)
  • -training ? by EM algorithm
  • -have good result

31
Katzs backing-off
32
Language models for Austen
  • CMU-Cambridge Toolkit
  • Katzs back off Good-Turing
  • Trigram is better than bigram
  • Four-gram is slightly worse
  • Back-off model is ineffective at bad long
    contexts

33
Outline
  • Purpose of Statistical NLP
  • Forming Equivalence Classes
  • Statistical Estimators
  • Combining Estimators
  • Conclusions

34
Conclusions
  • According to Chen and Goodman (1996,8,9)
  • Kneser-Ney is the best
  • According to Church and Gale (1991)
  • Good-Turing is the best
  • Bigram, 2 M-word text

35
  • Thanks!
Write a Comment
User Comments (0)