Part II. Statistical NLP

1 / 87
About This Presentation
Title:

Part II. Statistical NLP

Description:

... (n-gram) + 1 The idea is to give a little bit of the probability space ... In NLP applications that are very sparse, Laplace s ... The channel transforms the ... – PowerPoint PPT presentation

Number of Views:7
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Part II. Statistical NLP


1
Advanced Artificial Intelligence
  • Part II. Statistical NLP

N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard
Nebel, Lars Schmidt-Thieme
Most slides taken from Helmut Schmid, Rada
Mihalcea, Bonnie Dorr, Leila Kosseim and others
2
Contents
  • Short recap motivation for SNLP
  • Probabilistic language models
  • N-gramms
  • Predicting the next word in a sentence
  • Language guessing
  • Largely chapter 6 of Statistical NLP, Manning and
    Schuetze.
  • And chapter 6 of Speech and Language Processing,
    Jurafsky and Martin

3
Motivation
  • Human Language is highly ambiguous at all levels
  • acoustic levelrecognize speech vs. wreck a
    nice beach
  • morphological levelsaw to see (past), saw
    (noun), to saw (present, inf)
  • syntactic levelI saw the man on the hill with a
    telescope
  • semantic levelOne book has to be read by every
    student

4
NLP and Statistics
  • Statistical Disambiguation
  • Define a probability model for the data
  • Compute the probability of each alternative
  • Choose the most likely alternative

5
Language Models
Speech recognisers use a noisy channel model
The source generates a sentence s with
probability P(s).The channel transforms the text
into an acoustic signalwith probability
P(as). The task of the speech recogniser is to
find for a givenspeech signal a the most likely
sentence s s argmaxs P(sa) argmaxs
P(as) P(s) / P(a) argmaxs P(as) P(s)
6
Language Models
  • Speech recognisers employ two statistical models
  • a language model P(s)
  • an acoustics model P(as)

7
Language Model
  • Definition
  • Language model is a model that enables one to
    compute the probability, or likelihood, of a
    sentence s, P(s).
  • Lets look at different ways of computing P(s) in
    the context of Word Prediction

8
Example of bad language model
9
A bad language model
10
A bad language model
11
A Good Language Model
  • Determine reliable sentence probability estimates
  • P(And nothing but the truth) ?? 0.001
  • P(And nuts sing on the roof) ? 0

12
Shannon game Word Prediction
  • Predicting the next word in the sequence
  • Statistical natural language .
  • The cat is thrown out of the
  • The large green
  • Sue swallowed the large green

13
Claim
  • A useful part of the knowledge needed to allow
    Word Prediction can be captured using simple
    statistical techniques.
  • Compute
  • probability of a sequence
  • likelihood of words co-occurring
  • Why would we want to do this?
  • Rank the likelihood of sequences containing
    various alternative alternative hypotheses
  • Assess the likelihood of a hypothesis

14
Applications
  • Spelling correction
  • Mobile phone texting
  • Speech recognition
  • Handwriting recognition
  • Disabled users

15
Spelling errors
  • They are leaving in about fifteen minuets to go
    to her house.
  • The study was conducted mainly be John Black.
  • Hopefully, all with continue smoothly in my
    absence.
  • Can they lave him my messages?
  • I need to notified the bank of.
  • He is trying to fine out.

16
Handwriting recognition
  • Assume a note is given to a bank teller, which
    the teller reads as I have a gub. (cf. Woody
    Allen)
  • NLP to the rescue .
  • gub is not a word
  • gun, gum, Gus, and gull are words, but gun has a
    higher probability in the context of a bank

17
For Spell Checkers
  • Collect list of commonly substituted words
  • piece/peace, whether/weather, their/there ...
  • ExampleOn Tuesday, the whether On
    Tuesday, the weather

18
Language Models
How to assign probabilities to word
sequences? The probability of a word sequence
w1,n is decomposedinto a product of conditional
probabilities. P(w1,n) P(w1) P(w2 w1)
P(w3 w1,w2) ... P(wn w1,n-1) ?i1..n
P(wi w1,i-1)
19
Language Models
  • In order to simplify the model, we assume that
  • each word only depends on the 2 preceding words
    P(wi w1,i-1) P(wi wi-2, wi-1)
  • 2nd order Markov model, trigram
  • that the probabilities are time invariant
    (stationary)
  • P(Wic Wi-2a, Wi-1b) P(Wkc Wk-2a,
    Wk-1b)
  • Final formula P(w1,n) ?i1..n P(wi wi-2,
    wi-1)

20
Simple N-Grams
  • An N-gram model uses the previous N-1 words to
    predict the next one
  • P(wn wn-N1 wn-N2 wn-1 )
  • unigrams P(dog)
  • bigrams P(dog big)
  • trigrams P(dog the big)
  • quadrigrams P(dog chasing the big)

21
A Bigram Grammar Fragment
Eat on .16 Eat Thai .03
Eat some .06 Eat breakfast .03
Eat lunch .06 Eat in .02
Eat dinner .05 Eat Chinese .02
Eat at .04 Eat Mexican .02
Eat a .04 Eat tomorrow .01
Eat Indian .04 Eat dessert .007
Eat today .03 Eat British .001
22
Additional Grammar
ltstartgt I .25 Want some .04
ltstartgt Id .06 Want Thai .01
ltstartgt Tell .04 To eat .26
ltstartgt Im .02 To have .14
I want .32 To spend .09
I would .29 To be .02
I dont .08 British food .60
I have .04 British restaurant .15
Want to .65 British cuisine .01
Want a .05 British lunch .01
23
Computing Sentence Probability
  • P(I want to eat British food) P(Iltstartgt)
    P(wantI) P(towant) P(eatto) P(Britisheat)
    P(foodBritish) .25x.32x.65x.26x.001x.60
    .000080
  • vs.
  • P(I want to eat Chinese food) .00015
  • Probabilities seem to capture syntactic'' facts,
    world knowledge''
  • eat is often followed by a NP
  • British food is not too popular
  • N-gram models can be trained by counting and
    normalization

24
Some adjustments
  • product of probabilities numerical underflow for
    long sentences
  • so instead of multiplying the probs, we add the
    log of the probs
  • P(I want to eat British food)
  • Computed using
  • log(P(Iltsgt)) log(P(wantI)) log(P(towant))
    log(P(eatto)) log(P(Britisheat))
    log(P(foodBritish))
  • log(.25) log(.32) log(.65) log (.26)
    log(.001) log(.6)
  • -11.722

25
Why use only bi- or tri-grams?
  • Markov approximation is still costly
  • with a 20 000 word vocabulary
  • bigram needs to store 400 million parameters
  • trigram needs to store 8 trillion parameters
  • using a language model gt trigram is impractical
  • to reduce the number of parameters, we can
  • do stemming (use stems instead of word types)
  • group words into semantic classes
  • seen once --gt same as unseen
  • ...

26
(No Transcript)
27
(No Transcript)
28
Building n-gram Models
  • Data preparation
  • Decide training corpus
  • Clean and tokenize
  • How do we deal with sentence boundaries?
  • I eat. I sleep.
  • (I eat) (eat I) (I sleep)
  • ltsgtI eat ltsgt I sleep ltsgt
  • (ltsgt I) (I eat) (eat ltsgt) (ltsgt I) (I sleep)
    (sleep ltsgt)
  • Use statistical estimators
  • to derive a good probability estimates based on
    training data.

29
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • ( Validation
  • Held Out Estimation
  • Cross Validation )
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • Combining Estimators
  • Simple Linear Interpolation
  • General Linear Interpolation
  • Katzs Backoff

30
Statistical Estimators
  • --gt Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • ( Validation
  • Held Out Estimation
  • Cross Validation )
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • Combining Estimators
  • Simple Linear Interpolation
  • General Linear Interpolation
  • Katzs Backoff

31
Maximum Likelihood Estimation
  • Choose the parameter values which gives the
    highest probability on the training corpus
  • Let C(w1,..,wn) be the frequency of n-gram
    w1,..,wn

32
Example 1 P(event)
  • in a training corpus, we have 10 instances of
    come across
  • 8 times, followed by as
  • 1 time, followed by more
  • 1 time, followed by a
  • with MLE, we have
  • P(as come across) 0.8
  • P(more come across) 0.1
  • P(a come across) 0.1
  • P(X come across) 0 where X ? as, more,
    a

33
Problem with MLE data sparseness
  • What if a sequence never appears in training
    corpus? P(X)0
  • come across the men --gt prob 0
  • come across some men --gt prob 0
  • come across 3 men --gt prob 0
  • MLE assigns a probability of zero to unseen
    events
  • probability of an n-gram involving unseen words
    will be zero!

34
Maybe with a larger corpus?
  • Some words or word combinations are unlikely to
    appear !!!
  • Recall
  • Zipfs law
  • f 1/r

35
Problem with MLE data sparseness (cont)
  • in (Balh et al 83)
  • training with 1.5 million words
  • 23 of the trigrams from another part of the same
    corpus were previously unseen.
  • So MLE alone is not good enough estimator

36
Discounting or Smoothing
  • MLE is usually unsuitable for NLP because of the
    sparseness of the data
  • We need to allow for possibility of seeing events
    not seen in training
  • Must use a Discounting or Smoothing technique
  • Decrease the probability of previously seen
    events to leave a little bit of probability for
    previously unseen events

37
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • --gt Smoothing
  • --gt Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • ( Validation
  • Held Out Estimation
  • Cross Validation )
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • Combining Estimators
  • Simple Linear Interpolation
  • General Linear Interpolation
  • Katzs Backoff

38
Many smoothing techniques
  • Add-one
  • Add-delta
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • Church-Gale smoothing
  • Absolute-discounting
  • Kneser-Ney smoothing
  • ...

39
Add-one Smoothing (Laplaces law)
  • Pretend we have seen every n-gram at least once
  • Intuitively
  • new_count(n-gram) old_count(n-gram) 1
  • The idea is to give a little bit of the
    probability space to unseen events

40
Add-one Example
unsmoothed bigram counts
2nd word
unsmoothed normalized bigram probabilities
41
Add-one Example (cont)
add-one smoothed bigram counts
add-one normalized bigram probabilities
42
Add-one, more formally
  • N nb of n-grams in training corpus -
  • Bnb of bins (of possible n-grams)
  • B V2 for bigrams
  • B V3 for trigrams etc.
  • where V is size of vocabulary

43
Problem with add-one smoothing
  • bigrams starting with Chinese are boosted by a
    factor of 8 ! (1829 / 213)

unsmoothed bigram counts
add-one smoothed bigram counts
44
Problem with add-one smoothing (cont)
  • Data from the AP from (Church and Gale, 1991)
  • Corpus of 22,000,000 bigrams
  • Vocabulary of 273,266 words (i.e. 74,674,306,760
    possible bigrams - or bins)
  • 74,671,100,000 bigrams were unseen
  • And each unseen bigram was given a frequency of
    0.000137

Add-one smoothed freq.
Freq. from training data
fMLE fempirical fadd-one
0 0.000027 0.000137
1 0.448 0.000274
2 1.25 0.000411
3 2.24 0.000548
4 3.23 0.000685
5 4.21 0.000822
Freq. from held-out data
too high
too low
  • Total probability mass given to unseen bigrams
  • (74,671,100,000 x 0.000137) / 22,000,000 0.465
    !!!!

45
Problem with add-one smoothing
  • every previously unseen n-gram is given a low
    probability
  • but there are so many of them that too much
    probability mass is given to unseen events
  • adding 1 to frequent bigram, does not change much
  • but adding 1 to low bigrams (including unseen
    ones) boosts them too much !
  • In NLP applications that are very sparse,
    Laplaces Law actually gives far too much of the
    probability space to unseen events.

46
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • --gt Add-delta -- Lidstones Jeffreys-Perks
    Laws (ELE)
  • Validation
  • Held Out Estimation
  • Cross Validation
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • Combining Estimators
  • Simple Linear Interpolation
  • General Linear Interpolation
  • Katzs Backoff

47
Add-delta smoothing (Lidstones law)
  • instead of adding 1, add some other (smaller)
    positive value ?
  • most widely used value for ? 0.5
  • if ?0.5, Lidstones Law is called
  • the Expected Likelihood Estimation (ELE)
  • or the Jeffreys-Perks Law
  • better than add-one, but still

48
Adding ? / Lidstones law
The expected frequency of a trigram in a
random sample of size N is therefore f(w,w,w)
? f(w,w,w) (1- ?) N/B ? relative
discounting
49
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • --gt ( Validation
  • Held Out Estimation
  • Cross Validation )
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • Combining Estimators
  • Simple Linear Interpolation
  • General Linear Interpolation
  • Katzs Backoff

50
Validation / Held-out Estimation
  • How do we know how much of the probability space
    to hold out for unseen events?
  • ie. We need a good way to guess ? in advance
  • Held-out data
  • We can divide the training data into two parts
  • the training set used to build initial estimates
    by counting
  • the held out data used to refine the initial
    estimates (i.e. see how often the bigrams that
    appeared r times in the training text occur in
    the held-out text)

51
Held Out Estimation
  • For each n-gram w1...wn we compute
  • Ctr(w1...wn) the frequency of w1...wn in the
    training data
  • Cho(w1...wn) the frequency of w1...wn in the held
    out data
  • Let
  • r the frequency of an n-gram in the training
    data
  • Nr the number of different n-grams with
    frequency r in the training data
  • Tr the sum of the counts of all n-grams in the
    held-out data that appeared r times in the
    training data
  • T total number of n-gram in the held out data
  • So


52
Some explanation
probability in held-out data for all n-grams
appearing r times in the training data
since we have Nr different n-grams in the
training data that occurred r times, let's share
this probability mass equality among them

  • ex assume
  • if r5 and 10 different n-grams (types) occur 5
    times in training
  • --gt N5 10
  • if all the n-grams (types) that occurred 5 times
    in training, occurred in total (n-gram tokens) 20
    times in the held-out data
  • --gt T5 20
  • assume the held-out data contains 2000 n-grams
    (tokens)


53
Cross-Validation
  • Held Out estimation is useful if there is a lot
    of data available
  • If not, we can use each part of the data both as
    training data and as held out data.
  • Main methods
  • Deleted Estimation (two-way cross validation)
  • Divide data into part 0 and part 1
  • In one model use 0 as the training data and 1 as
    the held out data
  • In another model use 1 as training and 0 as held
    out data.
  • Do a weighted average of the two models
  • Leave-One-Out
  • Divide data into N parts (N nb of tokens)
  • Leave 1 token out each time
  • Train N language models

54
Comparison
Empirical results for bigram data (Church and
Gale) f femp fGT fadd1 fheld-out 0 0.000027 0.000
027 0.000137 0.000037 1 0.448 0.446 0.000274 0.396
2 1.25 1.26 0.000411 1.24 3 2.24 2.24 0.000548 2.
23 4 3.23 3.24 0.000685 3.22 5 4.21 4.22 0.000822
4.22 6 5.23 5.19 0.000959 5.20 7 6.21 6.21 0.00109
6.21 8 7.21 7.24 0.00123 7.18 9 8.26 8.25 0.00137
8.18
55
Dividing the corpus
  • Training
  • Training data (80 of total data)
  • To build initial estimates (frequency counts)
  • Held out data (10 of total data)
  • To refine initial estimates (smoothed estimates)
  • Testing
  • Development test data (5 of total data)
  • To test while developing
  • Final test data (5 of total data)
  • To test at the end
  • But how do we divide?
  • Randomly select data (ex. sentences, n-grams)
  • Advantage Test data is very similar to training
    data
  • Cut large chunks of consecutive data
  • Advantage Results are lower, but more realistic

56
Developing and Testing Models
  • Write an algorithm
  • Train it
  • With training set held-out data
  • Test it
  • With development set
  • Note things it does wrong revise it
  • Repeat 1-5 until satisfied
  • Only then, evaluate and publish results
  • With final test set
  • Better to give final results by testing on n
    smaller samples of the test data and averaging

57
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • ( Validation
  • Held Out Estimation
  • Cross Validation )
  • --gt Witten-Bell smoothing
  • Good-Turing smoothing
  • Combining Estimators
  • Simple Linear Interpolation
  • General Linear Interpolation
  • Katzs Backoff

58
Witten-Bell smoothing
  • intuition
  • An unseen n-gram is one that just did not occur
    yet
  • When it does happen, it will be its first
    occurrence
  • So give to unseen n-grams the probability of
    seeing a new n-gram

59
Witten-Bell the equations
  • Total probability mass assigned to zero-frequency
    N-grams
  • (NB T is OBSERVED types, not V)
  • So each zero N-gram gets the probability

60
Witten-Bell why discounting
  • Now of course we have to take away something
    (discount) from the probability of the events
    seen more than once

61
Witten-Bell for bigrams
  • We relativize the types to the previous word
  • this probability mass, must be distributed in
    equal parts over all unseen bigrams
  • Z (w1) number of unseen n-grams starting with
    w1

  • for each unseen event

62
Small example
  • all unseen bigrams starting with a will share a
    probability mass of
  • each unseen bigrams starting with a will have an
    equal part of this

63
Small example (cont)
  • all unseen bigrams starting with b will share a
    probability mass of
  • each unseen bigrams starting with b will have an
    equal part of this

64
Small example (cont)
  • all unseen bigrams starting with c will share a
    probability mass of
  • each unseen bigrams starting with c will have an
    equal part of this

65
More formally
  • Unseen bigrams
  • To get from the probabilities back to the counts,
    we know that

  • // N (w1) nb of bigrams starting with w1
  • so we get

66
The restaurant example
  • The original counts were
  • T(w) number of different seen bigrams types
    starting with w
  • we have a vocabulary of 1616 words, so we can
    compute
  • Z(w) number of unseen bigrams types starting
    with w
  • Z(w) 1616 - T(w)
  • N(w) number of bigrams tokens starting with w

67
Witten-Bell smoothed count
  • the count of the unseen bigram I lunch
  • the count of the seen bigram want to
  • Witten-Bell smoothed bigram counts

68
Witten-Bell smoothed probabilities
Witten-Bell normalized bigram probabilities
69
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • Validation
  • Held Out Estimation
  • Cross Validation
  • Witten-Bell smoothing
  • --gt Good-Turing smoothing
  • Combining Estimators
  • Simple Linear Interpolation
  • General Linear Interpolation
  • Katzs Backoff

70
Good-Turing Estimator
  • Based on the assumption that words have a
    binomial distribution
  • Works well in practice (with large corpora)
  • Idea
  • Re-estimate the probability mass of n-grams with
    zero (or low) counts by looking at the number of
    n-grams with higher counts
  • Ex

Nb of ngrams that occur c1 times
Nb of ngrams that occur c times
71
Good-Turing Estimator (cont)
  • In practice c is not used for all counts c
  • large counts (gt a threshold k) are assumed to be
    reliable
  • If c gt k (usually k 5)
  • c c
  • If c lt k

72
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • ( Validation
  • Held Out Estimation
  • Cross Validation )
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • --gt Combining Estimators
  • Simple Linear Interpolation
  • General Linear Interpolation
  • Katzs Backoff

73
Combining Estimators
  • so far, we gave the same probability to all
    unseen n-grams
  • we have never seen the bigrams
  • journal of Punsmoothed(of journal) 0
  • journal from Punsmoothed(from journal) 0
  • journal never Punsmoothed(never journal) 0
  • all models so far will give the same probability
    to all 3 bigrams
  • but intuitively, journal of is more probable
    because...
  • of is more frequent than from never
  • unigram probability P(of) gt P(from) gt P(never)

74
Combining Estimators (cont)
  • observation
  • unigram model suffers less from data sparseness
    than bigram model
  • bigram model suffers less from data sparseness
    than trigram model
  • so use a lower model estimate, to estimate
    probability of unseen n-grams
  • if we have several models of how the history
    predicts what comes next, we can combine them in
    the hope of producing an even better model

75
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • Validation
  • Held Out Estimation
  • Cross Validation
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • Combining Estimators
  • --gt Simple Linear Interpolation
  • General Linear Interpolation
  • Katzs Backoff

76
Simple Linear Interpolation
  • Solve the sparseness in a trigram model by mixing
    with bigram and unigram models
  • Also called
  • linear interpolation,
  • finite mixture models
  • deleted interpolation
  • Combine linearly
  • Pli(wnwn-2,wn-1) ?1P(wn) ?2P(wnwn-1)
    ?3P(wnwn-2,wn-1)
  • where 0? ?i ?1 and ?i ?i 1

77
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • Validation
  • Held Out Estimation
  • Cross Validation
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • Combining Estimators
  • Simple Linear Interpolation
  • --gt General Linear Interpolation
  • Katzs Backoff

78
General Linear Interpolation
  • In simple linear interpolation, the weights ?i
    are constant
  • So the unigram estimate is always combined with
    the same weight, regardless of whether the
    trigram is accurate (because there is lots of
    data) or poor
  • We can have a more general and powerful model
    where ?i are a function of the history h
  • where 0? ?i(h) ?1 and ?i ?i(h) 1
  • Having a specific ?(h) per n-gram is not a good
    idea, but we can set a ?(h) according to the
    frequency of the n-gram

79
Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Smoothing
  • Add-one -- Laplace
  • Add-delta -- Lidstones Jeffreys-Perks Laws
    (ELE)
  • Validation
  • Held Out Estimation
  • Cross Validation
  • Witten-Bell smoothing
  • Good-Turing smoothing
  • Combining Estimators
  • Simple Linear Interpolation
  • General Linear Interpolation
  • --gt Katzs Backoff

80
Backoff Smoothing
Smoothing of Conditional Probabilities p(Angeles
to, Los) If to Los Angeles is not in the
training corpus,the smoothed probability
p(Angeles to, Los) isidentical to p(York to,
Los). However, the actual probability is probably
close tothe bigram probability p(Angeles Los).
81
Backoff Smoothing
(Wrong) Back-off Smoothing of trigram
probabilities if C(w, w, w) gt 0P(w w,
w) P(w w, w) else if C(w, w) gt 0P(w
w, w) P(w w) else if C(w) gt 0P(w
w, w) P(w) elseP(w w, w) 1 / words
82
Backoff Smoothing
Problem not a probability distribution Solution
Combination of Back-off and frequency
discounting P(w w1,...,wk) C(w1,...,wk,w) /
N if C(w1,...,wk,w) gt 0 else P(w w1,...,wk)
?(w1,...,wk) P(w w2,...,wk)
83
Backoff Smoothing
The backoff factor is defined s.th. the
probabilitymass assigned to unobserved
trigrams ? ?(w1,...,wk) P(w
w2,...,wk)) w
C(w1,...,wk,w)0 is identical to the probability
mass discounted fromthe observed trigrams.
1- ? P(w w1,...,wk))
w C(w1,...,wk,w)gt0 Therefore, we
get ?(w1,...,wk) ( 1 - ? P(w
w1,...,wk)) / (1 - ? P(w w2,...,wk))
w
C(w1,...,wk,w)gt0
w C(w1,...,wk ,w)gt0
84
Other applications of LM
  • Author / Language identification
  • hypothesis texts that resemble each other (same
    author, same language) share similar
    characteristics
  • In English character sequence ing is more
    probable than in French
  • Training phase
  • construction of the language model
  • with pre-classified documents (known
    language/author)
  • Testing phase
  • evaluation of unknown text (comparison with
    language model)

85
Example Language identification
  • bigram of characters
  • characters 26 letters (case insensitive)
  • possible variations case sensitivity,
    punctuation, beginning/end of sentence marker,

86
1. Train a language model for English
2. Train a language model for French3. Evaluate
probability of a sentence with LM-English
LM-French4. Highest probability --gtlanguage of
sentence
87
Claim
  • A useful part of the knowledge needed to allow
    Word Prediction can be captured using simple
    statistical techniques.
  • Compute
  • probability of a sequence
  • likelihood of words co-occurring
  • It can be useful to do this.
Write a Comment
User Comments (0)