BASIC TECHNIQUES IN STATISTICAL NLP

1 / 49
About This Presentation
Title:

BASIC TECHNIQUES IN STATISTICAL NLP

Description:

... looking for Cantonese food. I'd like to eat ... we can compute the probability of a sentence like 'I want to eat Chinese food' ... P(I want to eat Chinese food) ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 50
Provided by: massimo90

less

Transcript and Presenter's Notes

Title: BASIC TECHNIQUES IN STATISTICAL NLP


1
BASIC TECHNIQUES IN STATISTICAL NLP
  • Word predictionn-gramssmoothing

2
Statistical Methods in NLE
  • Two characteristics of NL make it desirable to
    endow programs with the ability to LEARN from
    examples of past use
  • VARIETY (no programmer can really take into
    account all possibilities)
  • AMBIGUITY (need to have ways of choosing between
    alternatives)
  • In a number of NLE applications, statistical
    methods are very common
  • The simplest application WORD PREDICTION

3
We are good at word prediction
Stocks plunged this morning, despite a cut in
interest
Stocks plunged this morning, despite a cut in
interestrates by the Federal Reserve, as Wall
Stocks plunged this morning, despite a cut in
interestrates by the Federal Reserve, as
WallStreet began .
4
Real Spelling Errors
They are leaving in about fifteen minuets to go
to her house The study was conducted mainly be
John Black. The design an construction of the
system will take more than one year. Hopefully,
all with continue smoothly in my absence. Can
they lave him my messages? I need to notified the
bank of this problem. He is trying to fine out.
5
Handwriting recognition
  • From Woody Allens Take the Money and Run (1969)
  • Allen (a bank robber), walks up to the teller and
    hands her a note that reads. "I have a gun. Give
    me all your cash."
  • The teller, however, is puzzled, because he reads
    "I have a gub." "No, it's gun", Allen says.
  • "Looks like 'gub' to me," the teller says, then
    asks another teller to help him read the note,
    then another, and finally everyone is arguing
    over what the note means.

6
Applications of word prediction
  • Spelling checkers
  • Mobile phone texting
  • Speech recognition
  • Handwriting recognition
  • Disabled users

7
Statistics and word prediction
  • The basic idea underlying the statistical
    approach to word prediction is to use the
    probabilities of SEQUENCES OF WORDS to choose
    the most likely next word / correction of
    spelling error
  • I.e., to compute
  • For all words w, and predict as next word the one
    for which this (conditional) probability is
    highest.

P(w W1 . WN-1)
8
Using corpora to estimate probabilities
  • But where do we get these probabilities? Idea
    estimate them by RELATIVE FREQUENCY.
  • The simplest method Maximum Likelihood Estimate
    (MLE). Count the number of words in a corpus,
    then count how many times a given sequence is
    encountered.
  • Maximum because doesnt waste any probability
    on events not in the corpus

9
Maximum Likelihood Estimation for conditional
probabilities
  • In order to estimate P(wW1 WN), we can use
    instead
  • Cfr.
  • P(AB) P(AB) / P(B)

10
Aside counting words in corpora
  • Keep in mind that its not always so obvious what
    a word is (cfr. yesterday)
  • In text
  • He stepped out into the hall, was delighted to
    encounter a brother. (From the Brown corpus.)
  • In speech
  • I do uh main- mainly business data processing
  • LEMMAS cats vs cat
  • TYPES vs. TOKENS

11
The problem sparse data
  • In principle, we would like the n of our models
    to be fairly large, to model long distance
    dependencies such as
  • Sue SWALLOWED the large green
  • However, in practice, most events of encountering
    sequences of words of length greater than 3
    hardly ever occur in our corpora! (See below)
  • (Part of the) Solution we APPROXIMATE the
    probability of a word given all previous words

12
The Markov Assumption
  • The probability of being in a certain state only
    depends on the previous state
  • P(Xn Sk X1 Xn-1) P(Xn
    SkXn-1)
  • This is equivalent to the assumption that the
    next state only depends on the previous m inputs,
    for m finite
  • (N-gram models / Markov models can be seen as
    probabilistic finite state automata)

13
The Markov assumption for language n-grams
models
  • Making the Markov assumption for word prediction
    means assuming that the probability of a word
    only depends on the previous n words (N-GRAM
    model)

14
Bigrams and trigrams
  • Typical values of n are 2 or 3 (BIGRAM or TRIGRAM
    models)
  • P(WnW1 .. W n-1) P(WnW n-2,W n-1)
  • P(W1,Wn) ? P(Wi W i-2,W i-1)
  • What bigram model means in practice
  • Instead of P(rabbitJust the other day I saw a)
  • We use P(rabbita)
  • Unigram P(dog)Bigram P(dogbig)Trigram
    P(dogthe,big)

15
The chain rule
  • So how can we compute the probability of
    sequences of words longer than 2 or 3? We use the
    CHAIN RULE
  • E.g.,
  • P(the big dog) P(the) P(bigthe) P(dogthe big)
  • Then we use the Markov assumption to reduce this
    to manageable proportions

16
Example the Berkeley Restaurant Project (BERP)
corpus
  • BERP is a speech-based restaurant consultant
  • The corpus contains user queries examples
    include
  • Im looking for Cantonese food
  • Id like to eat dinner someplace nearby
  • Tell me about Chez Panisse
  • Im looking for a good place to eat breakfast

17
Computing the probability of a sentence
  • Given a corpus like BERP, we can compute the
    probability of a sentence like I want to eat
    Chinese food
  • Making the bigram assumption and using the chain
    rule, the probability can be approximated as
    follows
  • P(I want to eat Chinese food)
  • P(Isentence start) P(wantI)
    P(towant)P(eatto)
  • P(Chineseeat)P(foodChinese)

18
Bigram counts
19
How the bigram probabilities are computed
  • Example of P(I,I)
  • C(I,I) 8
  • C(I) 8 1087 13 . 3437
  • P(II) 8 / 3437 .0023

20
Bigram probabilities
21
The probability of the example sentence
  • P(I want to eat Chinese food) ?
  • P(Isentence start) P(wantI) P(towant)
    P(eatto) P(Chineseeat) P(foodChinese)
  • .25 .32 .65 .26 .002 .60 .000016

22
Examples of actual bigram probabilities computed
using BERP
23
Visualizing an n-gram based language model the
Shannon/Miller/Selfridge method
  • For unigrams
  • Choose a random value r between 0 and 1
  • Print out w such that P(w) r
  • For bigrams
  • Choose a random bigram P(wltsgt)
  • Then pick up bigrams to follow as before

24
The Shannon/Miller/Selfridge method trained on
Shakespeare
25
Approximating Shakespeare, contd
26
A more formal evaluation mechanism
  • Entropy
  • Cross-entropy

27
The downside
  • The entire Shakespeare oeuvre consists of
  • 884,647 tokens (N)
  • 29,066 types (V)
  • 300,000 bigrams
  • All of Jane Austens novels (on Manning and
    Schuetzes website)
  • N 617,091 tokens
  • V 14,585 types

28
Comparing Austen n-grams unigrams
In person she was inferior to
1-gram P(.) P(.) P(.) P(.)
1 the .034 the .034 the .034 the .034
2 to .032 to .032 to .032 to .032
3 and .030 and .030 and .030

8 was .015 was .015

13 she .011

1701 inferior .00005
29
Comparing Austen n-grams bigrams
In person she was inferior to
2-gram P(.person) P(.she) P(.was) P(.inferior)
1 and .099 had .0141 not .065 to .212
2 who .099 was .122 a .052

23 she .009

inferior 0
30
Comparing Austen n-grams trigrams
In person she was inferior to
3-gram P(.In,person) P(.person, she) P(.she,was) P(.was,inferior)
1 UNSEEN UNSEEN did .05 not .057 UNSEEN UNSEEN
2 was .05 very .038

inferior 0
31
Maybe with a larger corpus?
  • Words such as ergativity unlikely to be found
    outside a corpus of linguistic articles
  • More in general Zipfs law

32
Zipfs law for the Brown corpus
33
Addressing the zeroes
  • SMOOTHING is re-evaluating some of the
    zero-probability and low-probability n-grams,
    assigning them non-zero probabilities
  • Add-one
  • Witten-Bell
  • Good-Turing
  • BACK-OFF is using the probabilities of lower
    order n-grams when higher order ones are not
    available
  • Backoff
  • Linear interpolation

34
Add-one (Laplaces Law)
35
Effect on BERP bigram counts
36
Add-one bigram probabilities
37
The problem
38
The problem
  • Add-one has a huge effect on probabilities e.g.,
    P(towant) went from .65 to .28!
  • Too much probability gets removed from n-grams
    actually encountered
  • (more precisely the discount factor

39
Witten-Bell Discounting
  • How can we get a better estimate of the
    probabilities of things we havent seen?
  • The Witten-Bell algorithm is based on the idea
    that a zero-frequency N-gram is just an event
    that hasnt happened yet
  • How often these events happen? We model this by
    the probability of seeing an N-gram for the first
    time (we just count the number of times we first
    encountered a type)

40
Witten-Bell the equations
  • Total probability mass assigned to zero-frequency
    N-grams
  • (NB T is OBSERVED types, not V)
  • So each zero N-gram gets the probability

41
Witten-Bell why discounting
  • Now of course we have to take away something
    (discount) from the probability of the events
    seen more than once

42
Witten-Bell for bigrams
  • We relativize the types to the previous word

43
Add-one vs. Witten-Bell discounts for unigrams in
the BERP corpus
Word Add-One Witten-Bell
I .68 .97
want .42 .94
to .69 .96
eat .37 .88
Chinese .12 .91
food .48 .94
lunch .22 .91
44
One last discounting method .
  • The best-known discounting method is GOOD-TURING
    (Good, 1953)
  • Basic insight re-estimate the probability of
    N-grams with zero counts by looking at the number
    of bigrams that occurred once
  • For example, the revised count for bigrams that
    never occurred is estimated by dividing N1, the
    number of bigrams that occurred once, by N0, the
    number of bigrams that never occurred

45
Combining estimators
  • A method often used (generally in combination
    with discounting methods) is to use lower-order
    estimates to help with higher-order ones
  • Backoff (Katz, 1987)
  • Linear interpolation (Jelinek and Mercer, 1980)

46
Backoff the basic idea
47
Backoff with discounting
48
Readings
  • Jurafsky and Martin, chapter 6
  • The Statistics Glossary
  • Word prediction
  • For mobile phones
  • For disabled users
  • Further reading Manning and Schuetze, chapters 6
    (Good-Turing)

49
Acknowledgments
  • Some of the material in these slides was taken
    from lecture notes by Diane Litman James Martin
Write a Comment
User Comments (0)