BASIC TECHNIQUES IN STATISTICAL NLP

1 / 49

About This Presentation

Title:

BASIC TECHNIQUES IN STATISTICAL NLP

Description:

... looking for Cantonese food. I'd like to eat ... we can compute the probability of a sentence like 'I want to eat Chinese food' ... P(I want to eat Chinese food) ... – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 50

Provided by: massimo90

more less

Transcript and Presenter's Notes

Title: BASIC TECHNIQUES IN STATISTICAL NLP

1
BASIC TECHNIQUES IN STATISTICAL NLP

Word predictionn-gramssmoothing

2
Statistical Methods in NLE

Two characteristics of NL make it desirable to
endow programs with the ability to LEARN from
examples of past use
VARIETY (no programmer can really take into
account all possibilities)
AMBIGUITY (need to have ways of choosing between
alternatives)
In a number of NLE applications, statistical
methods are very common
The simplest application WORD PREDICTION

3
We are good at word prediction
Stocks plunged this morning, despite a cut in
interest
Stocks plunged this morning, despite a cut in
interestrates by the Federal Reserve, as Wall
Stocks plunged this morning, despite a cut in
interestrates by the Federal Reserve, as
WallStreet began .
4
Real Spelling Errors
They are leaving in about fifteen minuets to go
to her house The study was conducted mainly be
John Black. The design an construction of the
system will take more than one year. Hopefully,
all with continue smoothly in my absence. Can
they lave him my messages? I need to notified the
bank of this problem. He is trying to fine out.
5
Handwriting recognition

From Woody Allens Take the Money and Run (1969)
Allen (a bank robber), walks up to the teller and
hands her a note that reads. "I have a gun. Give
me all your cash."
The teller, however, is puzzled, because he reads
"I have a gub." "No, it's gun", Allen says.
"Looks like 'gub' to me," the teller says, then
asks another teller to help him read the note,
then another, and finally everyone is arguing
over what the note means.

6
Applications of word prediction

Spelling checkers
Mobile phone texting
Speech recognition
Handwriting recognition
Disabled users

7
Statistics and word prediction

The basic idea underlying the statistical
approach to word prediction is to use the
probabilities of SEQUENCES OF WORDS to choose
the most likely next word / correction of
spelling error
I.e., to compute
For all words w, and predict as next word the one
for which this (conditional) probability is
highest.

P(w W1 . WN-1)
8
Using corpora to estimate probabilities

But where do we get these probabilities? Idea
estimate them by RELATIVE FREQUENCY.
The simplest method Maximum Likelihood Estimate
(MLE). Count the number of words in a corpus,
then count how many times a given sequence is
encountered.
Maximum because doesnt waste any probability
on events not in the corpus

9
Maximum Likelihood Estimation for conditional
probabilities

In order to estimate P(wW1 WN), we can use
instead
Cfr.
P(AB) P(AB) / P(B)

10
Aside counting words in corpora

Keep in mind that its not always so obvious what
a word is (cfr. yesterday)
In text
He stepped out into the hall, was delighted to
encounter a brother. (From the Brown corpus.)
In speech
I do uh main- mainly business data processing
LEMMAS cats vs cat
TYPES vs. TOKENS

11
The problem sparse data

In principle, we would like the n of our models
to be fairly large, to model long distance
dependencies such as
Sue SWALLOWED the large green
However, in practice, most events of encountering
sequences of words of length greater than 3
hardly ever occur in our corpora! (See below)
(Part of the) Solution we APPROXIMATE the
probability of a word given all previous words

12
The Markov Assumption

The probability of being in a certain state only
depends on the previous state
P(Xn Sk X1 Xn-1) P(Xn
SkXn-1)
This is equivalent to the assumption that the
next state only depends on the previous m inputs,
for m finite
(N-gram models / Markov models can be seen as
probabilistic finite state automata)

13
The Markov assumption for language n-grams
models

Making the Markov assumption for word prediction
means assuming that the probability of a word
only depends on the previous n words (N-GRAM
model)

14
Bigrams and trigrams

Typical values of n are 2 or 3 (BIGRAM or TRIGRAM
models)
P(WnW1 .. W n-1) P(WnW n-2,W n-1)
P(W1,Wn) ? P(Wi W i-2,W i-1)
What bigram model means in practice
Instead of P(rabbitJust the other day I saw a)
We use P(rabbita)
Unigram P(dog)Bigram P(dogbig)Trigram
P(dogthe,big)

15
The chain rule

So how can we compute the probability of
sequences of words longer than 2 or 3? We use the
CHAIN RULE
E.g.,
P(the big dog) P(the) P(bigthe) P(dogthe big)
Then we use the Markov assumption to reduce this
to manageable proportions

16
Example the Berkeley Restaurant Project (BERP)
corpus

BERP is a speech-based restaurant consultant
The corpus contains user queries examples
include
Im looking for Cantonese food
Id like to eat dinner someplace nearby
Tell me about Chez Panisse
Im looking for a good place to eat breakfast

17
Computing the probability of a sentence

Given a corpus like BERP, we can compute the
probability of a sentence like I want to eat
Chinese food
Making the bigram assumption and using the chain
rule, the probability can be approximated as
follows
P(I want to eat Chinese food)
P(Isentence start) P(wantI)
P(towant)P(eatto)
P(Chineseeat)P(foodChinese)

18
Bigram counts
19
How the bigram probabilities are computed

Example of P(I,I)
C(I,I) 8
C(I) 8 1087 13 . 3437
P(II) 8 / 3437 .0023

20
Bigram probabilities
21
The probability of the example sentence

P(I want to eat Chinese food) ?
P(Isentence start) P(wantI) P(towant)
P(eatto) P(Chineseeat) P(foodChinese)
.25 .32 .65 .26 .002 .60 .000016

22
Examples of actual bigram probabilities computed
using BERP
23
Visualizing an n-gram based language model the
Shannon/Miller/Selfridge method

For unigrams
Choose a random value r between 0 and 1
Print out w such that P(w) r
For bigrams
Choose a random bigram P(wltsgt)
Then pick up bigrams to follow as before

24
The Shannon/Miller/Selfridge method trained on
Shakespeare
25
Approximating Shakespeare, contd
26
A more formal evaluation mechanism

Entropy
Cross-entropy

27
The downside

The entire Shakespeare oeuvre consists of
884,647 tokens (N)
29,066 types (V)
300,000 bigrams
All of Jane Austens novels (on Manning and
Schuetzes website)
N 617,091 tokens
V 14,585 types

28
Comparing Austen n-grams unigrams
In person she was inferior to
1-gram P(.) P(.) P(.) P(.)
1 the .034 the .034 the .034 the .034
2 to .032 to .032 to .032 to .032
3 and .030 and .030 and .030

8 was .015 was .015

13 she .011

1701 inferior .00005
29
Comparing Austen n-grams bigrams
In person she was inferior to
2-gram P(.person) P(.she) P(.was) P(.inferior)
1 and .099 had .0141 not .065 to .212
2 who .099 was .122 a .052

23 she .009

inferior 0
30
Comparing Austen n-grams trigrams
In person she was inferior to
3-gram P(.In,person) P(.person, she) P(.she,was) P(.was,inferior)
1 UNSEEN UNSEEN did .05 not .057 UNSEEN UNSEEN
2 was .05 very .038

inferior 0
31
Maybe with a larger corpus?

Words such as ergativity unlikely to be found
outside a corpus of linguistic articles
More in general Zipfs law

32
Zipfs law for the Brown corpus
33
Addressing the zeroes

SMOOTHING is re-evaluating some of the
zero-probability and low-probability n-grams,
assigning them non-zero probabilities
Add-one
Witten-Bell
Good-Turing
BACK-OFF is using the probabilities of lower
order n-grams when higher order ones are not
available
Backoff
Linear interpolation

34
Add-one (Laplaces Law)
35
Effect on BERP bigram counts
36
Add-one bigram probabilities
37
The problem
38
The problem

Add-one has a huge effect on probabilities e.g.,
P(towant) went from .65 to .28!
Too much probability gets removed from n-grams
actually encountered
(more precisely the discount factor

39
Witten-Bell Discounting

How can we get a better estimate of the
probabilities of things we havent seen?
The Witten-Bell algorithm is based on the idea
that a zero-frequency N-gram is just an event
that hasnt happened yet
How often these events happen? We model this by
the probability of seeing an N-gram for the first
time (we just count the number of times we first
encountered a type)

40
Witten-Bell the equations

Total probability mass assigned to zero-frequency
N-grams
(NB T is OBSERVED types, not V)
So each zero N-gram gets the probability

41
Witten-Bell why discounting

Now of course we have to take away something
(discount) from the probability of the events
seen more than once

42
Witten-Bell for bigrams

We relativize the types to the previous word

43
Add-one vs. Witten-Bell discounts for unigrams in
the BERP corpus
Word Add-One Witten-Bell
I .68 .97
want .42 .94
to .69 .96
eat .37 .88
Chinese .12 .91
food .48 .94
lunch .22 .91
44
One last discounting method .

The best-known discounting method is GOOD-TURING
(Good, 1953)
Basic insight re-estimate the probability of
N-grams with zero counts by looking at the number
of bigrams that occurred once
For example, the revised count for bigrams that
never occurred is estimated by dividing N1, the
number of bigrams that occurred once, by N0, the
number of bigrams that never occurred

45
Combining estimators

A method often used (generally in combination
with discounting methods) is to use lower-order
estimates to help with higher-order ones
Backoff (Katz, 1987)
Linear interpolation (Jelinek and Mercer, 1980)

46
Backoff the basic idea
47
Backoff with discounting
48
Readings