CMSC 723 / LING 645: Intro to Computational Linguistics

About This Presentation

Title:

CMSC 723 / LING 645: Intro to Computational Linguistics

Description:

They are leaving in about fifteen minuets to go to her house. ... Hopefully, all with continue smoothly in my absence. Can they lave him my messages? ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 54

Provided by: ericgu6

Category:

more less

Transcript and Presenter's Notes

Title: CMSC 723 / LING 645: Intro to Computational Linguistics

1
CMSC 723 / LING 645 Intro to Computational
Linguistics
September 22, 2004 Dorr Porter Stemmer,Intro to
Probabilistic NLP and N-grams (chap
6.1-6.3) Prof. Bonnie J. DorrDr. Christof
MonzTA Adam Lee
2
Computational Morphology (continued)

The Rules and the Lexicon
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
Approaches
Lexicon only
Lexicon and Rules
Finite-state Automata
Finite-state Transducers
Rules only

3
Lexicon-Free MorphologyPorter Stemmer

Lexicon-Free FST Approach
By Martin Porter (1980)http//www.tartarus.org/7
Emartin/PorterStemmer/
Cascade of substitutions given specific
conditions
GENERALIZATIONS
GENERALIZATION
GENERALIZE
GENERAL
GENER

4
Porter Stemmer

Definitions
C string of one or more consonants, where a
consonant is anything other than A E I O U or (Y
preceded by C)
V string of one or more vowels
M Measure, roughly with number of syllables
Words (C)(VC)M(V)
M0 TR, EE, TREE, Y, BY
M1 TROUBLE, OATS, TREES, IVY
M2 TROUBLES, PRIVATE, OATEN, ORRERY
Conditions
S - stem ends with S
v - stem contains a V
d - stem ends with double C, e.g., -TT, -SS
o - stem ends CVC, where second C is not W, X
or Y, e.g., -WIL, HOP

5
Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y

Step 1 Plural Nouns and Third Person Singular
Verbs
SSES ? SS caresses ? caress
IES ? I ponies ? poni
ties ? ti
SS ? SS caress ? caress
S ? cats ? cat

Step 2a Verbal Past Tense and Progressive
Forms (Mgt0) EED ? EE feed ? feed, agreed
? agree i (v) ED ? plastered ?
plaster, bled ? bled ii (v) ING ? motoring
? motor, sing ? sing
Step 2b If 2a.i or 2a.ii is successful, Cleanup
AT ? ATE conflat(ed) ? conflate BL ?
BLE troubl(ed) ? trouble IZ ? IZE
siz(ed) ? size (d and not (L or S or
Z)) hopp(ing) ? hop, tann(ed) ? tan ?
single letter hiss(ing) ? hiss, fizz(ed) ? fizz
(M1 and o) ? E fail(ing) ? fail,
fil(ing) ? file
6
Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y

Step 3 Y ? I
(v) Y ? I happy ? happi
sky ? sky

7
Porter Stemmer

Step 4 Derivational Morphology I Multiple
Suffixes
(mgt0) ATIONAL -gt ATE relational
-gt relate
(mgt0) TIONAL -gt TION conditional
-gt condition
rational
-gt rational
(mgt0) ENCI -gt ENCE valenci
-gt valence
(mgt0) ANCI -gt ANCE hesitanci
-gt hesitance
(mgt0) IZER -gt IZE digitizer
-gt digitize
(mgt0) ABLI -gt ABLE conformabli
-gt conformable
(mgt0) ALLI -gt AL radicalli
-gt radical
(mgt0) ENTLI -gt ENT differentli
-gt different
(mgt0) ELI -gt E vileli
- gt vile
(mgt0) OUSLI -gt OUS analogousli
-gt analogous
(mgt0) IZATION -gt IZE
vietnamization -gt vietnamize
(mgt0) ATION -gt ATE predication
-gt predicate
(mgt0) ATOR -gt ATE operator
-gt operate
(mgt0) ALISM -gt AL feudalism
-gt feudal
(mgt0) IVENESS -gt IVE decisiveness
-gt decisive
(mgt0) FULNESS -gt FUL hopefulness
-gt hopeful
(mgt0) OUSNESS -gt OUS callousness
-gt callous

8
Porter Stemmer

Step 5 Derivational Morphology II More Multiple
Suffixes
(mgt0) ICATE -gt IC triplicate
-gt triplic
(mgt0) ATIVE -gt formative
-gt form
(mgt0) ALIZE -gt AL formalize
-gt formal
(mgt0) ICITI -gt IC electriciti
-gt electric
(mgt0) ICAL -gt IC electrical
-gt electric
(mgt0) FUL -gt hopeful
-gt hope
(mgt0) NESS -gt goodness
-gt good

9
Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y

Step 6 Derivational Morphology III Single
Suffixes
(mgt1) AL -gt revival
-gt reviv
(mgt1) ANCE -gt allowance
-gt allow
(mgt1) ENCE -gt inference
-gt infer
(mgt1) ER -gt airliner
-gt airlin
(mgt1) IC -gt gyroscopic
-gt gyroscop
(mgt1) ABLE -gt adjustable
-gt adjust
(mgt1) IBLE -gt defensible
-gt defens
(mgt1) ANT -gt irritant
-gt irrit
(mgt1) EMENT -gt replacement
-gt replac
(mgt1) MENT -gt adjustment
-gt adjust
(mgt1) ENT -gt dependent
-gt depend
(mgt1 and (S or T)) ION -gt adoption
-gt adopt
(mgt1) OU -gt homologou
-gt homolog
(mgt1) ISM -gt communism
-gt commun
(mgt1) ATE -gt activate
-gt activ
(mgt1) ITI -gt angulariti
-gt angular
(mgt1) OUS -gt homologous
-gt homolog
(mgt1) IVE -gt effective
-gt effect

10
Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y

Step 7a Cleanup
(mgt1) E ? probate ? probat
rate ?
rate
(m1 and not o) E ? cease ? ceas
Step 7b More Cleanup
(m gt 1 and d and L) controll ? control
? single letter roll ? roll

11
Porter Stemmer

Errors of Omission
European Europe
analysis analyzes
matrices matrix
noise noisy
explain explanation
Errors of Commission
organization organ
doing doe
generalization generic
numerical numerous
university universe

From Krovetz 93
12
Why (not) Statistics for NLP?

Pro
Disambiguation
Error Tolerant
Learnable
Con
Not always appropriate
Difficult to debug

13
Weighted Automata/Transducers

Speech recognition storing a pronunciation
lexicon
Augmentation of FSA Each arc is associated with
a probability

14
Pronunciation network for about
15
Noisy Channel
16
Probability Definitions

Experiment (trial)
Repeatable procedure with well-defined possible
outcomes
Sample space
Complete set of outcomes
Event
Any subset of outcomes from sample space
Random Variable
Uncertain outcome in a trial

17
More Definitions

Probability
How likely is it to get a particular outcome?
Rate of getting that outcome in all trials
Probability of drawing a spade from 52
well-shuffled playing cards
Distribution Probabilities associated with each
outcome a random variable can take
Each outcome has probability between 0 and 1
The sum of all outcome probabilities is 1.

18
Conditional Probability

What is P(AB)?
First, what is P(A)?
P(It is raining) .06
Now what about P(AB)?
P(It is raining It was clear 10 minutes
ago) .004

Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
19
Independence

What is P(A,B) if A and B are independent?
P(A,B)P(A) P(B) iff A,B independent.
P(heads,tails) P(heads) P(tails) .5 .5
.25
P(doctor,blue-eyes) P(doctor) P(blue-eyes)
.01 .2 .002
What if A,B independent?
P(AB)P(A) iff A,B independent
Also P(BA)P(B) iff A,B independent

20
Bayes Theorem

Swap the order of dependence
Sometimes easier to estimate one kind of
dependence than the other

21
What does this have to do with the Noisy Channel
Model?
(O)
(H)
22
Noisy Channel Applied to Word Recognition

argmaxw P(wO) argmaxw P(Ow) P(w)
Simplifying assumptions
pronunciation string correct
word boundaries known
Problem
Given n iy, what is correct dictionary word?
What do we need?

ni knee, neat, need, new
23
What is the most likely word given ni?

Compute prior P(w)

Now compute likelihood P(niw), then multiply

24
Why N-grams?

Compute likelihood P(niw), then multiply

Unigram approach ignores context
Need to factor in context (n-gram)
Use P(needI) instead of just P(need)
Note P(newI) lt P(needI)

25
Next Word Predictionborrowed from J. Hirschberg

From a NY Times story...
Stocks plunged this .
Stocks plunged this morning, despite a cut in
interest rates
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
...
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began

26
Next Word Prediction (cont)

Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last

Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last Tuesday's terrorist attacks.

27
Human Word Prediction

Domain knowledge
Syntactic knowledge
Lexical knowledge

28
Claim

A useful part of the knowledge needed to allow
Word Prediction can be captured using simple
statistical techniques.
Compute
probability of a sequence
likelihood of words co-occurring

29
Why would we want to do this?

Rank the likelihood of sequences containing
various alternative alternative hypotheses
Assess the likelihood of a hypothesis

30
Why is this useful?

Speech recognition
Handwriting recognition
Spelling correction
Machine translation systems
Optical character recognizers

31
Handwriting Recognition

Assume a note is given to a bank teller, which
the teller reads as I have a gub. (cf. Woody
Allen)
NLP to the rescue .
gub is not a word
gun, gum, Gus, and gull are words, but gun has a
higher probability in the context of a bank

32
Real Word Spelling Errors

They are leaving in about fifteen minuets to go
to her house.
The study was conducted mainly be John Black.
The design an construction of the system will
take more than a year.
Hopefully, all with continue smoothly in my
absence.
Can they lave him my messages?
I need to notified the bank of.
He is trying to fine out.

33
For Spell Checkers

Collect list of commonly substituted words
piece/peace, whether/weather, their/there ...
ExampleOn Tuesday, the whether On
Tuesday, the weather

34
Language Model

Definition Language model is a model that
enables one to compute the probability, or
likelihood, of a sentence S, P(S).
Lets look at different ways of computing P(S) in
the context of Word Prediction

35
Word Prediction Simple vs. Smart

SimpleEvery word follows every other word w/
equal probability (0-gram)
Assume V is the size of the vocabulary
Likelihood of sentence S of length n is 1/V
1/V 1/V
If English has 100,000 words, probability of
each next word is 1/100000 .00001

SmarterProbability of each next word is related
to word frequency (unigram)
Likelihood of sentence S P(w1) P(w2)
P(wn)
Assumes probability of each word is independent
of probabilities of other words.
Even smarter Look at probability given previous
words (N-gram)
Likelihood of sentence S P(w1) P(w2w1)
P(wnwn-1)
Assumes probability of each word is dependent
on probabilities of other words.

36
Chain Rule

Conditional Probability
P(A1,A2) P(A1) P(A2A1)
The Chain Rule generalizes to multiple events
P(A1, ,An) P(A1) P(A2A1) P(A3A1,A2)P(AnA1
An-1)
Examples
P(the dog) P(the) P(dog the)
P(the dog bites) P(the) P(dog the) P(bites
the dog)

37
Relative Frequencies and Conditional Probabilities

Relative word frequencies are better than equal
probabilities for all words
In a corpus with 10K word types, each word would
have P(w) 1/10K
Does not match our intuitions that different
words are more likely to occur (e.g. the)
Conditional probability more useful than
individual relative word frequencies
Dog may be relatively rare in a corpus
But if we see barking, P(dogbarking) may be very
large

38
For a Word String

In general, the probability of a complete string
of words w1wn is
P(w )
P(w1)P(w2w1)P(w3w1..w2)P(wnw1wn-1)

But this approach to determining the
probability of a word sequence is not very
helpful in general.

39
Markov Assumption

How do we compute P(wnw1n-1)? Trick Instead of
P(rabbitI saw a), we use P(rabbita).
This lets us collect statistics in practice
A bigram model P(the barking dog)
P(theltstartgt)P(barkingthe)P(dogbarking)

Markov models are the class of probabilistic
models that assume that we can predict the
probability of some future unit without looking
too far into the past
Specifically, for N2 (bigram) P(w1) ?
P(wkwk-1)

Order of a Markov model length of prior context
bigram is first order, trigram is second order,

40
Counting Words in Corpora

What is a word?
e.g., are cat and cats the same word?
September and Sept?
zero and oh?
Is seventy-two one word or two? ATT?
Punctuation?
How many words are there in English?
Where do we find the things to count?

41
Corpora

Corpora are (generally online) collections of
text and speech
Examples
Brown Corpus (1M words)
Wall Street Journal and AP News corpora
ATIS, Broadcast News (speech)
TDT (text and speech)
Switchboard, Call Home (speech)
TRAINS, FM Radio (speech)

42
Training and Testing

Probabilities come from a training corpus, which
is used to design the model.
overly narrow corpus probabilities don't
generalize
overly general corpus probabilities don't
reflect task or domain
A separate test corpus is used to evaluate the
model, typically using standard metrics
held out test set
cross validation
evaluation differences should be statistically
significant

43
Terminology

Sentence unit of written language
Utterance unit of spoken language
Word Form the inflected form that appears in
the corpus
Lemma lexical forms having the same stem, part
of speech, and word sense
Types (V) number of distinct words that might
appear in a corpus (vocabulary size)
Tokens (N) total number of words in a corpus
Types seen so far (T) number of distinct words
seen so far in corpus (smaller than V and N)

44
Simple N-Grams

An N-gram model uses the previous N-1 words to
predict the next oneP(wn wn-N1 wn-N2 wn-1
)
unigrams P(dog)
bigrams P(dog big)
trigrams P(dog the big)
quadrigrams P(dog chasing the big)

45
Using N-Grams

Recall that
N-gram P(wnw1 ) P(wnwn-N1)
Bigram P(w1) ? P(wkwk-1)

For a bigram grammar
P(sentence) can be approximated by multiplying
all the bigram probabilities in the sequence
ExampleP(I want to eat Chinese food) P(I
ltstartgt) P(want I) P(to want) P(eat to)
P(Chinese eat) P(food Chinese)

46
A Bigram Grammar Fragment from BERP
Eat on .16 Eat Thai .03
Eat some .06 Eat breakfast .03
Eat lunch .06 Eat in .02
Eat dinner .05 Eat Chinese .02
Eat at .04 Eat Mexican .02
Eat a .04 Eat tomorrow .01
Eat Indian .04 Eat dessert .007
Eat today .03 Eat British .001
47
Additional BERP Grammar
ltstartgt I .25 Want some .04
ltstartgt Id .06 Want Thai .01
ltstartgt Tell .04 To eat .26
ltstartgt Im .02 To have .14
I want .32 To spend .09
I would .29 To be .02
I dont .08 British food .60
I have .04 British restaurant .15
Want to .65 British cuisine .01
Want a .05 British lunch .01
48
Computing Sentence Probability

P(I want to eat British food) P(Iltstartgt)
P(wantI) P(towant) P(eatto) P(Britisheat)
P(foodBritish) .25.32.65.26.001.60
.000080
vs. I want to eat Chinese food .00015
Probabilities seem to capture syntactic facts,
world knowledge
eat is often followed by a NP
British food is not too popular
N-gram models can be trained by counting and
normalization

49
BERP Bigram Counts
I Want To Eat Chinese Food lunch
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
50
BERP Bigram Probabilities Use Unigram Count

Normalization divide bigram count by unigram
count of first word.

I Want To Eat Chinese Food Lunch
3437 1215 3256 938 213 1506 459

Computing the probability of I I
P(II) C(I,I)/C(I) 8 / 3437 .0023
A bigram grammar is an NxN matrix of
probabilities, where N is the vocabulary size

51
Learning a Bigram Grammar

The formula P(wnwn-1) C(wn,wn-1)/C(wn-1) is
used for bigram parameter estimation
Relative Frequency
Maximum Likelihood Estimation (MLE) Parameter
set maximizes likelihood of training set T given
model M P(TM).

52
What do we learn about the language?