CMSC 723 / LING 645: Intro to Computational Linguistics - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

CMSC 723 / LING 645: Intro to Computational Linguistics

Description:

They are leaving in about fifteen minuets to go to her house. ... Hopefully, all with continue smoothly in my absence. Can they lave him my messages? ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 54
Provided by: ericgu6
Category:

less

Transcript and Presenter's Notes

Title: CMSC 723 / LING 645: Intro to Computational Linguistics


1
CMSC 723 / LING 645 Intro to Computational
Linguistics
September 22, 2004 Dorr Porter Stemmer,Intro to
Probabilistic NLP and N-grams (chap
6.1-6.3) Prof. Bonnie J. DorrDr. Christof
MonzTA Adam Lee
2
Computational Morphology (continued)
  • The Rules and the Lexicon
  • General versus Specific
  • Regular versus Irregular
  • Accuracy, speed, space
  • The Morphology of a language
  • Approaches
  • Lexicon only
  • Lexicon and Rules
  • Finite-state Automata
  • Finite-state Transducers
  • Rules only

3
Lexicon-Free MorphologyPorter Stemmer
  • Lexicon-Free FST Approach
  • By Martin Porter (1980)http//www.tartarus.org/7
    Emartin/PorterStemmer/
  • Cascade of substitutions given specific
    conditions
  • GENERALIZATIONS
  • GENERALIZATION
  • GENERALIZE
  • GENERAL
  • GENER

4
Porter Stemmer
  • Definitions
  • C string of one or more consonants, where a
    consonant is anything other than A E I O U or (Y
    preceded by C)
  • V string of one or more vowels
  • M Measure, roughly with number of syllables
  • Words (C)(VC)M(V)
  • M0 TR, EE, TREE, Y, BY
  • M1 TROUBLE, OATS, TREES, IVY
  • M2 TROUBLES, PRIVATE, OATEN, ORRERY
  • Conditions
  • S - stem ends with S
  • v - stem contains a V
  • d - stem ends with double C, e.g., -TT, -SS
  • o - stem ends CVC, where second C is not W, X
    or Y, e.g., -WIL, HOP

5
Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y
  • Step 1 Plural Nouns and Third Person Singular
    Verbs
  • SSES ? SS caresses ? caress
  • IES ? I ponies ? poni
  • ties ? ti
  • SS ? SS caress ? caress
  • S ? cats ? cat

Step 2a Verbal Past Tense and Progressive
Forms (Mgt0) EED ? EE feed ? feed, agreed
? agree i (v) ED ? plastered ?
plaster, bled ? bled ii (v) ING ? motoring
? motor, sing ? sing
Step 2b If 2a.i or 2a.ii is successful, Cleanup
AT ? ATE conflat(ed) ? conflate BL ?
BLE troubl(ed) ? trouble IZ ? IZE
siz(ed) ? size (d and not (L or S or
Z)) hopp(ing) ? hop, tann(ed) ? tan ?
single letter hiss(ing) ? hiss, fizz(ed) ? fizz
(M1 and o) ? E fail(ing) ? fail,
fil(ing) ? file
6
Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y
  • Step 3 Y ? I
  • (v) Y ? I happy ? happi
  • sky ? sky

7
Porter Stemmer
  • Step 4 Derivational Morphology I Multiple
    Suffixes
  • (mgt0) ATIONAL -gt ATE relational
    -gt relate
  • (mgt0) TIONAL -gt TION conditional
    -gt condition
  • rational
    -gt rational
  • (mgt0) ENCI -gt ENCE valenci
    -gt valence
  • (mgt0) ANCI -gt ANCE hesitanci
    -gt hesitance
  • (mgt0) IZER -gt IZE digitizer
    -gt digitize
  • (mgt0) ABLI -gt ABLE conformabli
    -gt conformable
  • (mgt0) ALLI -gt AL radicalli
    -gt radical
  • (mgt0) ENTLI -gt ENT differentli
    -gt different
  • (mgt0) ELI -gt E vileli
    - gt vile
  • (mgt0) OUSLI -gt OUS analogousli
    -gt analogous
  • (mgt0) IZATION -gt IZE
    vietnamization -gt vietnamize
  • (mgt0) ATION -gt ATE predication
    -gt predicate
  • (mgt0) ATOR -gt ATE operator
    -gt operate
  • (mgt0) ALISM -gt AL feudalism
    -gt feudal
  • (mgt0) IVENESS -gt IVE decisiveness
    -gt decisive
  • (mgt0) FULNESS -gt FUL hopefulness
    -gt hopeful
  • (mgt0) OUSNESS -gt OUS callousness
    -gt callous

8
Porter Stemmer
  • Step 5 Derivational Morphology II More Multiple
    Suffixes
  • (mgt0) ICATE -gt IC triplicate
    -gt triplic
  • (mgt0) ATIVE -gt formative
    -gt form
  • (mgt0) ALIZE -gt AL formalize
    -gt formal
  • (mgt0) ICITI -gt IC electriciti
    -gt electric
  • (mgt0) ICAL -gt IC electrical
    -gt electric
  • (mgt0) FUL -gt hopeful
    -gt hope
  • (mgt0) NESS -gt goodness
    -gt good

9
Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y
  • Step 6 Derivational Morphology III Single
    Suffixes
  • (mgt1) AL -gt revival
    -gt reviv
  • (mgt1) ANCE -gt allowance
    -gt allow
  • (mgt1) ENCE -gt inference
    -gt infer
  • (mgt1) ER -gt airliner
    -gt airlin
  • (mgt1) IC -gt gyroscopic
    -gt gyroscop
  • (mgt1) ABLE -gt adjustable
    -gt adjust
  • (mgt1) IBLE -gt defensible
    -gt defens
  • (mgt1) ANT -gt irritant
    -gt irrit
  • (mgt1) EMENT -gt replacement
    -gt replac
  • (mgt1) MENT -gt adjustment
    -gt adjust
  • (mgt1) ENT -gt dependent
    -gt depend
  • (mgt1 and (S or T)) ION -gt adoption
    -gt adopt
  • (mgt1) OU -gt homologou
    -gt homolog
  • (mgt1) ISM -gt communism
    -gt commun
  • (mgt1) ATE -gt activate
    -gt activ
  • (mgt1) ITI -gt angulariti
    -gt angular
  • (mgt1) OUS -gt homologous
    -gt homolog
  • (mgt1) IVE -gt effective
    -gt effect

10
Porter Stemmer
ltSgt ends with ltSgt v contains a V d
ends with double C o ends with CVC
second C is not W, X or Y
  • Step 7a Cleanup
  • (mgt1) E ? probate ? probat
  • rate ?
    rate
  • (m1 and not o) E ? cease ? ceas
  • Step 7b More Cleanup
  • (m gt 1 and d and L) controll ? control
  • ? single letter roll ? roll

11
Porter Stemmer
  • Errors of Omission
  • European Europe
  • analysis analyzes
  • matrices matrix
  • noise noisy
  • explain explanation
  • Errors of Commission
  • organization organ
  • doing doe
  • generalization generic
  • numerical numerous
  • university universe

From Krovetz 93
12
Why (not) Statistics for NLP?
  • Pro
  • Disambiguation
  • Error Tolerant
  • Learnable
  • Con
  • Not always appropriate
  • Difficult to debug

13
Weighted Automata/Transducers
  • Speech recognition storing a pronunciation
    lexicon
  • Augmentation of FSA Each arc is associated with
    a probability

14
Pronunciation network for about
15
Noisy Channel
16
Probability Definitions
  • Experiment (trial)
  • Repeatable procedure with well-defined possible
    outcomes
  • Sample space
  • Complete set of outcomes
  • Event
  • Any subset of outcomes from sample space
  • Random Variable
  • Uncertain outcome in a trial

17
More Definitions
  • Probability
  • How likely is it to get a particular outcome?
  • Rate of getting that outcome in all trials
  • Probability of drawing a spade from 52
    well-shuffled playing cards
  • Distribution Probabilities associated with each
    outcome a random variable can take
  • Each outcome has probability between 0 and 1
  • The sum of all outcome probabilities is 1.

18
Conditional Probability
  • What is P(AB)?
  • First, what is P(A)?
  • P(It is raining) .06
  • Now what about P(AB)?
  • P(It is raining It was clear 10 minutes
    ago) .004

Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
19
Independence
  • What is P(A,B) if A and B are independent?
  • P(A,B)P(A) P(B) iff A,B independent.
  • P(heads,tails) P(heads) P(tails) .5 .5
    .25
  • P(doctor,blue-eyes) P(doctor) P(blue-eyes)
    .01 .2 .002
  • What if A,B independent?
  • P(AB)P(A) iff A,B independent
  • Also P(BA)P(B) iff A,B independent

20
Bayes Theorem
  • Swap the order of dependence
  • Sometimes easier to estimate one kind of
    dependence than the other

21
What does this have to do with the Noisy Channel
Model?
(O)
(H)
22
Noisy Channel Applied to Word Recognition
  • argmaxw P(wO) argmaxw P(Ow) P(w)
  • Simplifying assumptions
  • pronunciation string correct
  • word boundaries known
  • Problem
  • Given n iy, what is correct dictionary word?
  • What do we need?

ni knee, neat, need, new
23
What is the most likely word given ni?
  • Compute prior P(w)
  • Now compute likelihood P(niw), then multiply

24
Why N-grams?
  • Compute likelihood P(niw), then multiply
  • Unigram approach ignores context
  • Need to factor in context (n-gram)
  • Use P(needI) instead of just P(need)
  • Note P(newI) lt P(needI)

25
Next Word Predictionborrowed from J. Hirschberg
  • From a NY Times story...
  • Stocks plunged this .
  • Stocks plunged this morning, despite a cut in
    interest rates
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    ...
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began

26
Next Word Prediction (cont)
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last Tuesday's terrorist attacks.

27
Human Word Prediction
  • Domain knowledge
  • Syntactic knowledge
  • Lexical knowledge

28
Claim
  • A useful part of the knowledge needed to allow
    Word Prediction can be captured using simple
    statistical techniques.
  • Compute
  • probability of a sequence
  • likelihood of words co-occurring

29
Why would we want to do this?
  • Rank the likelihood of sequences containing
    various alternative alternative hypotheses
  • Assess the likelihood of a hypothesis

30
Why is this useful?
  • Speech recognition
  • Handwriting recognition
  • Spelling correction
  • Machine translation systems
  • Optical character recognizers

31
Handwriting Recognition
  • Assume a note is given to a bank teller, which
    the teller reads as I have a gub. (cf. Woody
    Allen)
  • NLP to the rescue .
  • gub is not a word
  • gun, gum, Gus, and gull are words, but gun has a
    higher probability in the context of a bank

32
Real Word Spelling Errors
  • They are leaving in about fifteen minuets to go
    to her house.
  • The study was conducted mainly be John Black.
  • The design an construction of the system will
    take more than a year.
  • Hopefully, all with continue smoothly in my
    absence.
  • Can they lave him my messages?
  • I need to notified the bank of.
  • He is trying to fine out.

33
For Spell Checkers
  • Collect list of commonly substituted words
  • piece/peace, whether/weather, their/there ...
  • ExampleOn Tuesday, the whether On
    Tuesday, the weather

34
Language Model
  • Definition Language model is a model that
    enables one to compute the probability, or
    likelihood, of a sentence S, P(S).
  • Lets look at different ways of computing P(S) in
    the context of Word Prediction

35
Word Prediction Simple vs. Smart
  • SimpleEvery word follows every other word w/
    equal probability (0-gram)
  • Assume V is the size of the vocabulary
  • Likelihood of sentence S of length n is 1/V
    1/V 1/V
  • If English has 100,000 words, probability of
    each next word is 1/100000 .00001
  • SmarterProbability of each next word is related
    to word frequency (unigram)
  • Likelihood of sentence S P(w1) P(w2)
    P(wn)
  • Assumes probability of each word is independent
    of probabilities of other words.
  • Even smarter Look at probability given previous
    words (N-gram)
  • Likelihood of sentence S P(w1) P(w2w1)
    P(wnwn-1)
  • Assumes probability of each word is dependent
    on probabilities of other words.

36
Chain Rule
  • Conditional Probability
  • P(A1,A2) P(A1) P(A2A1)
  • The Chain Rule generalizes to multiple events
  • P(A1, ,An) P(A1) P(A2A1) P(A3A1,A2)P(AnA1
    An-1)
  • Examples
  • P(the dog) P(the) P(dog the)
  • P(the dog bites) P(the) P(dog the) P(bites
    the dog)

37
Relative Frequencies and Conditional Probabilities
  • Relative word frequencies are better than equal
    probabilities for all words
  • In a corpus with 10K word types, each word would
    have P(w) 1/10K
  • Does not match our intuitions that different
    words are more likely to occur (e.g. the)
  • Conditional probability more useful than
    individual relative word frequencies
  • Dog may be relatively rare in a corpus
  • But if we see barking, P(dogbarking) may be very
    large

38
For a Word String
  • In general, the probability of a complete string
    of words w1wn is
  • P(w )
  • P(w1)P(w2w1)P(w3w1..w2)P(wnw1wn-1)
  • But this approach to determining the
    probability of a word sequence is not very
    helpful in general.

39
Markov Assumption
  • How do we compute P(wnw1n-1)? Trick Instead of
    P(rabbitI saw a), we use P(rabbita).
  • This lets us collect statistics in practice
  • A bigram model P(the barking dog)
    P(theltstartgt)P(barkingthe)P(dogbarking)
  • Markov models are the class of probabilistic
    models that assume that we can predict the
    probability of some future unit without looking
    too far into the past
  • Specifically, for N2 (bigram) P(w1) ?
    P(wkwk-1)
  • Order of a Markov model length of prior context
  • bigram is first order, trigram is second order,

40
Counting Words in Corpora
  • What is a word?
  • e.g., are cat and cats the same word?
  • September and Sept?
  • zero and oh?
  • Is seventy-two one word or two? ATT?
  • Punctuation?
  • How many words are there in English?
  • Where do we find the things to count?

41
Corpora
  • Corpora are (generally online) collections of
    text and speech
  • Examples
  • Brown Corpus (1M words)
  • Wall Street Journal and AP News corpora
  • ATIS, Broadcast News (speech)
  • TDT (text and speech)
  • Switchboard, Call Home (speech)
  • TRAINS, FM Radio (speech)

42
Training and Testing
  • Probabilities come from a training corpus, which
    is used to design the model.
  • overly narrow corpus probabilities don't
    generalize
  • overly general corpus probabilities don't
    reflect task or domain
  • A separate test corpus is used to evaluate the
    model, typically using standard metrics
  • held out test set
  • cross validation
  • evaluation differences should be statistically
    significant

43
Terminology
  • Sentence unit of written language
  • Utterance unit of spoken language
  • Word Form the inflected form that appears in
    the corpus
  • Lemma lexical forms having the same stem, part
    of speech, and word sense
  • Types (V) number of distinct words that might
    appear in a corpus (vocabulary size)
  • Tokens (N) total number of words in a corpus
  • Types seen so far (T) number of distinct words
    seen so far in corpus (smaller than V and N)

44
Simple N-Grams
  • An N-gram model uses the previous N-1 words to
    predict the next oneP(wn wn-N1 wn-N2 wn-1
    )
  • unigrams P(dog)
  • bigrams P(dog big)
  • trigrams P(dog the big)
  • quadrigrams P(dog chasing the big)

45
Using N-Grams
  • Recall that
  • N-gram P(wnw1 ) P(wnwn-N1)
  • Bigram P(w1) ? P(wkwk-1)
  • For a bigram grammar
  • P(sentence) can be approximated by multiplying
    all the bigram probabilities in the sequence
  • ExampleP(I want to eat Chinese food) P(I
    ltstartgt) P(want I) P(to want) P(eat to)
    P(Chinese eat) P(food Chinese)

46
A Bigram Grammar Fragment from BERP
Eat on .16 Eat Thai .03
Eat some .06 Eat breakfast .03
Eat lunch .06 Eat in .02
Eat dinner .05 Eat Chinese .02
Eat at .04 Eat Mexican .02
Eat a .04 Eat tomorrow .01
Eat Indian .04 Eat dessert .007
Eat today .03 Eat British .001
47
Additional BERP Grammar
ltstartgt I .25 Want some .04
ltstartgt Id .06 Want Thai .01
ltstartgt Tell .04 To eat .26
ltstartgt Im .02 To have .14
I want .32 To spend .09
I would .29 To be .02
I dont .08 British food .60
I have .04 British restaurant .15
Want to .65 British cuisine .01
Want a .05 British lunch .01
48
Computing Sentence Probability
  • P(I want to eat British food) P(Iltstartgt)
    P(wantI) P(towant) P(eatto) P(Britisheat)
    P(foodBritish) .25.32.65.26.001.60
    .000080
  • vs. I want to eat Chinese food .00015
  • Probabilities seem to capture syntactic facts,
    world knowledge
  • eat is often followed by a NP
  • British food is not too popular
  • N-gram models can be trained by counting and
    normalization

49
BERP Bigram Counts
I Want To Eat Chinese Food lunch
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
50
BERP Bigram Probabilities Use Unigram Count
  • Normalization divide bigram count by unigram
    count of first word.

I Want To Eat Chinese Food Lunch
3437 1215 3256 938 213 1506 459
  • Computing the probability of I I
  • P(II) C(I,I)/C(I) 8 / 3437 .0023
  • A bigram grammar is an NxN matrix of
    probabilities, where N is the vocabulary size

51
Learning a Bigram Grammar
  • The formula P(wnwn-1) C(wn,wn-1)/C(wn-1) is
    used for bigram parameter estimation
  • Relative Frequency
  • Maximum Likelihood Estimation (MLE) Parameter
    set maximizes likelihood of training set T given
    model M P(TM).

52
What do we learn about the language?
  • What about...
  • P(I I) .0023
  • P(I want) .0025
  • P(I food) .013
  • What's being captured with ...
  • P(want I) .32
  • P(to want) .65
  • P(eat to) .26
  • P(food Chinese) .56
  • P(lunch eat) .055

53
Readings for next time
  • JM Chapter 5, 7.1-7.3
Write a Comment
User Comments (0)
About PowerShow.com