Natural Language Processing

1 / 38
About This Presentation
Title:

Natural Language Processing

Description:

Stocks plunged this morning, despite a cut in interest rates by the ... Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 39
Provided by: jimma87
Learn more at: http://www.cs.pitt.edu

less

Transcript and Presenter's Notes

Title: Natural Language Processing


1
Natural Language Processing
  • Lecture Notes 6

2
Word Prediction
  • Stocks plunged this

3
Word Prediction
  • Stocks plunged this morning, despite a cut in
    interest

4
Word Prediction
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall

5
Word Prediction
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began

6
Word Prediction
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last

7
Word Prediction
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last Tuesdays terrorist attacks.

8
Word Prediction
  • So, we can predict future words in an utterance
  • How?
  • Domain knowledge
  • Syntactic knowledge
  • Lexical knowledge
  • We will use probabilities

9
Word Prediction
  • If you can predict the next word, you can predict
    the likelihood of sequences containing various
    alternative words.
  • That will help us resolve POS, WSD,spelling
    correction, hand-writing recognition, speech
    recognition, augmentative communication

10
N-GramsThe big red dog
  • Unigrams P(dog)
  • Bigrams P(dogred)
  • Trigrams P(dogbig red)
  • Four-grams P(dogthe big red)
  • In general, well be dealing with
  • P(Word Some fixed prefix)

11
Using N-Grams
  • P(I want to eat Chinese food) P(Istart)P(wantI)
    P(toI want)P(foodI want to eat Chinese)
  • Markov assumptions
  • Bigrams (Istart)P(wantI)P(towant)P(foodChines
    e)
  • Trigrams
  • P(Istart)P(wantI)P(toI want)P(foodeat
    Chinese)

12
BERP Table CountsBerkely Restaurant Project

This isnt the complete table. E.g., I occurs
3437 times (see p. 201 in 1st edition)
13
BERP Table Bigram Probabilities
14
An Aside on Logs
  • You dont really do all those multiplies. The
    numbers are too small and lead to underflows
  • Convert the probabilities to logs and then do
    additions.

15
Generation
  • Choose N-Grams with non-0 probabilities and
    string them together to get a feeling for
    accuracy of the N-gram model

16
Shakespere
  • Unigrams
  • Every enter now severally so, let
  • Hill he late speaks or! A more to leg less first
    you enter
  • Bigrams
  • What means, sir. I confess she? Then all sorts,
    he is trim, captain.
  • Why dost stand forth thy canopy, forsooth he is
    this palpable hit the King Henry.

17
Shakespeare
  • Trigrams
  • Sweet prince, Falstaff shall die.
  • This shall forbid it should be branded, if renown
    made it empty
  • Quadigrams
  • What! I will go seek the traitor Gloucester
  • Will you not tell me who I am?

18
Observations
  • A small number of events occur with high
    frequency
  • You can collect reliable statistics on these
    events with relatively small samples
  • A large number of events occur with small
    frequency
  • You might have to wait a long time to gather
    statistics on the low frequency events

19
Observations
  • Some zeroes are really zeroes
  • Meaning that they represent events that cant or
    shouldnt occur
  • On the other hand, some zeroes arent really
    zeroes
  • They represent low frequency events that simply
    didnt occur in the corpus

20
Dealing with Problem of Zero Counts
  • Dont use higher order N-grams
  • Smoothing
  • Add-one
  • Witten-Bell
  • Backoff

21
Discounting or Smoothing
  • MLE is usually unsuitable for NLP because of the
    sparseness of the data
  • We need to allow for possibility of seeing events
    not seen in training
  • Must use a Discounting or Smoothing technique
  • Decrease the probability of previously seen
    events to leave a little bit of probability for
    previously unseen events

22
Add-one Smoothing (Laplaces law)
  • Pretend we have seen every n-gram at least once
  • Intuitively
  • new_count(n-gram) old_count(n-gram) 1
  • The idea is to give a little bit of the
    probability space to unseen events
  • P(wiwi-1)
  • (count(wi-1 wi)1)/(count(wi-1) V)
  • Later slides count(wi-1) referred to as N

23
Add-one Example (V1616)
unsmoothed bigram counts
2nd word
unsmoothed normalized bigram probabilities
24
Add-one Example (V1616)
add-one smoothed bigram counts
add-one normalized bigram probabilities
25
Problem add-one smoothing (V1616)
  • bigrams starting with Chinese are boosted by a
    factor of 8 ! (1829 / 213)

unsmoothed bigram counts
add-one smoothed bigram counts
26
Problem with add-one smoothing
  • every previously unseen n-gram is given a low
    probability
  • but there are so many of them that too much
    probability mass is given to unseen events
  • adding 1 to frequent bigram, does not change much
  • but adding 1 to low bigrams (including unseen
    ones) boosts them too much !
  • In NLP applications that are very sparse,
    Laplaces Law actually gives far too much of the
    probability space to unseen events.

27
Witten-Bell smoothing
  • intuition
  • An unseen n-gram is one that just did not occur
    yet
  • When it does happen, it will be its first
    occurrence
  • So give to unseen n-grams the probability of
    seeing a new n-gram

28
Witten-Bell the equations
  • Total probability mass assigned to zero-frequency
    unigrams (T observed types N word
    instances/tokens)
  • So each zero N-gram gets the probability

29
Witten-Bell why discounting
  • Now of course we have to take away something
    (discount) from the probability of the events
    seen more than once

30
Witten-Bell for bigrams
  • We relativize the types to the previous word
  • this probability mass, must be distributed in
    equal parts over all unseen bigrams
  • Z (w1) number of unseen n-grams starting with
    w1

  • for each unseen event

31
Small example
  • all unseen bigrams starting with a will share a
    probability mass of
  • each unseen bigram starting with a will have an
    equal part of this

32
Small example (cont)
  • all unseen bigrams starting with b will share a
    probability mass of
  • each unseen bigrams starting with b will have an
    equal part of this

33
Small example (cont)
  • all unseen bigrams starting with c will share a
    probability mass of
  • each unseen bigrams starting with c will have an
    equal part of this

34
Back to Counts
  • Unseen bigrams
  • To get from probabilities back to the counts, we
    know that
  • // N (w1) nb
    of bigrams starting with w1
  • //C(w2w1) here
    means Count(w1 w2)
  • so we get

35
The restaurant example
  • The original counts were
  • T(w) number of different seen bigrams types
    starting with w
  • we have a vocabulary of 1616 words, so we can
    compute
  • Z(w) number of unseen bigrams types starting
    with w
  • Z(w) 1616 - T(w)
  • N(w) number of bigrams tokens starting with w

36
Witten-Bell smoothed count
  • the count of the unseen bigram I lunch
  • the count of the seen bigram want to
  • Witten-Bell smoothed bigram counts

37
Witten-Bell smoothed probabilities
Witten-Bell normalized bigram probabilities
38
Simple Linear Interpolation
  • Solve the sparseness in a trigram model by mixing
    with bigram and unigram models
  • Also called
  • linear interpolation,
  • finite mixture models
  • deleted interpolation
  • Combine linearly
  • Pli(wnwn-2,wn-1) ?1P(wn) ?2P(wnwn-1)
    ?3P(wnwn-2,wn-1)
  • where 0? ?i ?1 and ?i ?i 1
Write a Comment
User Comments (0)