Statistical NLP - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Statistical NLP

Description:

Question: What prior knowledge should be built into our models of NLP? ... brushed 4 2000 8000. sins 2 3000 6000. Could 2 4000 8000. Applausive 1 8000 8000. 20 ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 28
Provided by: N248
Category:

less

Transcript and Presenter's Notes

Title: Statistical NLP


1
Statistical NLP
Introduction to Statistical NLP
2
Textbook
  • Manning, C. D., Schütze, H.
  • Foundations of Statistical Natural Language
    Processing. The MIT Press. 1999.

3
Rational versus Empiricist Approaches to Language
I
  • Question What prior knowledge should be built
    into our models of NLP?
  • Rationalist Answer A significant part of the
    knowledge in the human mind is not derived by the
    senses but is fixed in advance, presumably by
    genetic inheritance (Chomsky poverty of the
    stimulus).
  • Empiricist Answer The brain is able to perform
    association, pattern recognition, and
    generalization and, thus, the structures of
    Natural Language can be learned.

4
Rational versus Empiricist Approaches to Language
II
  • Chomskyan/generative linguists seek to describe
    the language module of the human mind (the
    I-language) for which data such as text (the
    E-language) provide only indirect evidence,
    which can be supplemented by native speakers
    intuitions.
  • Empiricists approaches are interested in
    describing the E-language as it actually occurs.
  • Chomskyans make a distinction between linguistic
    competence and linguistic performance. They
    believe that linguistic competence can be
    described in isolation while Empiricists reject
    this notion.

5
Todays Approach to NLP
  • From 1970-1989, people were concerned with the
    science of the mind and built small (toy) systems
    that attempted to behave intelligently.
  • Recently, there has been more interest on
    engineering practical solutions using automatic
    learning (knowledge induction).
  • While Chomskyans tend to concentrate on
    categorical judgements about very rare types of
    sentences, statistical NLP practitioners
    concentrate on common types of sentences.

6
NLP The Main Issues
  • Why is NLP difficult?
  • many words, many phenomena --gt many rules
  • Dictionnaire français DELAS 100 000 mots (700 000
    fléchies)
  • sentences, clauses, phrases, constituents,
    coordination, negation, imperatives/questions,
    inflections, parts of speech, pronunciation,
    topic/focus, and much more!
  • irregularity (exceptions, exceptions to the
    exceptions, ...)
  • potato -gt potato es (tomato, hero,...) photo -gt
    photo s, and even both mango -gt mango s or
    -gt mango es
  • Adjective / Noun order new book, electrical
    engineering, general regulations, flower garden,
    garden flower, ... but Governor General

7
Difficulties in NLP (cont.)
  • ambiguity
  • books NOUN or VERB?
  • you need many books vs. she books her flights
    online
  • No left turn weekdays 4-6 pm / except transit
    vehicles
  • when may transit vehicles turn Always? Never?
  • Thank you for not smoking, drinking, eating or
    playing radios without earphones.
  • Thank you for not eating without earphones??
  • or even Thank you for not drinking without
    earphones!?
  • My neighbors hat was taken by wind. He tried to
    catch it.
  • ...catch the wind or ...catch the hat ?

8
(Categorical) Rules or Statistics?
  • Preferences
  • clear cases context clues she books --gt books
    is a verb
  • rule if an ambiguous word (verb/nonverb) is
    preceded by a matching personal pronoun -gt word
    is a verb
  • less clear cases pronoun reference
  • she/he/it refers to the most recent noun or
    pronoun (?) (but maybe we can specify exceptions)
  • selectional
  • catching hat gtgt catching wind (but why not?)
  • semantic
  • never thank for drinking in a bus! (but what
    about the earphones?)

9
Solutions
  • Dont guess if you know
  • morphology (inflections)
  • lexicons (lists of words)
  • unambiguous names
  • perhaps some (really) fixed phrases
  • syntactic rules?
  • Use statistics (based on real-world data) for
    preferences (only?)
  • No doubt but this is the big question!

10
Statistical NLP
  • Imagine
  • Each sentence W w1, w2, ..., wn gets a
    probability P(WX) in a context X (think of it in
    the intuitive sense for now)
  • For every possible context X, sort all the
    imaginable sentences W according to P(WX)
  • Ideal situation
  • best sentence (most probable in context X)

  • NB same for

  • interpretation
  • P(W)   Ungrammatical sentences 

11
Real World Situation
  • Unable to specify set of grammatical sentences
    today using fixed categorical rules (maybe
    never)
  • Use statistical model based on REAL WORLD DATA
    and care about the best sentence only
    (disregarding the grammaticality issue)
  • best sentence
  • P(W)

12
Why is NLP Difficult?
  • NLP is difficult because Natural Language is
    highly ambiguous.
  • Example Our company is training workers has 3
    parses (i.e., syntactic analyses).
  • List the sales of the products produced in 1973
    with the products produced in 1972 has 455
    parses.
  • Therefore, a practical NLP system must be good at
    making disambiguation decisions of word sense,
    word category, syntactic structure, and semantic
    scope.

13
Methods that dont work well
  • Maximizing coverage while minimizing ambiguity is
    inconsistent with symbolic NLP.
  • Furthermore, hand-coding syntactic constraints
    and preference rules are time consuming to build,
    do not scale up well and are brittle in the face
    of the extensive use of metaphor in language.
  • Example if we code
  • animate being --gt swallow --gt physical
    object
  • I swallowed his story, hook, line, and
    sinker
  • The supernova swallowed the planet.

14
What Statistical NLP can do for us
  • Disambiguation strategies that rely on
    hand-coding produce a knowledge acquisition
    bottleneck and perform poorly on naturally
    occurring text.
  • A Statistical NLP approach seeks to solve these
    problems by automatically learning lexical and
    structural preferences from corpora. In
    particular, Statistical NLP recognizes that there
    is a lot of information in the relationships
    between words.
  • The use of statistics offers a good solution to
    the ambiguity problem statistical models are
    robust, generalize well, and behave gracefully in
    the presence of errors and new data.

15
Things that can be done with Text Corpora I Word
Counts
  • Word Counts to find out
  • What are the most common words in the text.
  • How many words are in the text (word tokens and
    word types).
  • What the average frequency of each word in the
    text is.
  • Limitation of word counts Most words appear very
    infrequently and it is hard to predict much about
    the behavior of words that do not occur often in
    a corpus. gt Zipfs Law.

16
Common words in Tom Sawyer
  • Word Freq. Use
  • the 3332 determiner (article)
  • and 2972 conjunction
  • a 1775 determiner
  • to 1725 preposition, verbal infinitive marker
  • of 1440 preposition
  • was 1161 auxiliary verb
  • it 1027 (personal/expletive) pronoun
  • in 906 preposition
  • that 877 complementizer, demonstrative
  • he 877 (personal) pronoun
  • I 783 (personal) pronoun
  • his 772 (possessive) pronoun
  • you 686 (personal) pronoun
  • Tom 679 proper noun
  • with 642 preposition

17
Things that can be done with Text Corpora II
Zipfs Law
  • If we count up how often each word type of a
    language occurs in a large corpus and then list
    the words in order of their frequency of
    occurrence, we can explore the relationship
    between the frequency of a word, f, and its
    position in the list, known as its rank, r.
  • Zipfs Law says that f ? 1/r
  • Significance of Zipfs Law For most words, our
    data about their use will be exceedingly sparse.
    Only for a few words will we have a lot of
    examples.

18
Lois de ZIP et Mendelbrot
  • Zipf's law
  • f ? 1/r (1)
  • There is a constant k such that
  • f . r k (2)
  • Mandelbrot's law
  • f P (r r) -B (3)
  • log f log P - B (log(r r) (4)

19
  • Word Freq. Rank f r
  • turned 51 200 10200
  • you'll 30 300 9000
  • name 21 400 8400
  • comes 16 500 8000
  • group 13 600 7800
  • lead 11 700 7700
  • Friends 10 800 8000
  • begin 9 900 8100
  • family 8 1000 8000
  • brushed 4 2000 8000
  • sins 2 3000 6000
  • Could 2 4000 8000
  • Applausive 1 8000 8000

20
Frequencies of frequencies in Tom Sawyer
  • Word Frequency of
  • Frequency Frequency
  • 1 3993
  • 2 1292
  • 3 664
  • 4 410
  • 5 243
  • 6 199
  • 7 172
  • 8 131
  • 9 82
  • 10 91
  • 1150 540
  • 51100 99
  • gt 100 102

21
Zipf's law in Tom Sawyer
  • Word Freq. Rank f r
  • (f) (r)
  • the 3332 1 3332
  • and 2972 2 5944
  • a 1775 3 5235
  • he 877 10 8770
  • but 410 20 8400
  • be 294 30 8820
  • there 222 40 8880
  • one 172 50 8600
  • about 158 60 9480
  • more 138 70 9660
  • never 124 80 9920
  • Oh 116 90 10440
  • two 104 100 10400

22
Things that can be done with Text Corpora III
Collocations
  • A collocation is any turn of phrase or accepted
    usage where somehow the whole is perceived as
    having an existence beyond the sum of its parts
    (e.g., disk drive, make up, bacon and eggs).
  • Collocations are important for machine
    translation.
  • Collocation can be extracted from a text
    (example, the most common bigrams can be
    extracted). However, since these bigrams are
    often insignificant (e.g., at the, of a),
    they can be filtered.

23
Commonest bigrams in the NYT
  • Frequency Word 1 Word 2
  • 80871 of the
  • 58841 in the
  • 26430 to the
  • 21842 on the
  • 21839 for the
  • 18568 and the
  • 16121 that the
  • 15630 at the
  • 15494 to be
  • 13899 in a
  • 13689 of a

24
Commonest bigrams in the NYT
  • Frequency Word 1 Word 2
  • 13361 by the
  • 13183 with the
  • 12622 from the
  • 11428 New York
  • 10007 he said
  • 9775 as a
  • 9231 is a
  • 8753 has been
  • 8573 for a

25
Filtered common bigrams in the NYT
  • Frequency Word 1 Word 2 POS pattern
  • 11487 New York A N
  • 7261 United States A N
  • 5412 Los Angeles N N
  • 3301 last year A N
  • 3191 Saudi Arabia N N
  • 2699 last week A N
  • 2514 vice president A N
  • 2378 Persian Gulf A N
  • 2161 San Francisco N N
  • 2106 President Bush N N
  • 2001 Middle East A N
  • 1942 Saddam Hussein N N
  • 1867 Soviet Union A N
  • 1850 White House A N
  • 1633 United Nations A N
  • 1337 York City N N
  • 1328 oil prices N N
  • 1210 next year A N

26
Things that can be done with Text Corpora IV
Concordances
  • Finding concordances corresponds to finding the
    different contexts in which a given word occurs.
  • One can use a Key Word In Context (KWIC)
    concordancing program.
  • Concordances are useful both for building
    dictionaries for learners of foreign languages
    and for guiding statistical parsers.

27
KWIC display
  • 1 could find a target. The librarian showed
    off - running hither and thither w
  • 2 elights in. The young lady teachers showed
    off - bending sweetly over pupils
  • 3 ingly. The young gentlemen teachers showed
    off with small scoldings and other
  • 4 seeming vexation). The little girls showed
    off in various ways, and the littl
  • 5 n various ways, and the little boys showed
    off with such diligence that the a
  • 6 t genuwyne? Tom lifted his lip and showed the
    vacancy. Well, all right, sai
  • 7 is little finger for a pen. Then he showed
    Huckleberry how to make an H and an
  • 8 ow's face was haggard, and his eyes showed the
    fear that was upon him. When he
  • 9 not overlook the fact that Tom even showed a
    marked aversion to these inquests
  • 10 own. Two or three glimmering lights showed
    where it lay, peacefully sleeping,
Write a Comment
User Comments (0)
About PowerShow.com