Title: Statistical NLP
1Statistical NLP
Introduction to Statistical NLP
2Textbook
- Manning, C. D., Schütze, H.
- Foundations of Statistical Natural Language
Processing. The MIT Press. 1999.
3Rational versus Empiricist Approaches to Language
I
- Question What prior knowledge should be built
into our models of NLP? - Rationalist Answer A significant part of the
knowledge in the human mind is not derived by the
senses but is fixed in advance, presumably by
genetic inheritance (Chomsky poverty of the
stimulus). - Empiricist Answer The brain is able to perform
association, pattern recognition, and
generalization and, thus, the structures of
Natural Language can be learned.
4Rational versus Empiricist Approaches to Language
II
- Chomskyan/generative linguists seek to describe
the language module of the human mind (the
I-language) for which data such as text (the
E-language) provide only indirect evidence,
which can be supplemented by native speakers
intuitions. - Empiricists approaches are interested in
describing the E-language as it actually occurs. - Chomskyans make a distinction between linguistic
competence and linguistic performance. They
believe that linguistic competence can be
described in isolation while Empiricists reject
this notion.
5Todays Approach to NLP
- From 1970-1989, people were concerned with the
science of the mind and built small (toy) systems
that attempted to behave intelligently. - Recently, there has been more interest on
engineering practical solutions using automatic
learning (knowledge induction). - While Chomskyans tend to concentrate on
categorical judgements about very rare types of
sentences, statistical NLP practitioners
concentrate on common types of sentences.
6NLP The Main Issues
- Why is NLP difficult?
- many words, many phenomena --gt many rules
- Dictionnaire français DELAS 100 000 mots (700 000
fléchies) - sentences, clauses, phrases, constituents,
coordination, negation, imperatives/questions,
inflections, parts of speech, pronunciation,
topic/focus, and much more! - irregularity (exceptions, exceptions to the
exceptions, ...) - potato -gt potato es (tomato, hero,...) photo -gt
photo s, and even both mango -gt mango s or
-gt mango es - Adjective / Noun order new book, electrical
engineering, general regulations, flower garden,
garden flower, ... but Governor General
7Difficulties in NLP (cont.)
- ambiguity
- books NOUN or VERB?
- you need many books vs. she books her flights
online - No left turn weekdays 4-6 pm / except transit
vehicles - when may transit vehicles turn Always? Never?
- Thank you for not smoking, drinking, eating or
playing radios without earphones. - Thank you for not eating without earphones??
- or even Thank you for not drinking without
earphones!? - My neighbors hat was taken by wind. He tried to
catch it. - ...catch the wind or ...catch the hat ?
8(Categorical) Rules or Statistics?
- Preferences
- clear cases context clues she books --gt books
is a verb - rule if an ambiguous word (verb/nonverb) is
preceded by a matching personal pronoun -gt word
is a verb - less clear cases pronoun reference
- she/he/it refers to the most recent noun or
pronoun (?) (but maybe we can specify exceptions) - selectional
- catching hat gtgt catching wind (but why not?)
- semantic
- never thank for drinking in a bus! (but what
about the earphones?)
9Solutions
- Dont guess if you know
- morphology (inflections)
- lexicons (lists of words)
- unambiguous names
- perhaps some (really) fixed phrases
- syntactic rules?
- Use statistics (based on real-world data) for
preferences (only?) - No doubt but this is the big question!
10Statistical NLP
- Imagine
- Each sentence W w1, w2, ..., wn gets a
probability P(WX) in a context X (think of it in
the intuitive sense for now) - For every possible context X, sort all the
imaginable sentences W according to P(WX) - Ideal situation
- best sentence (most probable in context X)
-
NB same for -
interpretation - P(W) Â Ungrammatical sentencesÂ
11Real World Situation
- Unable to specify set of grammatical sentences
today using fixed categorical rules (maybe
never) - Use statistical model based on REAL WORLD DATA
and care about the best sentence only
(disregarding the grammaticality issue) - best sentence
- P(W)
-
12Why is NLP Difficult?
- NLP is difficult because Natural Language is
highly ambiguous. - Example Our company is training workers has 3
parses (i.e., syntactic analyses). - List the sales of the products produced in 1973
with the products produced in 1972 has 455
parses. - Therefore, a practical NLP system must be good at
making disambiguation decisions of word sense,
word category, syntactic structure, and semantic
scope.
13Methods that dont work well
- Maximizing coverage while minimizing ambiguity is
inconsistent with symbolic NLP. - Furthermore, hand-coding syntactic constraints
and preference rules are time consuming to build,
do not scale up well and are brittle in the face
of the extensive use of metaphor in language. - Example if we code
- animate being --gt swallow --gt physical
object - I swallowed his story, hook, line, and
sinker - The supernova swallowed the planet.
14What Statistical NLP can do for us
- Disambiguation strategies that rely on
hand-coding produce a knowledge acquisition
bottleneck and perform poorly on naturally
occurring text. - A Statistical NLP approach seeks to solve these
problems by automatically learning lexical and
structural preferences from corpora. In
particular, Statistical NLP recognizes that there
is a lot of information in the relationships
between words. - The use of statistics offers a good solution to
the ambiguity problem statistical models are
robust, generalize well, and behave gracefully in
the presence of errors and new data.
15Things that can be done with Text Corpora I Word
Counts
- Word Counts to find out
- What are the most common words in the text.
- How many words are in the text (word tokens and
word types). - What the average frequency of each word in the
text is. - Limitation of word counts Most words appear very
infrequently and it is hard to predict much about
the behavior of words that do not occur often in
a corpus. gt Zipfs Law.
16Common words in Tom Sawyer
- Word Freq. Use
- the 3332 determiner (article)
- and 2972 conjunction
- a 1775 determiner
- to 1725 preposition, verbal infinitive marker
- of 1440 preposition
- was 1161 auxiliary verb
- it 1027 (personal/expletive) pronoun
- in 906 preposition
- that 877 complementizer, demonstrative
- he 877 (personal) pronoun
- I 783 (personal) pronoun
- his 772 (possessive) pronoun
- you 686 (personal) pronoun
- Tom 679 proper noun
- with 642 preposition
17Things that can be done with Text Corpora II
Zipfs Law
- If we count up how often each word type of a
language occurs in a large corpus and then list
the words in order of their frequency of
occurrence, we can explore the relationship
between the frequency of a word, f, and its
position in the list, known as its rank, r. - Zipfs Law says that f ? 1/r
- Significance of Zipfs Law For most words, our
data about their use will be exceedingly sparse.
Only for a few words will we have a lot of
examples.
18Lois de ZIP et Mendelbrot
- Zipf's law
- f ? 1/r (1)
- There is a constant k such that
- f . r k (2)
- Mandelbrot's law
- f P (r r) -B (3)
- log f log P - B (log(r r) (4)
19- Word Freq. Rank f r
- turned 51 200 10200
- you'll 30 300 9000
- name 21 400 8400
- comes 16 500 8000
- group 13 600 7800
- lead 11 700 7700
- Friends 10 800 8000
- begin 9 900 8100
- family 8 1000 8000
- brushed 4 2000 8000
- sins 2 3000 6000
- Could 2 4000 8000
- Applausive 1 8000 8000
20Frequencies of frequencies in Tom Sawyer
- Word Frequency of
- Frequency Frequency
- 1 3993
- 2 1292
- 3 664
- 4 410
- 5 243
- 6 199
- 7 172
- 8 131
- 9 82
- 10 91
- 1150 540
- 51100 99
- gt 100 102
21Zipf's law in Tom Sawyer
- Word Freq. Rank f r
- (f) (r)
- the 3332 1 3332
- and 2972 2 5944
- a 1775 3 5235
- he 877 10 8770
- but 410 20 8400
- be 294 30 8820
- there 222 40 8880
- one 172 50 8600
- about 158 60 9480
- more 138 70 9660
- never 124 80 9920
- Oh 116 90 10440
- two 104 100 10400
22Things that can be done with Text Corpora III
Collocations
- A collocation is any turn of phrase or accepted
usage where somehow the whole is perceived as
having an existence beyond the sum of its parts
(e.g., disk drive, make up, bacon and eggs). - Collocations are important for machine
translation. - Collocation can be extracted from a text
(example, the most common bigrams can be
extracted). However, since these bigrams are
often insignificant (e.g., at the, of a),
they can be filtered.
23Commonest bigrams in the NYT
- Frequency Word 1 Word 2
- 80871 of the
- 58841 in the
- 26430 to the
- 21842 on the
- 21839 for the
- 18568 and the
- 16121 that the
- 15630 at the
- 15494 to be
- 13899 in a
- 13689 of a
24Commonest bigrams in the NYT
- Frequency Word 1 Word 2
- 13361 by the
- 13183 with the
- 12622 from the
- 11428 New York
- 10007 he said
- 9775 as a
- 9231 is a
- 8753 has been
- 8573 for a
25Filtered common bigrams in the NYT
- Frequency Word 1 Word 2 POS pattern
- 11487 New York A N
- 7261 United States A N
- 5412 Los Angeles N N
- 3301 last year A N
- 3191 Saudi Arabia N N
- 2699 last week A N
- 2514 vice president A N
- 2378 Persian Gulf A N
- 2161 San Francisco N N
- 2106 President Bush N N
- 2001 Middle East A N
- 1942 Saddam Hussein N N
- 1867 Soviet Union A N
- 1850 White House A N
- 1633 United Nations A N
- 1337 York City N N
- 1328 oil prices N N
- 1210 next year A N
26Things that can be done with Text Corpora IV
Concordances
- Finding concordances corresponds to finding the
different contexts in which a given word occurs. - One can use a Key Word In Context (KWIC)
concordancing program. - Concordances are useful both for building
dictionaries for learners of foreign languages
and for guiding statistical parsers.
27KWIC display
- 1 could find a target. The librarian showed
off - running hither and thither w - 2 elights in. The young lady teachers showed
off - bending sweetly over pupils - 3 ingly. The young gentlemen teachers showed
off with small scoldings and other - 4 seeming vexation). The little girls showed
off in various ways, and the littl - 5 n various ways, and the little boys showed
off with such diligence that the a - 6 t genuwyne? Tom lifted his lip and showed the
vacancy. Well, all right, sai - 7 is little finger for a pen. Then he showed
Huckleberry how to make an H and an - 8 ow's face was haggard, and his eyes showed the
fear that was upon him. When he - 9 not overlook the fact that Tom even showed a
marked aversion to these inquests - 10 own. Two or three glimmering lights showed
where it lay, peacefully sleeping,