Statistical NLP - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Statistical NLP

Description:

Question: What prior knowledge should be built into our models of NLP? ... brushed 4 2000 8000. sins 2 3000 6000. Could 2 4000 8000. Applausive 1 8000 8000. 20 ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 28

Provided by: N248

Category:

more less

Transcript and Presenter's Notes

Title: Statistical NLP

1
Statistical NLP
Introduction to Statistical NLP
2
Textbook

Manning, C. D., Schütze, H.
Foundations of Statistical Natural Language
Processing. The MIT Press. 1999.

3
Rational versus Empiricist Approaches to Language
I

Question What prior knowledge should be built
into our models of NLP?
Rationalist Answer A significant part of the
knowledge in the human mind is not derived by the
senses but is fixed in advance, presumably by
genetic inheritance (Chomsky poverty of the
stimulus).
Empiricist Answer The brain is able to perform
association, pattern recognition, and
generalization and, thus, the structures of
Natural Language can be learned.

4
Rational versus Empiricist Approaches to Language
II

Chomskyan/generative linguists seek to describe
the language module of the human mind (the
I-language) for which data such as text (the
E-language) provide only indirect evidence,
which can be supplemented by native speakers
intuitions.
Empiricists approaches are interested in
describing the E-language as it actually occurs.
Chomskyans make a distinction between linguistic
competence and linguistic performance. They
believe that linguistic competence can be
described in isolation while Empiricists reject
this notion.

5
Todays Approach to NLP

From 1970-1989, people were concerned with the
science of the mind and built small (toy) systems
that attempted to behave intelligently.
Recently, there has been more interest on
engineering practical solutions using automatic
learning (knowledge induction).
While Chomskyans tend to concentrate on
categorical judgements about very rare types of
sentences, statistical NLP practitioners
concentrate on common types of sentences.

6
NLP The Main Issues

Why is NLP difficult?
many words, many phenomena --gt many rules
Dictionnaire français DELAS 100 000 mots (700 000
fléchies)
sentences, clauses, phrases, constituents,
coordination, negation, imperatives/questions,
inflections, parts of speech, pronunciation,
topic/focus, and much more!
irregularity (exceptions, exceptions to the
exceptions, ...)
potato -gt potato es (tomato, hero,...) photo -gt
photo s, and even both mango -gt mango s or
-gt mango es
Adjective / Noun order new book, electrical
engineering, general regulations, flower garden,
garden flower, ... but Governor General

7
Difficulties in NLP (cont.)

ambiguity
books NOUN or VERB?
you need many books vs. she books her flights
online
No left turn weekdays 4-6 pm / except transit
vehicles
when may transit vehicles turn Always? Never?
Thank you for not smoking, drinking, eating or
playing radios without earphones.
Thank you for not eating without earphones??
or even Thank you for not drinking without
earphones!?
My neighbors hat was taken by wind. He tried to
catch it.
...catch the wind or ...catch the hat ?

8
(Categorical) Rules or Statistics?

Preferences
clear cases context clues she books --gt books
is a verb
rule if an ambiguous word (verb/nonverb) is
preceded by a matching personal pronoun -gt word
is a verb
less clear cases pronoun reference
she/he/it refers to the most recent noun or
pronoun (?) (but maybe we can specify exceptions)
selectional
catching hat gtgt catching wind (but why not?)
semantic
never thank for drinking in a bus! (but what
about the earphones?)

9
Solutions

Dont guess if you know
morphology (inflections)
lexicons (lists of words)
unambiguous names
perhaps some (really) fixed phrases
syntactic rules?
Use statistics (based on real-world data) for
preferences (only?)
No doubt but this is the big question!

10
Statistical NLP

Imagine
Each sentence W w1, w2, ..., wn gets a
probability P(WX) in a context X (think of it in
the intuitive sense for now)
For every possible context X, sort all the
imaginable sentences W according to P(WX)
Ideal situation
best sentence (most probable in context X)
NB same for
interpretation
P(W) Ungrammatical sentences

11
Real World Situation

Unable to specify set of grammatical sentences
today using fixed categorical rules (maybe
never)
Use statistical model based on REAL WORLD DATA
and care about the best sentence only
(disregarding the grammaticality issue)
best sentence
P(W)

12
Why is NLP Difficult?

NLP is difficult because Natural Language is
highly ambiguous.
Example Our company is training workers has 3
parses (i.e., syntactic analyses).
List the sales of the products produced in 1973
with the products produced in 1972 has 455
parses.
Therefore, a practical NLP system must be good at
making disambiguation decisions of word sense,
word category, syntactic structure, and semantic
scope.

13
Methods that dont work well

Maximizing coverage while minimizing ambiguity is
inconsistent with symbolic NLP.
Furthermore, hand-coding syntactic constraints
and preference rules are time consuming to build,
do not scale up well and are brittle in the face
of the extensive use of metaphor in language.
Example if we code
animate being --gt swallow --gt physical
object
I swallowed his story, hook, line, and
sinker
The supernova swallowed the planet.

14
What Statistical NLP can do for us

Disambiguation strategies that rely on
hand-coding produce a knowledge acquisition
bottleneck and perform poorly on naturally
occurring text.
A Statistical NLP approach seeks to solve these
problems by automatically learning lexical and
structural preferences from corpora. In
particular, Statistical NLP recognizes that there
is a lot of information in the relationships
between words.
The use of statistics offers a good solution to
the ambiguity problem statistical models are
robust, generalize well, and behave gracefully in
the presence of errors and new data.

15
Things that can be done with Text Corpora I Word
Counts

Word Counts to find out
What are the most common words in the text.
How many words are in the text (word tokens and
word types).
What the average frequency of each word in the
text is.
Limitation of word counts Most words appear very
infrequently and it is hard to predict much about
the behavior of words that do not occur often in
a corpus. gt Zipfs Law.

16
Common words in Tom Sawyer

Word Freq. Use
the 3332 determiner (article)
and 2972 conjunction
a 1775 determiner
to 1725 preposition, verbal infinitive marker
of 1440 preposition
was 1161 auxiliary verb
it 1027 (personal/expletive) pronoun
in 906 preposition
that 877 complementizer, demonstrative
he 877 (personal) pronoun
I 783 (personal) pronoun
his 772 (possessive) pronoun
you 686 (personal) pronoun
Tom 679 proper noun
with 642 preposition

17
Things that can be done with Text Corpora II
Zipfs Law

If we count up how often each word type of a
language occurs in a large corpus and then list
the words in order of their frequency of
occurrence, we can explore the relationship
between the frequency of a word, f, and its
position in the list, known as its rank, r.
Zipfs Law says that f ? 1/r
Significance of Zipfs Law For most words, our
data about their use will be exceedingly sparse.
Only for a few words will we have a lot of
examples.

18
Lois de ZIP et Mendelbrot

Zipf's law
f ? 1/r (1)
There is a constant k such that
f . r k (2)
Mandelbrot's law
f P (r r) -B (3)
log f log P - B (log(r r) (4)

Word Freq. Rank f r
turned 51 200 10200
you'll 30 300 9000
name 21 400 8400
comes 16 500 8000
group 13 600 7800
lead 11 700 7700
Friends 10 800 8000
begin 9 900 8100
family 8 1000 8000
brushed 4 2000 8000
sins 2 3000 6000
Could 2 4000 8000
Applausive 1 8000 8000

20
Frequencies of frequencies in Tom Sawyer

Word Frequency of
Frequency Frequency
1 3993
2 1292
3 664
4 410
5 243
6 199
7 172
8 131
9 82
10 91
1150 540
51100 99
gt 100 102

21
Zipf's law in Tom Sawyer

Word Freq. Rank f r
(f) (r)
the 3332 1 3332
and 2972 2 5944
a 1775 3 5235
he 877 10 8770
but 410 20 8400
be 294 30 8820
there 222 40 8880
one 172 50 8600
about 158 60 9480
more 138 70 9660
never 124 80 9920
Oh 116 90 10440
two 104 100 10400

22
Things that can be done with Text Corpora III
Collocations

A collocation is any turn of phrase or accepted
usage where somehow the whole is perceived as
having an existence beyond the sum of its parts
(e.g., disk drive, make up, bacon and eggs).
Collocations are important for machine
translation.
Collocation can be extracted from a text
(example, the most common bigrams can be
extracted). However, since these bigrams are
often insignificant (e.g., at the, of a),
they can be filtered.

23
Commonest bigrams in the NYT

Frequency Word 1 Word 2
80871 of the
58841 in the
26430 to the
21842 on the
21839 for the
18568 and the
16121 that the
15630 at the
15494 to be
13899 in a
13689 of a

24
Commonest bigrams in the NYT

Frequency Word 1 Word 2
13361 by the
13183 with the
12622 from the
11428 New York
10007 he said
9775 as a
9231 is a
8753 has been
8573 for a

25
Filtered common bigrams in the NYT

Frequency Word 1 Word 2 POS pattern
11487 New York A N
7261 United States A N
5412 Los Angeles N N
3301 last year A N
3191 Saudi Arabia N N
2699 last week A N
2514 vice president A N
2378 Persian Gulf A N
2161 San Francisco N N
2106 President Bush N N
2001 Middle East A N
1942 Saddam Hussein N N
1867 Soviet Union A N
1850 White House A N
1633 United Nations A N
1337 York City N N
1328 oil prices N N
1210 next year A N

26
Things that can be done with Text Corpora IV
Concordances

Finding concordances corresponds to finding the
different contexts in which a given word occurs.
One can use a Key Word In Context (KWIC)
concordancing program.
Concordances are useful both for building
dictionaries for learners of foreign languages
and for guiding statistical parsers.

27
KWIC display

1 could find a target. The librarian showed
off - running hither and thither w
2 elights in. The young lady teachers showed
off - bending sweetly over pupils
3 ingly. The young gentlemen teachers showed
off with small scoldings and other
4 seeming vexation). The little girls showed
off in various ways, and the littl
5 n various ways, and the little boys showed
off with such diligence that the a
6 t genuwyne? Tom lifted his lip and showed the
vacancy. Well, all right, sai
7 is little finger for a pen. Then he showed
Huckleberry how to make an H and an
8 ow's face was haggard, and his eyes showed the
fear that was upon him. When he
9 not overlook the fact that Tom even showed a
marked aversion to these inquests
10 own. Two or three glimmering lights showed
where it lay, peacefully sleeping,