The Dream

About This Presentation

Transcript and Presenter's Notes

Title: The Dream

1
?? ???? ?? ??? ??? ??, ???? ??

3? ??

2
The Dream

Itd be great if machines could
Process our email (usefully)
Translate languages accurately
Help us manage, summarize, and aggregate
information
Use speech as a UI (when needed)
Talk to us / listen to us
But they cant
Language is complex, ambiguous, flexible, and
subtle
Good solutions need linguistics and machine
learning knowledge
So

3
What is NLP?

Fundamental goal deep understand of broad
language
Not just string processing or keyword matching!
End systems that we want to build
Ambitious speech recognition, machine
translation, information extraction, dialog
interfaces, question answering, trend finding
Modest spelling correction, text categorization

4
Speech Systems

Automatic Speech Recognition (ASR)
Audio in, text out
SOTA 0.3 for digit strings, 5 dictation, 50
TV
Text to Speech (TTS)
Text in, audio out
SOTA totally intelligible (if sometimes
unnatural)
Speech systems currently
Model the speech signal
Model language

5
Machine Translation

Translation systems encode
Something about fluent language
Something about how two languages correspond
(middle of term)
SOTA for easy language pairs, better than
nothing, but more an understanding aid than a
replacement for human translators

6
Information Extraction

Information Extraction (IE)
Unstructured text to database entries
SOTA perhaps 70 accuracy for multi-sentence
temples, 90 for single easy fields

7
Question Answering

Question Answering
More than search
Ask general comprehension questions of a document
collection
Can be really easy Whats the capital of
Wyoming?
Can be harder How many US states capitals are
also their largest cities?
Can be open ended What are the main issues in
the global warming debate?
SOTA Can do factoids, even when text isnt a
perfect match

8
What is nearby NLP?

Computational Linguistics
Using computational methods to learn more about
how language works
We end up doing this and using it
Cognitive Science
Figuring out how the human brain works
Includes the bits that do language
Humans the only working NLP prototype!
Speech?
Mapping audio signals to text
Traditionally separate from NLP, converging?
Two components acoustic models and language
models
Language models in the domain of stat NLP

9
What is this Class?

Three aspects to the course
Linguistic Issues
What are the range of language phenomena?
What are the knowledge sources that let us
disambiguate?
What representations are appropriate?
Technical Methods
Learning and parameter estimation
Increasingly complex model structures
Efficient algorithms dynamic programming, search
Engineering Methods
Issues of scale
Sometimes, very ugly hacks
Well focus on what makes the problems hard, and
what works in practice

10
Class Requirements and Goals

Class requirements
Uses a variety of skills / knowledge
Basic probability and statistics
Basic linguistics background
Decent coding skills (Java)
Most people are probably missing one of the above
Well address some review concepts with sections,
TBD
Class goals
Learn the issues and techniques of statistical
NLP
Build the real tools used in NLP (language
models, taggers, parsers, translation systems)
Be able to read current research papers in the
field
See where the gaping holes in the field are!

11
Rational versus Empiricist Approaches toLanguage
(I)

Question What prior knowledge should be built
into our models of NLP?
Rationalist Answer A significant part of the
knowledge in the human mind is not derived by the
senses but is fixed in advance, presumably by
genetic inheritance (Chomsky poverty of the
stimulus).
Empiricist Answer The brain is able to perform
association, pattern recognition, and
generalization and, thus, the structures of
Natural Language can be learned.

12
Rational versus Empiricist Approaches toLanguage
(II)

Chomskyan/generative linguists seek to describe
the language module of the human mind (the
Ilanguage) for which data such as text (the
Elanguage) provide only indirect evidence, which
can be supplemented by native speakers
intuitions.
Empiricists approaches are interested in
describing the E-language as it actually occurs.
Chomskyans make a distinction between linguistic
competence and linguistic performance. They
believe that linguistic competence can be
described in isolation while Empiricists reject
this notion.

13
Empiricist

Seeks methods that can work on raw text as it
exists
Knowledge induction (automatic learning), not by
disambiguation
American structuralism
The work of Shannon
Assign probabilities on linguistic events
compared to concentrating on categorical
judgments about rare types of sentences

14
??

???? ???? ?? ?
??? ??? ?????
??? ?? ??? ???, ??, ??????
??? ??? ??? ???, ? ???? ??? ???? ??????
?????? ??? ??? ??
In additions to this, she insisted that women
were regarded as a different existence from men
unfairly.
(???? ??? ??? ? ?)
take a while, sort of/kind of,
I kind of love you.(??)

15
Some Early NLP History

1950s
Foundational work automata, information theory,
etc.
First speech systems
Machine translation (MT) hugely funded by
military (imagine that)
Toy models MT using basically word-substitution
Optimism!
1960s and 1970s NLP Winter
Bar-Hillel (FAHQT) and ALPAC reports kills MT
Work shifts to deeper models, syntax
but toy domains / grammars (SHRDLU, LUNAR)
1980s The Empirical Revolution
Expectations get reset
Corpus-based methods become central
Deep analysis often traded for robust and simple
approximations
Evaluate everything

16
Todays Approach to NLP

From 1970-1989, people were concerned with the
science of the mind and built small (toy) systems
that attempted to behave intelligently.
Recently, there has been more interest on
engineering practical solutions using automatic
learning (knowledge induction).
While Chomskyans tend to concentrate on
categorical judgements about very rare types of
sentences, statistical NLP practitioners
concentrate on common types of sentences.

17
Why is NLP Difficult?

NLP is difficult because Natural Language is
highly ambiguous.
Example The company is training workers has 2
or more parse trees (i.e., syntactic analyses).
List the sales of the products produced in 1973
with the products produced in 1972 has 455
parses.
Therefore, a practical NLP system must be good at
making disambiguation decisions of word sense,
word category, syntactic structure, and semantic
scope.

18
Methods that dont work well

Maximizing coverage while minimizing ambiguity is
inconsistent with symbolic NLP.
Furthermore, hand-coded syntactic constraints and
preference rules are time consuming to build, do
not scale up well and are brittle in the face of
the extensive use of metaphor in language.
Example if we code
animate being --gt swallow --gt physical object
I swallowed his story, hook, line, and
sinker.
The supernova swallowed the planet.

19
Classical NLP Parsing

Write symbolic or logical rules
Use deduction systems to prove parses from words
Minimal grammar on Fed raises sentence 36
parses
Simple 10-rule grammar 592 parses
Real-size grammar many millions of parses
This scaled very badly, didnt yield
broad-coverage tools

20
NLP Annotation

Much of NLP is annotating text with structure
which specifies how its assembled.
Syntax grammatical structure
Semantics meaning, either lexical or
compositional

21
What Made NLP Hard?

The core problems
Ambiguity
Sparsity
Scale
Unmodeled Variables

22
Problem Ambiguities

Headlines
Iraqi Head Seeks Arms
Ban on Nude Dancing on Governors Desk
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Stolen Painting Found by Tree
Kids Make Nutritious Snacks
Local HS Dropouts Cut in Half
Hospitals Are Sued by 7 Foot Doctors
Why are these funny?

23
Syntactic Ambiguities

Maybe were sunk on funny headlines, but normal,
boring sentences are unambiguous?
Our company is training workers.
Fed raises interest rates 0.5 in a measure
against inflation

24
Dark Ambiguities

Dark ambiguities most analyses are shockingly
bad (meaning, they dont have an interpretation
you can get your mind around)
Unknown words and new usages
Solution We need mechanisms to focus attention
on the best ones, probabilistic techniques do this

25
Semantic Ambiguities

Even correct tree-structured syntactic analyses
dont always nail down the meaning
Every morning someones alarm clock wakes me up
Johns boss said he was doing better

26
Other Levels of Language

Tokenization/morphology
What are the words, what is the sub-word
structure?
Often simple rules work (period after Mr. isnt
sentence break)
Relatively easy in English, other languages are
harder
Segmentation
Morphology
Discourse how do sentences relate to each other?
Pragmatics what intent is expressed by the
literal meaning, how to react to an utterance?
Phonetics acoustics and physical production of
sounds
Phonology how sounds pattern in a language

27
Disambiguation for Applications

Sometimes life is easy
Can do text classification pretty well just
knowing the set of words used in the document,
same for authorship attribution
Word-sense disambiguation not usually needed for
web search because of majority effects or
intersection effects (jaguar habitat isnt the
car)
Sometimes only certain ambiguities are relevant
Other times, all levels can be relevant (e.g.,
translation)

he hoped to record a world record
28
Problem Scale

People did know that language was ambiguous!
but they hoped that all interpretations would be
good ones (or ruled out pragmatically)
they didnt realize how bad it would be

29
Corpora

A corpus is a collection of text
Often annotated in some way
Sometimes just lots of text
Balanced vs. uniform corpora
Examples
Newswire collections 500M words
Brown corpus 1M words of tagged balanced text
Penn Treebank 1M words of parsed WSJ
Canadian Hansards 10M words of aligned French /
English sentences
The Web billions of words of who knows what

30
Corpus-Based Methods

A corpus like a treebank gives us three important
tools
It gives us broad coverage

31
Corpus-Based Methods

It gives us statistical information

This is a very different kind of
subject/object asymmetry than what many linguists
are interested in.
32
Corpus-Based Methods

It lets us check our answers!

33
Problem Sparsity

However sparsity is always a problem
New unigram (word), bigram (word pair), and rule
rates in newswire

34
The (Effective) NLP Cycle

Pick a problem (usually some disambiguation)
Get a lot of data (usually a labeled corpus)
Build the simplest thing that could possibly work
Repeat
See what the most common errors are
Figure out what information a human would use
Modify the system to exploit that information
Feature engineering
Representation design
Machine learning methods
Were going to do this over and over again

35
Language isnt Adversarial

One nice thing we know NLP can be done!
Language isnt adversarial
Its produced with the intent of being understood
With some understanding of language, you can
often tell what knowledge sources are relevant
But most variables go unmodeled
Some knowledge sources arent easily available
(realworld knowledge, complex models of other
peoples plans)
Some kinds of features are beyond our technical
ability to model (especially cross-sentence
correlations)

36
??? ???? ??? ??!!

Epistemological accuracy!!
???????.
?????. ?????, ????
?????. ?????
?????.

37
What Statistical NLP can do for us

Disambiguation strategies that rely on
hand-coding produce a knowledge acquisition
bottleneck and perform poorly on naturally
occurring text.
A Statistical NLP approach seeks to solve these
problems by automatically learning lexical and
structural preferences from corpora. In
particular, Statistical NLP recognizes that there
is a lot of information in the relationships
between words.
The use of statistics offers a good solution to
the ambiguity problem statistical models are
robust, generalize well, and behave gracefully in
the presence of errors and new data

38
Corpora

Brown Corpus 1 million words
British National Corpus 100 mil. Words
American National Corpus 10 mil. words -gt 100
Penn TreeBank - parsed WSJ text
Canadian Hansard parallel corpus (bilingual)
Dictionaries
Longman Dictionary of Contemporary English
WordNet (hierarchy of synsets)

39
Things that can be done with Text Corpora
(I)Word Counts

Word Counts to find out
What are the most common words in the text.
How many words are in the text (word tokens and
word types).
What the average frequency of each word in the
text is.
Limitation of word counts Most words appear very
infrequently and it is hard to predict much about
the behavior of words that do not occur often in
a corpus. gt Zipfs Law.

40
Things that can be done with Text Corpora
(II)Zipfs Law

If we count up how often each word type of a
language occurs in a large corpus and then list
the words in order of their frequency of
occurrence, we can explore the relationship
between the frequency of a word, f, and its
position in the list, known as its rank, r.
Zipfs Law says that f ? 1/r
Significance of Zipfs Law For most words, our
data about their use will be exceedingly sparse.
Only for a few words will we have a lot of
examples.

41
Common words in Tom Sawyer
42
Frequencies of frequencies in Tom Sawyer
43
Zipf's law in Tom Sawyer
44
Zipf's law in Tom Sawyer
45
Zipfs Law
46
Zipf's law for the Brown corpus
47
Mandelbrot's formula for the Brown corpus
48
Things that can be done with Text Corpora
(III)Collocations

A collocation is any turn of phrase or accepted
usage where somehow the whole is perceived as
having an existence beyond the sum of its parts
(e.g., disk drive, make up, bacon and eggs).
Collocations are important for machine
translation.
Collocations can be extracted from a text
(example, the most common bigrams can be
extracted). However, since these bigrams are
often insignificant (e.g., at the, of a),
they can be filtered.

49
?? ?

drive ? disk drive, make up
?? ? ?? ?? ? ?? ??
?? ? ???? ? ??? ??
??? ?? ?? ???.
?? ???? ??
??? ?? ????, ??? ??? ??. ?? ?? ??? ???? ???
????????? (?? ??, ?? ??)

50
??

bigram of the, in the, to the, on the . New
York, he said, as a
Filtering adjectivenoun, nounnoun
last year, next year ???

51
Commonest bigrams in the NYT
52
Filtered common bigrams in the NYT
53
Things that can be done with Text Corpora
(IV)Concordances

Finding concordances corresponds to finding the
different contexts in which a given word occurs.
One can use a Key Word In Context (KWIC)
concordancing program.
Concordances are useful both for building
dictionaries for learners of foreign languages
and for guiding statistical parsers.

54
KWIC display
55
Syntactic frames for showed in Tom Sawyer
56
Why study NLP Statistically?

Up until the late 1980s, NLP was mainly
investigated using a rule-based approach.
However, rules appear too strict to characterize
peoples use of language.
This is because people tend to stretch and bend
rules in order to meet their communicative
needs.
Methods for making the modeling of language
more accurate are needed and statistical methods
appear to provide the necessary flexibility.

57
Subdivisions of NLP

Parts of Speech and Morphology (words, their
syntactic function in sentences, and the various
forms they can take).
Phrase Structure and Syntax (regularities and
constraints of word order and phrase structure).
Semantics (the study of the meaning of words
(lexical semantics) and of how word meanings are
combined into the meaning of sentences, etc.)
Pragmatics (the study of how knowledge about the
world and language conventions interact with
literal meaning).

58
Topics Covered in this course
59
Tools and Resources Used

Probability/Statistical Theory Statistical
Distributions, Bayesian Decision Theory.
Linguistics Knowledge Morphology, Syntax,
Semantics and Pragmatics.
Corpora Bodies of marked or unmarked text to
which statistical methods and current linguistic
knowledge can be applied in order to discover
novel linguistic theories or interesting and
useful knowledge organization.

60
Textbook and other useful information

Foundations of Statistical Natural Language
Processing, by Chris Manning and Hinrich Schütze,
MIT Press, 1999.
Course Website borame.cs.pusan.ac.kr

Write a Comment

User Comments (0)

About PowerShow.com

The Dream PowerPoint PPT Presentation