The Dream - PowerPoint PPT Presentation

About This Presentation
Title:

The Dream

Description:

Can be harder: 'How many US states' capitals are also their largest cities? ... 'List the sales of the products produced in 1973 with the products produced in ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 61
Provided by: Kwo85
Category:
Tags: capitals | dream | list | of | states

less

Transcript and Presenter's Notes

Title: The Dream


1
?? ???? ?? ??? ??? ??, ???? ??
  • 3? ??

2
The Dream
  • Itd be great if machines could
  • Process our email (usefully)
  • Translate languages accurately
  • Help us manage, summarize, and aggregate
    information
  • Use speech as a UI (when needed)
  • Talk to us / listen to us
  • But they cant
  • Language is complex, ambiguous, flexible, and
    subtle
  • Good solutions need linguistics and machine
    learning knowledge
  • So

3
What is NLP?
  • Fundamental goal deep understand of broad
    language
  • Not just string processing or keyword matching!
  • End systems that we want to build
  • Ambitious speech recognition, machine
    translation, information extraction, dialog
    interfaces, question answering, trend finding
  • Modest spelling correction, text categorization

4
Speech Systems
  • Automatic Speech Recognition (ASR)
  • Audio in, text out
  • SOTA 0.3 for digit strings, 5 dictation, 50
    TV
  • Text to Speech (TTS)
  • Text in, audio out
  • SOTA totally intelligible (if sometimes
    unnatural)
  • Speech systems currently
  • Model the speech signal
  • Model language

5
Machine Translation
  • Translation systems encode
  • Something about fluent language
  • Something about how two languages correspond
    (middle of term)
  • SOTA for easy language pairs, better than
    nothing, but more an understanding aid than a
    replacement for human translators

6
Information Extraction
  • Information Extraction (IE)
  • Unstructured text to database entries
  • SOTA perhaps 70 accuracy for multi-sentence
    temples, 90 for single easy fields

7
Question Answering
  • Question Answering
  • More than search
  • Ask general comprehension questions of a document
    collection
  • Can be really easy Whats the capital of
    Wyoming?
  • Can be harder How many US states capitals are
    also their largest cities?
  • Can be open ended What are the main issues in
    the global warming debate?
  • SOTA Can do factoids, even when text isnt a
    perfect match

8
What is nearby NLP?
  • Computational Linguistics
  • Using computational methods to learn more about
    how language works
  • We end up doing this and using it
  • Cognitive Science
  • Figuring out how the human brain works
  • Includes the bits that do language
  • Humans the only working NLP prototype!
  • Speech?
  • Mapping audio signals to text
  • Traditionally separate from NLP, converging?
  • Two components acoustic models and language
    models
  • Language models in the domain of stat NLP

9
What is this Class?
  • Three aspects to the course
  • Linguistic Issues
  • What are the range of language phenomena?
  • What are the knowledge sources that let us
    disambiguate?
  • What representations are appropriate?
  • Technical Methods
  • Learning and parameter estimation
  • Increasingly complex model structures
  • Efficient algorithms dynamic programming, search
  • Engineering Methods
  • Issues of scale
  • Sometimes, very ugly hacks
  • Well focus on what makes the problems hard, and
    what works in practice

10
Class Requirements and Goals
  • Class requirements
  • Uses a variety of skills / knowledge
  • Basic probability and statistics
  • Basic linguistics background
  • Decent coding skills (Java)
  • Most people are probably missing one of the above
  • Well address some review concepts with sections,
    TBD
  • Class goals
  • Learn the issues and techniques of statistical
    NLP
  • Build the real tools used in NLP (language
    models, taggers, parsers, translation systems)
  • Be able to read current research papers in the
    field
  • See where the gaping holes in the field are!

11
Rational versus Empiricist Approaches toLanguage
(I)
  • Question What prior knowledge should be built
    into our models of NLP?
  • Rationalist Answer A significant part of the
    knowledge in the human mind is not derived by the
    senses but is fixed in advance, presumably by
    genetic inheritance (Chomsky poverty of the
    stimulus).
  • Empiricist Answer The brain is able to perform
    association, pattern recognition, and
    generalization and, thus, the structures of
    Natural Language can be learned.

12
Rational versus Empiricist Approaches toLanguage
(II)
  • Chomskyan/generative linguists seek to describe
    the language module of the human mind (the
    Ilanguage) for which data such as text (the
    Elanguage) provide only indirect evidence, which
    can be supplemented by native speakers
    intuitions.
  • Empiricists approaches are interested in
    describing the E-language as it actually occurs.
  • Chomskyans make a distinction between linguistic
    competence and linguistic performance. They
    believe that linguistic competence can be
    described in isolation while Empiricists reject
    this notion.

13
Empiricist
  • Seeks methods that can work on raw text as it
    exists
  • Knowledge induction (automatic learning), not by
    disambiguation
  • American structuralism
  • The work of Shannon
  • Assign probabilities on linguistic events
    compared to concentrating on categorical
    judgments about rare types of sentences

14
??
  • ???? ???? ?? ?
  • ??? ??? ?????
  • ??? ?? ??? ???, ??, ??????
  • ??? ??? ??? ???, ? ???? ??? ???? ??????
  • ?????? ??? ??? ??
  • In additions to this, she insisted that women
    were regarded as a different existence from men
    unfairly.
  • (???? ??? ??? ? ?)
  • take a while, sort of/kind of,
  • I kind of love you.(??)

15
Some Early NLP History
  • 1950s
  • Foundational work automata, information theory,
    etc.
  • First speech systems
  • Machine translation (MT) hugely funded by
    military (imagine that)
  • Toy models MT using basically word-substitution
  • Optimism!
  • 1960s and 1970s NLP Winter
  • Bar-Hillel (FAHQT) and ALPAC reports kills MT
  • Work shifts to deeper models, syntax
  • but toy domains / grammars (SHRDLU, LUNAR)
  • 1980s The Empirical Revolution
  • Expectations get reset
  • Corpus-based methods become central
  • Deep analysis often traded for robust and simple
    approximations
  • Evaluate everything

16
Todays Approach to NLP
  • From 1970-1989, people were concerned with the
    science of the mind and built small (toy) systems
    that attempted to behave intelligently.
  • Recently, there has been more interest on
    engineering practical solutions using automatic
    learning (knowledge induction).
  • While Chomskyans tend to concentrate on
    categorical judgements about very rare types of
    sentences, statistical NLP practitioners
    concentrate on common types of sentences.

17
Why is NLP Difficult?
  • NLP is difficult because Natural Language is
    highly ambiguous.
  • Example The company is training workers has 2
    or more parse trees (i.e., syntactic analyses).
  • List the sales of the products produced in 1973
    with the products produced in 1972 has 455
    parses.
  • Therefore, a practical NLP system must be good at
    making disambiguation decisions of word sense,
    word category, syntactic structure, and semantic
    scope.

18
Methods that dont work well
  • Maximizing coverage while minimizing ambiguity is
    inconsistent with symbolic NLP.
  • Furthermore, hand-coded syntactic constraints and
    preference rules are time consuming to build, do
    not scale up well and are brittle in the face of
    the extensive use of metaphor in language.
  • Example if we code
  • animate being --gt swallow --gt physical object
  • I swallowed his story, hook, line, and
    sinker.
  • The supernova swallowed the planet.

19
Classical NLP Parsing
  • Write symbolic or logical rules
  • Use deduction systems to prove parses from words
  • Minimal grammar on Fed raises sentence 36
    parses
  • Simple 10-rule grammar 592 parses
  • Real-size grammar many millions of parses
  • This scaled very badly, didnt yield
    broad-coverage tools

20
NLP Annotation
  • Much of NLP is annotating text with structure
    which specifies how its assembled.
  • Syntax grammatical structure
  • Semantics meaning, either lexical or
    compositional

21
What Made NLP Hard?
  • The core problems
  • Ambiguity
  • Sparsity
  • Scale
  • Unmodeled Variables

22
Problem Ambiguities
  • Headlines
  • Iraqi Head Seeks Arms
  • Ban on Nude Dancing on Governors Desk
  • Juvenile Court to Try Shooting Defendant
  • Teacher Strikes Idle Kids
  • Stolen Painting Found by Tree
  • Kids Make Nutritious Snacks
  • Local HS Dropouts Cut in Half
  • Hospitals Are Sued by 7 Foot Doctors
  • Why are these funny?

23
Syntactic Ambiguities
  • Maybe were sunk on funny headlines, but normal,
    boring sentences are unambiguous?
  • Our company is training workers.
  • Fed raises interest rates 0.5 in a measure
    against inflation

24
Dark Ambiguities
  • Dark ambiguities most analyses are shockingly
    bad (meaning, they dont have an interpretation
    you can get your mind around)
  • Unknown words and new usages
  • Solution We need mechanisms to focus attention
    on the best ones, probabilistic techniques do this

25
Semantic Ambiguities
  • Even correct tree-structured syntactic analyses
    dont always nail down the meaning
  • Every morning someones alarm clock wakes me up
  • Johns boss said he was doing better

26
Other Levels of Language
  • Tokenization/morphology
  • What are the words, what is the sub-word
    structure?
  • Often simple rules work (period after Mr. isnt
    sentence break)
  • Relatively easy in English, other languages are
    harder
  • Segmentation
  • Morphology
  • Discourse how do sentences relate to each other?
  • Pragmatics what intent is expressed by the
    literal meaning, how to react to an utterance?
  • Phonetics acoustics and physical production of
    sounds
  • Phonology how sounds pattern in a language

27
Disambiguation for Applications
  • Sometimes life is easy
  • Can do text classification pretty well just
    knowing the set of words used in the document,
    same for authorship attribution
  • Word-sense disambiguation not usually needed for
    web search because of majority effects or
    intersection effects (jaguar habitat isnt the
    car)
  • Sometimes only certain ambiguities are relevant
  • Other times, all levels can be relevant (e.g.,
    translation)

he hoped to record a world record
28
Problem Scale
  • People did know that language was ambiguous!
  • but they hoped that all interpretations would be
    good ones (or ruled out pragmatically)
  • they didnt realize how bad it would be

29
Corpora
  • A corpus is a collection of text
  • Often annotated in some way
  • Sometimes just lots of text
  • Balanced vs. uniform corpora
  • Examples
  • Newswire collections 500M words
  • Brown corpus 1M words of tagged balanced text
  • Penn Treebank 1M words of parsed WSJ
  • Canadian Hansards 10M words of aligned French /
    English sentences
  • The Web billions of words of who knows what

30
Corpus-Based Methods
  • A corpus like a treebank gives us three important
    tools
  • It gives us broad coverage

31
Corpus-Based Methods
  • It gives us statistical information

This is a very different kind of
subject/object asymmetry than what many linguists
are interested in.
32
Corpus-Based Methods
  • It lets us check our answers!

33
Problem Sparsity
  • However sparsity is always a problem
  • New unigram (word), bigram (word pair), and rule
    rates in newswire

34
The (Effective) NLP Cycle
  • Pick a problem (usually some disambiguation)
  • Get a lot of data (usually a labeled corpus)
  • Build the simplest thing that could possibly work
  • Repeat
  • See what the most common errors are
  • Figure out what information a human would use
  • Modify the system to exploit that information
  • Feature engineering
  • Representation design
  • Machine learning methods
  • Were going to do this over and over again

35
Language isnt Adversarial
  • One nice thing we know NLP can be done!
  • Language isnt adversarial
  • Its produced with the intent of being understood
  • With some understanding of language, you can
    often tell what knowledge sources are relevant
  • But most variables go unmodeled
  • Some knowledge sources arent easily available
    (realworld knowledge, complex models of other
    peoples plans)
  • Some kinds of features are beyond our technical
    ability to model (especially cross-sentence
    correlations)

36
??? ???? ??? ??!!
  • Epistemological accuracy!!
  • ???????.
  • ?????. ?????, ????
  • ?????. ?????
  • ?????.

37
What Statistical NLP can do for us
  • Disambiguation strategies that rely on
    hand-coding produce a knowledge acquisition
    bottleneck and perform poorly on naturally
    occurring text.
  • A Statistical NLP approach seeks to solve these
    problems by automatically learning lexical and
    structural preferences from corpora. In
    particular, Statistical NLP recognizes that there
    is a lot of information in the relationships
    between words.
  • The use of statistics offers a good solution to
    the ambiguity problem statistical models are
    robust, generalize well, and behave gracefully in
    the presence of errors and new data

38
Corpora
  • Brown Corpus 1 million words
  • British National Corpus 100 mil. Words
  • American National Corpus 10 mil. words -gt 100
  • Penn TreeBank - parsed WSJ text
  • Canadian Hansard parallel corpus (bilingual)
  • Dictionaries
  • Longman Dictionary of Contemporary English
  • WordNet (hierarchy of synsets)

39
Things that can be done with Text Corpora
(I)Word Counts
  • Word Counts to find out
  • What are the most common words in the text.
  • How many words are in the text (word tokens and
    word types).
  • What the average frequency of each word in the
    text is.
  • Limitation of word counts Most words appear very
    infrequently and it is hard to predict much about
    the behavior of words that do not occur often in
    a corpus. gt Zipfs Law.

40
Things that can be done with Text Corpora
(II)Zipfs Law
  • If we count up how often each word type of a
    language occurs in a large corpus and then list
    the words in order of their frequency of
    occurrence, we can explore the relationship
    between the frequency of a word, f, and its
    position in the list, known as its rank, r.
  • Zipfs Law says that f ? 1/r
  • Significance of Zipfs Law For most words, our
    data about their use will be exceedingly sparse.
    Only for a few words will we have a lot of
    examples.

41
Common words in Tom Sawyer
42
Frequencies of frequencies in Tom Sawyer
43
Zipf's law in Tom Sawyer
44
Zipf's law in Tom Sawyer
45
Zipfs Law
46
Zipf's law for the Brown corpus
47
Mandelbrot's formula for the Brown corpus
48
Things that can be done with Text Corpora
(III)Collocations
  • A collocation is any turn of phrase or accepted
    usage where somehow the whole is perceived as
    having an existence beyond the sum of its parts
    (e.g., disk drive, make up, bacon and eggs).
  • Collocations are important for machine
    translation.
  • Collocations can be extracted from a text
    (example, the most common bigrams can be
    extracted). However, since these bigrams are
    often insignificant (e.g., at the, of a),
    they can be filtered.

49
?? ?
  • drive ? disk drive, make up
  • ?? ? ?? ?? ? ?? ??
  • ?? ? ???? ? ??? ??
  • ??? ?? ?? ???.
  • ?? ???? ??
  • ??? ?? ????, ??? ??? ??. ?? ?? ??? ???? ???
  • ????????? (?? ??, ?? ??)

50
??
  • bigram of the, in the, to the, on the . New
    York, he said, as a
  • Filtering adjectivenoun, nounnoun
  • last year, next year ???

51
Commonest bigrams in the NYT
52
Filtered common bigrams in the NYT
53
Things that can be done with Text Corpora
(IV)Concordances
  • Finding concordances corresponds to finding the
    different contexts in which a given word occurs.
  • One can use a Key Word In Context (KWIC)
    concordancing program.
  • Concordances are useful both for building
    dictionaries for learners of foreign languages
    and for guiding statistical parsers.

54
KWIC display
55
Syntactic frames for showed in Tom Sawyer
56
Why study NLP Statistically?
  • Up until the late 1980s, NLP was mainly
    investigated using a rule-based approach.
  • However, rules appear too strict to characterize
    peoples use of language.
  • This is because people tend to stretch and bend
    rules in order to meet their communicative
    needs.
  • Methods for making the modeling of language
    more accurate are needed and statistical methods
    appear to provide the necessary flexibility.

57
Subdivisions of NLP
  • Parts of Speech and Morphology (words, their
    syntactic function in sentences, and the various
    forms they can take).
  • Phrase Structure and Syntax (regularities and
    constraints of word order and phrase structure).
  • Semantics (the study of the meaning of words
    (lexical semantics) and of how word meanings are
    combined into the meaning of sentences, etc.)
  • Pragmatics (the study of how knowledge about the
    world and language conventions interact with
    literal meaning).

58
Topics Covered in this course
59
Tools and Resources Used
  • Probability/Statistical Theory Statistical
    Distributions, Bayesian Decision Theory.
  • Linguistics Knowledge Morphology, Syntax,
    Semantics and Pragmatics.
  • Corpora Bodies of marked or unmarked text to
    which statistical methods and current linguistic
    knowledge can be applied in order to discover
    novel linguistic theories or interesting and
    useful knowledge organization.

60
Textbook and other useful information
  • Foundations of Statistical Natural Language
    Processing, by Chris Manning and Hinrich Schütze,
    MIT Press, 1999.
  • Course Website borame.cs.pusan.ac.kr
Write a Comment
User Comments (0)
About PowerShow.com