Title: Welcome to Introduction to Natural Language Processing CIS 530
1Welcome toIntroduction to Natural Language
ProcessingCIS 530
2CIS 530 General Information Spring 2008
- Instructor Mitch Marcus
- Email mitch_at_cis.upenn.edu Office Levine
503Office Hours To be arranged - TA TBA
- Meeting Time and Location
- TR 430pm-600pm
- Towne 307
- Prerequisites
- An intro to AI OR an intro in natural language
syntax. - Course Work
- Several problem sets in the earlier part of the
course a course project during the second half
of the course.
3Textbooks
- Required
- Jurafsky Martin, SPEECH and LANGUAGE
PROCESSING An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition, Draft of 2nd edition (will be
provided) - Chris Manning and Hinrich Shutze, Foundations of
Statistical Natural Language Processing, MIT
Press, 1999. (available online from the Penn
campus at http//cognet.mit.edu/library/books/view
?isbn0262133601 ) - Various supplementary readings.
- Other Useful Books
- An introductory Python book (such as Lutz
Ascher, Learning Python, O'Reilly, 1999) if you
don't already know Python (available online
through Safari) - David Beazley, Python Essential Reference, 2nd
Edition, New Riders, 2001.
4NLP Grand Challenges
- The Ultimate Goal For computers to use NL as
effectively as humans do. - Reading and writing text
- Abstracting
- Monitoring
- Extraction into Databases
- Interactive Dialogue Natural, effective access
to computer systems - Informal Speech Input and Output
- Translation Input and Output in Multiple
Languages
5The Past Human Machine Interfaces
- SHRDLU (Winograd, 1969)
- A fragile demonstration of the fundamental vision
- PUNDIT (Hirschman, Palmer, 1989)
- Information Extraction from Real World texts
6SHRDLU A demonstration proof
- Person PICK UP A BIG RED BLOCK.
- Computer OK. (does it)
- Person GRASP THE PYRAMID.
- Computer I DON'T UNDERSTAND WHICH PYRAMID YOU
MEAN. - Person FIND A BLOCK WHICH IS TALLER THAN THE ONE
YOU ARE HOLDING AND PUT IT INTO THE BOX. - Computer BY "IT", I ASSUME YOU MEAN THE BLOCK
WHICH IS TALLER THAN THE ONE I AM HOLDING. - Computer OK. (does it)
- Person WHAT DOES THE BOX CONTAIN?
- Computer THE BLUE PYRAMID AND THE BLUE BLOCK.
- Person WHAT IS THE PYRAMID SUPPORTED BY?
- Computer THE BOX.
7SHRDLU A demonstration proof
8The Past Information Extraction
- PUNDIT (Palmer et al. 1987)
- Sample CASREP automatic summary
- FAILURE OF ONE OF TWO SACS. UNIT HAD LOW OUTPUT
AIR PRESSURE. RESULTED IN SLOW GAS TURBINE
START. TROUBLESHOOTING REVEALED NORMAL SAC LUBE
OIL PRESSURE AND TEMPERATURE. EROSION OF IMPELLOR
BLADE TIP EVIDENT. CAUSE OF EROSION OF IMPELLOR
BLADE UNDETERMINED. NEW SAC RECEIVED.
9(No Transcript)
10The Past Crucial flaws in the paradigm
- These systems worked well, BUT
- Usually, only for a small set of examples
- Person-years of work to port to new applications
and, often, to extend coverage on a single
application - Very limited and inconsistent coverage of English
11Interactive systems often worked well
- because of a magical factPeople automatically
adapt and limit their language given a small set
of exemplars if the underlying linguistic
generalizations are HABITABLE - This wont handle non-interactive language
12The State of NLP
- NLP Past
- Rich Representations
- NLP Present
- Powerful Statistical Disambiguation
13An Early Robust Statistical NLP Application
- A Statistical Model For Etymology (Church 85)
- Determining etymology is crucial for
text-to-speech
14An Early Robust Statistical NLP Application
- Etymology can be determined reasonably accurately
from statistics computed from letter sequences
trigrams!
15A Central Challenge Extracting Meaning
??Meaning Extractor??
Text or speech
Meaning
16Meaning representations should capture
- Entities
- Of some type Nation, Know-how
- Events and relations
- Predicates with arguments
- Recursively
- And More
- Quantifiers
The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea.
17Literal vs. Implicit Meaning
- Cognitive beings automatically
- combine literal meaning
- with world knowledge
- to see implicit meaning
- Q Whose greed? Q Whose ambition?
- Understanding this involves inferring implicit
meaning - Recent NLP has focused on robust extraction of
shallow, literal meaning
The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea, a
Pakistani government official said Monday The
transfers were made during the late 1980s and in
the early and mid 1990s, and were motivated by
"personal greed and ambition," an official said.
18Levels of Representation
Full Semantics
Explicit Semantics
Syntax
Words
Morphology
19Word Unigram Representation
- The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea, a
Pakistani government official said Monday. Khan
made the confession in a written statement
submitted "a couple of days ago" to investigators
probing allegations of nuclear proliferation by
Pakistan, the official told The Associated Press
on condition on anonymity. The transfers were
made during the late 1980s and in the early and
mid 1990s, and were motivated by "personal greed
and ambition," the official said. The official
said the transfers were not authorized by the
government.
20Word Bigram Representation
- The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea, a
Pakistani government official said Monday. Khan
made the confession in a written statement
submitted "a couple of days ago" to investigators
probing allegations of nuclear proliferation by
Pakistan, the official told The Associated Press
on condition on anonymity. The transfers were
made during the late 1980s and in the early and
mid 1990s, and were motivated by "personal greed
and ambition," the official said. The official
said the transfers were not authorized by the
government.
21Levels of Representation
Full Semantics
Explicit Semantics
Syntax
Words
Morphology
Also, higher representations require lower
22Syntax Representation Treebank
- TreeBank includes
- Part of speech (not shown here)
- Syntactic structure
231995 A breakthrough in parsing
- 106 words of Treebank Annotation
- Machine Learning Robust Parsers
Training Program
training sentences
answers
Models
The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea
Trees
Parser
- 1990 Best hand-built parsers 40-60 accuracy
(guess) - 1995 Statistical parsers 90 accuracy
24Rich Linguistic Representations Powerful
Machine Learning Robust, Effective NLP
- 1970s, 80s Focus on Linguistic Representatins
- 1990s, early 2000s Focus on Machine Learning
- Recently New work combining the two
25Levels of Representation
Full Semantics
Explicit Semantics
Syntax
Words
Morphology
Also, higher representations require lower
26Shallow Verb Semantics Propbank
- The
- founder
- of
- Pakistans
- nuclear department
- Abdul Qadeer Khan
- has
- admitted
- he
- transferred
- nuclear technology
- to
- Iran,
- Libya,
- and
- North Korea
NP
NP
PP
NP
S
NP
NP
VP
VP
SBAR
NP
S
VP
NP
PP
NP
NP
NP
- PropBank adds
- Lexical semantics of verbs
NP
27A Very First ExperimentRecovering Semantic
Structure Automatically
Training Program
Training Program
training sentences
training sentences
Treebank
Propbank
Models
Models
Trees
Semantic Analyzer
Semantic Relations
Sentences
Parser
Semantic Relations Retrieved (Hacioglu et al.,
Univ of Colorado)
28Explicit Semantic Representations
A1 act Acknowledge agent E1
object Founder name Abdul Qadeer
Khan description A3 act Establish
agent E1 organization E2
object Agency description Pakistans
nuclear department
proposition A2 act Transfer agent E1
theme E4 object Know-How
description nuclear technology
destination (and E5 object
Nation name Pakistan" E6
object Nation name Libya"
E7 object Nation name North
Korea" )
E1 Founder Names Abdul Qadeer
Khan Descriptions The founder of Pakistans
nuclear department, he
Establish Agent Org
E2 Agency Descriptions Pakistans nuclear
department
Subsidiary SubOrg SuperOrg
E3 Nation Names Pakistan
E4 Know-How Descriptions nuclear technology
Acknowledge Person Fact
E5 Nation Names Iran
E6 Nation Names Libya
Transfer Agent Item Dest
E7 Nation Names North Korea
29Vision Building decoders for literal meaning
Correct parse trees
Training Program
Training sentences
Current State of Art
Models
The founder of Pakistan's nuclear program, Abdul
Qadeer Khan, has admitted he transferred nuclear
technology to Iran, Libya and North Korea
Parser
30Levels of Representation
Full Semantics
Explicit Semantics
Syntax
Words
Morphology
Also, higher representations require lower
31PASCAL Recognizing Textual Entailment
- Pattern Analysis, Statistical Modelling and
Computational Learning a Network of Excellence
sponsored by the EU as part of its IST program.
32Approximate Syllabus
33Unit I Intro Word-Based Methods
- Introduction to Python
- N-Gram Word-Based Models of Syntax
- Word Distributions
- Smoothing Backoff
- Entropy and Relative Entropy
- Word Classes and Part of Speech Tagging
- Tag Set Design
- Hidden Markov Models
- Transformation-based Learning
- Speech Recognition
- Why is Speech Recognition Hard?
- HMMs for speech
34Unit II - Parsing
- Introduction to Syntactic Analysis
- Context Free Models CF Parsing for NL Syntax
- Statistical Parsing of CFGs
- Probabilistic CFGs
- Generative Statistical Models
- Discriminative Models for Parsing
- Enriched Models for NL Syntax
- The Inadequacy of CF Models
- Feature Structures and Unification
- Tree Adjoining Grammars
35Syllabus III Meaning
- Lexical Semantics
- Word Sense Disambiguation Decision Lists, SVMs
- Logical Form and Semantics
- Introduction to Logical Form
- Mapping from Syntactic Structures to LF
- Quantifier scope and Cooper Storage
- Entailment Logical Inference
- Discourse Pragmatics
- Reference and Anaphora
- Text Coherence Discourse Structure
36Syllabus IV Putting the Pieces Together
- Machine Translation
- Synchronous TAGS
- Statistical Translation
- Generation Summarization
- Text Planning, Content Determination and
Realization - Statistical Techniques for Summarization