Title: Fall 2005
1EECS 595 / LING 541 / SI 661761
Natural Language Processing
- Fall 2005
- Lecture Notes 1
2Introduction
3Course logistics
- Instructor Prof. Dragomir Radev
(radev_at_umich.edu) Ph.D., Computer Science,
Columbia University Formerly at IBM TJ Watson
Research Center - Times Thursdays 240-525 PM, in 411, West Hall
- Office hours TBA, 3080 West Hall Connector
Course home page
http//www.si.umich.edu/radev/NLP-fall2005
4Example (from a famous movie)
Dave Bowman Open the pod bay doors, HAL. HAL
Im sorry Dave. Im afraid I cant do that.
5Example
I saw her fall
- How many different interpretations does the above
sentence have? How many of them are
reasonable/grammatical?
6(No Transcript)
7(No Transcript)
8Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices
nearing 70 per barrel, welcomed a raft of strong
earnings reports.
9Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices
nearing 70 per barrel, welcomed a raft of strong
earnings reports.
10Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices
nearing 70 per barrel, welcomed a raft of strong
earnings reports.
11Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices
nearing 70 per barrel, welcomed a raft of strong
earnings reports.
12Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices
nearing 70 per barrel, welcomed a raft of strong
earnings reports.
13Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices
nearing 70 per barrel, welcomed a raft of strong
earnings reports.
14Example 2
Accenture posts higher earnings Consulting and
technology services firm beats estimates stock
gains in after-hours trading.July 7, 2005 435
PM EDT NEW YORK (Reuters) - Accenture Ltd., one
of the world's largest consulting and technology
services firms, posted a higher quarterly profit
Thursday boosted by a rebound in consulting
demand. Fiscal third-quarter net income more
than doubled to about 484 million, or 51 cents a
share, from 210 million, or 37 cents a share, a
year earlier, the company said. Analysts had
expected earning of 43 cents a share, according
to First Call. Accenture stock rose about 2
percent in after-hours trading after falling
nearly 6 percent in regular New York Stock
Exchange trading.
15- Gary Larson (The Far Side) cartoon
- What we say to dogs
- Okay Ginger! Ive had it! You stay out of the
garbage! Understand, Ginger? - What they hear
- Blah Ginger! blah blah blah blah blah blah blah
blah blah blah blah Ginger?"
16Time Warner to hold off on Cablevision But top
Time Warner execs said it may eventually be
interested in the cable assets.July 8, 2005
720 PM EDT SUN VALLEY, Idaho (Reuters) - A top
Time Warner Inc. executive said Friday it could
not bid for Cablevision until it completes a deal
to buy Adelphia Communications Corp., splashing
cold water on early buyout speculation. Time
Warner is in a joint deal with Comcast Corp. to
buy bankrupt cable provider Adelphia
Communications Corp. "We can't do anything else
until we get it (Adelphia) integrated," said Don
Logan, chairman of Time Warner's media and
communications group. But he added, "We've
always said we are interested in Cablevision. ...
Anything is possible." In June, the Dolan family
offered Cablevision shareholders about 33.50 per
share in a 7.9 billion deal to take the company
private. Analysts and one of Cablevision's top
investors have said the offer is too low and
could put the cable system, which serves 3
million customers in the New York area, into play
for other suitors, including Time Warner Cable
and Comcast. Wall Street analysts said in June
that Time Warner, if it were to bid, could top
the offer with a 35 to 40 per share bid. Time
Warner is the parent company of this Web site.
Time Warner chief executive Dick Parsons said on
Friday his company's decision about whether to
buy Cablevision Corp. rests on whether the Dolan
family decides to put it up for sale. "Chuck
(Dolan) controls it and it's not as if we could
take it away from him," Parsons said during a
break at the Allen Co. conference in Sun
Valley, Idaho. "When he's ready to bring that
asset to market he knows we're here." Parsons
would not comment on whether he has had recent
conversations with Dolan about buying
Cablevision. Parsons said he and Dolan agree
that cable assets are undervalued and that now is
a good time to buy them. Time Warner is the
parent company of CNN/Money.
17Stocks edge upMajor gauges make tentative gains
at Friday's open after steep Fed-inspired
selloff.July 1, 2005 946 AM EDT NEW YORK
(CNN/Money) - Stocks inched higher early Friday,
recovering some from the big selloff after the
Federal Reserve boosted interest rates again, and
signaled it didn't intend to pause anytime soon.
The Dow Jones industrial average (down 99.51 to
10,274.97, Charts), the broader Standard Poor's
500 (up 2.50 to 1,193.83, Charts) index and the
Nasdaq composite (up 4.84 to 2,061.80, Charts)
all added a few points in the early going, with
the Nasdaq lagging the blue chip indicators a
bit. Stocks ended a mixed quarter on a down note
Thursday, with the Dow losing more than 100
points after the Fed raised the target for its
fed funds rate, an overnight bank lending rate,
another quarter point to 3.25 percent. In the
closely watched statement, the central bankers
acknowledged the impact of higher energy prices
and other negatives, but said the economic
expansion remains on track. They also pledged to
keep raising rates at a "measured" pace, all of
which suggested that they don't plan to pause in
the near term. Gains early Friday were broad
based, with 27 out of 30 Dow issues rising. In
corporate news, Microsoft (up 0.02 to 24.86,
Research) has settled antitrust claims made by
IBM (unchanged at 74.20, Research), the
companies said Friday. The software leader will
pay IBM 775 million as part of the deal. A
number of economic reports were due around 10
a.m. ET. The Institute for Supply Management's
manufacturing index for June was expected to have
risen to 51.5 in the month from 51.4 in May,
according to a consensus of economists surveyed
by Briefing.com. The revised read on June
consumer sentiment from the University of
Michigan was also due, as was the May read on
construction spending. Treasury prices slipped
after Thursday's big rally. The fall raised the
yield on the 10-year note to 3.94 percent from
3.92 percent late Thursday. Treasury prices and
yields move in opposite directions. In currency
trading, the dollar jumped versus the euro and
the yen. U.S. light crude oil for August
delivery rose 32 cents to trade at 56.82 a
barrel in electronic trading. Crude set a record
closing price for a nearby futures contract at
60.54 on Monday. COMEX gold fell 1.20 to
435.90 an ounce. In global trade, Asian-Pacific
markets ended mostly lower, and European markets
rose at midday.
18Google cracks 300Shares of the popular search
engine pass 300 for the first time and are now
up 260 since IPO.June 27, 2005 552 PM EDT By
Paul R. La Monica, CNN/Money senior writerNEW
YORK (CNN/Money) - Shares of Google, the popular
search-engine company, surpassed the 300 level
for the first time on Monday, sparking memories
of the dot-com stock craze of the late 1990s.
Google gained 2.3 percent to finish at 304.10,
slightly below its high for the day of 304.30.
The stock has now gained nearly 260 percent since
it went public last August at 85 a share. Much
of the optimism surrounding Google comes from the
fact that it is the leader in the white-hot
online advertising industry. The company reported
much better than expected sales and earnings for
the first quarter, thanks to a booming market for
online advertising, particularly ads tied to
specific keyword searches. And during the past
few weeks, Google has released several new
features -- including a desktop search function
for businesses and a test version of a
personalized home page tool -- that should help
the company remain competitive against rivals
Yahoo! and Microsoft. Several analysts have also
speculated that Google will soon launch an online
payment service that could compete against eBay's
PayPal. In addition, many investors have been
betting that the company, which now has a market
value of nearly 85 billion, will soon be added
to the benchmark SP 500 index. But the stock's
meteoric rise as of late -- shares have surged
more than 50 percent since the company reported
first-quarter results in mid-April -- has some
analysts thinking that the stock could take a hit
in the near future. "You might see the stock
pause temporarily," said Marianne Wolk, an
analyst with Susquehanna Financial Group. "For
the longer term, we're still very bullish but in
the very short term it wouldn't be a surprise to
see the stock stabilize or pull back." The key
for Google will be how strong its second quarter
results are. Google is set to report these
numbers on July 21. Analysts expect Google's
sales, excluding revenues it shares with
affiliates, a figure known as traffic acquisition
costs or TAC, to come in at 840 million, nearly
double last year's levels. Earnings, excluding
certain one-time charges, are forecast at 1.21,
an increase of 121 percent from a year ago. Wolk
thinks that Google should meet these targets but
does not believe the company will report results
that are significantly better than consensus
projections. And if Google does not continue to
beat estimates, the stock could take a bath.
"For Google to keep heading higher, it's
absolutely critical that they keep hitting
numbers. Everyone now believes the story," said
John Tinker, an analyst with ThinkEquity
Partners. Still, many investors are finding it
hard to bet against Google because it has been
posting extremely strong levels of sales growth
and healthy profit margins as a public company.
So the comparisons to the late 1990s, when shares
of many unprofitable Internet companies soared
solely due to hype, may not be apt. To that end,
Google is expected to generate nearly 3.6
billion in sales, excluding TAC and revenue of 5
billion next year as the company continues to
benefit from a shift of advertising dollars from
more mainstream media sources such as television,
radio, and newspapers, to the Web. In addition
to its ubiquitous search engine, Google has
branched out into related areas in order to
capitalize on the boom in online advertising. The
company has a comparison shopping site, Froogle,
a free e-mail service called Gmail which features
ads embedded in e-mails, and a local search site
that operates as kind of a Web version of the
Yellow Pages. Google also has expanded rapidly
abroad, with sales from outside the U.S.
accounting for nearly 40 percent of total sales
in the first quarter. What's more, some argue
that Google is not overvalued, since it continues
to trade at a discount to its top rival, Yahoo.
However, this gap has narrowed significantly as
of late. Google's price-to-earnings ratio, based
on 2005 earnings estimates, is 58. Yahoo trades
at 61.5 times earnings estimates for this year.
"Google is not an undiscovered stock any more,"
said Tinker. "It's no longer inefficiently
priced." And Google also potentially faces the
issue of the summer sluggishness that typically
affects Internet stocks. Last year, shares of
several Internet companies plunged in July as
results did not live up to lofty expectations.
19Silly sentences
- Children make delicious snacks
- Stolen painting found by tree
- I saw the Grand Canyon flying to New York
- Court to try shooting defendant
- Ban on nude dancing on Governors desk
- Red tape holds up new bridges
- Iraqi head seeks arms
- Blair wins on budget, more lies ahead
- Local high school dropouts cut in half
- Hospitals are sued by seven foot doctors
- In America a woman has a baby every 15 minutes.
How does she do that?
20Main problems in language
- Novel words and usages
- Blogs, little r me,7342.67
- Spam as verb, email
- Inconsistencies
- Beverly Hills, Beverly Sills
- junior college, college junior
- pet spray, pet llama
- Parsing problems
- Cup holder
- Federal Reserve Board Chairman
- Implicature/reasoning
- World knowledge
- Subjectivity, scoping, negation
21Types of ambiguity
- Morphological Joe is quite impossible. Joe is
quite important. - Phonetic Joes finger got number.
- Part of speech Joe won the first round.
- Syntactic Call Joe a taxi.
- Pp attachment Joe ate pizza with a fork. Joe ate
pizza with meatballs. Joe ate pizza with Mike.
Joe ate pizza with pleasure. - Sense Joe took the bar exam.
- Modality Joe may win the lottery.
- Subjectivity Joe believes that stocks will rise.
- Scoping Joe likes ripe apples and pears.
- Negation Joe likes his pizza with no cheese and
tomatoes. - Referential Joe yelled at Mike. He had broken
the bike. Joe yelled at Mike.
He was angry at him. - Reflexive John bought him a present. John bought
himself a present. - Ellipsis and parallelism Joe gave Mike a beer
and Jeremy a glass of wine. - Metonymy Boston called and left a message for
Joe.
22Synonyms/paraphrases
The SP 500 climbed 6.93, or 0.56 percent, to
1,243.72, its best close since June
12, 2001. The Nasdaq gained 12.22, or 0.56
percent, to 2,198.44 for its best showing since
June 8, 2001. The DJIA rose 68.46, or
0.64 percent, to 10,705.55, its highest level
since March 15.
23What is Natural Language Processing
- Natural Language Processing (NLP) is the study of
the computational treatment of natural language. - NLP draws on research in Linguistics, Theoretical
Computer Science, Mathematics and Statistics,
Artificial Intelligence, Psychology, etc.
24NLP
- Information extraction
- Named entity recognition
- Trend analysis
- Subjectivity analysis
- Text classification
- Anaphora resolution, alias resolution
- Cross-document crossreference
- Parsing
- Semantic analysis
- Word sense disambiguation
- Word clustering
- Question answering
- Summarization
- Document retrieval (filtering, routing)
- Structured text (relational tables)
- Paraphrasing and paraphrasing/entailment ID
- Text generation
- Machine translation
25What is needed (1) linguistic knowledge
- Examples
- Zipfs law rank(wi)freq(wi) const
- Collocations
- Strong beer but powerful beer
- Big sister but large sister
- Stocks rise but ?stocks ascend (225,000 hits on
Google vs. 47 hits) - Constituents
- Children eat pizza.
- They eat pizza.
- My cousins neighbors children eat pizza.
- _ Eat pizza!
- Burstiness
- P(ct2ctgt1)
- How to get it
- Manual rules
- Automatically acquired from large text
collections (corpora)
26Linguistics
- Knowledge about language
- Phonetics and phonology - the study of sounds
- Morphology - the study of word components
- Syntax - the study of sentence and phrase
structure - Lexical semantics - the study of the meanings of
words - Compositional semantics - how to combine words
- Pragmatics - how to accomplish goals
- Discourse conventions - how to deal with units
larger than utterances
27What is needed (2) mathematical and
computational tools
- Language models
- Estimation methods
- Hidden Markov Models (HMM) for sequences
- Context-free grammars (CFG) for trees
- Conditional Random Fields (CRF)
- Generative/discriminative models
- Maximum entropy models
- Random walks
- Latent semantic indexing (LSI)
- Representation issues
- Feature engineering
28Theoretical Computer Science
- Automata
- Deterministic and non-deterministic finite-state
automata - Push-down automata
- Grammars
- Regular grammars
- Context-free grammars
- Context-sensitive grammars
- Complexity
- Algorithms
- Dynamic programming
29Mathematics and Statistics
- Probabilities
- Statistical models
- Hypothesis testing
- Linear algebra
- Optimization
- Numerical methods
30Artificial Intelligence
- Logic
- First-order logic
- Predicate calculus
- Agents
- Speech acts
- Planning
- Constraint satisfaction
- Machine learning
31Existing applications
- Web search
- Natural language interfaces to databases
- Parsing job postings
- Military intelligence
- Summarizing medical records
- Information extraction for databases
- Wrapper induction
32Potential applications
- Trend recognition
- Db conversion named entity extraction
classification relation extraction - Detecting change
- Summarization
- Social network analysis
- Assigning subjectivity scores (stars)
- Sentiment classification
- Alignment of text w/ other signal (time series)
- Record linkage
33Current work at CLAIR
- Semi-supervised entity and relation extraction
- Subjectivity analysis factuality extraction
- Protein interaction recognition
- Text summarization
- Text mining from the Web
- Lexical network models of the Web
- Syntactic alignment
- Chronology recovery
- Classification
34Final remarks
- Language is not adversarial
- It is used to convey useful information
- Hard to extract this information automatically
- Need to use NLP
- Inference mathematics, statistics, machine
learning - Networks/fields
- Graph theory
- Differential equaitions
- Statistics/optimization
- Linguistics/KR/AI
- Sequence alignment
- Linear algebra/vector analysis
35Ambiguity
I saw her fall.
- The categories of knowledge of language can be
thought of as ambiguity-resolving components - How many different interpretations does the above
sentence have? - How can each ambiguous piece be resolved?
- Does speech input make the sentence even more
ambiguous?
Time flies like an arrow.
36The alphabet soup(NLP vs. CL vs. SP vs. HLT vs.
NLE)
- NLP (Natural Language Processing)
- CL (Computational Linguistics)
- SP (Speech Processing)
- HLT (Human Language Technology)
- NLE (Natural Language Engineering)
- Other areas of research Speech and Text
Generation, Speech and Text Understanding,
Information Extraction, Information Retrieval,
Dialogue Processing, Inference - Related areas Spelling Correction, Grammar
Correction, Text Summarization
37Some demos
- ATT Labs Text to Speech (http//www.research.att.
com/projects/tts/demo.html) - Babelfish (http//babelfish.altavista.com)
- OneAcross (http//www.oneacross.com)
- AskJeeves (http//www.ask.com)
- IONaut (http//www.ionaut.com8400) seems to be
down - NSIR (http//tangra.si.umich.edu/clair/NSIR/html/n
sir.cgi) - AnswerBus (http//www.answerbus.com)
- NewsInEssence (http//www.newsinessence.com)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42The Turing Test
- Alan Turing the Turing test (language as test
for intelligence) - Three participants a computer and two humans
(one is an interrogator) - Interrogators goal to tell the machine and
human apart - Machines goal to fool the interrogator into
believing that a person is responding - Other humans goal to help the interrogator
reach his goal
Q Please write me a sonnet on the topic of the
Forth Bridge. A Count me out on this one. I
never could write poetry. Q Add 34957 to
70764. A 105621 (after a pause)
43Some brief history
- Foundational insights (40s and 50s) automaton
(Turing), probabilities, information theory
(Shannon), formal languages (Backus and Naur),
noisy channel and decoding (Shannon), first
systems (Davis et al., Bell Labs) - Two camps (57-70) symbolic and
stochastic.Transformation grammar (Harris,
Chomsky), artificial intelligence (Minsky,
McCarthy, Shannon, Rochester), automated theorem
proving and problem solving (Newell and
Simon)Bayesian reasoning (Mosteller and
Wallace)Corpus work (Kucera and Francis)
44Some brief history
- Four paradigms (70-83) stochastic (IBM),
logic-based (Colmerauer, Pereira and Warren, Kay,
Bresnan), nlu (Winograd, Schank, Fillmore),
discourse modelling (Grosz and Sidner) - Empiricism and finite-state models redux (83-93)
Kaplan and Kay (phonology and morphology), Church
(syntax) - Late years (94-03) strong integration of
different techniques, different areas (including
speech and IR), probabilistic models, machine
learning
45The state of the art and the near-term future
- World-Wide Web (WWW)
- Sample scenarios
- generate weather reports in two languages
- teaching deaf people to speak
- translate Web pages into different languages
- speak to your appliances
- find restaurants
- answer questions
- grade essays (?)
- closed-captioning in many languages
- automatic description of a soccer game
46Structure of the course
- Three major parts
- Linguistic, mathematical, and computational
background - Computational models of morphology, syntax,
semantics, discourse, pragmatics - Applications text generation, machine
translation, information extraction, etc. - Three major goals
- Learn the basic principles and theoretical issues
underlying natural language processing - Learn techniques and tools used to develop
practical, robust systems that can communicate
with users in one or more languages - Gain insight into many open research problems in
natural language
47Readings
- Speech and Language Processing(Daniel Jurafsky
and James Martin)Prentice-Hall, 2000ISBN
0-13-095069-6 - Handouts given in class
- 1-2 chapters per week
Optional readings Natural Language
Understanding by Allen Foundations of
Statistical Natural Language Processing by
Manning and Schütze.
48Grading
- Four homework assignments (40)
- Midterm (15)
- Final project (20)
- Final exam (25)
- Additional requirements for SI761
49Assignments
- (subject to change)
- Finite-state modeling, part of speech tagging,
and information extraction - Fsmtools/lextools/JMX (Bell Labs, Penn)
- Tagging and parsing
- Brill tagger/Charniak parser (JHU, Brown)
- Machine translation
- GIZA/Rewrite decoder (Aachen, JHU, ISI)
- Text generation
- FUF/Surge (Columbia)
50Syllabus
51Other meetings
- CLAIR meeting
- (TBA)
- Artificial Intelligence Seminar
- (Tuesdays 4-530)
- STIET
- (Thursdays 4-530)
52Projects
Each student will be responsible for designing
and completing a research project that
demonstrates the ability to use concepts from the
class in addressing a practical problem. A
significant part of the final grade will depend
on the project assignment. Students can elect to
do a project on an assigned topic, or to select a
topic of their own. The final version of the
project will be put on the World Wide Web, and
will be defended in front of the class at the end
of the semester (procedure TBA). In some cases
(and only with instructors approval), students
may be allowed to work in pairs when the
projects scope is significant.
53Sample projects
- Noun phrase parser
- Paraphrase identification
- Question answering
- NL access to databases
- Named entity tagging
- Rhetorical parsing
- Anaphora resolution, entity crossreference
- Document and sentence alignment
- Using bioinformatics methods
- Encyclopedia
- Information extraction
- Speech processing
- Sentence normalization
- Text summarization
- Sentence compression
- Definition extraction
- Crossword puzzle generation
- Prepositional phrase attachment
- Machine translation
- Generation
- Semi-structured document parsing
- Semantic analysis of short queries
- User-friendly summarization
- Number classification
- Domain-specific PP attachment
- Time-dependent fact extraction
54Main research forums and other pointers
- Conferences ACL/NAACL, SIGIR, AAAI/IJCAI, ANLP,
Coling, HLT, EACL/NAACL, AMTA/MT Summit,
ICSLP/Eurospeech - Journals Computational Linguistics, Natural
Language Engineering, Information Retrieval,
Information Processing and Management, ACM
Transactions on Information Systems, ACM TALIP,
ACM TSLP - University centers Columbia, CMU, JHU, Brown,
UMass, MIT, UPenn, USC/ISI, NMSU, Michigan,
Maryland, Edinburgh, Cambridge, Saarland,
Sheffield, and many others - Industrial research sites IBM, SRI, BBN, MITRE,
MSR, (ATT, Bell Labs, PARC) - Startups Language Weaver, Ask.com, LCC
- The Anthology http//www.aclweb.org/anthology
55(No Transcript)
56What this course is NOT
- EECS 597 / LING 792 / SI 661 Language and
Information, last taught in Winter 2005,
essentially an introduction to corpus-based and
statistical NLP. - Topics covered introduction to computational
linguistics, information theory, data compression
and coding, N-gram models, clustering,
lexicography, collocations, text summarization,
information extraction, question answering, word
sense disambiguation, analysis of style, and
other topics . - SI 760 Information Retrieval, last taught
Winter 2005. - Topics covered information need, IR models,
documents, queries, query languages, relevance,
retrieval evaluation, reference collections,
query expansion and relevance feedback, indexing
and searching, XML retrieval, language modeling
approaches, crawling the Web, hyperlink analysis,
measuring the Web, similarity and clustering,
social network analysis for IR, hubs and
authorities, PageRank and HITS, focused crawling,
relevance transfer, question answering - The new advanced NLP/IR course, to be offered
Winter 2006. - An undergraduate Linguistics course such as Ling
212 Intro to the Symbolic Analysis of Language
or Ling 320 Programming for Linguistics and
Language Studies
57Other sites
- Johns Hopkins University (Jason
Eisner)http//www.cs.jhu.edu/jason/465/ - Cornell University (Lillian Lee)http//courses.cs
.cornell.edu/cs674/2002SP/ - Stanford University (Chris Manning)http//www.sta
nford.edu/class/cs224n/ - JHU Summer workshophttp//www.clsp.jhu.edu/ws2003
/calendar/preliminary.shtml
58Readings
- JM Chapters 1, 2
- What is Computational Linguistics by Hans
Uszkoreithttp//www.coli.uni-sb.de/hansu/what_is
_cl.html - Lecture notes 1