Title: NLP: A Sampler
1NLP A Sampler
- Several NLP Application Areas
- NLP from a Cognitive Science Perspective
2A Quick Tour of Several NLP Application Areas
- Information Extraction into Databases
- Text Summarization
- Machine Translation
3Information Extraction into Databases
4Information Extraction (Distillation)
- Information extraction is the identification, in
text, of specified classes of - Entities and named entities
- relations
- events
- For relations and events, this includes finding
the participants and modifiers (date, time,
location, etc.). - We then build a data base about a given relation
or event - peoples jobs
- peoples whereabouts
- merger and acquisition activity
- disease outbreaks
- genomics relation
5Extraction Example
- George Garrick, 40 years old, president of the
London-based European Information Services Inc.,
was appointed chief executive officer ofNielsen
Marketing Research, USA.
George Garrick, 40 years old,
Nielsen Marketing Research, USA.
6Named Entities
- The who, where, when how much in a sentence
- The task identify atomic elements of information
in text - person names
- company/organization names
- locations
- datestimes
- percentages
- monetary amounts
7Wont simple lists solve the problem?
- too numerous to include in dictionaries
- changing constantly
- appear in many variant forms
- subsequent occurrences might be abbreviated
- list search/matching doesnt perform well
8Applications
- Information Extraction
- Summary generation
- Machine Translation
- Document organization/classification
- Automatic indexing of books
- Improve Internet search results(location
Clinton/South Carolina vs. PresidentClinton)
9I-Aliass Threat tracker
10Named Entity Detection in Chinese
Location
Date
Location
Organization
Location
(from a slide by Papineni)
11Levels of BBN Statistical Analysis
Yugoslav President Slobodan Milosevic received on
Thursday the representatives of the Association
of Yugoslav Banks, headed by its president Milos
Milosavljevic, who is also the general director
of JugoBanka.
Name finding Parsing Co-reference
12Information Extraction from Propositions
Propositions are normalized connections from the
parse trees. Entities and relations are
extracted statistically from propositions.
Person
ORG
ORG
Person
GPE
ORG
Person
Date
Person
13Text Summarization
14Information overload
- The problem
- 4 Billion URLs indexed by Google
- 200 TB of data on the Web Lyman and Varian 03
- Possible approaches
- information retrieval
- document clustering
- information extraction
- visualization
- question answering
- text summarization
- (next several slides adapted from Drago Radev, U.
Michigan)
15What happened?
MILAN, Italy, April 18. A small airplane crashed
into a government building in heart of Milan,
setting the top floors on fire, Italian police
reported. There were no immediate reports on
casualties as rescue workers attempted to clear
the area in the city's financial district. Few
details of the crash were available, but news
reports about it immediately set off fears that
it might be a terrorist act akin to the Sept. 11
attacks in the United States. Those fears
sent U.S. stocks tumbling to session lows in late
morning trading. Witnesses reported hearing a
loud explosion from the 30-story office building,
which houses the administrative offices of the
local Lombardy region and sits next to the city's
central train station. Italian state television
said the crash put a hole in the 25th floor of
the Pirelli building. News reports said smoke
poured from the opening. Police and ambulances
rushed to the building in downtown Milan. No
further details were immediately available.
How many victims?
When, where?
Says who?
Was it a terrorist act?
What was the target?
16http//www1.cs.columbia.edu/nlp/newsblaster/summar
ies/2006-03-26-09-23-19-030.html
17Machine Translation
18Why use computers in translation?
- Too much translation for humans
- Technical materials too boring for humans
- Greater consistency required
- Need results more quickly
- Not everything needs to be top quality
- Reduce costs
- Any one of these may justify machine translation
or computer aids - (next several slides adapted from Language Weaver)
19Statistical Machine Translation Technology
Spanish/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Que hambre tengo yo
20How A Statistical MT System Learns
21Translating a New Document
22Language Weaver
Source Aljazeera, January 8, 2005
23 Translingual Chat Instant Messaging
Original
Translation
24Language Weaver (Al Jazeera 8/2007)
LanguageWeaver Demo Website
25Language Weaver Hybrid Translation Technology
- Chinese Source TextSample 1 ??,???????????????
???????????????????24????1200??????,8???40??????
,???????????,??????????????????
????????????????????,???????????????????????,???
??????????????? ??????????????????????????
?????,35????????3562.1??????,????????????????,?
???????????????????76?????????????????
????,??????????????????????????????,?4??,???????26
.79??,??27.6???4??,? ???9??,????50.5,??????????
?????????????,???,?????????????18.8??,????
22,????105?????????1.1??,?????25? Language
Weaver Experimental Syntax MT Sample 1 The
motor show, has always been the' barometer' of a
national car consumption and market potential.
The Beijing International Auto Show has more than
1,200 exhibitors from 24 countries and 8 days of
receiving more than 40 million visitors, setting
a new record in China's auto show, are deeply
aware of the automobile market signals. "China
is one of the largest automobile market in the
world. Over the years, this phrase implies more
auto businessmen. But now, more and more facts
indicates that it is to become a reality. Data
from the Motor Show is very convincing. The
Beijing Qingnian Bao Report on-the-spot
investigation showed that about 35 percent of
35-year-old visitors, 62.1 percent of the
respondents said that the truck was mainly to buy
a car in the near future to collect information,
even at the exhibition may purchase or suitable
products 76 of respondents indicated in the
past two years to buy private cars. Since the
beginning of this year, the strong growth of the
domestic car market. According to the figures
released by the National Bureau of Statistics, in
the first four months, the country produced
267,900 vehicles, up 27.6 percent in particular,
in April, the production of 90,000 vehicles, an
increase of 50.5 over the same period last year,
setting a record high for the monthly output
growth over the past 10-odd years. In terms of
sales in the first quarter, manufacturing
enterprises in the country sold 188,000 cars, up
22 percent over the same period of last year, up
10.5 percent 11,000 vehicles, dropping by nearly
25 percent lower than the beginning of the year.
26Broadcast Monitoring BBN MAPS Language Weaver
MT
27(No Transcript)
28Why MT Is Hard Ambiguity
- Syntactic AmbiguityI saw the man on the hill
with the telescope - Lexical Ambiguity
- E book
- S libro, reservar
- Semantic Ambiguity
- Homographyball(E) pelota, baile(S)
- Polysemykill(E), matar, acabar (S)
- Semantic granularityesperar(S) wait, expect,
hope (E)be(E) ser, estar(S)fish(E) pez,
pescado(S)
29Why MT Is Hard Divergences
- Meaning of two translationally equivalent phrases
is distributed differently in the two languages - Example
- English RUN INTO ROOM
- Spanish ENTER IN ROOM RUNNING
30NLP in Cognitive ScienceLanguage is
Computational
31Auxiliaries (helping verbs) in English
- John could eat.
- John has eaten.
- John was eating
- John could be eating.
- John was eaten.
- John could have been eaten.
- John could have been being eaten.
- John could eaten.
- John has eating.
- John was eat.
- John could being eating.
- John was eat.
- John could been eaten.
- John have be eating.
32Grammar is really algorithmic Affix Hopping
- Auxiliary templates (Affixes in blue italic)
- (could 0)
- ( Pres) (have en) (be ing) (be
en) V - ( Past)
- Algorithm
- Pick verb
- Pick template
- Pick optional items
- Hop Affixes over following items
- Glue and do morphology
33Pronouns Topological Constraints on the Syntax
Tree
- Everyone he knows likes John.
- Everyone John knows likes him.
- John is liked by everyone he knows.
- He is liked by everyone John knows.
- How are 1 and 2 related? Mean the same?
- How are 1 and 3 related? Mean the same?
- How are 3 and 4 related? Mean the same?
- How are 2 and 4 related? Mean the same?
34C-Command
- Node X c-commands node Y, if
- every branching node dominating X also dominates
Y, and - neither X nor Y dominate the other.
- Topological constraint A pronoun cannot c-command
its antecedent
35A Fragment from the TNT Tutor
- Word processing makes typing easy
- Make a typo?
- No problem!
- Just back up, type over the mistake and its
gone. - And, it eliminates retyping.
- Need a second draft?
- What does each it refer to?
36Reference Discourse Stack Constraints
- Word processing makes typing easy
- Make a typo?No problem!Just back up, type over
the mistake and its gone. - And, it eliminates retyping.Need a second draft?
- Intonation signals block structure!
37Reference Pragmatics requires Inference
- Today was Jacks birthday.
- Penny and Janet went to the store.
- They were going to get him a present.
- Janet decided to get a top.
- Dont do that said Penny.
- Jack has a top.
- He will make you take it back.
- Charniak 72
38Reference Pragmatics requires Inference II
- The city council refused to give the students a
permit for the demonstration because - they feared violence.
- they advocated revolution.
- Winograd 70
39Pronouns Coreference
- Syntactic tree provides topological constraints
- Discourse structure provides stack constraints
- Remaining possibilities ranked using world
knowledge