Title: Christopher Manning
1Christopher Manning
- CS300 talk Fall 2000
- manning_at_cs.stanford.edu
- http//nlp.stanford.edu/manning/
2Research areas of interestNLP/CL
- Statistical NLP models Combining linguistic and
statistical sophistication - NLP and ML methods for extracting meaning
relations from webpages, medical texts, etc. - Information extraction and text mining
- Lexical and structural acquisition from raw text
- Using robust NLP dialect/style, readability,
- Using pragmatics, genre, NLP in web searching
- Computational lexicography and the visualization
of linguistic information
3Models for language
- What is the motivation for statistical models for
understanding language? - From the beginning, logics and logical reasoning
were invented for handling natural language
understanding - Logics have a language-like form that draws from
and meshes well with natural languages - Where are the numbers?
4Sophisticated grammars for NL
- From NP ? Det Adj N
- there
- developed
- precise and
- sophisticated
- grammar
- formalisms
- (such as LFG,
- HPSG)
5The Problem of Ambiguity
- Any broad-coverage grammar is hugely ambiguous
(often hundreds of parses for 20 word
sentences). - Making the grammar more comprehensive only makes
the ambiguity problem get worse. - Traditional (symbolic) NLP methods dont provide
a solution. - Selectional restrictions fail because creative/
metaphorical use of language is everywhere - I swallowed his story
- The supernova swallowed up the planet
6The problem of ambiguity close up
- The post office will hold out discounts and
service concessions as incentives. - 12 words. Real language. At least 83 parses.
7(No Transcript)
8Statistical NLP methods
- P(to Sarah drove)
- P(time is verb Time flies like an arrow)
- P(NP ? Det Adj N mother VPdrive )
- Statistical NLP methods
- Estimate grammar parameters by gathering counts
from texts or structured analyses of texts - Assign probabilities to various things to
determine the likelihood of word sequences,
sentence structure, and interpretation
9Probabilistic Context-Free Grammars
NP
NP Det N 0.4 NP NPposs N 0.1 NP
Pronoun 0.2 NP NP PP 0.1 NP N
0.2
NP
PP
N
Det
P(subtree above) 0.1 x 0.4 0.04
10Why Probabilistic Grammars?
- The predictions about grammaticality and
ambi-guity of categorical grammars are not in
accord with human perceptions or engineering
needs. - Categorical grammars arent predictive
- They dont tell us what sounds natural
- Probabilistic grammars model error tolerance,
online lexical acquisition, and have been
amazingly successful as an engineering tool - They capture a lot of world knowledge for free
- Relevant to linguistic change and variation, too!
11Example near
- In Middle English, was an adjective Maling
- But, today, is it an adjective or a preposition?
- The near side of the moon
- We were near the station
- Not just a word with multiple parts of speech!
There is evidence of blending - We were nearer the bus stop than the train
- He has never been nearer the center of the
financial establishment
12Research aim
- Most current statistical models are quite simple
(linguistically and also statistically) - Aim To combine the good features of statistical
NLP methods with the sophistication of rich
linguistic analyses.
13Lexicalising a CFG
VPlooked
Vlooked
PPinside
looked
Pinside
NPbox
Dthe
Nbox
- A lexicalized CFG can capture probabilistic
dependencies between words
the
box
14Left-corner parsing
- The memory requirements of standard parsers do
not match human linguistic processing. What
humans find hardest center embedding - The man that the woman the priest met knows
couldnt help - is really the bread-and-butter of standard CFG
parsing - (((a b)))
- As an alternative, left-corner parsing does
capture this.
15Parsing and (stack) complexity
- She ruled that the contract between the union and
company dictated that claims from both sides
should be bargained over or arbitrated.
16Tree geometry vs. stack depth
- Kims friends mothers car smells.
- Kim thinks Sandy knows she likes green apples.
- The rat that the cat that Kim likes chased died
- TD LC BU
- 5 1 1
- 1 1 7
- 3 3 7
17Probabilistic Left-Corner Grammars
- Use richer probabilistic conditioning
- Left corner and goal category rather than just
parent - P(NP Det Adj N Det, S)
- Allow left-to-right online parsing (which
- can hope to explain how people build
- partial interpretations online)
- Easy integration with lexicalization,
- part-of-speech tagging models, etc.
S
NP
Det
Adj
N
18Probabilistic Head-driven Grammars
- The heads of phrases are the source of the main
constraining information about a sentence
structure - We work out from heads by following the
dependency order of the sentence - The crucial property is that we have always built
and have available to us for conditioning all
governing heads and all less oblique dependents
of the same head - We can also easily integrate phrase length
19Information from the web The problem
- When people see web pages, they understand their
meaning - By and large. To the extent that they dont,
theres a gradual degradation - When computers see web pages, they get only
character strings and HTML tags
20The human view
21The intelligent agent view
-
- Ford Motor Company - Home Page
- trucks, SUV, mazda, volvo, lincoln, mercury,
jaguar, aston martin, ford" - Company corporate home page"
-
-
- WIDTH768
- HREF"default.asp?pageid473" onmouseover"logoOve
r('fordscript')rolloverText('ht0')"
onmouseout"logoOut('fordscript')rolloverText('ht
0')"pt.gif" ALT"Learn more about Ford Motor Company"
WIDTH"521" HEIGHT"39"
-
22The problem (cont.)
- We'd like computers to see meanings as well, so
that computer agents could more intelligently
process the web - These desires have led to XML, RDF, agent markup
languages, and a host of other proposals and
technologies which attempt to impose more syntax
and semantics on the web in order to make life
easier for agents.
23Thesis
- The problem cant and wont be solved by
mandating a universal semantics for the web - The solution is rather agents that can
understand the human web by text and image
processing
24(1) The semantics
- Are there adequate and adequately understood
methods for marking up pages with such a
consistent semantics, in such a way that it would
support simple reasoning by agents? - No.
25What are some AI people saying?
- Anyone familiar with AI must realize that the
study of knowledge representationat least as it
applies to the commensense knowledge required
for reading typical texts such as newspapersis
not going anywhere fast. This subfield of AI has
become notorious for the production of countless
non-monotonic logics and almost as many logics of
knowledge and belief, and none of the work shows
any obvious application to actual
knowledge-representation problems. Indeed, the
only person who has had the courage to actually
try to create large knowledge bases full of
commonsense knowledge, Doug Lenat , is believed
by everyone save himself to be failing in his
attempt. (Charniak 1993xviixviii)
26(2) Pragmatics not semantics
- pragmatic relating to matters of fact or
practical affairs often to the exclusion of
intellectual or artistic matters - pragmatics linguistics concerned with the
relationship of the meaning of sentences to their
meaning in the environment in which they occur - A lot of the meaning in web pages (as in any
communication) derives from the context what is
referred to in the philosophy of language
tradition as pragmatics - Communication is situated
27Pragmatics on the web
- Information supplied is incomplete humans will
interpret it - Numbers are often missing units
- A rubber band for sale at a stationery site is
a very different item to a rubber band on a metal
lathe - A sidelight means something different to a
glazier than to a regular person - Humans will evaluate content using information
about the site, and the style of writing - value filtering
28(3) The world changes
- The way in which business is being done is
changing at an astounding rate - or at least thats what the ads from ebusiness
companies scream at us - Semantic needs and usages evolve (like languages)
more rapidly than standards (cf. the Académie
française) - People use words that arent in the dictionary.
- Their listeners understand them.
29(4) Interoperation
- Ontology a shared formal conceptualization of a
particular domain - Meaning transfer frequently has to occur across
the subcommunities that are currently designing
ML languages, and then all the problems
reappear, and the current proposals don't do much
to help
30Many products cross industries
- http//www.interfilm-usa.com/Polyester.htm
- Interfilm offers a complete range of SKC's
Skyrol brand polyester films for use in a wide
variety of packaging and industrial processes. - Gauges 48 - 1400
- Typical End Uses Packaging, Electrical, Labels,
Graphic Arts, Coating and Laminating - labels milk jugs, beer/wine, combination forms,
laminated coupons,
31(5) Pain but no gain
- A lot of the time people won't put in information
according to standards for semantic/agent markup,
even if they exist. - Three reasons
- Laziness Only 0.3 of sites currently use the
(simple) Dublin Core metadata standard. - Profits Having an easily robot-crawlable site is
a recipe for turning what you sell into a
commodity, and hence making little profit - Cheats There are people out there that will
abuse any standard, if its profitable
32(6) Less structure to come
- the convergence of voice and data is creating
the next key interface between people and their
technology. By 2003, an estimated 450 billion
worth of e-commerce transactions will be
voice-commanded. - Question will these customers speak XML tags?
- Intel ad, NYT, 28 Sep 2000
- Data Source Forrester Research.
33The connection to language
- Decker et al. IEEE Internet Computing (2000)
- The Web is the first widely exploited
many-to-many data-interchange medium, and it
poses new requirements for any exchange format - Universal expressive power
- Syntactic interoperability
- Semantic interoperability
- But human languages have all these properties,
and maintain superior expressivity and
interoperability through their flexibility and
context dependence
34NLP and information access
- Solution use robust natural language processing
and machine learning techniques - NLP comes into its own when you want to do more
than just standard IR. - E.g., defined information needs over text
- An apartment with 2 bedrooms in Menlo Park for
less than 1,500. - Where was there an airline accident today?
- What proteins is this gene known to regulate?
35Example of extracting textual relations Real
Estate Ads
- System starts with plain text of ads
- These are hardly exactly English
- But an unstructured information source, close to
English - Chosen as lowest common denominator
- Output database records
- A variety of tables giving information about
- the property bedrooms, garages, price
- the real estate agency
- inspection times
36Real Estate Ads Input
- 2067206v1
- March 02, 1998
- MADDINGTON 89,000
-
- OPEN 1.00 - 1.45
- U 11 / 10 BERTRAM ST
- NEW TO MARKET Beautiful
- 3 brm freestanding
- villa, close to shops bus
- Owner moved to Melbourne
- ideally suit 1st home buyer,
- investor 55 and over.
- Brian Hazelden 0418 958 996
- R WHITE LEEMING 9332 3477
37Real Estate Ads Output
- Output is database tables
- But the general idea in slot-filler format
- SUBURB MADDINGTON
- ADDRESS (11,10,BERTRAM,ST)
- INSPECTION (1.00,1.45,11/Nov/98)
- BEDROOMS 3
- TYPE HOUSE
- AGENT BRIAN HAZELDEN
- BUS PHONE 9332 3477
- MOB PHONE 0418 958 996
- Manning Whitelaw, U. Sydney 1998 in daily use
at News Corp.
38(No Transcript)
39(No Transcript)
40One needs a little NLP
- There is no semantic coding to use
- Standard IR doesnt work
- suburbs
- the Paddington of the west
- one hours drive from Sydney
- real estate agent
- prices
- recently sold for x. Was y now z. Rent.
- bedrooms
- multi-property ads
41Text Segmentation
- Real-estate ads have an hiearchical text
structure!! - SOUTHPORT UNIT SPECIALS
- 58,900 o.n.o. 2 brm close to water and shops.
- 114,000 "Grandview", excellent value, good
returns - LJ Coleman Real Estate
- Contact Steve 5527 0572
- GLEBE 2br yd 250 4br yd 430
- COOGEE 3br yd 320 1br 150
- BALMAIN 1br 180
- H.R. Licensed FEE 9516-3211
42The End