Title: Information pragmatics A Natural Language Processing Approach
1Information pragmaticsA Natural Language
Processing Approach
- Christopher Manning
- CSLI IAP meeting
- November 2000
- http//nlp.stanford.edu/manning/
2The problem
- When people see web pages, they understand their
meaning
- By and large. To the extent that they dont,
theres a gradual degradation
- When computers see web pages, they get only
character strings and HTML tags
3The human view
4The intelligent agent view
-
- Ford Motor Company - Home Page
- trucks, SUV, mazda, volvo, lincoln, mercury,
jaguar, aston martin, ford"
- Company corporate home page"
-
-
- WIDTH768
- HREF"default.asp?pageid473" onmouseover"logoOve
r('fordscript')rolloverText('ht0')"
onmouseout"logoOut('fordscript')rolloverText('ht
0')"pt.gif" ALT"Learn more about Ford Motor Company"
WIDTH"521" HEIGHT"39"
-
5The problem (cont.)
- We'd like computers to see meanings as well, so
that computer agents could more intelligently
process the web
- These desires have led to XML, RDF, agent markup
languages, and a host of other proposals and
technologies which attempt to impose more syntax
and semantics on the web in order to make life
easier for agents. - E.g., Guha (Epinions CTO/ex-Cyc, 1999) Very
little of the information on the web is machine
understandable. Need to move from a repository
of data to a Web of Knowledge. RDF and the Open
Directory might enable us to reach this goal.
6Ontologies
- The answer, it is suggested, is ontologies
- Shared formal conceptualizations of particular
domains concepts, relations, objects,
and constraints
- An ontology is a specification of a
conceptualization that is designed for reuse
across multiple applications
- Ontologies controlled vocabularies, taxonomy, OO
database schema, knowledge-representation system
- Ontologies, as specifications of the concepts in
a given field, and of the relationships among
those concepts, provide insight into the nature
of information produced by that field and are an
essential ingredient for any attempts to arrived
at a shared understanding of concepts in a field.
7Why is this idea appealing?
- An ontology is really a dictionary. A data
dictionary.
- In the world of closed company databases, one had
a clear semantics for fields and tables, and the
ability to combine information across them by
well-specified logical means - In the world-wide web, you have a mess
- The desire for a global or industry-wide ontology
is a desire to bring back the good old days.
8Thesis
- The problem cant and wont be solved by
mandating a universal semantics for the web.
9Nuanced Thesis (1)
- Structured knowledge is important, and there will
be increasing use of structure and keys just as
we started using zipcodes, and then the
postoffice started barcoding. - These processes all offer the opportunity to
increase speed and precision, and agents will
want to use them when available and reliable
- But successful agents will need to be able to
work even when this information isnt there.
- The postoffice still delivers your mail, even
when the zipcode is missing or wrong.
10Nuanced Thesis/Theses? (2)
- There will never be a complete explicit and
unambiguous semantics for everything needed on
the web or even a non-trivial chunk of it
both because of the scale of the problem and the
speed of change - Much of the semantic knowledge needs instead to
reside in the agent
- The agent needs to be able to understand the
human web, by reasoning using contextual
information and its own knowledge, and various
kinds of text and image processing
11XML?
- Im not saying that XML wont be used much. It
certainly will be used widely
- e.g., News organizations moving to adopt NewsML
for efficient production of electronic news
Reuters, 11 October 2000
- Internally, it will be used for most content
(except tabular data), so that content can be
easily retargeted for browsers, WAP, iMode, and
whatever comes next - Some sites will publish XML to outside users.
12Will XML be published?
- Another lesson of transitions is that the old
way persists for a very long time. The 4.0-level
browsers will be with us for the
foreseeable future. - Dave Winer (reacting to similar
conclusions of Jakob Nielsen)
- If youre going to be serving HTML for the
foreseeable future, why bother complicating your
life by serving something else as well?
- Especially when it doesnt look better to the
user
- Or people might charge for XML, while giving HTML
away for free
13XML
- Even when it is published, XML goes only a small
way to enabling knowledge transfer
- It is simply a syntax
- The same meanings can be encoded by it in many
ways, and conversely, different meanings can be
coded in the same way.
- This is what suggests the need for a clearly
mandated semantics for web markup
14Explicit, usable web semantics
- Will such a thing work?
- That is, will web pages be consistently marked up
with a uniform explicit semantics that is easily
processed by agents so that they dont have to
deal with that messy HTML that underlies what
humans look at? - I think not. For a bunch of reasons.
15(1) The semantics
- Are there adequate and adequately understood
methods for marking up pages with such a
consistent semantics, in such a way that it would
support simple reasoning by agents? - No.
16What are some AI people saying?
- Anyone familiar with AI must realize that the
study of knowledge representationat least as it
applies to the commensense knowledge required
for reading typical texts such as newspapersis
not going anywhere fast. This subfield of AI has
become notorious for the production of countless
non-monotonic logics and almost as many logics of
knowledge and belief, and none of the work shows
any obvious application to actual
knowledge-representation problems. Indeed, the
only person who has had the courage to actually
try to create large knowledge bases full of
commonsense knowledge, Doug Lenat , is believed
by everyone save himself to be failing in his
attempt. (Charniak 1993xviixviii)
17(2) Many of the problems are pragmatics not
semantics
- pragmatic relating to matters of fact or
practical affairs often to the exclusion of
intellectual or artistic matters
- pragmatics linguistics concerned with the
relationship of the meaning of sentences to their
meaning in the environment in which they occur
- A lot of the meaning in web pages (as in any
communication) derives from the context what is
referred to in the philosophy of language
tradition as pragmatics - Communication is situated
18The crêperie
- After making use of 3 different picture search
engines, and spending at least ½ an hour on the
site of a very dedicated French photographer, I
had found the setting for my story a crêperie. - Well, almost. The visuals didnt really convey
what I needed, so let me settle for a worse
quality picture of a gyro shop.
19Not actually a crêperie
20Important points
- Multimedia information sources are vital
- The meaning of a text is strongly determined by
its context of use
- Indeed, you can think of language as conveying
the minimal amount of information necessary given
the context and assumed shared knowledge
- Humans are used to communicating even when they
dont completely hear or understand the signal
even if this example is a bit extreme
21Pragmatics on the web
- Information supplied is incomplete humans will
interpret it
- Numbers are often missing units
- A rubber band for sale at a stationery site is
a very different item to a rubber band on a metal
lathe
- A sidelight means something different to a
glazier than to a regular person
- Humans will evaluate content using information
about the site, and the style of writing
- value filtering
22(3) The world changes
- The way in which business is being done is
changing at an astounding rate
- or at least thats what the ads from ebusiness
companies scream at us
- Semantic needs and usages evolve (like languages)
more rapidly than standards (cf. the Académie
française)
- People use words that arent in the dictionary.
- Their listeners understand them.
23Rapid change
- Last year Rambus wasnt a concept in computer
memory classification, now it is
- Cell phones have long had attributes like size
and battery life
- Now whether they support WAP is an attribute
- In a couple of years time that attribute will
probably have disappeared again
- People will introduce new products when theyre
ready, not when some committee has added the
terms to an ontology
24(4) Interoperation
- Ontology a shared formal conceptualization of a
particular domain
- Meaning transfer frequently has to occur across
the subcommunities that are currently designing
ML languages, and then all the problems
reappear, and the current proposals don't do much
to help
25Many products cross industries
- http//www.interfilm-usa.com/Polyester.htm
- Interfilm offers a complete range of SKC's
Skyrol brand polyester films for use in a wide
variety of packaging and industrial processes.
- Gauges 48 - 1400
- Typical End Uses Packaging, Electrical, Labels,
Graphic Arts, Coating and Laminating
- labels milk jugs, beer/wine, combination forms,
laminated coupons,
26Mismatches
- When interoperation involves distinct domains or
just distinct subcommunities within an industry,
semantic mismatch ensues
- Local representational power conflicts with
global consistency you want to advertise your
new feature
- Your own needs will take priority
- Systems will need to deal with this
heterogeneity
- Integration of information across XML markup
languages is scarcely easier than integration of
the same information represented in HTML.
27Semantic mismatches
- Different Usages
- Cell phone mobile phone
- Data projector beamer
- Different levels of specialized vocabulary
- water table the strip of wood that points
outward at the bottom of the door
- hydrologists mean something very different by
water table
- Ambiguity of reference
- Is C.D. Manning the same person as Christopher
Manning?
28Name matching/Object identity knowledge
- Database theory is built around ideas of unique
identifiers, determinate relational operations,
- (Human) natural language processing is built
around context-embedded reasoning about issues of
identity and meaning
- Around Stanford, the president is John Hennessy
- Elsewhere its well, either Gore or Bush
- Integrating information sources requires
probabilistic reasoning about object identity
29(5) Pain but no gain
- A lot of the time people wont put in information
according to standards for semantic/agent markup,
even if they exist.
- Three reasons
30(5.1) Pain no gain
- Laziness
- Only 0.3 of sites currently use the (simple)
Dublin Core metadata standard. (Lawrence and
Giles 1999).
- Even less are likely to use something that is
more work
- Why? They dont appear to perceive much value, I
guess. What would change this?
31Inconsistency digital cameras
- Image Capture Device 1.68 million pixel 1/2-inch
CCD sensor
- Image Capture Device Total Pixels Approx. 3.34
million Effective Pixels Approx. 3.24 million
- Image sensor Total Pixels Approx. 2.11
million-pixel
- Imaging sensor Total Pixels Approx. 2.11
million 1,688 (H) x 1,248 (V)
- CCD Total Pixels Approx. 3,340,000 (2,140H x
1,560 V )
- Effective Pixels Approx. 3,240,000 (2,088 H x
1,550 V )
- Recording Pixels Approx. 3,145,000 (2,048 H x
1,536 V )
- These all came off the same manufacturers
website!!
- And this is a very technical domain. Try sofa
beds.
32(5.2) Pain no gain
- Sell the sizzle, not the steak
- The way businesses make money is by selling
something at a profit (for more than necessary)
- The way you do this is by getting people to want
it from you
- advertising
- site stickiness (while Im here)
- trust
- Newspaper advertisements rarely contain spec
sheets
33(5.2) Pain no gain
- Having an easily robot-crawlable site is a recipe
for turning what you sell into a commodity
- This may open new markets
- But most would prefer not to be in this business
- Having all your goods turned into a commodity by
a shopping bot isnt in your best interest.
- the profits are very low
34(5.3) Gain, no pain
- The web is a nasty free-wheeling place
- There are people out there that will abuse the
intended use and semantics of any standard,
providing they see opportunities to profit from
doing so - An agent cannot simply believe the semantics
- It will have to reason skeptically based on all
contextual and world knowledge available to it.
35(6) Less structure to come
- the convergence of voice and data is creating
the next key interface between people and their
technology. By 2003, an estimated 450 billion
worth of e-commerce transactions will be
voice-commanded. - Question will these customers speak XML tags?
- Intel ad, NYT, 28 Sep 2000
- Data Source Forrester Research.
36Summary so far
- With large-scale distributed information sources
like the web, everyone suddenly needs to deal
with highly heterogeneous data sources of
uncertain correctness and value, where there are
frequent semantic mismatches in which terms are
used or what they mean. Contextual information
is often needed to determine the meaning or
reference of terms. In other words, the problems
look a lot like Natural Language Processing,
regardless of whether the data is text as
narrowly defined.
37The connection to language
- Decker et al. IEEE Internet Computing (2000)
- The Web is the first widely exploited
many-to-many data-interchange medium, and it
poses new requirements for any exchange format
- Universal expressive power
- Syntactic interoperability
- Semantic interoperability
- But human languages have all these properties,
and maintain superior expressivity and
interoperability through their flexibility and
context dependence
38The direction to go
- Successful agents will need prior knowledge, and
use ontologies, etc. to help interpret web pages
they become a locus of semantics
- But they will also depend on contextual knowledge
and reasoning in the face of uncertain
information.
- They will use well-marked up information, if
available and trusted, but they will be able to
extract their own metadata from information
intended for humans, regardless of the form in
which the information appears.
39The scale of the problem
- The web is too big a thing for it to be likely
for humans to hand-enter metadata for most pages
- Hand-building ontologies and reasoning systems
hasnt been very successful
- Agents must be able to extract propositions or
relations from information intended for humans
- A useful observation in seeking this goal is that
text statistics can often be used as a surrogate
for world knowledge
40Processing textual data
- Use language technology to add value to data by
- interpretation
- transformation
- value filtering
- augmentation (providing metadata)
- Two motivations
- The large amount of information in textual form
- Information integration needs NLP-style methods
41Knowledge Extraction Vision
- Multi-dimensional Meta-data Extraction
42Task Text Categorization
- Take a document and assign it a label
representing its content.
- Classic example decide if a newspaper article is
about politics, business, or sports?
- But there are many relevant web uses for the same
technology
- Is this page a laser printer product page?
- Does this company accept overseas orders?
- What kind of job does this job posting describe?
- What kind of position does this list of
responsibilities describe?
- What position does this this list of skills best
fit?
43Task Information Extraction / Wrapper Induction
- A lot of information that could be represented in
a structured semantically clear format isnt
- It may be costly, not desired, or not in ones
control (screen scraping) to change this.
- Information extraction systems
- Find and understand relevant parts of texts.
- Produce a structured representation of the
relevant information relations (in DB sense)
- Goal being able to answer semantic queries using
unstructured natural language sources
44Example Classified Ads
- 2067206v1
- March 02, 1998
- MADDINGTON 89,000
-
- OPEN 1.00 - 1.45
- U 11 / 10 BERTRAM ST
- NEW TO MARKET Beautiful
- 3 brm freestanding
- villa, close to shops bus
- Owner moved to Melbourne
- ideally suit 1st home buyer,
- investor 55 and over.
- Brian Hazelden 0418 958 996
- R WHITE LEEMING 9332 3477
45Real Estate Ads Output
- Output is database tables
- But the general idea in slot-filler format
- SUBURB MADDINGTON
- ADDRESS (11,10,BERTRAM,ST)
- INSPECTION (1.00,1.45,11/Nov/98)
- BEDROOMS 3
- TYPE HOUSE
- AGENT BRIAN HAZELDEN
- BUS PHONE 9332 3477
- MOB PHONE 0418 958 996
46(No Transcript)
47Why doesnt text search (IR) work?
- What you search for in real estate
advertisements
- Suburbs. You might think easy, but
- Real estate agents Coldwell Banker, Mosman
- Phrases Only 45 minutes from Parramatta
- Multiple property ads have different suburbs
- Money want a range not a textual match
- Multiple amounts was 155K, now 145K
- Variations offers in the high 700s but not
rents for 270
- Bedrooms similar issues (br, bdr, beds, B/R )
48Task ParsingModern statistical parsers
- A greatly increased ability to do accurate,
robust, broad coverage parsing
- Achieved by converting parsing into a
classification task and using ML methods
- Statistical methods (fairly) accurately resolve
structural and real world ambiguities
- Quickly rather than cubic complete parse
algorithms, find best parse in linear time
- Provide probabilistic language models that can be
integrated with speech recognition systems.
49From structure to meaning
- Syntactic structures aren't meanings, but heads
and dependents essentially gives one relations
- orders(president, review(spectrum(wireless)))
- We don't do issues of noun phrase scope, but
that's probably too hard for robust NLP
- Remaining problems synonymy and polysemy
- Words have multiple meanings
- Several words can mean the same thing
- But there are statistical methods for these
tasks
- So the goal of transforming a text into relations
of facts is close
50Precision Semantic markup
- The story so far
- We can get a fair way with text learning!
- In some places, moderate accuracy is okay
- But often business needs precision
- as Gio Wiederhold points out in his talks
- These methods may not offer sufficient accuracy
51Precision Semantic markup
- This is where semantic markup comes back in
- If a page has reliable semantic markup, such a
program can use it to provide much higher
accuracy levels
- Agents will need to check the provided markup
- But deciding that provided semantic markup is
trustworthy is a lot easier (and hence more
reliable) decision than working out the meaning
from unstructured text
52Data verification
- Humans are very good at checking if data is
reasonable
- 5525 Beverly Place, Pittsburgh
- 361-5525
- They know if content is reasonable by content
analysis
53Data verification
- Most programs are dumb
- especially if they expect to just rely on
semantic markup
- Again one needs unstructured text classification
and learning
- one needs to check that field contents are
reasonable
- Richly semantically marked up data has a real use
here, since it allows agents to continue to learn
(especially as usage changes over time)
54Conclusion
- Rich semantic markup has an important place
improving the precision of agent understanding
- But there will be no substitute for agents that
can work with unstructured data
- part of that data is text what I know about!
- but visual and other information is also
incredibly important
- one really needs to use how a page looks
- All of it involves reasoning from uncertain
situated information more in the style of NLP
55Thank you!