Information pragmatics A Natural Language Processing Approach

About This Presentation

Title:

Information pragmatics A Natural Language Processing Approach

Description:

To the extent that they don't, there's a gradual degradation ... Real estate agents: Coldwell Banker, Mosman. Phrases: Only 45 minutes from Parramatta ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 56

Provided by: Mann5

Learn more at: https://nlp.stanford.edu

more less

Transcript and Presenter's Notes

Title: Information pragmatics A Natural Language Processing Approach

1
Information pragmaticsA Natural Language
Processing Approach

Christopher Manning
CSLI IAP meeting
November 2000
http//nlp.stanford.edu/manning/

2
The problem

When people see web pages, they understand their
meaning
By and large. To the extent that they dont,
theres a gradual degradation
When computers see web pages, they get only
character strings and HTML tags

3
The human view
4
The intelligent agent view

Ford Motor Company - Home Page
trucks, SUV, mazda, volvo, lincoln, mercury,
jaguar, aston martin, ford"
Company corporate home page"
WIDTH768
HREF"default.asp?pageid473" onmouseover"logoOve
r('fordscript')rolloverText('ht0')"
onmouseout"logoOut('fordscript')rolloverText('ht
0')"pt.gif" ALT"Learn more about Ford Motor Company"
WIDTH"521" HEIGHT"39"

5
The problem (cont.)

We'd like computers to see meanings as well, so
that computer agents could more intelligently
process the web
These desires have led to XML, RDF, agent markup
languages, and a host of other proposals and
technologies which attempt to impose more syntax
and semantics on the web in order to make life
easier for agents.
E.g., Guha (Epinions CTO/ex-Cyc, 1999) Very
little of the information on the web is machine
understandable. Need to move from a repository
of data to a Web of Knowledge. RDF and the Open
Directory might enable us to reach this goal.

6
Ontologies

The answer, it is suggested, is ontologies
Shared formal conceptualizations of particular
domains concepts, relations, objects,
and constraints
An ontology is a specification of a
conceptualization that is designed for reuse
across multiple applications
Ontologies controlled vocabularies, taxonomy, OO
database schema, knowledge-representation system
Ontologies, as specifications of the concepts in
a given field, and of the relationships among
those concepts, provide insight into the nature
of information produced by that field and are an
essential ingredient for any attempts to arrived
at a shared understanding of concepts in a field.

7
Why is this idea appealing?

An ontology is really a dictionary. A data
dictionary.
In the world of closed company databases, one had
a clear semantics for fields and tables, and the
ability to combine information across them by
well-specified logical means
In the world-wide web, you have a mess
The desire for a global or industry-wide ontology
is a desire to bring back the good old days.

8
Thesis

The problem cant and wont be solved by
mandating a universal semantics for the web.

9
Nuanced Thesis (1)

Structured knowledge is important, and there will
be increasing use of structure and keys just as
we started using zipcodes, and then the
postoffice started barcoding.
These processes all offer the opportunity to
increase speed and precision, and agents will
want to use them when available and reliable
But successful agents will need to be able to
work even when this information isnt there.
The postoffice still delivers your mail, even
when the zipcode is missing or wrong.

10
Nuanced Thesis/Theses? (2)

There will never be a complete explicit and
unambiguous semantics for everything needed on
the web or even a non-trivial chunk of it
both because of the scale of the problem and the
speed of change
Much of the semantic knowledge needs instead to
reside in the agent
The agent needs to be able to understand the
human web, by reasoning using contextual
information and its own knowledge, and various
kinds of text and image processing

11
XML?

Im not saying that XML wont be used much. It
certainly will be used widely
e.g., News organizations moving to adopt NewsML
for efficient production of electronic news
Reuters, 11 October 2000
Internally, it will be used for most content
(except tabular data), so that content can be
easily retargeted for browsers, WAP, iMode, and
whatever comes next
Some sites will publish XML to outside users.

12
Will XML be published?

Another lesson of transitions is that the old
way persists for a very long time. The 4.0-level
browsers will be with us for the
foreseeable future.
Dave Winer (reacting to similar
conclusions of Jakob Nielsen)
If youre going to be serving HTML for the
foreseeable future, why bother complicating your
life by serving something else as well?
Especially when it doesnt look better to the
user
Or people might charge for XML, while giving HTML
away for free

13
XML

Even when it is published, XML goes only a small
way to enabling knowledge transfer
It is simply a syntax
The same meanings can be encoded by it in many
ways, and conversely, different meanings can be
coded in the same way.
This is what suggests the need for a clearly
mandated semantics for web markup

14
Explicit, usable web semantics

Will such a thing work?
That is, will web pages be consistently marked up
with a uniform explicit semantics that is easily
processed by agents so that they dont have to
deal with that messy HTML that underlies what
humans look at?
I think not. For a bunch of reasons.

15
(1) The semantics

Are there adequate and adequately understood
methods for marking up pages with such a
consistent semantics, in such a way that it would
support simple reasoning by agents?
No.

16
What are some AI people saying?

Anyone familiar with AI must realize that the
study of knowledge representationat least as it
applies to the commensense knowledge required
for reading typical texts such as newspapersis
not going anywhere fast. This subfield of AI has
become notorious for the production of countless
non-monotonic logics and almost as many logics of
knowledge and belief, and none of the work shows
any obvious application to actual
knowledge-representation problems. Indeed, the
only person who has had the courage to actually
try to create large knowledge bases full of
commonsense knowledge, Doug Lenat , is believed
by everyone save himself to be failing in his
attempt. (Charniak 1993xviixviii)

17
(2) Many of the problems are pragmatics not
semantics

pragmatic relating to matters of fact or
practical affairs often to the exclusion of
intellectual or artistic matters
pragmatics linguistics concerned with the
relationship of the meaning of sentences to their
meaning in the environment in which they occur
A lot of the meaning in web pages (as in any
communication) derives from the context what is
referred to in the philosophy of language
tradition as pragmatics
Communication is situated

18
The crêperie

After making use of 3 different picture search
engines, and spending at least ½ an hour on the
site of a very dedicated French photographer, I
had found the setting for my story a crêperie.
Well, almost. The visuals didnt really convey
what I needed, so let me settle for a worse
quality picture of a gyro shop.

19
Not actually a crêperie
20
Important points

Multimedia information sources are vital
The meaning of a text is strongly determined by
its context of use
Indeed, you can think of language as conveying
the minimal amount of information necessary given
the context and assumed shared knowledge
Humans are used to communicating even when they
dont completely hear or understand the signal
even if this example is a bit extreme

21
Pragmatics on the web

Information supplied is incomplete humans will
interpret it
Numbers are often missing units
A rubber band for sale at a stationery site is
a very different item to a rubber band on a metal
lathe
A sidelight means something different to a
glazier than to a regular person
Humans will evaluate content using information
about the site, and the style of writing
value filtering

22
(3) The world changes

The way in which business is being done is
changing at an astounding rate
or at least thats what the ads from ebusiness
companies scream at us
Semantic needs and usages evolve (like languages)
more rapidly than standards (cf. the Académie
française)
People use words that arent in the dictionary.
Their listeners understand them.

23
Rapid change

Last year Rambus wasnt a concept in computer
memory classification, now it is
Cell phones have long had attributes like size
and battery life
Now whether they support WAP is an attribute
In a couple of years time that attribute will
probably have disappeared again
People will introduce new products when theyre
ready, not when some committee has added the
terms to an ontology

24
(4) Interoperation

Ontology a shared formal conceptualization of a
particular domain
Meaning transfer frequently has to occur across
the subcommunities that are currently designing
ML languages, and then all the problems
reappear, and the current proposals don't do much
to help

25
Many products cross industries

http//www.interfilm-usa.com/Polyester.htm
Interfilm offers a complete range of SKC's
Skyrol brand polyester films for use in a wide
variety of packaging and industrial processes.
Gauges 48 - 1400
Typical End Uses Packaging, Electrical, Labels,
Graphic Arts, Coating and Laminating
labels milk jugs, beer/wine, combination forms,
laminated coupons,

26
Mismatches

When interoperation involves distinct domains or
just distinct subcommunities within an industry,
semantic mismatch ensues
Local representational power conflicts with
global consistency you want to advertise your
new feature
Your own needs will take priority
Systems will need to deal with this
heterogeneity
Integration of information across XML markup
languages is scarcely easier than integration of
the same information represented in HTML.

27
Semantic mismatches

Different Usages
Cell phone mobile phone
Data projector beamer
Different levels of specialized vocabulary
water table the strip of wood that points
outward at the bottom of the door
hydrologists mean something very different by
water table
Ambiguity of reference
Is C.D. Manning the same person as Christopher
Manning?

28
Name matching/Object identity knowledge

Database theory is built around ideas of unique
identifiers, determinate relational operations,
(Human) natural language processing is built
around context-embedded reasoning about issues of
identity and meaning
Around Stanford, the president is John Hennessy
Elsewhere its well, either Gore or Bush
Integrating information sources requires
probabilistic reasoning about object identity

29
(5) Pain but no gain

A lot of the time people wont put in information
according to standards for semantic/agent markup,
even if they exist.
Three reasons

30
(5.1) Pain no gain

Laziness
Only 0.3 of sites currently use the (simple)
Dublin Core metadata standard. (Lawrence and
Giles 1999).
Even less are likely to use something that is
more work
Why? They dont appear to perceive much value, I
guess. What would change this?

31
Inconsistency digital cameras

Image Capture Device 1.68 million pixel 1/2-inch
CCD sensor
Image Capture Device Total Pixels Approx. 3.34
million Effective Pixels Approx. 3.24 million
Image sensor Total Pixels Approx. 2.11
million-pixel
Imaging sensor Total Pixels Approx. 2.11
million 1,688 (H) x 1,248 (V)
CCD Total Pixels Approx. 3,340,000 (2,140H x
1,560 V )
Effective Pixels Approx. 3,240,000 (2,088 H x
1,550 V )
Recording Pixels Approx. 3,145,000 (2,048 H x
1,536 V )
These all came off the same manufacturers
website!!
And this is a very technical domain. Try sofa
beds.

32
(5.2) Pain no gain

Sell the sizzle, not the steak
The way businesses make money is by selling
something at a profit (for more than necessary)
The way you do this is by getting people to want
it from you
advertising
site stickiness (while Im here)
trust
Newspaper advertisements rarely contain spec
sheets

33
(5.2) Pain no gain

Having an easily robot-crawlable site is a recipe
for turning what you sell into a commodity
This may open new markets
But most would prefer not to be in this business
Having all your goods turned into a commodity by
a shopping bot isnt in your best interest.
the profits are very low

34
(5.3) Gain, no pain

The web is a nasty free-wheeling place
There are people out there that will abuse the
intended use and semantics of any standard,
providing they see opportunities to profit from
doing so
An agent cannot simply believe the semantics
It will have to reason skeptically based on all
contextual and world knowledge available to it.

35
(6) Less structure to come

the convergence of voice and data is creating
the next key interface between people and their
technology. By 2003, an estimated 450 billion
worth of e-commerce transactions will be
voice-commanded.
Question will these customers speak XML tags?
Intel ad, NYT, 28 Sep 2000
Data Source Forrester Research.

36
Summary so far

With large-scale distributed information sources
like the web, everyone suddenly needs to deal
with highly heterogeneous data sources of
uncertain correctness and value, where there are
frequent semantic mismatches in which terms are
used or what they mean. Contextual information
is often needed to determine the meaning or
reference of terms. In other words, the problems
look a lot like Natural Language Processing,
regardless of whether the data is text as
narrowly defined.

37
The connection to language

Decker et al. IEEE Internet Computing (2000)
The Web is the first widely exploited
many-to-many data-interchange medium, and it
poses new requirements for any exchange format
Universal expressive power
Syntactic interoperability
Semantic interoperability
But human languages have all these properties,
and maintain superior expressivity and
interoperability through their flexibility and
context dependence

38
The direction to go

Successful agents will need prior knowledge, and
use ontologies, etc. to help interpret web pages
they become a locus of semantics
But they will also depend on contextual knowledge
and reasoning in the face of uncertain
information.
They will use well-marked up information, if
available and trusted, but they will be able to
extract their own metadata from information
intended for humans, regardless of the form in
which the information appears.

39
The scale of the problem

The web is too big a thing for it to be likely
for humans to hand-enter metadata for most pages
Hand-building ontologies and reasoning systems
hasnt been very successful
Agents must be able to extract propositions or
relations from information intended for humans
A useful observation in seeking this goal is that
text statistics can often be used as a surrogate
for world knowledge

40
Processing textual data

Use language technology to add value to data by
interpretation
transformation
value filtering
augmentation (providing metadata)
Two motivations
The large amount of information in textual form
Information integration needs NLP-style methods

41
Knowledge Extraction Vision

Multi-dimensional Meta-data Extraction

42
Task Text Categorization

Take a document and assign it a label
representing its content.
Classic example decide if a newspaper article is
about politics, business, or sports?
But there are many relevant web uses for the same
technology
Is this page a laser printer product page?
Does this company accept overseas orders?
What kind of job does this job posting describe?
What kind of position does this list of
responsibilities describe?
What position does this this list of skills best
fit?

43
Task Information Extraction / Wrapper Induction

A lot of information that could be represented in
a structured semantically clear format isnt
It may be costly, not desired, or not in ones
control (screen scraping) to change this.
Information extraction systems
Find and understand relevant parts of texts.
Produce a structured representation of the
relevant information relations (in DB sense)
Goal being able to answer semantic queries using
unstructured natural language sources

44
Example Classified Ads

2067206v1
March 02, 1998
MADDINGTON 89,000
OPEN 1.00 - 1.45
U 11 / 10 BERTRAM ST
NEW TO MARKET Beautiful
3 brm freestanding
villa, close to shops bus
Owner moved to Melbourne
ideally suit 1st home buyer,
investor 55 and over.
Brian Hazelden 0418 958 996
R WHITE LEEMING 9332 3477

45
Real Estate Ads Output

Output is database tables
But the general idea in slot-filler format
SUBURB MADDINGTON
ADDRESS (11,10,BERTRAM,ST)
INSPECTION (1.00,1.45,11/Nov/98)
BEDROOMS 3
TYPE HOUSE
AGENT BRIAN HAZELDEN
BUS PHONE 9332 3477
MOB PHONE 0418 958 996

46
(No Transcript)
47
Why doesnt text search (IR) work?

What you search for in real estate
advertisements
Suburbs. You might think easy, but
Real estate agents Coldwell Banker, Mosman
Phrases Only 45 minutes from Parramatta
Multiple property ads have different suburbs
Money want a range not a textual match
Multiple amounts was 155K, now 145K
Variations offers in the high 700s but not
rents for 270
Bedrooms similar issues (br, bdr, beds, B/R )

48
Task ParsingModern statistical parsers

A greatly increased ability to do accurate,
robust, broad coverage parsing
Achieved by converting parsing into a
classification task and using ML methods
Statistical methods (fairly) accurately resolve
structural and real world ambiguities
Quickly rather than cubic complete parse
algorithms, find best parse in linear time
Provide probabilistic language models that can be
integrated with speech recognition systems.

49
From structure to meaning

Syntactic structures aren't meanings, but heads
and dependents essentially gives one relations
orders(president, review(spectrum(wireless)))
We don't do issues of noun phrase scope, but
that's probably too hard for robust NLP
Remaining problems synonymy and polysemy
Words have multiple meanings
Several words can mean the same thing
But there are statistical methods for these
tasks
So the goal of transforming a text into relations
of facts is close

50
Precision Semantic markup

The story so far
We can get a fair way with text learning!
In some places, moderate accuracy is okay
But often business needs precision
as Gio Wiederhold points out in his talks
These methods may not offer sufficient accuracy

51
Precision Semantic markup

This is where semantic markup comes back in
If a page has reliable semantic markup, such a
program can use it to provide much higher
accuracy levels
Agents will need to check the provided markup
But deciding that provided semantic markup is
trustworthy is a lot easier (and hence more
reliable) decision than working out the meaning
from unstructured text

52
Data verification

Humans are very good at checking if data is
reasonable
5525 Beverly Place, Pittsburgh
361-5525
They know if content is reasonable by content
analysis

53
Data verification

Most programs are dumb
especially if they expect to just rely on
semantic markup
Again one needs unstructured text classification
and learning
one needs to check that field contents are
reasonable
Richly semantically marked up data has a real use
here, since it allows agents to continue to learn
(especially as usage changes over time)

54
Conclusion

Rich semantic markup has an important place
improving the precision of agent understanding
But there will be no substitute for agents that
can work with unstructured data
part of that data is text what I know about!
but visual and other information is also
incredibly important
one really needs to use how a page looks
All of it involves reasoning from uncertain
situated information more in the style of NLP

55
Thank you!

Write a Comment

User Comments (0)