Robust Local Textual Inference

About This Presentation

Title:

Robust Local Textual Inference

Description:

... of a plumber, a carpenter, a painter, an electrician, and an interior decorator. ... The interior decorator has to complete his job before that of the electrician. ... – PowerPoint PPT presentation

Number of Views:159

Avg rating:3.0/5.0

Slides: 52

Provided by: christo394

Learn more at: https://nlp.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Robust Local Textual Inference

1
Robust Local Textual Inference

Christopher Manning, Stanford University
Bill MacCartney, Marie-Catherine de Marneffe (U.
C. de Louvain), Teg Grenager, Daniel Cer (U.
Colorado), Rajat Raina, Christopher Cox,
Anna Rafferty, Roger Grosse, Josh Ainslie, Aria
Haghighi, Jenny Finkel, Jeff Michels, Kristina
Toutanova, and Andrew Y. Ng

2
The backdrop

There is a long, sometimes successful history of
writing by hand systems writing systems that
understand more deeply
Using limited vocabulary and syntax in a literal
way over limited domains. The TAUM-METEO system.
Recently, statistical/machine learning
computational linguistics has provided tools for
disambiguating natural language.
Parsers and annotators for any text from any
domain
E.g., Named Entity Recognition Person? Company?
In August 2004, Charles Schwab came to Arizona
and opened a temporary location on 92nd Street

3
An external perspective on NLP

NLP has many successful tools with all sorts of
uses
Part of speech tagging, named entity recognition,
syntactic parsing, semantic role parsing,
coreference determination
but they concentrate on structure not meaning
By-and-large non-NLP people want systems for more
holistic semantic tasks
Text categorization
Information retrieval/web search
The state-of-the-art in these areas is (slightly
extended) bag-of-words models

4
The problem for NLP

The problem for NLP Search engines actually work
pretty well for people
But people would like to get more from
text-processing applications
Information gathering is not just surface
expression
Answers to many questions are a bit below the
surface
This interpretation is the difference between
data and knowledge
Challenge a tool that works robustly on any text
and understands a useful, greater amount of
sentence meaning

5
Talk Outline

The NLP Challenge Beyond the bag of words
The Pascal task of robust local textual inference
Deep logical approaches to NLP
Answering GRE analytic section logic puzzles
Lev, MacCartney, Levy, and Manning 2004
Some first attempts
Raina, Ng, and Manning 2005 Haghighi, Ng, and
Manning 2005
A second attempt
MacCartney, deMarneffe, Grenager, Cer, and
Manning 2006

6
2. The PASCAL Textual Inference TaskDagan,
Glickman, and Magnini 2005

The task Can systems correctly perform local
textual inferences individual inference steps?
On the assumption that some piece of text (T) is
true, does this imply the truth of some other
hypothesis text (H)?
Sydney was the host city of the 2000 Olympics ?
The Olympics have been held in Sydney TRUE
The format could be used for evaluating extended
inferential chains or knowledge
But, in practice, fairly direct, local stuff

7
The PASCAL Textual Inference Task

The task focuses on the variability of semantic
expression in language
The reverse task of disambiguation
The Dow Jones Industrial Average closed up 255
The Dow climbed 255 points today
The Dow Jones Industrial Average gained over 250
points
An abstraction from any particular application,
but directly applicable to applications

8
Natural Examples Reading Comprehension

(CNN Student News) -- January 24, 2006
Answer the following questions about today's
featured news stories. Write your answers in the
space provided.
1. Where is the country of Somalia located? What
ocean borders this country?
2. Why did crew members from the USS Winston S.
Churchill recently stop a small vessel off the
coast of Somalia? What action did the crew of the
Churchill take?

9
Real Uses

Semantic search
Find documents about lobbyists attempting to
bribe U.S. senators
(lobbyist attempted to bribe U.S. senator)
Question answering
Who acquired Overture?
Use to score candidate answers based on passage
retrieval and named entity recognition
Customer email response
My Squeezebox regularly skips during music
playback
? Sender can hear music through Squeezebox
Relation extraction (database building)
Document summarization

10
Verification of terms Dan Roth

Non-disclosure Agreement
WHEREAS Recipient is desirous of obtaining said
confidential information for purposes of
evaluation thereof and as a basis for further
discussions with Owner regarding assistance with
development of the confidential information for
the benefit of Owner or for the mutual benefit of
Owner and Recipient THEREFORE, Recipient
hereby agrees to receive the information in
confidence and to treat it as confidential for
all purposes. Recipient will not divulge or use
in any manner any of said confidential
information unless by written consent from Owner,
and Recipient will use at least the same efforts
it regularly employs for its own confidential
information to avoid disclosure to others.
Provided, however, that this obligation to
treat information confidentially will not apply
to any information already in Recipient's
possession or to any information that is
generally available to the public or becomes
generally available through no act or influence
of Recipient. Recipient will inform Owner of the
public nature or Recipient's possession of the
information without delay after Owner's
disclosure thereof or will be stopped from
asserting such as defense to remedy under this
agreement.
Each party acknowledges that all of the
disclosing party's Confidential Information is
owned solely by the disclosing party (or its
licensors and/or other vendors) and that the
unauthorized disclosure or use of such
Confidential Information would cause irreparable
harm and significant injury, the degree of which
may be difficult to ascertain. Accordingly, each
party agrees that the disclosing party will have
the right to obtain an immediate injunction
enjoining any breach of this Agreement, as well
as the right to pursueany and all other rights
and remedies available at law or in equity for
such a breach.
Recipient will exercise its best efforts to
conduct its evaluation within a reasonable time
after Owner's disclosure and will provide Owner
with its assessment thereof without delay.
Recipient will return all information, including
all copies thereof, to Owner upon request. This
agreement shall remain in effect for ten years
after the date of it's execution, and it shall be
construed under the laws of the State of Texas.
Conditions I care about
All information discussed is freely shareable
unless other party indicates in advance that it
is confidential
TRUE? FALSE?

11
PASCAL RTE Examples
Should be easy

T iTunes software has seen strong sales in
Europe.
H Strong sales for iTunes in Europe. TRUE
T The anti-terrorist court found two men guilty
of murdering Shapour Bakhtiar and his secretary
Sorush Katibeh, who were found with their throats
cut in August 1991.
H Shapour Bakhtiar died in 1991. TRUE
T Like the United States, U.N. officials are
also dismayed that Aristide killed a conference
called by Prime Minister Robert Malval in
Port-au-Prince in hopes of bringing all the
feuding parties together.
H Aristide had Prime Minister Robert Malval
murdered in Port-au-Prince. FALSE

Note not entailed!
Theyre allowed to try to trick you
12
Evaluation

The notion of inference is as would typically be
interpreted by people, assuming common human
understanding of language and common background
knowledge.
Not entailment according to some linguistic
theory
High agreement on this data human accuracy is
about 95
Accuracy you correctly say whether the
hypothesis does or does not follow from the text
Confidence weighted score or average accuracy
Rank all n pairs by system-supplied confidence
Use ranking to define a weighted average
Tests whether you know what you know

13
3. Logics mapping from NL to Reasoning GRE/LSAT
logic puzzles

Six sculpturesC, D, E, F, G, and Hare to be
exhibited in rooms 1, 2, and 3 of an art gallery.
Sculptures C and E may not be exhibited in the
same room.
Sculptures D and G must be exhibited in the same
room.
If sculptures E and F are exhibited in the same
room, no other sculpture may be exhibited in that
room.
At least one sculpture must be exhibited in each
room, and no more than three sculptures may be
exhibited in any room.
4. If sculpture D is exhibited in room 1 and
sculptures E and F are exhibited in room 2, which
of the following must be true ?
(A) Sculpture C must be exhibited in room 1.
(B) Sculpture H must be exhibited in room 3.
(C) Sculpture G must be exhibited in room 1.
(D) Sculpture H must be exhibited in room 2.
(E) Sculptures C and H must be exhibited in the
same room.

14
The GRE logic puzzles domain

An English description of a constraint
satisfaction problem, followed by questions about
satisfying assignments
Answers cannot be found in the text by surface
question answering methods (e.g., TREC QA)
Formalization and logical inference are necessary
Obtaining proper formalization requires
Accurate syntactic parsing
Resolving semantic ambiguities (scope,
co-reference)
Discourse analysis
Easy to test (found test material)
If the formalization is right, the reasoning is
easy
No ambiguity or subjectivity about the correct
answer

15
Challenges

For most puzzles, the puzzle type, the variables,
and values for assignments are not obvious
Mrs. Green wishes to renovate her cottage.
She hires the services of a plumber, a carpenter,
a painter, an electrician, and an interior
decorator. The renovation is to be completed in a
period of one working week i.e. Monday to
Friday. Every worker will be taking one complete
day to do his job. Mrs. Green will allow just one
person to work per day.
The painter will do his work only after the
plumber and the carpenter have completed their
jobs.
The interior decorator has to complete his job
before that of the electrician.
The type of this puzzle is a constrained linear
ordering of things (here, contractors)

16
Scope Needs to be Resolved!

At least one sculpture must be exhibited in each
room.
The same sculpture in each room?
No more than three sculptures may be exhibited in
any room.
Reading 1 For every room, there are no more
than three sculptures exhibited in it.
Reading 2 Only three or less sculptures are
exhibited (the rest are not shown).
Reading 3 Only a certain set of three or less
sculptures may be exhibited in any room (for the
other sculptures there are restrictions in
allowable rooms).
Some readings will be ruled out by being
uninformative or by contradicting other
statements
Otherwise we must be content with probability
distributions over scope-resolved semantic forms

17
System overviewLev, MacCartney, Levy, and
Manning 2004
English text
parse trees
SLformulas
FOLformulas
URs
DLformulas
correctanswer
18
Semantic logic (SL)

Our goal is a translation to First Order Logic
(FOL)
But FOL is ungainly, and far from NL
NL has events, plurals, modalities, complex
quantifiers
Intermediate representation semantic logic (SL)
Event and group variables
Modal operators?? (necessarily) and ? (possibly)
Generalized quantifiers Q(type, var, restrictor,
body)
Our example becomes
? Q(?, x1, room(x1), Q(?1, x2, sculpture(x2), ?e
exhibit(e) ? patient(e, x2) ? in(e, x1))
?? Q(?, y, room(y), Q(gt3, g, sculpture(g), ?e
exhibit(e) ? patient(e, g) ? in(e, y))
More compact, more natural

19
Combinatorial semantics

Aim is to assign a semantic representation
(roughly, a lambda expression) to each semantic
unit
The hope is to use a small lexicon for
semantically potent words and to synthesize
semantics for open class words

every dog barks (S) ?x.(dog(x)?bark(x))
every dog (NP) ?Q.?x.(dog(x)?Q_at_x)
barks (VP) ?x.bark(x)
every (Det) ?P.?Q.?x.(P_at_x?Q_at_x)
dog (Noun) ?x.dog(x)
20
FOL Reasoning module

Complementary reasoning engines
A theorem prover (TP) is used to show that a set
of formulas is inconsistent (proof by
contradiction)
A model builder (MB) is used to show that a set
of formulas is consistent (proof by example)
Idea harness TP and MB in tandem
Could questions examine each answer choice
MB says choice consistent ? choice is correct
TP says choice inconsistent ? choice is incorrect
Must questions examine negation of each choice
MB says negation consistent ? choice is incorrect
TP says negation inconsistent ? choice is correct
Just a theorem prover is not enough
Cant handle could be true questions properly
Despite finite domain, some proofs too deep to
find

21
How far did we get?

Worked to be able to handle the sculptures
example (set of 6 questions) completely
Worked to be able to do a second problem
What about new puzzle texts?
Statistical parse is correct (fully usable) in
about 60 of cases
Main problem is unhandled semantic phenomena,
e.g., different, except, a complete list,
VP ellipsis, ,
Only 1 out of 21 questions actually doable start
to end!

22
Pascal RTE comparison

Bos and Markert (2005) used a similar theorem
prover/model builder combination as part of a two
strategy entry in RTE1.
Indeed, our logic puzzles approach was strongly
influenced by Bos work
Coverage/correctness of approach
Found proof/contradiction for 30
pairs(3.757.5)
Of these, 23 were correct (77)
Example error
T Microsoft was established in Italy in 1985. ?
H Microsoft was established in 1985.

23
How real world textual inference differs from
logical semantics

Modals
Text Researchers at the Harvard School of Public
Health say that people who drink coffee may be
doing a lot more than keeping themselves awake -
this kind of consumption apparently also can help
reduce the risk of diseases.
?Hypothesis Coffee drinking has health benefits.
(RTE1 ID 19)
May is a discourse hedge, not a possible worlds
modal
Reported views/speech
Text According to the Encyclopedia Britannica,
Indonesia is the largest archipelagic nation in
the world, consisting of 13,670 islands.
? Hypothesis 13,670 islands make up Indonesia.
(RTE1 ID 605)
Source for information, not to suggest truth
unknown

24
Speaker meaning

The Pascal RTE task can be taken as an applied
test of human notions of speaker meaning
It clearly goes beyond the literal meaning of the
text
Recanati (2004 19) proposes regarding what is
said as what a normal interpreter would
understand as being said, in the context at
hand.
Pascal RTE could be viewed as operationalizing
such a criterion.

25
4. Tackling robust textual inference Weighted
Abductive inference

Idea Raina, Ng, and Manning 2005
Represent text and hypothesis as logical
formulae.
A hypothesis can be inferred from the text if and
only if the hypothesis logic formula can be
proved from the text logical formula (at some
cost).
Toy example

T Kidnappers released a Filipino hostage. H A Filipino hostage was freed. TRUE
(? A, B, E) Kidnappers(A) ? released(E, A, B) ? Filipino(B) ? hostage(B) (? X, F) Filipino(X) ? hostage(X) ? freed(F, X)
Prove?
Weighted abduction Allow assumptions at various
costs released(p, q, r) 2 gt freed(s,
r) (Hobbs et al., 1993)
26
Representation Example Bills mother walked to
the grocery store
Logical formula mother(A) Bill(B) poss(B,
A) grocery(C) store(C) walked(E, A, C)
VBD
PERSON
ARGM-LOC
VBD
PERSON

Can and do make this representation richer
walked is a verb
Bill is a PERSON (named entity).
Add sem roles store is the location/destination
of walked.

27
Linguistic preprocessing

High performance Named Entity Recognizers
Finkel et al. 2005
Canonicalization quantity, date, and money
expressions
Normalized dates and relational expressions of
amount gt 200
T Kessler's team conducted 60,643 face-to-face
interviews with adults in 14 countries.
H Kessler's team interviewed more than 60,000
adults in 14 countries.
Statistical parser.
Update data a little for 2005 Al Qaeda Aal
-Qa?ieda
Collocations if appearing in WordNet (Bill
hung_up the phone)
Semantic Role identification Propbank roles
Toutanova et al. 2005
Coreference
T Since its formation in 1948, Israel H
Israel was established in 1948.
Heuristics to find event nouns (the murder of
police commander)
Hand-built acronyms, country and nationality,
factive verbs

28
How can we model abductive assumption costs in
proof?

Consider assumptions that unify pairs of terms.
Need to assign cost C(A) to all assumptions A of
the form

Possible considerations

Predicate match cost Synonym cost? Are they similar? Same named entity type? Argument match cost Same semantic role? Coreference for constants?
29
Abductive assumptions

Compute features f(A)(f1(A), f2(A), ?, fD(A)) of
A.
Given feature weights w(w1, w2, ?, wD), define

Each such assumption provides a potential proof
step.
Can find a minimum cost complete proof by uniform
cost search.
Output TRUE iff this proof has cost lt a threshold
wD1.
Weak proof theory!!

30
Can we learn the assumption costs?

Intuition Given a data set, find assumptions
that are used in the proofs for TRUE examples,
and lower their costs.
The minimum cost proof Pmin consists of a
sequence of assumptions A1, A2, ?, AN.
Construct a feature vector for the proof Pmin
If Pmin is given, the final cost for an example
is linear in w.
However, the overall feature vector is computed
by abductive theorem proving, which uses w
internally!
Solve by an iterative procedure guaranteed to
converge to a local maximum of the (nonconvex)
likelihood function.

31
Example features

Zero cost to match same item with same arguments
Low cost to unify things listed in WordNet as
synonyms
Higher cost to match something with vague LSA
similarity
Higher cost if arguments of verb mismatch
Antonyms/Negation High cost for unifying verbs,
if they are antonyms or one is negated and the
other not.
T Stocks fell. H
Stocks rose. FALSE
T Clintons book was not a hit H Clintons
book was a hit. FALSE
Non-factive verbs High cost for unifying verbs,
if only one is modified by a non-factive verb.
T John was charged for doing X. H John
did X. FALSE

32
Results

Evaluate on PASCAL RTE1 dataset.
Development set of 567 examples.
Test set of 800 examples.
Divided into 7 tasks (motivating applications)
Balanced number of true/false examples.
Output TRUE/FALSE, confidence value.
Empirically found to be tough.
Baselines
TF, TFIDF
Standard information retrieval algorithms.
Ignore natural language syntax.

33
RTE1 Results Raina et al. 2005
Baselines Baselines General General ByTask ByTask
tf tf.idf Accuracy CWS Accuracy CWS
DevSet 1 64.8 0.778 65.5 0.805
DevSet 2 52.1 0.578 55.7 0.661
DevSet 1 DevSet 2 58.5 0.679 60.8 0.743
Test Set 49.5/0.548 51.8/0.56 56.3 0.620 55.3 0.686

Difficult! Best other results Accuracy58.6,
CWS0.617

34
Partial coverage accuracy results
Both know something, but task-specific
optimization is better!
ByTask
ByTask
General
Random
35
5. New System ArchitectureMacCartney, Grenager,
de Marneffe, Cer, Manning, HLT-NAACL 2006
An inference
Linguistic Preprocessing
Aligner
Inferer
Answer R? yes, no
36
Why the old approach was broken!

P DeLay bought Enron stock and Clinton sold
Enron stock
H DeLay sold Enron stock

Yes
No
Probably, yes
37
Why we need sloppy matching

Passage Today's best estimate of giant panda
numbers in the wild is about 1,100 individuals
living in up to 32 separate populations mostly in
China's Sichuan Province, but also in Shaanxi and
Gansu provinces.
Hypothesis 1 There are 32 pandas in the wild in
China. (FALSE)
Hypothesis 2 There are about 1,100 pandas in the
wild in China. (TRUE)
Wed like to get this right, but we just dont
have the technology to fully understand from best
estimate of giant panda numbers in the wild is
about 1,100 to there are about 1,100 pandas in
the wild

38
A solution Align, then evaluate

P DeLay bought Enron stock and Clinton sold
Enron stock
H DeLay sold Enron stock

39
Things we aim to fix MacCartney, Grenager, de
Marneffe, Cer, Manning, HLT-NAACL 2006

Confounding of alignment and entailment
Assumption on monotonicity
Matching/embedding methods assume upward
monotonicity
Sue saw Les Miserables in London ?
Sue saw Les Miserables
But
Fedex began business in Zimbabwe in 2003 ?
Fedex began business in 2003
Assumption/requirement of locality

40
Whether an alignment is good depends on non-local
factors
1 P Some students came to school by car. Q Did
any students come to school? A Yes 2 P No
students came to school by car. Q Did any
students come to school? A Dont know Context
of monotonicity Whether it is okay to have by
car as extra material in the hypothesis
depends on subject quantifier
3 P It is not the case that Bin Laden was seen
in Tora Bora. Q Was Bin Laden seen in Tora
Bora? A no Its difficult to see non-factive
context when aligning seen ? seen
41
Representation/alignment example

T Mitsubishi Motors Corp.s new vehicle sales in
the US fell 46 percent in June.
H Mitsubishi sales rose 46 percent.
Answer not entailed

Alignment from hypothesis to text
rose ? fell
sales ? sales
Mitsubishi ? Mitsubishi_Motors_Corp.
percent ? percent
46 ? 46

Alignment score 0.89 Alignment score 0.89 Alignment score 0.89
Features Aligned antonyms in pos/pos context
Structure main predicate good match Numeric
quantity match Date text
date deleted in hypothesis Alignment good

Infererence score -5.42 ? FALSE
42
Modal Inferer

Identify aligned roots
Determine modality of each root
Using linguistic features
e.g. can, perhaps, might ? POSSIBLE
Six canonical modalities
POSSIBLE, NOT_POSSIBLE, ACTUAL, NOT_ACTUAL,
NECESSARY, NOT_NECESSARY
Look up judgment for modality pair
(POSSIBLE, ACTUAL) ? dontknow
(NECESSARY, NOT_ACTUAL) ? no
(ACTUAL, POSSIBLE) ? yes
P The Scud C has a range of 500 kilometers and
is manufactured in Syria with know-how from North
Korea.
H A Scud C can fly 500 kilometers.
(ACTUAL, POSSIBLE) ? yes

43
Factives other implicatives

T Libya has tried, with limited success, to
develop its own indigenous missile, and to extend
the range of its aging SCUD force for many years
under the Al Fatah and other missile programs.
H Libya has developed its own domestic missile
program.
Answer not entailed. Tried to X does not
entail X.
Evaluate governing verbs for implicativity
Unknown say, tell, suspect, try,
Fact know, wonderful,
True manage to,
False doubtful, misbelieve,
Need to check for negative context

44
Numeric Mismatches

Check alignment of number, date, money nodes
T The Pew Internet Life survey interviewed
people in 26 countries.
H The Pew Internet Life study interviewed people
in more than 20 countries
T BioPort Corp. of Lansing, Michigan is the
sole U.S. manufacturer of an anthrax vaccine.
H There are three U.S. manufacturers of anthrax
vaccine.
three is aligned to NO_WORD here sole ? 1 ?

45
Restrictive adjuncts

We can check whether adding/dropping restrictive
adjuncts is licensed relative to upward and
downward entailing contexts
In all, Zerich bought 422 million worth of oil
from Iraq, according to the Volcker committee
? Zerich bought oil from Iraq during the embargo
Zerich didnt buy any oil from Iraq, according to
the Volcker committee
? Zerich didnt buy oil from Iraq during the
embargo

46
What do we have?

Not full, deep semantics
But it still isnt possible to do logical
inference for open domain robust textual
inference (with real data)
We do inference-pattern matching
On semantic dependency graphs not surface
patterns
Calculate rich semantic features
Adjunct_deletion_licensed_relative_to_universal
Related to the notion of natural logic

47
Natural logic

A logic whose vehicle of inference is natural
language (syntactic structures)
No translation into conventional logical notation
Aristotles syllogisms ? Leibniz (who coined
term) ? Lakoff ? van Benthem ? Sánchez Valencia
Natural logic lets us sidestep having to fully
translate sentences into an accurate semantic
representation
Exercise accurately translate into FOL
According to Ruiz, police may have been reluctant
to enter the building before they were convinced
that most of the weapons had been found.

The police found few weapons.
All guns are weapons.
The police found few guns.
?
48
Our RTE2 Results
Learned Accuracy
Dev Set 67.0
Test Set 60.5
49
Most useful features

Positive
Structural match
Good alignment score
Modal yes
Polarity text and hypothesis both negative
polarity
Negative
Date inserted/mismatched
Structure clear mismatch
Quantifier mismatch
Bad alignment score
Different polarity
Modal no/dont know

50
Things that its hard to do

Non-entailment is easier than entailment
Good at finding knock-out features
Hard to be certain that weve considered
everything
Deal with dropping/adding modifiers vs.
upward/downward entailing contexts is hard
Need to know which are restrictive/not/discourse
items
Maurice was subsequently killed in Angola.
Multiword lexical semantics/world knowledge
Were pretty good at synonyms, hyponyms, antonyms
But we cant resolve a lot of multi-word
equivalences
T David McCool took the money and decided to
start Muzzy Lane in 2002
H David McCool is the founder of Muzzy Lane

51
Envoi

What Ive shown
The beginnings of an ability to do robust textual
inference
Still lots of room for improving/fixing
everything!
Potential for applications like evidence
extraction
Find passages suggesting price fixing by Enron
Moderate precision is still useful for reranking
applications like semantic search/question
answering
Meta question the path to semantics for NLP
Hand-built logical methods still dont scale
Is continuing to annotate data the only answer?
How much can we do with unsupervised learning?
With hand-built resources?