Title: Robust Local Textual Inference
1Robust Local Textual Inference
- Christopher Manning, Stanford University
- Bill MacCartney, Marie-Catherine de Marneffe (U.
C. de Louvain), Teg Grenager, Daniel Cer (U.
Colorado), Rajat Raina, Christopher Cox,
Anna Rafferty, Roger Grosse, Josh Ainslie, Aria
Haghighi, Jenny Finkel, Jeff Michels, Kristina
Toutanova, and Andrew Y. Ng
2The backdrop
- There is a long, sometimes successful history of
writing by hand systems writing systems that
understand more deeply - Using limited vocabulary and syntax in a literal
way over limited domains. The TAUM-METEO system. - Recently, statistical/machine learning
computational linguistics has provided tools for
disambiguating natural language. - Parsers and annotators for any text from any
domain - E.g., Named Entity Recognition Person? Company?
- In August 2004, Charles Schwab came to Arizona
and opened a temporary location on 92nd Street
3An external perspective on NLP
- NLP has many successful tools with all sorts of
uses - Part of speech tagging, named entity recognition,
syntactic parsing, semantic role parsing,
coreference determination - but they concentrate on structure not meaning
- By-and-large non-NLP people want systems for more
holistic semantic tasks - Text categorization
- Information retrieval/web search
- The state-of-the-art in these areas is (slightly
extended) bag-of-words models
4The problem for NLP
- The problem for NLP Search engines actually work
pretty well for people - But people would like to get more from
text-processing applications - Information gathering is not just surface
expression - Answers to many questions are a bit below the
surface - This interpretation is the difference between
data and knowledge - Challenge a tool that works robustly on any text
and understands a useful, greater amount of
sentence meaning
5Talk Outline
- The NLP Challenge Beyond the bag of words
- The Pascal task of robust local textual inference
- Deep logical approaches to NLP
- Answering GRE analytic section logic puzzles
Lev, MacCartney, Levy, and Manning 2004 - Some first attempts
- Raina, Ng, and Manning 2005 Haghighi, Ng, and
Manning 2005 - A second attempt
- MacCartney, deMarneffe, Grenager, Cer, and
Manning 2006
62. The PASCAL Textual Inference TaskDagan,
Glickman, and Magnini 2005
- The task Can systems correctly perform local
textual inferences individual inference steps? - On the assumption that some piece of text (T) is
true, does this imply the truth of some other
hypothesis text (H)? - Sydney was the host city of the 2000 Olympics ?
- The Olympics have been held in Sydney TRUE
- The format could be used for evaluating extended
inferential chains or knowledge - But, in practice, fairly direct, local stuff
7The PASCAL Textual Inference Task
- The task focuses on the variability of semantic
expression in language - The reverse task of disambiguation
- The Dow Jones Industrial Average closed up 255
- The Dow climbed 255 points today
- The Dow Jones Industrial Average gained over 250
points - An abstraction from any particular application,
but directly applicable to applications
8Natural Examples Reading Comprehension
- (CNN Student News) -- January 24, 2006
- Answer the following questions about today's
featured news stories. Write your answers in the
space provided. - 1. Where is the country of Somalia located? What
ocean borders this country? -
- 2. Why did crew members from the USS Winston S.
Churchill recently stop a small vessel off the
coast of Somalia? What action did the crew of the
Churchill take? -
9Real Uses
- Semantic search
- Find documents about lobbyists attempting to
bribe U.S. senators - (lobbyist attempted to bribe U.S. senator)
- Question answering
- Who acquired Overture?
- Use to score candidate answers based on passage
retrieval and named entity recognition - Customer email response
- My Squeezebox regularly skips during music
playback - ? Sender can hear music through Squeezebox
- Relation extraction (database building)
- Document summarization
10Verification of terms Dan Roth
- Non-disclosure Agreement
- WHEREAS Recipient is desirous of obtaining said
confidential information for purposes of
evaluation thereof and as a basis for further
discussions with Owner regarding assistance with
development of the confidential information for
the benefit of Owner or for the mutual benefit of
Owner and Recipient THEREFORE, Recipient
hereby agrees to receive the information in
confidence and to treat it as confidential for
all purposes. Recipient will not divulge or use
in any manner any of said confidential
information unless by written consent from Owner,
and Recipient will use at least the same efforts
it regularly employs for its own confidential
information to avoid disclosure to others.Â
Provided, however, that this obligation to
treat information confidentially will not apply
to any information already in Recipient's
possession or to any information that is
generally available to the public or becomes
generally available through no act or influence
of Recipient. Recipient will inform Owner of the
public nature or Recipient's possession of the
information without delay after Owner's
disclosure thereof or will be stopped from
asserting such as defense to remedy under this
agreement. - Each party acknowledges that all of the
disclosing party's Confidential Information is
owned solely by the disclosing party (or its
licensors and/or other vendors) and that the
unauthorized disclosure or use of such
Confidential Information would cause irreparable
harm and significant injury, the degree of which
may be difficult to ascertain. Accordingly, each
party agrees that the disclosing party will have
the right to obtain an immediate injunction
enjoining any breach of this Agreement, as well
as the right to pursueany and all other rights
and remedies available at law or in equity for
such a breach. - Recipient will exercise its best efforts to
conduct its evaluation within a reasonable time
after Owner's disclosure and will provide Owner
with its assessment thereof without delay.
Recipient will return all information, including
all copies thereof, to Owner upon request. This
agreement shall remain in effect for ten years
after the date of it's execution, and it shall be
construed under the laws of the State of Texas.Â
- Conditions I care about
- All information discussed is freely shareable
unless other party indicates in advance that it
is confidential - TRUE? FALSE?
11PASCAL RTE Examples
Should be easy
- T iTunes software has seen strong sales in
Europe. - H Strong sales for iTunes in Europe. TRUE
- T The anti-terrorist court found two men guilty
of murdering Shapour Bakhtiar and his secretary
Sorush Katibeh, who were found with their throats
cut in August 1991. - H Shapour Bakhtiar died in 1991. TRUE
- T Like the United States, U.N. officials are
also dismayed that Aristide killed a conference
called by Prime Minister Robert Malval in
Port-au-Prince in hopes of bringing all the
feuding parties together. - H Aristide had Prime Minister Robert Malval
murdered in Port-au-Prince. FALSE
Note not entailed!
Theyre allowed to try to trick you
12Evaluation
- The notion of inference is as would typically be
interpreted by people, assuming common human
understanding of language and common background
knowledge. - Not entailment according to some linguistic
theory - High agreement on this data human accuracy is
about 95 - Accuracy you correctly say whether the
hypothesis does or does not follow from the text - Confidence weighted score or average accuracy
- Rank all n pairs by system-supplied confidence
- Use ranking to define a weighted average
- Tests whether you know what you know
133. Logics mapping from NL to Reasoning GRE/LSAT
logic puzzles
- Six sculpturesC, D, E, F, G, and Hare to be
exhibited in rooms 1, 2, and 3 of an art gallery. - Sculptures C and E may not be exhibited in the
same room. - Sculptures D and G must be exhibited in the same
room. - If sculptures E and F are exhibited in the same
room, no other sculpture may be exhibited in that
room. - At least one sculpture must be exhibited in each
room, and no more than three sculptures may be
exhibited in any room. - 4. If sculpture D is exhibited in room 1 and
sculptures E and F are exhibited in room 2, which
of the following must be true ? - (A) Sculpture C must be exhibited in room 1.
- (B) Sculpture H must be exhibited in room 3.
- (C) Sculpture G must be exhibited in room 1.
- (D) Sculpture H must be exhibited in room 2.
- (E) Sculptures C and H must be exhibited in the
same room.
14The GRE logic puzzles domain
- An English description of a constraint
satisfaction problem, followed by questions about
satisfying assignments - Answers cannot be found in the text by surface
question answering methods (e.g., TREC QA) - Formalization and logical inference are necessary
- Obtaining proper formalization requires
- Accurate syntactic parsing
- Resolving semantic ambiguities (scope,
co-reference) - Discourse analysis
- Easy to test (found test material)
- If the formalization is right, the reasoning is
easy - No ambiguity or subjectivity about the correct
answer
15Challenges
- For most puzzles, the puzzle type, the variables,
and values for assignments are not obvious - Mrs. Green wishes to renovate her cottage.
She hires the services of a plumber, a carpenter,
a painter, an electrician, and an interior
decorator. The renovation is to be completed in a
period of one working week i.e. Monday to
Friday. Every worker will be taking one complete
day to do his job. Mrs. Green will allow just one
person to work per day. - The painter will do his work only after the
plumber and the carpenter have completed their
jobs. - The interior decorator has to complete his job
before that of the electrician. - The type of this puzzle is a constrained linear
ordering of things (here, contractors)
16Scope Needs to be Resolved!
- At least one sculpture must be exhibited in each
room. - The same sculpture in each room?
- No more than three sculptures may be exhibited in
any room. - Reading 1 For every room, there are no more
than three sculptures exhibited in it. - Reading 2 Only three or less sculptures are
exhibited (the rest are not shown). - Reading 3 Only a certain set of three or less
sculptures may be exhibited in any room (for the
other sculptures there are restrictions in
allowable rooms). - Some readings will be ruled out by being
uninformative or by contradicting other
statements - Otherwise we must be content with probability
distributions over scope-resolved semantic forms
17System overviewLev, MacCartney, Levy, and
Manning 2004
English text
parse trees
SLformulas
FOLformulas
URs
DLformulas
correctanswer
18Semantic logic (SL)
- Our goal is a translation to First Order Logic
(FOL) - But FOL is ungainly, and far from NL
- NL has events, plurals, modalities, complex
quantifiers - Intermediate representation semantic logic (SL)
- Event and group variables
- Modal operators?? (necessarily) and ? (possibly)
- Generalized quantifiers Q(type, var, restrictor,
body) - Our example becomes
- ? Q(?, x1, room(x1), Q(?1, x2, sculpture(x2), ?e
exhibit(e) ? patient(e, x2) ? in(e, x1)) - ?? Q(?, y, room(y), Q(gt3, g, sculpture(g), ?e
exhibit(e) ? patient(e, g) ? in(e, y)) - More compact, more natural
19Combinatorial semantics
- Aim is to assign a semantic representation
(roughly, a lambda expression) to each semantic
unit - The hope is to use a small lexicon for
semantically potent words and to synthesize
semantics for open class words
every dog barks (S) ?x.(dog(x)?bark(x))
every dog (NP) ?Q.?x.(dog(x)?Q_at_x)
barks (VP) ?x.bark(x)
every (Det) ?P.?Q.?x.(P_at_x?Q_at_x)
dog (Noun) ?x.dog(x)
20FOL Reasoning module
- Complementary reasoning engines
- A theorem prover (TP) is used to show that a set
of formulas is inconsistent (proof by
contradiction) - A model builder (MB) is used to show that a set
of formulas is consistent (proof by example) - Idea harness TP and MB in tandem
- Could questions examine each answer choice
- MB says choice consistent ? choice is correct
- TP says choice inconsistent ? choice is incorrect
- Must questions examine negation of each choice
- MB says negation consistent ? choice is incorrect
- TP says negation inconsistent ? choice is correct
- Just a theorem prover is not enough
- Cant handle could be true questions properly
- Despite finite domain, some proofs too deep to
find
21How far did we get?
- Worked to be able to handle the sculptures
example (set of 6 questions) completely - Worked to be able to do a second problem
- What about new puzzle texts?
- Statistical parse is correct (fully usable) in
about 60 of cases - Main problem is unhandled semantic phenomena,
e.g., different, except, a complete list,
VP ellipsis, , - Only 1 out of 21 questions actually doable start
to end!
22Pascal RTE comparison
- Bos and Markert (2005) used a similar theorem
prover/model builder combination as part of a two
strategy entry in RTE1. - Indeed, our logic puzzles approach was strongly
influenced by Bos work - Coverage/correctness of approach
- Found proof/contradiction for 30
pairs(3.757.5) - Of these, 23 were correct (77)
- Example error
- T Microsoft was established in Italy in 1985. ?
- H Microsoft was established in 1985.
23How real world textual inference differs from
logical semantics
- Modals
- Text Researchers at the Harvard School of Public
Health say that people who drink coffee may be
doing a lot more than keeping themselves awake -
this kind of consumption apparently also can help
reduce the risk of diseases. - ?Hypothesis Coffee drinking has health benefits.
(RTE1 ID 19) - May is a discourse hedge, not a possible worlds
modal - Reported views/speech
- Text According to the Encyclopedia Britannica,
Indonesia is the largest archipelagic nation in
the world, consisting of 13,670 islands. - ? Hypothesis 13,670 islands make up Indonesia.
(RTE1 ID 605) - Source for information, not to suggest truth
unknown
24Speaker meaning
- The Pascal RTE task can be taken as an applied
test of human notions of speaker meaning - It clearly goes beyond the literal meaning of the
text - Recanati (2004 19) proposes regarding what is
said as what a normal interpreter would
understand as being said, in the context at
hand. - Pascal RTE could be viewed as operationalizing
such a criterion.
254. Tackling robust textual inference Weighted
Abductive inference
- Idea Raina, Ng, and Manning 2005
- Represent text and hypothesis as logical
formulae. - A hypothesis can be inferred from the text if and
only if the hypothesis logic formula can be
proved from the text logical formula (at some
cost). - Toy example
T Kidnappers released a Filipino hostage. H A Filipino hostage was freed. TRUE
(? A, B, E) Kidnappers(A) ? released(E, A, B) ? Filipino(B) ? hostage(B) (? X, F) Filipino(X) ? hostage(X) ? freed(F, X)
Prove?
Weighted abduction Allow assumptions at various
costs released(p, q, r) 2 gt freed(s,
r) (Hobbs et al., 1993)
26Representation Example Bills mother walked to
the grocery store
Logical formula mother(A) Bill(B) poss(B,
A) grocery(C) store(C) walked(E, A, C)
VBD
PERSON
ARGM-LOC
VBD
PERSON
- Can and do make this representation richer
- walked is a verb
- Bill is a PERSON (named entity).
- Add sem roles store is the location/destination
of walked.
27Linguistic preprocessing
- High performance Named Entity Recognizers
Finkel et al.  2005 - Canonicalization quantity, date, and money
expressions - Normalized dates and relational expressions of
amount gt 200 - T Kessler's team conducted 60,643 face-to-face
interviews with adults in 14 countries. - H Kessler's team interviewed more than 60,000
adults in 14 countries. - Statistical parser.
- Update data a little for 2005 Al Qaeda Aal
-Qa?ieda - Collocations if appearing in WordNet (Bill
hung_up the phone) - Semantic Role identification Propbank roles
Toutanova et al. 2005 - Coreference
- T Since its formation in 1948, Israel H
Israel was established in 1948. - Heuristics to find event nouns (the murder of
police commander) - Hand-built acronyms, country and nationality,
factive verbs
28How can we model abductive assumption costs in
proof?
- Consider assumptions that unify pairs of terms.
- Need to assign cost C(A) to all assumptions A of
the form
Predicate match cost Synonym cost? Are they similar? Same named entity type? Argument match cost Same semantic role? Coreference for constants?
29Abductive assumptions
- Compute features f(A)(f1(A), f2(A), ?, fD(A)) of
A. - Given feature weights w(w1, w2, ?, wD), define
- Each such assumption provides a potential proof
step. - Can find a minimum cost complete proof by uniform
cost search. - Output TRUE iff this proof has cost lt a threshold
wD1. - Weak proof theory!!
30Can we learn the assumption costs?
- Intuition Given a data set, find assumptions
that are used in the proofs for TRUE examples,
and lower their costs. - The minimum cost proof Pmin consists of a
sequence of assumptions A1, A2, ?, AN. - Construct a feature vector for the proof Pmin
- If Pmin is given, the final cost for an example
is linear in w. - However, the overall feature vector is computed
by abductive theorem proving, which uses w
internally! - Solve by an iterative procedure guaranteed to
converge to a local maximum of the (nonconvex)
likelihood function.
31Example features
- Zero cost to match same item with same arguments
- Low cost to unify things listed in WordNet as
synonyms - Higher cost to match something with vague LSA
similarity - Higher cost if arguments of verb mismatch
- Antonyms/Negation High cost for unifying verbs,
if they are antonyms or one is negated and the
other not. - T Stocks fell. H
Stocks rose. FALSE - T Clintons book was not a hit H Clintons
book was a hit. FALSE - Non-factive verbs High cost for unifying verbs,
if only one is modified by a non-factive verb. - T John was charged for doing X. H John
did X. FALSE
32Results
- Evaluate on PASCAL RTE1 dataset.
- Development set of 567 examples.
- Test set of 800 examples.
- Divided into 7 tasks (motivating applications)
- Balanced number of true/false examples.
- Output TRUE/FALSE, confidence value.
- Empirically found to be tough.
- Baselines
- TF, TFIDF
- Standard information retrieval algorithms.
- Ignore natural language syntax.
33RTE1 Results Raina et al. 2005
Baselines Baselines General General ByTask ByTask
tf tf.idf Accuracy CWS Accuracy CWS
DevSet 1 64.8 0.778 65.5 0.805
DevSet 2 52.1 0.578 55.7 0.661
DevSet 1 DevSet 2 58.5 0.679 60.8 0.743
Test Set 49.5/0.548 51.8/0.56 56.3 0.620 55.3 0.686
- Difficult! Best other results Accuracy58.6,
CWS0.617
34Partial coverage accuracy results
Both know something, but task-specific
optimization is better!
ByTask
ByTask
General
Random
355. New System ArchitectureMacCartney, Grenager,
de Marneffe, Cer, Manning, HLT-NAACL 2006
An inference
Linguistic Preprocessing
Aligner
Inferer
Answer R? yes, no
36Why the old approach was broken!
- P DeLay bought Enron stock and Clinton sold
Enron stock - H DeLay sold Enron stock
Yes
No
Probably, yes
37Why we need sloppy matching
- Passage Today's best estimate of giant panda
numbers in the wild is about 1,100 individuals
living in up to 32 separate populations mostly in
China's Sichuan Province, but also in Shaanxi and
Gansu provinces. - Hypothesis 1 There are 32 pandas in the wild in
China. (FALSE) - Hypothesis 2 There are about 1,100 pandas in the
wild in China. (TRUE) - Wed like to get this right, but we just dont
have the technology to fully understand from best
estimate of giant panda numbers in the wild is
about 1,100 to there are about 1,100 pandas in
the wild
38A solution Align, then evaluate
- P DeLay bought Enron stock and Clinton sold
Enron stock - H DeLay sold Enron stock
39Things we aim to fix MacCartney, Grenager, de
Marneffe, Cer, Manning, HLT-NAACL 2006
- Confounding of alignment and entailment
- Assumption on monotonicity
- Matching/embedding methods assume upward
monotonicity - Sue saw Les Miserables in London ?
- Sue saw Les Miserables
- But
- Fedex began business in Zimbabwe in 2003 ?
- Fedex began business in 2003
- Assumption/requirement of locality
40Whether an alignment is good depends on non-local
factors
1 P Some students came to school by car. Q Did
any students come to school? A Yes 2 P No
students came to school by car. Q Did any
students come to school? A Dont know Context
of monotonicity Whether it is okay to have by
car as extra material in the hypothesis
depends on subject quantifier
3 P It is not the case that Bin Laden was seen
in Tora Bora. Q Was Bin Laden seen in Tora
Bora? A no Its difficult to see non-factive
context when aligning seen ? seen
41Representation/alignment example
- T Mitsubishi Motors Corp.s new vehicle sales in
the US fell 46 percent in June. - H Mitsubishi sales rose 46 percent.
- Answer not entailed
Alignment from hypothesis to text
rose ? fell
sales ? sales
Mitsubishi ? Mitsubishi_Motors_Corp.
percent ? percent
46 ? 46
Alignment score 0.89 Alignment score 0.89 Alignment score 0.89
Features Aligned antonyms in pos/pos context
Structure main predicate good match Numeric
quantity match Date text
date deleted in hypothesis Alignment good
Infererence score -5.42 ? FALSE
42Modal Inferer
- Identify aligned roots
- Determine modality of each root
- Using linguistic features
- e.g. can, perhaps, might ? POSSIBLE
- Six canonical modalities
- POSSIBLE, NOT_POSSIBLE, ACTUAL, NOT_ACTUAL,
NECESSARY, NOT_NECESSARY - Look up judgment for modality pair
- (POSSIBLE, ACTUAL) ? dontknow
- (NECESSARY, NOT_ACTUAL) ? no
- (ACTUAL, POSSIBLE) ? yes
- P The Scud C has a range of 500 kilometers and
is manufactured in Syria with know-how from North
Korea. - H A Scud C can fly 500 kilometers.
- (ACTUAL, POSSIBLE) ? yes
43Factives other implicatives
- T Libya has tried, with limited success, to
develop its own indigenous missile, and to extend
the range of its aging SCUD force for many years
under the Al Fatah and other missile programs. - H Libya has developed its own domestic missile
program. - Answer not entailed. Tried to X does not
entail X. - Evaluate governing verbs for implicativity
- Unknown say, tell, suspect, try,
- Fact know, wonderful,
- True manage to,
- False doubtful, misbelieve,
- Need to check for negative context
44Numeric Mismatches
- Check alignment of number, date, money nodes
- T The Pew Internet Life survey interviewed
people in 26 countries. - H The Pew Internet Life study interviewed people
in more than 20 countries - T BioPort Corp. of Lansing, Michigan is the
sole U.S. manufacturer of an anthrax vaccine. - H There are three U.S. manufacturers of anthrax
vaccine. - three is aligned to NO_WORD here sole ? 1 ?
45Restrictive adjuncts
- We can check whether adding/dropping restrictive
adjuncts is licensed relative to upward and
downward entailing contexts - In all, Zerich bought 422 million worth of oil
from Iraq, according to the Volcker committee - ? Zerich bought oil from Iraq during the embargo
- Zerich didnt buy any oil from Iraq, according to
the Volcker committee - ? Zerich didnt buy oil from Iraq during the
embargo
46What do we have?
- Not full, deep semantics
- But it still isnt possible to do logical
inference for open domain robust textual
inference (with real data) - We do inference-pattern matching
- On semantic dependency graphs not surface
patterns - Calculate rich semantic features
- Adjunct_deletion_licensed_relative_to_universal
- Related to the notion of natural logic
47Natural logic
- A logic whose vehicle of inference is natural
language (syntactic structures) - No translation into conventional logical notation
- Aristotles syllogisms ? Leibniz (who coined
term) ? Lakoff ? van Benthem ? Sánchez Valencia - Natural logic lets us sidestep having to fully
translate sentences into an accurate semantic
representation - Exercise accurately translate into FOL
- According to Ruiz, police may have been reluctant
to enter the building before they were convinced
that most of the weapons had been found.
The police found few weapons.
All guns are weapons.
The police found few guns.
?
48Our RTE2 Results
Learned Accuracy
Dev Set 67.0
Test Set 60.5
49Most useful features
- Positive
- Structural match
- Good alignment score
- Modal yes
- Polarity text and hypothesis both negative
polarity - Negative
- Date inserted/mismatched
- Structure clear mismatch
- Quantifier mismatch
- Bad alignment score
- Different polarity
- Modal no/dont know
50Things that its hard to do
- Non-entailment is easier than entailment
- Good at finding knock-out features
- Hard to be certain that weve considered
everything - Deal with dropping/adding modifiers vs.
upward/downward entailing contexts is hard - Need to know which are restrictive/not/discourse
items - Maurice was subsequently killed in Angola.
- Multiword lexical semantics/world knowledge
- Were pretty good at synonyms, hyponyms, antonyms
- But we cant resolve a lot of multi-word
equivalences - T David McCool took the money and decided to
start Muzzy Lane in 2002 - H David McCool is the founder of Muzzy Lane
51Envoi
- What Ive shown
- The beginnings of an ability to do robust textual
inference - Still lots of room for improving/fixing
everything! - Potential for applications like evidence
extraction - Find passages suggesting price fixing by Enron
- Moderate precision is still useful for reranking
applications like semantic search/question
answering - Meta question the path to semantics for NLP
- Hand-built logical methods still dont scale
- Is continuing to annotate data the only answer?
- How much can we do with unsupervised learning?
With hand-built resources?