Title: Recognising Textual Entailment and Computational Semantics
1Recognising Textual Entailment and Computational
Semantics
Johan Bos Dipartimento di Informatica University
of Rome "La Sapienza"
2State of the Art
- Computational semantics has now reached a
state where we have at our disposal robust
systems that are able to compute semantic
representations for natural language texts,
achieving wide coverage in open domains
3Evaluation
- In order to understand what we are doing, we
should be able to measure the performance of our
systems - Also important for funding and assessing
possibilities for commercial development - What would be a good method to evaluate systems
that produce semantic representations?
4A sem-beauty contest?
Sem World 2007
5Recognising Textual Entailment
- Recently, a new shared task has been organised
RTE
6Recognising Textual Entailment
- Recently, a new shared task has been organised
RTE - Is RTE a good method for semantic evaluation?
7Outline of this talk
- Wide-coverage Semantics
- Boxer
- Possible Evaluation methods
- Recognising Textual Entailment
- Discussion
8Wide-coverage semantics
- Lingo/LKB Minimal Recursive Semantics
Copestake 2002
9Wide-coverage semantics
- Lingo/LKB Minimal Recursive Semantics
Copestake 2002 - ShalmaneserFrame SemanticsErk Pado 2006
10Wide-coverage semantics
- Lingo/LKB Minimal Recursive Semantics
Copestake 2002 - ShalmaneserFrame SemanticsErk Pado 2006
- BoxerDiscourse Representation Structures Bos
2005
11Boxer
- Works on output of the CC parser
- Input CCG derivations
- Output DRT boxes
- The CC Parser
- Statistical, robust, wide-coverage
- Clark Curran (ACL 2004)
- Grammar derived from CCGbank
- 409 different categories
- Hockenmaier Steedman (ACL 2002)
12Semantic construction in Boxer
- Work is done in the lexicon
- Lambda calculus as glue language
- Function application and beta-conversion
- Semantic formalism
- Discourse Representation Structures
- First-order logic formulas
- Output format
- Prolog terms
- XML
13CCGDRT lexical semantics
14CCGDRT derivation
- NP/Na Nspokesman
S\NPlied - ?p. ?q. p_at_xq_at_x ?z.
?x.x_at_?y.
15CCGDRT derivation
- NP/Na Nspokesman
S\NPlied - ?p. ?q. p_at_xq_at_x ?z.
?x.x_at_?y. - -----------------------------------------------
- (FA) - NP a spokesman
- ?p. ?q. p_at_xq_at_x_at_?z.
-
16CCGDRT derivation
- NP/Na Nspokesman
S\NPlied - ?p. ?q. p_at_xq_at_x ?z.
?x.x_at_?y. - --------------------------------------------------
------ (FA) - NP a spokesman
- ?q. q_at_x
-
17CCGDRT derivation
- NP/Na Nspokesman
S\NPlied - ?p. ?q. p_at_xq_at_x ?z.
?x.x_at_?y. - --------------------------------------------------
------ (FA) - NP a spokesman
- ?q. q_at_x
-
18CCGDRT derivation
- NP/Na Nspokesman
S\NPlied - ?p. ?q. p_at_xq_at_x ?z.
?x.x_at_?y. - --------------------------------------------------
------ (FA) - NP a spokesman
- ?q. q_at_x
- ---------------------------------------
----------------------------------------- (BA) -
S a spokesman lied - ?x.x_at_?y.
_at_?q. q_at_x
19CCGDRT derivation
- NP/Na Nspokesman
S\NPlied - ?p. ?q. p_at_xq_at_x ?z.
?x.x_at_?y. - --------------------------------------------------
------ (FA) - NP a spokesman
- ?q. q_at_x
- ---------------------------------------
----------------------------------------- (BA) -
S a spokesman lied - ?q.
q_at_x _at_ ?y.
20CCGDRT derivation
- NP/Na Nspokesman
S\NPlied - ?p. ?q. p_at_xq_at_x ?z.
?x.x_at_?y. - --------------------------------------------------
------ (FA) - NP a spokesman
- ?q. q_at_x
- ---------------------------------------
----------------------------------------- (BA) -
S a spokesman lied -
21CCGDRT derivation
- NP/Na Nspokesman
S\NPlied - ?p. ?q. p_at_xq_at_x ?z.
?x.x_at_?y. - --------------------------------------------------
------ (FA) - NP a spokesman
- ?q. q_at_x
- ---------------------------------------
----------------------------------------- (BA) -
S a spokesman lied -
22Example Output
- ExamplePierre Vinken, 61 years old, will join
the board as a nonexecutive director Nov. 29. Mr.
Vinken is chairman of Elsevier N.V., the Dutch
publishing group. - Semantic representation, DRT
- Complete Wall Street Journal
23Outline of this talk
- Wide-coverage Semantics
- Boxer
- Possible Evaluation methods
- Recognising Textual Entailment
- Discussion
24Evaluation -- possible methods
- Annotated corpus of semantic representations
- Task-oriented evaluation
- Evaluation by inference
25Annotated Corpus
- Basic idea
- Create a corpus of sentences annotated with
semantic representations - Key example Penn Tree Bank
- Going in this direction propbank, framenet
- Problems
- We dont have such a corpus
- There are no resources to build one
- What constitutes a good semantic representation?
26Annotation Problems 1/3
- How shallow, how deep?
- Davidsonian or neo-Davidsonian?
- Plural noun phrases?
- Tense and aspect?
- Superlative and comparatives?
27Annotation Problems 2/3
- Underspecified or resolved?
- Scope of quantifiers and operators
- Word senses from Wordnet?
- Anaphora
- Presupposition
28Annotation Problems 3/3
- Which semantic formalism?
- Minimal Recursive Semantics
- Discourse Representation Theory
- First-order logic
- OWL?
29Task-oriented evaluation
- Employ semantics in existing NLP applications,
such as QA and MT - Measure performance with and without semantics,
or comparative analysis - Nice idea, but
- Interface problems
- Substitution problems
30Message
- Evaluating semantic components is not
kerfuffle-free
31Outline of this talk
- Wide-coverage Semantics
- Boxer
- Possible Evaluation methods
- Recognising Textual Entailment
- Discussion
32RTE
- Recognising
- Textual
- Entailment
33RTE
- Recognising Textual Entailment
34RTE
- Recognising Textual Entailment
35Recognising Textual Entailment
- A task for NLP systems to recognise entailment
between two (short) texts - Introduced in 2004/2005 as part of the PASCAL
Network of Excellence - Proved to be a difficult, but popular task
- PASCAL provided a development and test set of
several hundred examples - Organised by Ido Dagan and others
36RTE Example (entailment)
RTE 1977 (TRUE)
His family has steadfastly denied the
charges. ----------------------------------------
------------- The charges were denied by his
family.
?
37RTE Example (no entailment)
RTE 2030 (FALSE)
Lyon is actually the gastronomical capital of
France. ------------------------------------------
----------- Lyon is the capital of France.
X
38RTE is hard, example 1
Example (TRUE)
The leaning tower is a building in Pisa. Pisa is
a town in Italy. ---------------------------------
-------------------- The leaning tower is a
building in Italy.
?
39RTE is hard, example 1
Example (FALSE)
The leaning tower is the highest building in
Pisa. Pisa is a town in Italy. -------------------
---------------------------------- The leaning
tower is the highest building in Italy.
X
40RTE is hard, example 2
Example (TRUE)
Johan is walking around. -------------------------
---------------------------- Johan is walking.
?
41RTE is hard, example 2
Example (FALSE)
Johan is farting around. -------------------------
---------------------------- Johan is farting.
X
42History of RTE
- Old or new?
- Perhaps surprisingly, RTE is not really a new
task - Has been present implicitly in computational and
formal semantics - 1996 FRACAS test suite
- 2004/5 PASCAL challenges
43History of RTE
- Old or new?
- Perhaps surprisingly, RTE is not really a new
task - Has been present implicitly in computational and
formal semantics - 1996 FRACAS test suite
- 2004/5 PASCAL challenges
- The first RTE examples are over two thousand
years old!
44Aristotles Syllogisms
ARISTOTLE 1 (TRUE)
All men are mortal. Socrates is a
man. ------------------------------- Socrates is
mortal.
?
45Aristotles Syllogisms
ARISTOTLE 2 (FALSE)
All men are mortal. Socrates is not a
man. ------------------------------- Socrates is
mortal.
X
46The CURT system
- Blackburn Bos 2005 includes a system that
checks for consistency and informativity of new
utterances - Not robust, but arguably one of the first
systems that implement textual inference - First-order logic, theorem proving, and model
building
47CURT
- Testing a discourse for informativity
48CURT
- Testing a discourse for informativity
49CURT
- Testing a discourse for informativity
50CURT
- Testing a discourse for informativity
51The FRACAS test suite
- European Project on Computational Semantics, in
the mid 1990s - Test suite published in D16, but since then
forgotten? - Cooper et al. (1996) Using the Framework. Fracas
deliverable D16, section 3 - Aim of test suite measure semantic competence of
NLP system
52The FRACAS test suite
- Grouped on linguistic and semantic phenomena
- Generalised quantifiers
- Plurals
- Nominal Anaphora
- Ellipsis
- Adjectives
- Comparatives
- Temporal Reference
- Verbs
- Attitudes
53FRACAS example pairs
- 3.209 Mickey is a small animal.
Dumbo is a large animal.
Is Mickey
smaller than Dumbo? YES - 3.205 Dumbo is a large animal.
.. ... Is Dumbo
a small animal? NO - 3.206 Fido is not a small animal.
.. Is Fido a
large animal? DONT KNOW
54PASCAL RTE
- First organised evaluation campaign on natural
language entailment - RTE-1 UK 2005
- RTE-2 Venice 2006
- RTE-3 Prague 2007
- Coordinated by Ido Dagan and others
- Now already a well established shared task in
computational linguistics
55Approaches to RTE
- There are several methods
- We will look at five of them to see how difficult
RTE actually is - And whether computational semantics can play a
role
56Recognising Textual Entailment
57Flipping a coin
- Advantages
- Easy to implement
- Disadvantages
- Just 50 accuracy
58Recognising Textual Entailment
- Method 2
- Calling a friend
59Calling a friend
- Advantages
- High accuracy (95)
- Disadvantages
- Lose friends
- High phonebill
60Recognising Textual Entailment
- Method 3
- Ask the audience
61Ask the audience
RTE 893 (????)
The first settlements on the site of Jakarta
wereestablished at the mouth of the Ciliwung,
perhapsas early as the 5th century
AD. ----------------------------------------------
------------------ The first settlements on the
site of Jakarta wereestablished as early as the
5th century AD.
62Human Upper Bound
RTE 893 (TRUE)
The first settlements on the site of Jakarta
wereestablished at the mouth of the Ciliwung,
perhapsas early as the 5th century
AD. ----------------------------------------------
------------------ The first settlements on the
site of Jakarta wereestablished as early as the
5th century AD.
?
63Recognising Textual Entailment
64Word Overlap Approaches
- Popular approach
- Ranging in sophistication from simple bag of word
to use of WordNet - Accuracy rates ca. 55
65Word Overlap
- Advantages
- Relatively straightforward algorithm
- Disadvantages
- Hardly better than flipping a coin
66RTE State-of-the-Art
- Pascal RTE challenge
- Hard problem
- Requires semantics
67Recognising Textual Entailment
- Method 5
- Computational Semantics
68Nutcracker
- Components of Nutcracker
- The CC parser for CCG
- Boxer
- Vampire, a FOL theorem prover
- Paradox and Mace, FOL model builders
- Background knowledge
- WordNet hyponyms, synonyms
- NomLex nominalisations
69How Nutcracker works
- Given a textual entailment pair T/H with text T
and hypothesis H - Produce DRSs for T and H
- Translate these DRSs into FOL
- Give the following input to the theorem prover
Vampire - T ? H
-
- If Vampire finds a proof, then we predict that T
entails H
70Example (Vampire proof)
RTE-2 112 (TRUE)
On Friday evening, a car bomb exploded outside a
Shiite mosque in Iskandariyah, 30 miles south of
the capital. -------------------------------------
---------------- A bomb exploded outside a mosque.
?
71Example (Vampire proof)
RTE-2 489 (TRUE)
Initially, the Bundesbank opposed the
introduction of the euro but was compelled to
accept it in light of the political pressure of
the capitalist politicians who supportedits
introduction. ------------------------------------
----------------- The introduction of the euro
has been opposed.
?
72WordNet at work
RTE 1952 (TRUE)
Crude oil prices soared to record
levels. ------------------------------------------
----------- Crude oil prices rise.
?
- Background Knowledge?x(soar(x)?rise(x))
73Nutcracker results
- Nutcracker, combined with a shallow overlap
system, was one of the top systems at RTE-1
74World Knowledge 1
RTE 1049 (TRUE)
Four Venezuelan firefighters who were traveling
to a training course in Texas were killed when
their sport utility vehicle drifted onto the
shoulder of a Highway and struck a parked
truck. -------------------------------------------
--------------------- Four firefighters were
killed in a car accident.
?
75World Knowledge 2
RTE-2 235 (TRUE)
Indonesia says the oil blocks are within its
borders, as does Malaysia, which has also sent
warships to the area, claiming that its waters
and airspace have been violated. ----------------
----------------------------------------------- Th
ere is a territorial waters dispute.
?
76Outline of this talk
- Wide-coverage Semantics
- Boxer
- Possible Evaluation methods
- Recognising Textual Entailment
- Discussion
77Discussion
- We now know what RTE is and how it relates to
computational semantics - From the perspective of computational semantics
- What is good about RTE?
- What is bad about RTE?
78Good about RTE
- Reasonably natural examples
- Evaluation measure simple
- Independent of semantic formalism
- A relatively large set of examples
79Bad about RTE
- Unfocussed
- Examples can contain more than one phenomenon
that you to have get right - Difficult to use in system developing
- No control over whether a system understands
certain phenomena or not - Unclear how much background knowledge is
permitted - Unclear how much pragmatics is assumed
80An example of a focused test suite
- Text from real data
- The Osaka World Trade Center is the highest
building in Japan. - Possible Hypotheses
- The Osaka World Trade Center is the third highest
building in Japan. NO - The Osaka World Trade Center is a building in
Japan. YES - The Osaka World Trade Center is one of the
highest buildings in Japan. YES - The Osaka World Trade Center is the highest
building in Western Japan. MAYBE
81The Future
- In order to make progress, we need more focused
test suites - The FRACAS collection is just a start
- These are not meant as a replacement for the
PASCAL RTE examples - Note this is not really a new idea. But the time
seems ripe to develop such test suites
82Textual Inference Data Sets
- Pascal data sets
- RTE 1 (1.376 pairs)
- RTE 2 (1.600 pairs)
- Other less known data sets
- FRACAS (346 pairs) Cooper et al. 1996
- PARC (76 pairs) Zaenen, Karttunen Crouch 2005
- Specific phenomena
- Adjectives (ca. 1.000 pairs) Amoia Gardent
2006
83Comparing FRACAS, CURT PASCAL
- FRACAS test suite
- YES / NO / MAYBE
- CURT system
- Uninformative, Inconsistent, OK
- PASCAL RTE dataset
- TRUE / FALSE
84Available for research
- What we need is systems to experiment with
- Boxer is freely available for research purposes
- Nutcracker will be made available at some point
85Conclusion
- RTE can be seen as the Turing Test for
computational semantics - Therefore, computational semanticists should take
the RTE task seriously - The number of computational semanticists
participating in the RTE campaigns has been
surprisingly low - Let's hope that changes in the future!
86No Sem World 2007
Sem World 2007