Title: A Phrase-Based Model of Alignment for Natural Language Inference
1A Phrase-Based Model of Alignmentfor Natural
Language Inference
- Bill MacCartney, Michel Galley,
- and Christopher D. Manning
- Stanford University
- 26 October 2008
2Natural language inference (NLI) (aka RTE)
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Does premise P justify an inference to hypothesis
H? - An informal notion of inference variability of
linguistic expression
P Gazprom today confirmed a two-fold increase in
its gas price for Georgia, beginning next
Monday. H Gazprom will double Georgias gas
bill. yes
- Like MT, NLI depends on a facility for alignment
- I.e., linking corresponding words/phrases in two
related sentences
3Alignment example
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
H (hypothesis)
P (premise)
4Approaches to NLI alignment
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Alignment addressed variously by current NLI
systems - In some approaches to NLI, alignments are
implicit - NLI via lexical overlap Glickman et al. 05,
Jijkoun de Rijke 05 - NLI as proof search Tatu Moldovan 07, Bar-Haim
et al. 07 - Other NLI systems make alignment step explicit
- Align first, then determine inferential validity
Marsi Kramer 05, MacCartney et al. 06 - What about using an MT aligner?
- Alignment is familiar in MT, with extensive
literatureBrown et al. 93, Vogel et al. 96, Och
Ney 03, Marcu Wong 02, DeNero et al. 06,
Birch et al. 06, DeNero Klein 08 - Can tools techniques of MT alignment transfer
to NLI?
5NLI alignment vs. MT alignment
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Doubtful NLI alignment differs in several
respects - Monolingual can exploit resources like WordNet
- Asymmetric P often longer has content
unrelated to H - Cannot assume semantic equivalence
- NLI aligner must accommodate frequent unaligned
content - Little training data available
- MT aligners use unsupervised training on huge
amounts of bitext - NLI aligners must rely on supervised training
much less data
6Contributions of this paper
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- In this paper, we
- Undertake the first systematic study of alignment
for NLI - Existing NLI aligners use idiosyncratic methods,
are poorly documented, use proprietary data - Examine the relation between alignment in NLI and
MT - How do existing MT aligners perform on NLI
alignment task? - Propose a new model of alignment for NLI MANLI
- Outperforms existing MT NLI aligners on NLI
alignment task
7The MANLI aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- A model of alignment for NLI consisting of four
components
- Phrase-based representation
- Feature-based scoring function
- Decoding using simulated annealing
- Perceptron learning
8Phrase-based alignment representation
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
Represent alignments by sequence of phrase edits
EQ, SUB, DEL, INS
EQ(Gazprom1, Gazprom1) INS(will2) DEL(today2) DEL(
confirmed3) DEL(a4) SUB(two-fold5 increase6,
double3) DEL(in7) DEL(its8)
- One-to-one at phrase level (but many-to-many at
token level) - Avoids arbitrary alignment choices can use
phrase-based resources
9A feature-based scoring function
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Score edits as linear combination of features,
then sum
- Edit type features EQ, SUB, DEL, INS
- Phrase features phrase sizes, non-constituents
- Lexical similarity feature max over similarity
scores - WordNet synonymy, hyponymy, antonymy,
Jiang-Conrath - Distributional similarity à la Dekang Lin
- Various measures of string/lemma similarity
- Contextual features distortion, matching
neighbors
10Decoding using simulated annealing
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
100 times
11Perceptron learning of feature weights
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- We use a variant of averaged perceptron Collins
2002
Initialize weight vector w 0, learning rate R0
1 For training epoch i 1 to 50 For each
problem ?Pj, Hj? with gold alignment Ej Set Êj
ALIGN(Pj, Hj, w) Set w w Ri ? (?(Ej)
?(Êj)) Set w w / ?w?2 (L2 normalization) Set
wi w (store weight vector for this
epoch) Set Ri 0.8 ? Ri1 (reduce learning
rate) Throw away weight vectors from first 20 of
epochs Return average weight vector
Training runs require about 20 hours (on 800 RTE
problems)
12The MSR RTE2 alignment data
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Previously, little supervised data
- Now, MSR gold alignments for RTE2
- Brockett 2007
- dev test sets, 800 problems each
- Token-based, but many-to-many
- allows implicit alignment of phrases
- 3 independent annotators
- 3 of 3 agreed on 70 of proposed links
- 2 of 3 agreed on 99.7 of proposed links
- merged using majority rule
13Evaluation on MSR data
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- We evaluate several systems on MSR data
- A simple baseline aligner
- MT aligners GIZA Cross-EM
- NLI aligners Stanford RTE, MANLI
- How well do they recover gold-standard
alignments? - We report per-link precision, recall, and F1
- We also report exact match rate for complete
alignments
14Baseline bag-of-words aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
Match each H token to most similar P token cf.
Glickman et al. 2005
- Surprisingly good recall, despite extreme
simplicity - But very mediocre precision, F1, exact match
rate - Main problem aligns every token in H
15MT aligners GIZA Cross-EM
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Can we show that MT aligners arent suitable for
NLI? - Run GIZA via Moses, with default parameters
- Train on dev set, evaluate on dev test sets
- Asymmetric alignments in both directions
- Then symmetrize using INTERSECTION heuristic
- Initial results are very poor 56 F1
- Doesnt even align equal words
- Remedy add lexicon of equal words as extra
training data - Do similar experiments with Berkeley Cross-EM
aligner
16Results MT aligners
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
GIZA 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3
Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8
- Similar F1, but GIZA wins on precision,
Cross-EM on recall - Both do best with lexicon INTERSECTION
heuristic - Also tried UNION, GROW, GROW-DIAG,
GROW-DIAG-FINAL,GROW-DIAG-FINAL-AND, and
asymmetric alignments - All achieve better recall, but much worse
precision F1 - Problem too little data for unsupervised
learning - Need to compensate by exploiting external lexical
resources
17The Stanford RTE aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Token-based alignments map from H tokens to P
tokens - Phrase alignments not directly representable
- (But, named entities collocations collapsed in
pre-processing) - Exploits external lexical resources
- WordNet, LSA, distributional similarity, string
sim, - Syntax-based features to promote aligning
corresponding predicate-argument structures - Decoding learning similar to MANLI
18Results Stanford RTE aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
GIZA 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3
Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8
Stanford RTE 81.1 75.8 78.4 0.5 82.7 75.8 79.1 0.3
includes (generous) correction for missed
punctuation
- Better F1 than MT aligners but recall lags
precision - Stanford does poor job aligning function words
- 13 of links in gold are prepositions articles
- Stanford misses 67 of these (MANLI only 10)
- Also, Stanford fails to align multi-word phrases
- peace activists protestors, hackers
non-authorized personnel
19Results MANLI aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
GIZA 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3
Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8
Stanford RTE 81.1 75.8 78.4 0.5 82.7 75.8 79.1 0.3
MANLI 83.4 85.5 84.4 21.7 85.4 85.3 85.3 21.3
- MANLI outperforms all others on every measure
- F1 10.5 higher than GIZA, 6.2 higher than
Stanford - Good balance of precision recall
- Matched gt20 exactly
20MANLI results discussion
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Three factors contribute to success
- Lexical resources jail prison, prevent stop
, injured wounded - Contextual features enable matching function
words - Phrases death penalty capital punishment,
abdicate give up - But phrases help less than expected!
- If we set max phrase size 1, we lose just 0.2
in F1 - Recall errors room to improve
- 40 need better lexical resources conservation
protecting, organization agencies, bone
fragility osteoporosis - Precision errors harder to reduce
- equal function words (49), forms of be (21),
punctuation (7)
21Can aligners predict RTE answers?
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Weve been evaluating against gold-standard
alignments - But alignment is just one component of an NLI
system - Does a good alignment indicate a valid inference?
- Not necessarily negations, modals, non-factives
implicatives, - But alignment score can be strongly predictive
- And many NLI systems rely solely on alignment
- Using alignment score to predict RTE answers
- Predict YES if score gt threshold
- Tune threshold on development data
- Evaluate on test data
22Results predicting RTE answers
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 test RTE2 test
System Acc AvgP Acc AvgP
Bag-of-words 61.3 61.5 57.9 58.9
Stanford RTE 63.1 64.9 60.9 59.2
MANLI 59.3 69.0 60.3 61.0
RTE2 entries (average) 58.5 59.1
LCC Hickl et al. 2006 75.4 80.8
- No NLI aligner rivals best complete RTE system
- (Most) complete systems do a lot more than just
alignment! - But, Stanford MANLI beat average entry for RTE2
- Many NLI systems could benefit from better
alignments!
23Conclusion
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- MT aligners not directly applicable to NLI
- They rely on unsupervised learning from massive
amounts of bitext - They assume semantic equivalence of P H
- MANLI succeeds by
- Exploiting (manually automatically constructed)
lexical resources - Accommodating frequent unaligned phrases
- Phrase-based representation shows potential
- But not yet proven need better phrase-based
lexical resources
24Backup slides follow
25Related work
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
- Lots of past work on phrase-based MT
- But most systems extract phrases from
word-aligned data - Despite assumption that many translations are
non-compositional - Recent work jointly aligns weights
phrasesMarcu Wong 02, DeNero et al. 06, Birch
et al. 06, DeNero Klein 08 - However, this is of limited applicability to the
NLI task - MANLI uses phrases only when words arent
appropriate - MT uses longer phrases to realize more
dependencies(e.g. word order, agreement,
subcategorization) - MT systems dont model word insertions
deletions