A Phrase-Based Model of Alignment for Natural Language Inference - PowerPoint PPT Presentation

About This Presentation

Title:

A Phrase-Based Model of Alignment for Natural Language Inference

Description:

NLI aligner must accommodate frequent unaligned content. Little training data available ... Initialize weight vector w = 0, learning rate R0 = 1. For training ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 26

Provided by: BillMac9

Category:

more less

Transcript and Presenter's Notes

Title: A Phrase-Based Model of Alignment for Natural Language Inference

1
A Phrase-Based Model of Alignmentfor Natural
Language Inference

Bill MacCartney, Michel Galley,
and Christopher D. Manning
Stanford University
26 October 2008

2
Natural language inference (NLI) (aka RTE)
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Does premise P justify an inference to hypothesis
H?
An informal notion of inference variability of
linguistic expression

P Gazprom today confirmed a two-fold increase in
its gas price for Georgia, beginning next
Monday. H Gazprom will double Georgias gas
bill. yes

Like MT, NLI depends on a facility for alignment
I.e., linking corresponding words/phrases in two
related sentences

3
Alignment example
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
H (hypothesis)
P (premise)
4
Approaches to NLI alignment
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Alignment addressed variously by current NLI
systems
In some approaches to NLI, alignments are
implicit
NLI via lexical overlap Glickman et al. 05,
Jijkoun de Rijke 05
NLI as proof search Tatu Moldovan 07, Bar-Haim
et al. 07
Other NLI systems make alignment step explicit
Align first, then determine inferential validity
Marsi Kramer 05, MacCartney et al. 06
What about using an MT aligner?
Alignment is familiar in MT, with extensive
literatureBrown et al. 93, Vogel et al. 96, Och
Ney 03, Marcu Wong 02, DeNero et al. 06,
Birch et al. 06, DeNero Klein 08
Can tools techniques of MT alignment transfer
to NLI?

5
NLI alignment vs. MT alignment
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Doubtful NLI alignment differs in several
respects
Monolingual can exploit resources like WordNet
Asymmetric P often longer has content
unrelated to H
Cannot assume semantic equivalence
NLI aligner must accommodate frequent unaligned
content
Little training data available
MT aligners use unsupervised training on huge
amounts of bitext
NLI aligners must rely on supervised training
much less data

6
Contributions of this paper
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

In this paper, we
Undertake the first systematic study of alignment
for NLI
Existing NLI aligners use idiosyncratic methods,
are poorly documented, use proprietary data
Examine the relation between alignment in NLI and
MT
How do existing MT aligners perform on NLI
alignment task?
Propose a new model of alignment for NLI MANLI
Outperforms existing MT NLI aligners on NLI
alignment task

7
The MANLI aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

A model of alignment for NLI consisting of four
components

Phrase-based representation
Feature-based scoring function
Decoding using simulated annealing
Perceptron learning

8
Phrase-based alignment representation
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
Represent alignments by sequence of phrase edits
EQ, SUB, DEL, INS
EQ(Gazprom1, Gazprom1) INS(will2) DEL(today2) DEL(
confirmed3) DEL(a4) SUB(two-fold5 increase6,
double3) DEL(in7) DEL(its8)

One-to-one at phrase level (but many-to-many at
token level)
Avoids arbitrary alignment choices can use
phrase-based resources

9
A feature-based scoring function
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Score edits as linear combination of features,
then sum

Edit type features EQ, SUB, DEL, INS
Phrase features phrase sizes, non-constituents
Lexical similarity feature max over similarity
scores
WordNet synonymy, hyponymy, antonymy,
Jiang-Conrath
Distributional similarity à la Dekang Lin
Various measures of string/lemma similarity
Contextual features distortion, matching
neighbors

10
Decoding using simulated annealing
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
100 times
11
Perceptron learning of feature weights
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

We use a variant of averaged perceptron Collins
2002

Initialize weight vector w 0, learning rate R0
1 For training epoch i 1 to 50 For each
problem ?Pj, Hj? with gold alignment Ej Set Êj
ALIGN(Pj, Hj, w) Set w w Ri ? (?(Ej)
?(Êj)) Set w w / ?w?2 (L2 normalization) Set
wi w (store weight vector for this
epoch) Set Ri 0.8 ? Ri1 (reduce learning
rate) Throw away weight vectors from first 20 of
epochs Return average weight vector
Training runs require about 20 hours (on 800 RTE
problems)
12
The MSR RTE2 alignment data
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Previously, little supervised data
Now, MSR gold alignments for RTE2
Brockett 2007
dev test sets, 800 problems each
Token-based, but many-to-many
allows implicit alignment of phrases
3 independent annotators
3 of 3 agreed on 70 of proposed links
2 of 3 agreed on 99.7 of proposed links
merged using majority rule

13
Evaluation on MSR data
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

We evaluate several systems on MSR data
A simple baseline aligner
MT aligners GIZA Cross-EM
NLI aligners Stanford RTE, MANLI
How well do they recover gold-standard
alignments?
We report per-link precision, recall, and F1
We also report exact match rate for complete
alignments

14
Baseline bag-of-words aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
Match each H token to most similar P token cf.
Glickman et al. 2005

Surprisingly good recall, despite extreme
simplicity
But very mediocre precision, F1, exact match
rate
Main problem aligns every token in H

15
MT aligners GIZA Cross-EM
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Can we show that MT aligners arent suitable for
NLI?
Run GIZA via Moses, with default parameters
Train on dev set, evaluate on dev test sets
Asymmetric alignments in both directions
Then symmetrize using INTERSECTION heuristic
Initial results are very poor 56 F1
Doesnt even align equal words
Remedy add lexicon of equal words as extra
training data
Do similar experiments with Berkeley Cross-EM
aligner

16
Results MT aligners
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
GIZA 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3
Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8

Similar F1, but GIZA wins on precision,
Cross-EM on recall
Both do best with lexicon INTERSECTION
heuristic
Also tried UNION, GROW, GROW-DIAG,
GROW-DIAG-FINAL,GROW-DIAG-FINAL-AND, and
asymmetric alignments
All achieve better recall, but much worse
precision F1
Problem too little data for unsupervised
learning
Need to compensate by exploiting external lexical
resources

17
The Stanford RTE aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Token-based alignments map from H tokens to P
tokens
Phrase alignments not directly representable
(But, named entities collocations collapsed in
pre-processing)
Exploits external lexical resources
WordNet, LSA, distributional similarity, string
sim,
Syntax-based features to promote aligning
corresponding predicate-argument structures
Decoding learning similar to MANLI

18
Results Stanford RTE aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
GIZA 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3
Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8
Stanford RTE 81.1 75.8 78.4 0.5 82.7 75.8 79.1 0.3

includes (generous) correction for missed
punctuation

Better F1 than MT aligners but recall lags
precision
Stanford does poor job aligning function words
13 of links in gold are prepositions articles
Stanford misses 67 of these (MANLI only 10)
Also, Stanford fails to align multi-word phrases
peace activists protestors, hackers
non-authorized personnel

19
Results MANLI aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
GIZA 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3
Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8
Stanford RTE 81.1 75.8 78.4 0.5 82.7 75.8 79.1 0.3
MANLI 83.4 85.5 84.4 21.7 85.4 85.3 85.3 21.3

MANLI outperforms all others on every measure
F1 10.5 higher than GIZA, 6.2 higher than
Stanford
Good balance of precision recall
Matched gt20 exactly

20
MANLI results discussion
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Three factors contribute to success
Lexical resources jail prison, prevent stop
, injured wounded
Contextual features enable matching function
words
Phrases death penalty capital punishment,
abdicate give up
But phrases help less than expected!
If we set max phrase size 1, we lose just 0.2
in F1
Recall errors room to improve
40 need better lexical resources conservation
protecting, organization agencies, bone
fragility osteoporosis
Precision errors harder to reduce
equal function words (49), forms of be (21),
punctuation (7)

21
Can aligners predict RTE answers?
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Weve been evaluating against gold-standard
alignments
But alignment is just one component of an NLI
system
Does a good alignment indicate a valid inference?
Not necessarily negations, modals, non-factives
implicatives,
But alignment score can be strongly predictive
And many NLI systems rely solely on alignment
Using alignment score to predict RTE answers
Predict YES if score gt threshold
Tune threshold on development data
Evaluate on test data

22
Results predicting RTE answers
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 test RTE2 test
System Acc AvgP Acc AvgP
Bag-of-words 61.3 61.5 57.9 58.9
Stanford RTE 63.1 64.9 60.9 59.2
MANLI 59.3 69.0 60.3 61.0
RTE2 entries (average) 58.5 59.1
LCC Hickl et al. 2006 75.4 80.8

No NLI aligner rivals best complete RTE system
(Most) complete systems do a lot more than just
alignment!
But, Stanford MANLI beat average entry for RTE2
Many NLI systems could benefit from better
alignments!

23
Conclusion
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

MT aligners not directly applicable to NLI
They rely on unsupervised learning from massive
amounts of bitext
They assume semantic equivalence of P H
MANLI succeeds by
Exploiting (manually automatically constructed)
lexical resources
Accommodating frequent unaligned phrases
Phrase-based representation shows potential
But not yet proven need better phrase-based
lexical resources

24
Backup slides follow

25
Related work
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion

Lots of past work on phrase-based MT
But most systems extract phrases from
word-aligned data
Despite assumption that many translations are
non-compositional
Recent work jointly aligns weights
phrasesMarcu Wong 02, DeNero et al. 06, Birch
et al. 06, DeNero Klein 08
However, this is of limited applicability to the
NLI task
MANLI uses phrases only when words arent
appropriate
MT uses longer phrases to realize more
dependencies(e.g. word order, agreement,
subcategorization)
MT systems dont model word insertions
deletions