A Phrase-Based Model of Alignment for Natural Language Inference - PowerPoint PPT Presentation

About This Presentation
Title:

A Phrase-Based Model of Alignment for Natural Language Inference

Description:

NLI aligner must accommodate frequent unaligned content. Little training data available ... Initialize weight vector w = 0, learning rate R0 = 1. For training ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 26
Provided by: BillMac9
Category:

less

Transcript and Presenter's Notes

Title: A Phrase-Based Model of Alignment for Natural Language Inference


1
A Phrase-Based Model of Alignmentfor Natural
Language Inference
  • Bill MacCartney, Michel Galley,
  • and Christopher D. Manning
  • Stanford University
  • 26 October 2008

2
Natural language inference (NLI) (aka RTE)
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Does premise P justify an inference to hypothesis
    H?
  • An informal notion of inference variability of
    linguistic expression

P Gazprom today confirmed a two-fold increase in
its gas price for Georgia, beginning next
Monday. H Gazprom will double Georgias gas
bill. yes
  • Like MT, NLI depends on a facility for alignment
  • I.e., linking corresponding words/phrases in two
    related sentences

3
Alignment example
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
H (hypothesis)
P (premise)
4
Approaches to NLI alignment
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Alignment addressed variously by current NLI
    systems
  • In some approaches to NLI, alignments are
    implicit
  • NLI via lexical overlap Glickman et al. 05,
    Jijkoun de Rijke 05
  • NLI as proof search Tatu Moldovan 07, Bar-Haim
    et al. 07
  • Other NLI systems make alignment step explicit
  • Align first, then determine inferential validity
    Marsi Kramer 05, MacCartney et al. 06
  • What about using an MT aligner?
  • Alignment is familiar in MT, with extensive
    literatureBrown et al. 93, Vogel et al. 96, Och
    Ney 03, Marcu Wong 02, DeNero et al. 06,
    Birch et al. 06, DeNero Klein 08
  • Can tools techniques of MT alignment transfer
    to NLI?

5
NLI alignment vs. MT alignment
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Doubtful NLI alignment differs in several
    respects
  • Monolingual can exploit resources like WordNet
  • Asymmetric P often longer has content
    unrelated to H
  • Cannot assume semantic equivalence
  • NLI aligner must accommodate frequent unaligned
    content
  • Little training data available
  • MT aligners use unsupervised training on huge
    amounts of bitext
  • NLI aligners must rely on supervised training
    much less data

6
Contributions of this paper
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • In this paper, we
  • Undertake the first systematic study of alignment
    for NLI
  • Existing NLI aligners use idiosyncratic methods,
    are poorly documented, use proprietary data
  • Examine the relation between alignment in NLI and
    MT
  • How do existing MT aligners perform on NLI
    alignment task?
  • Propose a new model of alignment for NLI MANLI
  • Outperforms existing MT NLI aligners on NLI
    alignment task

7
The MANLI aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • A model of alignment for NLI consisting of four
    components
  1. Phrase-based representation
  2. Feature-based scoring function
  3. Decoding using simulated annealing
  4. Perceptron learning

8
Phrase-based alignment representation
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
Represent alignments by sequence of phrase edits
EQ, SUB, DEL, INS
EQ(Gazprom1, Gazprom1) INS(will2) DEL(today2) DEL(
confirmed3) DEL(a4) SUB(two-fold5 increase6,
double3) DEL(in7) DEL(its8)
  • One-to-one at phrase level (but many-to-many at
    token level)
  • Avoids arbitrary alignment choices can use
    phrase-based resources

9
A feature-based scoring function
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Score edits as linear combination of features,
    then sum
  • Edit type features EQ, SUB, DEL, INS
  • Phrase features phrase sizes, non-constituents
  • Lexical similarity feature max over similarity
    scores
  • WordNet synonymy, hyponymy, antonymy,
    Jiang-Conrath
  • Distributional similarity à la Dekang Lin
  • Various measures of string/lemma similarity
  • Contextual features distortion, matching
    neighbors

10
Decoding using simulated annealing
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
100 times
11
Perceptron learning of feature weights
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • We use a variant of averaged perceptron Collins
    2002

Initialize weight vector w 0, learning rate R0
1 For training epoch i 1 to 50 For each
problem ?Pj, Hj? with gold alignment Ej Set Êj
ALIGN(Pj, Hj, w) Set w w Ri ? (?(Ej)
?(Êj)) Set w w / ?w?2 (L2 normalization) Set
wi w (store weight vector for this
epoch) Set Ri 0.8 ? Ri1 (reduce learning
rate) Throw away weight vectors from first 20 of
epochs Return average weight vector
Training runs require about 20 hours (on 800 RTE
problems)
12
The MSR RTE2 alignment data
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Previously, little supervised data
  • Now, MSR gold alignments for RTE2
  • Brockett 2007
  • dev test sets, 800 problems each
  • Token-based, but many-to-many
  • allows implicit alignment of phrases
  • 3 independent annotators
  • 3 of 3 agreed on 70 of proposed links
  • 2 of 3 agreed on 99.7 of proposed links
  • merged using majority rule

13
Evaluation on MSR data
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • We evaluate several systems on MSR data
  • A simple baseline aligner
  • MT aligners GIZA Cross-EM
  • NLI aligners Stanford RTE, MANLI
  • How well do they recover gold-standard
    alignments?
  • We report per-link precision, recall, and F1
  • We also report exact match rate for complete
    alignments

14
Baseline bag-of-words aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
Match each H token to most similar P token cf.
Glickman et al. 2005
  • Surprisingly good recall, despite extreme
    simplicity
  • But very mediocre precision, F1, exact match
    rate
  • Main problem aligns every token in H

15
MT aligners GIZA Cross-EM
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Can we show that MT aligners arent suitable for
    NLI?
  • Run GIZA via Moses, with default parameters
  • Train on dev set, evaluate on dev test sets
  • Asymmetric alignments in both directions
  • Then symmetrize using INTERSECTION heuristic
  • Initial results are very poor 56 F1
  • Doesnt even align equal words
  • Remedy add lexicon of equal words as extra
    training data
  • Do similar experiments with Berkeley Cross-EM
    aligner

16
Results MT aligners
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
GIZA 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3
Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8
  • Similar F1, but GIZA wins on precision,
    Cross-EM on recall
  • Both do best with lexicon INTERSECTION
    heuristic
  • Also tried UNION, GROW, GROW-DIAG,
    GROW-DIAG-FINAL,GROW-DIAG-FINAL-AND, and
    asymmetric alignments
  • All achieve better recall, but much worse
    precision F1
  • Problem too little data for unsupervised
    learning
  • Need to compensate by exploiting external lexical
    resources

17
The Stanford RTE aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Token-based alignments map from H tokens to P
    tokens
  • Phrase alignments not directly representable
  • (But, named entities collocations collapsed in
    pre-processing)
  • Exploits external lexical resources
  • WordNet, LSA, distributional similarity, string
    sim,
  • Syntax-based features to promote aligning
    corresponding predicate-argument structures
  • Decoding learning similar to MANLI

18
Results Stanford RTE aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
GIZA 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3
Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8
Stanford RTE 81.1 75.8 78.4 0.5 82.7 75.8 79.1 0.3




includes (generous) correction for missed
punctuation
  • Better F1 than MT aligners but recall lags
    precision
  • Stanford does poor job aligning function words
  • 13 of links in gold are prepositions articles
  • Stanford misses 67 of these (MANLI only 10)
  • Also, Stanford fails to align multi-word phrases
  • peace activists protestors, hackers
    non-authorized personnel

19
Results MANLI aligner
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 dev RTE2 dev RTE2 test RTE2 test RTE2 test RTE2 test
System P R F1 E P R F1 E
Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3
GIZA 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3
Cross-EM 67.6 80.1 72.1 1.3 70.3 81.0 74.1 0.8
Stanford RTE 81.1 75.8 78.4 0.5 82.7 75.8 79.1 0.3
MANLI 83.4 85.5 84.4 21.7 85.4 85.3 85.3 21.3
  • MANLI outperforms all others on every measure
  • F1 10.5 higher than GIZA, 6.2 higher than
    Stanford
  • Good balance of precision recall
  • Matched gt20 exactly

20
MANLI results discussion
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Three factors contribute to success
  • Lexical resources jail prison, prevent stop
    , injured wounded
  • Contextual features enable matching function
    words
  • Phrases death penalty capital punishment,
    abdicate give up
  • But phrases help less than expected!
  • If we set max phrase size 1, we lose just 0.2
    in F1
  • Recall errors room to improve
  • 40 need better lexical resources conservation
    protecting, organization agencies, bone
    fragility osteoporosis
  • Precision errors harder to reduce
  • equal function words (49), forms of be (21),
    punctuation (7)

21
Can aligners predict RTE answers?
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Weve been evaluating against gold-standard
    alignments
  • But alignment is just one component of an NLI
    system
  • Does a good alignment indicate a valid inference?
  • Not necessarily negations, modals, non-factives
    implicatives,
  • But alignment score can be strongly predictive
  • And many NLI systems rely solely on alignment
  • Using alignment score to predict RTE answers
  • Predict YES if score gt threshold
  • Tune threshold on development data
  • Evaluate on test data

22
Results predicting RTE answers
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
RTE2 dev RTE2 dev RTE2 test RTE2 test
System Acc AvgP Acc AvgP
Bag-of-words 61.3 61.5 57.9 58.9
Stanford RTE 63.1 64.9 60.9 59.2
MANLI 59.3 69.0 60.3 61.0
RTE2 entries (average) 58.5 59.1
LCC Hickl et al. 2006 75.4 80.8
  • No NLI aligner rivals best complete RTE system
  • (Most) complete systems do a lot more than just
    alignment!
  • But, Stanford MANLI beat average entry for RTE2
  • Many NLI systems could benefit from better
    alignments!

23
Conclusion
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • MT aligners not directly applicable to NLI
  • They rely on unsupervised learning from massive
    amounts of bitext
  • They assume semantic equivalence of P H
  • MANLI succeeds by
  • Exploiting (manually automatically constructed)
    lexical resources
  • Accommodating frequent unaligned phrases
  • Phrase-based representation shows potential
  • But not yet proven need better phrase-based
    lexical resources

24
Backup slides follow
  • END

25
Related work
Introduction The MANLI Aligner Evaluation
on MSR Data Predicting RTE Answers
Conclusion
  • Lots of past work on phrase-based MT
  • But most systems extract phrases from
    word-aligned data
  • Despite assumption that many translations are
    non-compositional
  • Recent work jointly aligns weights
    phrasesMarcu Wong 02, DeNero et al. 06, Birch
    et al. 06, DeNero Klein 08
  • However, this is of limited applicability to the
    NLI task
  • MANLI uses phrases only when words arent
    appropriate
  • MT uses longer phrases to realize more
    dependencies(e.g. word order, agreement,
    subcategorization)
  • MT systems dont model word insertions
    deletions
Write a Comment
User Comments (0)
About PowerShow.com