Title: DependencyBased Automatic Evaluation for Machine Translation
1Dependency-Based Automatic Evaluation for Machine
Translation
- Karolina Owczarzak, Josef van Genabith, Andy Way
- National Centre for Language Technology
- School of Computing
- Dublin City University
2Overview
- Automatic evaluation for Machine Translation
(MT) BLEU, NIST, GTM, METEOR, TER - Lexical-Functional Grammar (LFG) in language
processing parsing to simple logical forms - LFG in MT evaluation
- assessing level of parser noise the adjunct
attachment experiment - checking for bias the Europarl experiment
- correlation with human judgement the MultiTrans
experiment - Future work
3Automatic MT evaluation
- Automatic MT metrics fast and cheap way to
evaluate your MT system - Basic and most popular BLEU, NIST
- John resigned yesterday vs. Yesterday, John
quit - 1-grams 2/3 (john, yesterday)
- 2-grams 0/2
- 3-grams 0/1 Total 2/6 n-grams 0.33
- String comparison - not sensitive to legitimate
syntactic and lexical variation - Need large test sets and/or multiple references
4Automatic MT evaluation
- Other attempts to include more variation into
evaluation - General Text Matcher (GTM) precision and recall
on translation-reference pairs, weights
contiguous matches more - Translation Error Rate (TER) edit distance for
translation-reference pair, number of insertions,
deletions, substitutions and shifts - METEOR sum of n-gram matches for exact string
forms, stemmed words, and WordNet synonyms - Kauchak and Barzilay (2006) using WordNet
synonyms with BLEU - Owczarzak et al. (2006) using paraphrases
derived from the test set through word/phrase
alignment with BLEU and NIST
5Lexical-Functional Grammar
- Sentence structure representation
- c-structure (constituent) CFG trees, reflects
surface word order and structural hierarchy - f-structure (functional) abstract grammatical
(syntactic) relations
John resigned yesterday vs. Yesterday, John
resigned
triples SUBJ(resign, john) PERS(john, 3)
NUM(john, sg) TENSE(resign, past) ADJ(resign,
yesterday) PERS(yesterday, 3) NUM(yesterday,
sg) triples preds only SUBJ(resign,
john) ADJ(resign, yesterday)
6LFG Parser
- Cahill et al. (2004) LFG parser based on Penn II
Treebank (demo at http//lfg-demo.computing.dcu.ie
/lfgparser.html) - Automatic annotation of Charniaks/Bikels output
parse with attribute-value equations, resolving
to f-structures - Evaluation of parser quality comparison of
dependencies produced by the parser with the set
of dependencies in human annotation of same text,
precision and recall - Our LFG parser reaches high precision and recall
scores
7LFG in MT evaluation
- Parse translation and reference into LFG
f-structures rendered as dependency triples - Comparison of translation and reference text on
structural (dependency) level - Calculate precision and recall on translation and
reference dependency sets - Comparison of two automatically produced outputs
- how much noise does the parser introduce?
John resigned yesterday SUBJ(resign,
john) PERS(john, 3) NUM(john, sg) TENSE(resign,
past) ADJ(resign, yesterday) PERS(yesterday,
3) NUM(yesterday, sg)
Yesterday, John resigned SUBJ(resign,
john) PERS(john, 3) NUM(john, sg) TENSE(resign,
past) ADJ(resign, yesterday) PERS(yesterday,
3) NUM(yesterday, sg)
vs.
100
8The adjunct attachment experiment
- 100 English Europarl sentences containing
adjuncts or coordinated structures - Hand-modified to change the placement of the
adjunct or the order of coordinated elements, no
change in meaning or grammaticality -
- Schengen, on the other hand, is not organic. lt-
original reference - On the other hand, Schengen is not organic. lt-
modified translation - Change limited to c-structure, no change in
f-structure - A perfect parser should give both identical set
of dependencies
9The adjunct attachment experiment - results
Parser
10The Europarl experiment
- N-gram-based metrics (BLEU, NIST) favour
n-gram-based translation (statistical MT) - Owczarzak et al. (2006)
- BLEU Pharaoh gt Logomedia (0.0349)
- NIST Pharaoh gt Logomedia (0.6219)
- Human Pharaoh lt Logomedia (0.19)
- 4000 sentences from Spanish-English Europarl
- Two translations
- Logomedia
- Pharaoh
- Evaluated with BLEU, NIST, GTM, TER, METEOR
(-WordNet), dependency-based method (basic,
predicate-only, -WordNet, -bitext-generated
paraphrases) - WordNet paraphrases used to create new
best-matching reference for the translation, then
evaluated with dependency-based method
11The Europarl experiment - results
Europarl 4000 Logomedia vs Pharaoh
12The MultiTrans experiment
- Correlation of dependency-based method with human
evaluation - Comparison with correlation of BLEU, NIST, GTM,
METEOR, TER - Linguistic Data Consortium Multiple Translation
Chinese Parts 2 and 4 - multiple translations of Chinese newswire text
- four human-produced references
- segment-level human scores for a subset of the
translations - total 16,800 translation-reference-human score
segments - Pearsons correlation coefficient
- -1 negative correlation
- 0 no correlation
- 1 positive correlation
13The MultiTrans experiment - results
Correlation with human judgement of translation
quality
- Dependency-based method sensitive to grammatical
structure of the sentence more grammatical
translation more fluent translation - Different position of a word different local
(and global) structure the word appears in
dependency triples that do not match the
reference
14Future work
- Use n-best parses to reduce parser noise and
increase number of matches
- Generate a paraphrase set through word alignment
from a large bitext (Europarl), use instead of
WordNet - Create weights for individual dependency scores
that contribute to segment-level score, train to
maximize correlation with human judgement
15Conclusions
- New automatic method for evaluation of MT output
- LFG dependency triples simple logical form
- Evaluation on structural level, not surface
string form - Allows legitimate syntactic variation
- Allow legitimate lexical variation when used with
WordNet or paraphrases - Correlates higher than other metrics with human
evaluation of fluency -
16References
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR
An Automatic Metric for MT Evaluation with
Improved Correlation with Human Judgments.
Proceedings of the ACL 2005 Workshop on Intrinsic
and Extrinsic Evaluation Measures for MT and/or
Summarization 65-73. - Aoife Cahill, Michael Burke, Ruth ODonovan,
Josef van Genabith, and Andy Way. 2004.
Long-Distance Dependency Resolution in
Automatically Acquired Wide-Coverage PCFG-Based
LFG Approximations. Proceedings of ACL 2004
320-327. - George Doddington. 2002. Automatic Evaluation of
MT Quality using N-gram Co-occurrence Statistics.
Proceedings of HLT 2002 138-145. - David Kauchak and Regina Barzilay. 2006.
Paraphrasing for Automatic Evaluation.
Proceedings of HLT-NAACL 2006 45-462. - Philipp Koehn, Franz Och and Daniel Marcu. 2003.
Statistical Phrase-Based Translation. Proceedings
of HLT-NAACL 2003 48-54. - Philipp Koehn. 2005. Europarl A Parallel Corpus
for Statistical Machine Translation. Proceedings
of MT Summit 2005 79-86. - Philipp Koehn. 2004. Pharaoh a beam search
decoder for phrase-based statistical machine
translation models. Proceedings of the AMTA 2004
Workshop on Machine Translation From real users
to research 115-124. - Karolina Owczarzak, Declan Groves, Josef van
Genabith, and Andy Way. 2006. Contextual
Bitext-Derived Paraphrases in Automatic MT
Evaluation. Proceedings of the HLT-NAACL 2006
Workshop on Statistical Machine Translation
86-93. - Kishore Papineni, Salim Roukos, Todd Ward, and
WeiJing Zhu. 2002. BLEU a method for automatic
evaluation of machine translation. In Proceedings
of ACL 2002 311-318. - Mathew Snover, Bonnie Dorr, Richard Schwartz,
John Makhoul, Linnea Micciula. 2006. A Study of
Translation Error Rate with Targeted Human
Annotation. Proceedings of AMTA 2006 223-231. - Joseph P. Turian, Luke Shen, and I. Dan Melamed.
2003. Evaluation of Machine Translation and Its
Evaluation. Proceedings of MT Summit 2003
386-393.