DependencyBased Automatic Evaluation for Machine Translation - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

DependencyBased Automatic Evaluation for Machine Translation

Description:

100 English Europarl sentences containing adjuncts or coordinated structures ... Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 17

Provided by: karolinao

Category:

more less

Transcript and Presenter's Notes

Title: DependencyBased Automatic Evaluation for Machine Translation

1
Dependency-Based Automatic Evaluation for Machine
Translation

Karolina Owczarzak, Josef van Genabith, Andy Way
National Centre for Language Technology
School of Computing
Dublin City University

2
Overview

Automatic evaluation for Machine Translation
(MT) BLEU, NIST, GTM, METEOR, TER
Lexical-Functional Grammar (LFG) in language
processing parsing to simple logical forms
LFG in MT evaluation
assessing level of parser noise the adjunct
attachment experiment
checking for bias the Europarl experiment
correlation with human judgement the MultiTrans
experiment
Future work

3
Automatic MT evaluation

Automatic MT metrics fast and cheap way to
evaluate your MT system
Basic and most popular BLEU, NIST
John resigned yesterday vs. Yesterday, John
quit
1-grams 2/3 (john, yesterday)
2-grams 0/2
3-grams 0/1 Total 2/6 n-grams 0.33
String comparison - not sensitive to legitimate
syntactic and lexical variation
Need large test sets and/or multiple references

4
Automatic MT evaluation

Other attempts to include more variation into
evaluation
General Text Matcher (GTM) precision and recall
on translation-reference pairs, weights
contiguous matches more
Translation Error Rate (TER) edit distance for
translation-reference pair, number of insertions,
deletions, substitutions and shifts
METEOR sum of n-gram matches for exact string
forms, stemmed words, and WordNet synonyms
Kauchak and Barzilay (2006) using WordNet
synonyms with BLEU
Owczarzak et al. (2006) using paraphrases
derived from the test set through word/phrase
alignment with BLEU and NIST

5
Lexical-Functional Grammar

Sentence structure representation
c-structure (constituent) CFG trees, reflects
surface word order and structural hierarchy
f-structure (functional) abstract grammatical
(syntactic) relations

John resigned yesterday vs. Yesterday, John
resigned
triples SUBJ(resign, john) PERS(john, 3)
NUM(john, sg) TENSE(resign, past) ADJ(resign,
yesterday) PERS(yesterday, 3) NUM(yesterday,
sg) triples preds only SUBJ(resign,
john) ADJ(resign, yesterday)
6
LFG Parser

Cahill et al. (2004) LFG parser based on Penn II
Treebank (demo at http//lfg-demo.computing.dcu.ie
/lfgparser.html)
Automatic annotation of Charniaks/Bikels output
parse with attribute-value equations, resolving
to f-structures
Evaluation of parser quality comparison of
dependencies produced by the parser with the set
of dependencies in human annotation of same text,
precision and recall
Our LFG parser reaches high precision and recall
scores

7
LFG in MT evaluation

Parse translation and reference into LFG
f-structures rendered as dependency triples
Comparison of translation and reference text on
structural (dependency) level
Calculate precision and recall on translation and
reference dependency sets
Comparison of two automatically produced outputs
how much noise does the parser introduce?

John resigned yesterday SUBJ(resign,
john) PERS(john, 3) NUM(john, sg) TENSE(resign,
past) ADJ(resign, yesterday) PERS(yesterday,
3) NUM(yesterday, sg)
Yesterday, John resigned SUBJ(resign,
john) PERS(john, 3) NUM(john, sg) TENSE(resign,
past) ADJ(resign, yesterday) PERS(yesterday,
3) NUM(yesterday, sg)
vs.
100
8
The adjunct attachment experiment

100 English Europarl sentences containing
adjuncts or coordinated structures
Hand-modified to change the placement of the
adjunct or the order of coordinated elements, no
change in meaning or grammaticality
Schengen, on the other hand, is not organic. lt-
original reference
On the other hand, Schengen is not organic. lt-
modified translation
Change limited to c-structure, no change in
f-structure
A perfect parser should give both identical set
of dependencies

9
The adjunct attachment experiment - results
Parser
10
The Europarl experiment

N-gram-based metrics (BLEU, NIST) favour
n-gram-based translation (statistical MT)
Owczarzak et al. (2006)
BLEU Pharaoh gt Logomedia (0.0349)
NIST Pharaoh gt Logomedia (0.6219)
Human Pharaoh lt Logomedia (0.19)
4000 sentences from Spanish-English Europarl
Two translations
Logomedia
Pharaoh
Evaluated with BLEU, NIST, GTM, TER, METEOR
(-WordNet), dependency-based method (basic,
predicate-only, -WordNet, -bitext-generated
paraphrases)
WordNet paraphrases used to create new
best-matching reference for the translation, then
evaluated with dependency-based method

11
The Europarl experiment - results
Europarl 4000 Logomedia vs Pharaoh
12
The MultiTrans experiment

Correlation of dependency-based method with human
evaluation
Comparison with correlation of BLEU, NIST, GTM,
METEOR, TER
Linguistic Data Consortium Multiple Translation
Chinese Parts 2 and 4
multiple translations of Chinese newswire text
four human-produced references
segment-level human scores for a subset of the
translations
total 16,800 translation-reference-human score
segments
Pearsons correlation coefficient
-1 negative correlation
0 no correlation
1 positive correlation

13
The MultiTrans experiment - results
Correlation with human judgement of translation
quality

Dependency-based method sensitive to grammatical
structure of the sentence more grammatical
translation more fluent translation
Different position of a word different local
(and global) structure the word appears in
dependency triples that do not match the
reference

14
Future work

Use n-best parses to reduce parser noise and
increase number of matches

Generate a paraphrase set through word alignment
from a large bitext (Europarl), use instead of
WordNet
Create weights for individual dependency scores
that contribute to segment-level score, train to
maximize correlation with human judgement

15
Conclusions

New automatic method for evaluation of MT output
LFG dependency triples simple logical form
Evaluation on structural level, not surface
string form
Allows legitimate syntactic variation
Allow legitimate lexical variation when used with
WordNet or paraphrases
Correlates higher than other metrics with human
evaluation of fluency

16
References

Satanjeev Banerjee and Alon Lavie. 2005. METEOR
An Automatic Metric for MT Evaluation with
Improved Correlation with Human Judgments.
Proceedings of the ACL 2005 Workshop on Intrinsic
and Extrinsic Evaluation Measures for MT and/or
Summarization 65-73.
Aoife Cahill, Michael Burke, Ruth ODonovan,
Josef van Genabith, and Andy Way. 2004.
Long-Distance Dependency Resolution in
Automatically Acquired Wide-Coverage PCFG-Based
LFG Approximations. Proceedings of ACL 2004
320-327.
George Doddington. 2002. Automatic Evaluation of
MT Quality using N-gram Co-occurrence Statistics.
Proceedings of HLT 2002 138-145.
David Kauchak and Regina Barzilay. 2006.
Paraphrasing for Automatic Evaluation.
Proceedings of HLT-NAACL 2006 45-462.
Philipp Koehn, Franz Och and Daniel Marcu. 2003.
Statistical Phrase-Based Translation. Proceedings
of HLT-NAACL 2003 48-54.
Philipp Koehn. 2005. Europarl A Parallel Corpus
for Statistical Machine Translation. Proceedings
of MT Summit 2005 79-86.
Philipp Koehn. 2004. Pharaoh a beam search
decoder for phrase-based statistical machine
translation models. Proceedings of the AMTA 2004
Workshop on Machine Translation From real users
to research 115-124.
Karolina Owczarzak, Declan Groves, Josef van
Genabith, and Andy Way. 2006. Contextual
Bitext-Derived Paraphrases in Automatic MT
Evaluation. Proceedings of the HLT-NAACL 2006
Workshop on Statistical Machine Translation
86-93.
Kishore Papineni, Salim Roukos, Todd Ward, and
WeiJing Zhu. 2002. BLEU a method for automatic
evaluation of machine translation. In Proceedings
of ACL 2002 311-318.
Mathew Snover, Bonnie Dorr, Richard Schwartz,
John Makhoul, Linnea Micciula. 2006. A Study of
Translation Error Rate with Targeted Human
Annotation. Proceedings of AMTA 2006 223-231.
Joseph P. Turian, Luke Shen, and I. Dan Melamed.
2003. Evaluation of Machine Translation and Its
Evaluation. Proceedings of MT Summit 2003
386-393.