Title: Contextual Bitextderived Paraphrases in Automatic MT Evaluation
1Contextual Bitext-derived Paraphrases in
Automatic MT Evaluation
Karolina Owczarzak, Declan Groves, Josef Van
Genabith, Andy Way
National Centre for Language Technology, Dublin
City University
HLT-NAACL, 09 June 2006
2Overview
- Automatic MT evaluation and its limitations
- Generation of paraphrases from word alignments
- Using paraphrases in evaluation
- Correlation with human judgments
- Paraphrase quality
3Automatic evaluation of MT quality
- Most popular metrics BLEU and NIST
4Automatic evaluation of MT quality
- Most popular metrics BLEU and NIST
But we dont have an answer
But we have no answer to it
However we cannot react
However we have no reply to that
5Automatic evaluation of MT quality
- Most popular metrics BLEU and NIST
But
we
have
an
answer
dont
But
we
have
no
answer
to
it
However we cannot react
7-grams 0/1
6-grams 0/2
5-grams 0/3
4-grams 0/4
However
we
have
no
reply
to
that
3-grams 1/5
2-grams 3/6
1-grams 6/7
0.0000 or smoothed 0.4392
6Automatic evaluation of MT quality
- Insensitive to admissible lexical differences
-
- answer ? reply
- Insensitive to admissible syntactic differences
- yesterday it was raining ? it was raining
yesterday - we dont have ? we have no
7Automatic evaluation of MT quality
- Attempts to come up with better metrics
- - word order
- Translation Error Rate (Snover et al. 2005)
- Maximum Matching String (Turian et al. 2003)
- - lexical and word-order issues
- CDER (Leusch et al. 2006)
- METEOR (Banerjee and Lavie 2005)
- linear regression model (Russo-Lassner et al.
2005) - Need POS taggers, stemmers, thesauri, WordNet
8Word and phrase alignment
- Statistical Machine Translation
Source Lg Text
Target Lg Text
agréable
nice, pleasant, good
nous navons pas
we dont have, we have no
9Generating paraphrases
- For each word/phrase ei find all words/phrases
fi1, , fin that ei aligns with, then for each fi
find all words/phrases ek?i1, , ek?in that fi
aligns with (Bannard and Callison-Burch 2005)
pleasant 0.75 agreeable 0.25 good 0.8
great 0.2 good 0.99
0.5 0.75 0.375 0.5 0.25 0.125 0.25
0.8 0.2 0.25 0.2 0.05 0.25 0.99
0.2475
nice good (0.4475), pleasant (0.375),
agreeable (0.125), great (0.05)
10Paraphrases in automatic MT evaluation
ea pea1, ., pean ez pezn , ., pezn
11Paraphrases in automatic MT evaluation
For each segment
ea pea1, ., pean ez pezn , ., pezn
12Experiment 1
- Test set 2000 sentences, French-English Europarl
- Two translations
- Pharaoh phrase-based SMT
- Logomedia rule-based MT
- Scored with BLEU and NIST
- Original reference
- Best-matching reference using paraphrases derived
from the test set - Paraphrase lists generated using GIZA and
refined word alignment strategy (Och and Ney,
2003 Koehn et al., 2003 Tiedemann, 2004) - Subset of 100 sentences from each translation
scored by two human judges (accuracy, fluency)
13Examples of paraphrases
- area field, this area, sector, aspect, this
sector - above all specifically, especially
- agreement accordance
- believe that believe, think that, feel that,
think - extensive widespread, broad, wide
- make progress on can move forward
- risk management management of risks
14Examples of reference segments
Example 1 Candidate translation the question of
climates with is a good example Original
reference the climate issue is a good example of
this Best-match reference the climate question
is a good example of this
Example 2 Candidate translation thank you very
much mr commissioner Original reference thank
you commissioner Best-match reference thank you
very much commissioner
15Results
Translation by Pharaoh on EP 2000 sent
16Pearsons correlation with human judgment
Subset of 100 sentences from the translation by
Pharaoh (EP 2000 sent)
17Paraphrase quality
- 700,000 sentence pairs, French-English Europarl
- Paraphrase lists generated using GIZA and
refined word alignment strategy (Och and Ney,
2003 Koehn et al., 2003 Tiedemann, 2004) - Quality of paraphrases evaluated with respect to
syntactic and semantic accuracy
Bannard and Callison-Burch, 2005
18Results
Syntactic accuracy
Semantic accuracy
19Filtering paraphrases
- Some inaccuracy still useful
- be supported support, supporting
- Filters
- - exclude closed class items prepositions,
personal pronouns, possessive pronouns, auxiliary
verbs have and be - Fr. à Eng. to, in, at
- to ? in ? at
- - prevent paraphrases of the form
- ei (w) ei (w), where
- w ? (prepositions, pronouns, auxiliary verbs,
modal verbs, negation, conjunction) -
- aspect aspect is
- hours hours for
- available not available
- - POS taggers, parsers
20References
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR
An Automatic Metric for MT Evaluation with
Improved Correlation with Human Judgments.
Proceedings of the ACL 2005 Workshop on Intrinsic
and Extrinsic Evaluation Measures for MT and/or
Summarization 65-73. - Colin Bannard and Chris Callison-Burch. 2005.
Paraphrasing with Bilingual Parallel Corpora.
Proceedings of the 43rd Annual Meeting of the
Association for Computational Linguistics (ACL
2005) 597-604. - Philipp Koehn, Franz Och and Daniel Marcu. 2003.
Statistical Phrase-Based Translation. Proceedings
of the Human Language Technology Conference
(HLT-NAACL 2003) 48-54. - Grazia Russo-Lassner, Jimmy Lin, and Philip
Resnik. 2005. A Paraphrase-based Approach to
Machine Translation Evaluation. Technical Report
LAMP-TR-125/CS-TR-4754/UMIACS-TR-2005-57,
University of Maryland, College Park, MD. - Mathew Snover, Bonnie Dorr, Richard Schwartz,
John Makhoul, Linnea Micciula and Ralph
Weischedel. 2005. A Study of Translation Error
Rate with Targeted Human Annotation. Technical
Report LAMP-TR-126, CS-TR-4755,
UMIACS-TR-2005-58, University of Maryland,
College Park. MD. - Jörg Tiedemann. 2004. Word to word alignment
strategies. Proceedings of the 20th International
Conference on Computational Linguistics (COLING
2004) 212-218. - Gregor Leusch, Nicola Ueffing and Hermann Ney.
2006. CDER Efficient MT Evaluation Using Block
Movements. To appear in Proceedings of the 11th
Conference of the European Chapter of the
Association for Computational Linguistics (EACL
2006). - Franz Josef Och and Hermann Ney. 2003. A
Systematic Comparison of Various Statistical
Alignment Modes. Computational Linguistics,
291951. - Franz Josef Och, Daniel Gildea, Sanjeev
Khudanpur, Anoop Sarkar, Kenji Yamada, Alex
Fraser, Shankar Kumar, Libin Shen, David Smith,
Katherine Eng, Viren Jain, Zhen Jin, and Dragomir
Radev. 2003. Syntax for statistical machine
translation. Technical report, Center for
Language and Speech Processing, John Hopkins
University, Baltimore, MD.