Title: Statistical modelling of MT output corpora for Information Extraction
1 - Statistical modelling of MT output corpora for
Information Extraction
2Overview
- Using MT output for IE
- Requirements and evaluation of usability
- S-score measuring the degree of word
significance for a text by contrasting text
and corpus usages - Experiment set-up and MT evaluation metrics
- using differences in S-scores for MT evaluation
- Results of MT evaluation for IE
- Comparison of MT systems
- Correlations with human evaluation measures of MT
- Issues of MT architecture and evaluation scores
- Conclusions Future work
3Using MT for IE
- Requirements for human use and for automatic
processing are different - fluency is less important than adequacy
- stylistic errors are less important than factual
errors, e.g. - MT Bill Fisher 'to send a bill to a
fisher - Frequency issues
- low-frequent words carry the most important
information (require accurate disambiguation) - Some IE tasks use statistical models (expected to
be different for MT)
4Frequency issues disambiguation
5Frequency issues statistical modelling for IE
- Research on adaptive IE automatic template
acquisition via statistical means - find sentences containing statistically
significant words - build templates around such sentences
- Template element fillers (e.g., NEs) often appear
among statistically significant words - Distribution of word frequencies is expected to
be different for MT checking if this is the case
6Measuring statistical significance
- Swordtext -- the score of statistical
significance for a particular word in a
particular text - Pwordtext -- the relative frequency of the word
in the text - Pwordrest-corp -- the relative frequency of the
same word in the rest of the corpus, without this
text - Nwordtxt-not-found -- the proportion of texts
in the corpus, where this word is not found
(number of texts, where it is not found divided
by number of texts in the corpus) - Pwordall-corp -- the relative frequency of the
word in the whole corpus, including this
particular text
7Intuitive appeal of significance scores
- Selecting words potentially important for IE
- In the Marseille Facet of the Urba-Gracco Affair,
Messrs. Emmanuelli, Laignel, Pezet, and Sanmarco
Confronted by the Former Officials of the SP
Research Department - On Wednesday, February 9, the presiding judge of
the Court of Criminal Appeals of Lyon, Henri
Blondet, charged with investigating the Marseille
facet of the Urba-Gracco affair, proceeded with
an extensive confrontation among several
Socialist deputies and former directors of
Urba-Gracco. Ten persons, including Henri
Emmanuelli and Andre Laignel, former treasurers
of the SP, Michel Pezet, and Philippe Sanmarco,
former deputies (SP) from the Bouches-du-Rhône,
took part in a hearing which lasted more than
seven hours
8...Intuitive appeal of significance scores
9Metric for usability of MT for IE
- Suggestion measuring differences in statistical
significance for a human translation and MT
allows estimating the amount of prospective
problems - Question do any human evaluation measures of MT
correlate with differences in S-scores for
different MT systems?
10Experiment setup
- Available 100 texts developed for DARPA 94 MT
evaluation exercise - French originals
- 2 different human translations (reference and
expert) - 5 translations of MT systems ("French into
English) - knowledge-based Systran Reverso Metal
Globalink - IBM statistical approach to MT Candide
- DARPA evaluation scores available for each system
and for human expert translation - Informativeness Adequacy Fluency
- Calculating distances of combined S-scores
between - the human reference translation
- other translations (MT and the expert
translation)
11The distance scores
- Based on comparing sets of words with S-score gt 1
- words significant in both texts with different
statistical significance scores - words not present in the reference translation
(overgenerated in MT) - words not present in MT, but present in the
reference translation (undergenerated in MT) - Computing distance scores
- o-score for avoiding overgeneration (
Presicion) - u-score for avoiding undergeneration ( Recall)
- uo combined score (calculated as F-measure)
12Computing distance scores...
- Words that changed their significance
13 Computing distance scores
- Scores for avoiding over- and under-generation
- Making scores compatible across texts
- (the number of significant words may be
different)
14The resulting distance scores
15DARPA Adequacy and scores
16o-score DARPA 94 Adequacy
17DARPA Fluency and scores
18uo-score and DARPA 94 Fluency
19Results and correlation of scores
- Human expert translation scores higher than MT
- Statistical MT system Candide is
characteristically different - Strong positive correlation found for
- o-score DARPA adequacy
- Weak positive correlation found for
- uo DARPA fluency
- No correlation was found between u-score (high
for statistical MT) and human MT evaluation
measures
20Conclusions
- Word-significance measure S is useful in other
areas - (e.g., distinguishing lexical and morphological
differences) - Threshold S gt 1 distinguishes content and
functional words across different languages - (checked for English, French and Russian)
- Statistical modelling showed substantial
differences between human translation and MT
output corpora - Measures of contrastive frequencies for words in
a particular text and the rest of the corpus
correlate with human evaluation of MT (scores for
adequacy)
21Future work
- Statistical modelling of Example-based MT
- Investigating the actual performance of IE
systems on different tasks using MT of different
quality (with different "usability for IE"
scores) and its correlation with proposed MT
evaluation measures - Establishing formal properties for intuitive
judgements about translation quality (translation
equivalence, adequacy, and fluency in human
translation and MT)