Relevant target fragments retrieved and recombined to deriv - PowerPoint PPT Presentation

1 / 50
About This Presentation

Relevant target fragments retrieved and recombined to deriv


Relevant target fragments retrieved and recombined to derive final translation. ... Retrieved example translation candidates are recombined, along with their ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 51
Provided by: declan7


Transcript and Presenter's Notes

Title: Relevant target fragments retrieved and recombined to deriv

Hybrid Data-Driven Models of Machine Translation
  • Andy Way ( Declan Groves)
  • National Centre for Language Technology,
  • School of Computing,
  • Dublin City University, Dublin 9, Ireland

  • Motivations
  • Example-Based Machine Translation
  • Marker-Based EBMT
  • Statistical Machine Translation
  • Experiments
  • Language Pairs Corpora Used
  • EBMT and PBSMT baseline systems
  • Hybrid System Experiments
  • Making use of merged data sets
  • Phrases, Chunks and Training-Test Corpora
  • Conclusions
  • Future Work

  • Most MT research carried out today is
  • Example-Based Machine Translation (EBMT)
  • Statistical Machine Translation (SMT)
  • Lack of comparative research
  • Relative unavailability of EBMT systems
  • Lack of participation of EBMT researchers in
    competitive evaluations
  • Dominance of the SMT approach

Example-Based Machine Translation
  • As with SMT, EBMT makes use of information
    extracted from sententially-aligned bilingual
    corpora. In general
  • SMT only uses parameters, throws away data
  • EBMT makes use of linguistic units directly
  • During Translation
  • Source side of bitext is searched for close
  • Source-target subsentential links are determined
  • Relevant target fragments retrieved and
    recombined to derive final translation.

EBMT An Example
  • Assumes an aligned bilingual corpus of examples
    against which input text is matched
  • Best match is found using a similarity metric
    based on word co-occurrence, POS, generalized
    templates and bilingual dictionaries (exact and
    fuzzy matching)

EBMT An Example
  • Assumes an aligned bilingual corpus of examples
    against which input text is matched
  • Best match is found using a similarity metric
    based on word co-occurrence, POS, generalized
    templates and bilingual dictionaries (exact and
    fuzzy matching)

EBMT An Example
  • Identify useful fragments

EBMT An Example
  • Identify useful fragments
  • Recombination depends on nature of examples used

on Monday
John went to
Jean est allé à
the bakers
la boulangerie
Marker-Based EBMT at DCU
Marker-Based EBMT at DCU
  • Gaijin Veale Way, RANLP 97
  • Gough et al., AMTA 02
  • wEBMT Way Gough, Comp. Linguistics 03
  • Gough Way, EAMT 04
  • Way Gough, TMI 04
  • Gough, PhD Thesis 05
  • Way Gough, Natural Language Engineering 05
  • Way Gough, Machine Translation 05
  • Groves Way, ACL w/shop on Data-Driven MT 05
  • Groves Way, Machine Translation EAMT 06
  • MaTrEx Armstrong et al., TC-STAR OpenLab 06
  • Stroppa et al., NIST MT-Eval 06, AMTA 06,

System Development
System Development
System Development
Marker-Based EBMT
  • The Marker Hypothesis states that all natural
    languages have
  • a closed set of specific words or morphemes
  • which appear in a limited set of grammatical
  • and which signal that context.
  • Green, 1979
  • Universal psycholinguistic constraint languages
    are marked for syntactic structure at surface
    level by closed set of lexemes or morphemes

The Dearborn Mich., energy company stopped paying
a dividend in the third quarter of 1984 because
of troubles at its Midland nuclear plant.
Marker-Based EBMT
  • The Marker Hypothesis states that all natural
    languages have
  • a closed set of specific words or morphemes
  • which appear in a limited set of grammatical
  • and which signal that context.
  • Green, 1979
  • Universal psycholinguistic constraint languages
    are marked for syntactic structure at surface
    level by closed set of lexemes or morphemes

The Dearborn Mich., energy company stopped paying
a dividend in the third quarter of 1984 because
of troubles at its Midland nuclear plant.
  • Three NPs start with determiners, one with a
    possessive pronoun
  • Nominal element will appear soon to the right
  • Sets of determiners and possessive pronouns small
    and finite

Marker-Based EBMT
  • The Marker Hypothesis states that all natural
    languages have
  • a closed set of specific words or morphemes
  • which appear in a limited set of grammatical
  • and which signal that context.
  • Green, 1979
  • Universal psycholinguistic constraint languages
    are marked for syntactic structure at surface
    level by closed set of lexemes or morphemes

The Dearborn Mich., energy company stopped paying
a dividend in the third quarter of 1984 because
of troubles at its Midland nuclear plant.
  • Four prepositional phrases, with prepositional
  • NP object will appear soon to the right
  • Set of prepositions small and finite

Marker-Based EBMT Chunking
  • Use a set of closed-class marker words to segment
    aligned source and target sentences during a
    pre-processing stage
  • ltPUNCgt now used as end-of-chunk marker
  • English Marker words extracted from CELEX

Marker-Based EBMT Chunking (2)
  • Enables the use of basic syntactic markup for
    extraction of translation resources
  • Source-target sentence pairs are tagged with
    marker categories in pre-processing stage

EN ltPRONgt you click apply ltPREPgt to view ltDETgt
the effect ltPREPgt of ltDETgt the
selection FR ltPRONgt vous cliquez ltPRONgt sur
appliquer ltPREPgt pour visualiser
ltDETgtl effet ltPREPgt de ltDETgt la sélection
  • Aligned source-target chunks created by
    segmenting sentences based
  • on these marker tags along with cognate
    and word co-occurrence
  • information
  • ltPRONgt you click apply
    ltPRONgt vous cliquez sur appliquer
  • ltPREPgt to view
    ltPREPgt pour visualiser
  • ltDETgt the effect
    ltDETgt leffet
  • ltPREPgt of the selection
    ltPREPgt de la sélection

Marker-Based EBMT Chunking (2)
  • Enables the use of basic syntactic markup for
    extraction of translation resources
  • Source-target sentence pairs are tagged with
    marker categories in pre-processing stage

EN ltPRONgt you click apply ltPREPgt to view ltDETgt
the effect ltPREPgt of ltDETgt the
selection FR ltPRONgt vous cliquez ltPRONgt sur
appliquer ltPREPgt pour visualiser
ltDETgtl effet ltPREPgt de ltDETgt la sélection
  • Aligned source-target chunks created by
    segmenting sentences based
  • on these marker tags along with cognate
    and word co-occurrence
  • information
  • ltPRONgt you click apply
    ltPRONgt vous cliquez sur appliquer
  • ltPREPgt to view
    ltPREPgt pour visualiser
  • ltDETgt the effect
    ltDETgt leffet
  • ltPREPgt of the selection
    ltPREPgt de la sélection
  • Chunks must contain at least one non-marker
    wordensures chunks contain
  • useful contextual information

Marker-Based EBMT Lexicon Template Extraction
  • Chunks containing only one non-marker word in
    both source and target languages can then be used
    to extract a word-level lexicon
  • ltPREPgt to ltPREPgt pour
  • ltLEXgt view ltLEXgt visualiser
  • ltLEXgt effect ltLEXgt effet
  • ltDETgt the ltDETgt l
  • ltPREPgt of ltPREPgt de
  • In a final pre-processing stage, we produce a set
    of generalized marker templates by replacing
    marker words with their tags
  • ltPRONgt click apply ltPRONgt cliquez sur appliquer
  • ltPREPgt view ltPREPgt visualiser
  • ltDETgt effect ltDETgt effet
  • ltPREPgt the selection ltPREPgt la sélection
  • Any marker word pair can now be inserted at the
    appropriate tag location.
  • More general examples add flexibility to the
    matching process and improve coverage (and

Marker-Based EBMT
  • During translation
  • Resources are searched from maximal (specific
    source-target sentence-pairs) to minimal context
    (word-for-word translation).
  • Retrieved example translation candidates are
    recombined, along with their weights, based on
    source sentence order
  • System outputs n-best list of translations

Phrase-Based SMT
  • SMT translation and language models now make use
    of phrase-translations in TM, along with word
    correspondences, to improve translation output.
  • Better modelling of syntax and local
  • Phrase extraction heuristics based on word
    alignments shown to be better than more
    syntactically motivated approaches Koehn et al.,
  • Perform word alignment in both source-target and
    target-source directions
  • Take intersection of unidirectional alignments
  • Extend the intersection iteratively into the
    union by adding adjacent alignments within the
    alignment space Och Ney 2003, Koehn et al.,
  • Extract all possible phrases from sentence pairs
    which correspond to these alignments
  • Phrase probabilities can be calculated from
    relative frequencies

Outline Recap
  • Motivations
  • Example-Based Machine Translation
  • Marker-Based EBMT
  • Statistical Machine Translation
  • Experiments
  • Language Pairs Corpora Used
  • EBMT and PBSMT baseline systems
  • Hybrid System Experiments
  • Making use of merged data sets
  • Phrases, Chunks and Training-Test Corpora
  • Conclusions
  • Future Work

  • Way Gough, 05 (cf. talk here in May 05) on
    203K- Sun TM (4.8M words), and a 4K- test set
    (ave. -length 13.1 words EN, 15.2 words FR),
    EBMTgtvanilla WB-SMT (Giza, CMU-Cambridge
    statistical toolkit, ISI ReWrite Decoder) for
  • Best BLEU scores
  • EN?FR .453 EBMT, .338 WB-SMT
  • FR?EN .461 EBMT, .446 WB-SMT

  • The Phrase-Based system using GIZA-Data
    outperforms the same system seeded with EBMT-Data
    on all metrics, bar Precision (0.6598 vs. 0.6661)
  • Marker-Based EBMT system beats both Phrase-Based
    SMT systems, particularly for BLEU (0.4409 vs.
    0.3758) and Recall (0.6877 vs. 0.5759).

  • Scores for all systems are better for FR?EN than
    for EN?FR
  • Again, the Phrase-Based system using GIZA data
    outperforms the same system seeded with EBMT
  • As for EN?FR, the Marker-Based EBMT system
    significantly outperforms both Phrase-Based SMT
    systems for FR?EN.

Towards Hybridity
  • Decided to merge data sources
  • Combine parts of EBMT sub-sentential alignments
    with parts of the data induced using GIZA
  • Performed a number of experiments using
  • EBMT Phrases GIZA Words (SEMI-HYBRID)
  • Investigate if quality of EBMT phrases is better
    than GIZA phrases
  • All Data (HYBRID) GIZA Words Phrases EBMT
    Words Phrases
  • EBMT phrases will be used instead of SMT n-grams
  • EBMT phrases should add extra probability to
    more useful SMT phrases i.e. the probabilities
    of the phrases in the intersection of these two
    sets are boosted

EBMT Phrases
Giza Phrases
Merging Data Sources EN?FR Results
  • Using EBMT phrases GIZA words improves
    significantly on using EBMT data alone
  • Merging all the EBMT and GIZA data improves on
    all metrics, most significantly for BLEU score
    (0.4259 vs. 0.3643 SEMI-HYBRID).
  • EBMT system still wins out for BLEU score, Recall
    and WER

Merging Data Sources FR?EN Results
  • Using EBMT phrases GIZA words shows
    improvements on PBSMT system seeded with EBMT
    data, but improves only on the GIZA seeded
    systems BLEU score (0.4888 vs. 0.4198).
  • However, merging all data improves on both PBSMT
    systems on all metrics
  • EBMT system beats Hybrid system only on Recall
    and WER

Results Discussion
  • Best PBSMT BLEU scores (with Giza data only)
    0.375 (E-F), 0.420 (F-E)
  • Seeding PBSMT with EBMT data gets good scores
    for BLEU, 0.364 (E-F), 0.395 (F-E) note
    differences in data size (1.73M vs. 403K)
  • PBSMT loses out to EBMT system
  • Semi-Hybrid System
  • Seeding Pharaoh with SMT words and EBMT phrases
    improves over baseline Giza seeded system
  • Data size diminishes considerably (430K vs.
  • Worse results than for EBMT system.
  • Fully-Hybrid System
  • Better results than for semi-hybrid system E-F
    0.426 (0.396), F-E 0.489 (0.427)
  • Data size increases to 2.04M phrase table entries
  • For F-E, Hybrid system beats EBMT on BLEU (0.4888
    vs. 0.4611) Precision (0.6927 vs. 0.6782) EBMT
    ahead for Recall WER.

EBMT PB-SMT (on Europarl)
  • Groves Way, 06a/b
  • Added SMT-chunks to EBMT system ? hybrid
    statistical EBMT system
  • New domain Europarl (FR?EN, 322K- ) Koehn, 05
  • Extracted training data from designated training
    sets, filtering based on sentence length and
    relative sentence length (ratio of 1.5 used).
  • Allowed us to extract high-quality training sets
  • For testing, randomly extracted 5000
    sentences from the Europarl common
  • test set. Avg. sentence lengths 20.5
    words (French), 19.0 words (English)

  • Compared the performance of our Marker-Based EBMT
    system against that of a PB-SMT system built
  • Pharaoh Phrase-Based Decoder Koehn, 04
  • SRI LM toolkit Stolcke, 02.
  • Refined alignment strategy Och Ney, 03
  • Trained on incremental data sets, tested on 5000
    sentence test set
  • Effect of increasing training data on translation
  • Performed translation for FR?EN
  • Evaluated translation quality automatically using
    BLEU Papineni et al., 02, Precision Recall
    (GTM toolkit Turian et al., 03) and Word-error
    rate (WER)

EBMT vs. PBSMT French-English
  • Doubling the amount of data improves performance
    across the board for both EBMT and PBSMT
  • PBSMT system clearly outperforms EBMT system, on
    average achieving 0.07 BLEU score higher
  • PBSMT achieves a significantly lower WER (e.g.
    68.55 vs. 82.43 for the 322K data set)
  • Increasing amount of training data results in
  • 3-5 increase in relative BLEU for PBSMT
  • 6.2 to 10.3 relative BLEU score improvement
    for EBMT

EBMT vs. PBSMT English-French
  • PBSMT continues to outperform EBMT system by some
  • e.g. 0.1933 vs. 0.1488 BLEU score, 0.518 vs.
    0.4578 Recall for 322K data set
  • Difference between systems is somewhat less for
    EN?FR than for FR?EN
  • EBMT system performance much more consistent for
    both directions
  • PBSMT system performs 2 BLEU score worse (10
    relative) for EN?FR than for
  • FR?EN
  • French-English is easier
  • Fewer agreement errors, problems with boundary
    friction e.g. le? the (FR?EN),
  • the? le, la, les, l (EN?FR)
  • EBMT scores higher for EN?FR than for
  • FR?EN in terms of BLEU score
  • Cf. Callison-Burch et al., 06, BLEU for
    evaluating non-n-gram-based systems

Hybrid System Experiments
  • Decided to merge elements of EBMT marker-based
    alignments with PBSMT phrases and words induced
    via GIZA
  • Number of Hybrid Systems
  • LEX-EBMT Replaced EBMT lexicon with higher
    quality PBSMT word-alignments, to lower WER
  • H-EBMT vs. H-PBSMT Merged PBSMT words and
    phrases with EBMT data (words and phrases) and
    passed resulting data to baseline EBMT and
    baseline PBSMT systems
  • H-EBMT-LM Reranked the output of H-EBMT systems
    using the PBSMT systems equivalent language model

Hybrid Experiments French-English
Hybrid Experiments French-English
Hybrid Experiments French-English
Hybrid Experiments French-English
  • Use of the improved lexicon (LEX-EBMT), leads to
    only slight improvements (average relative
    increase of 2.9 BLEU)
  • Adding Hybrid data improves above baselines, for
    both EBMT (H-EBMT) and PBSMT (H-PBSMT)
  • H-PBSMT system achieves higher BLEU score trained
    on 78K 156K compared with PBSMT system when
    trained on twice as much data.
  • The addition of the language model to the H-EBMT
    system helps guide word order after lexical
    selection and thus improves results further

Hybrid Experiments English-French
  • We see similar results for EN?FR as for FR?EN
  • The more SMT-like the EBMT system becomes, the
    more the BLEU scores fall in line with other
    metrics, i.e. higher for FR?EN than for EN?FR
  • Using the hybrid data set we get a 15 average
    relative increase in BLEU score for the EBMT
    system, and 6.2 for the H-PBSMT system over its
  • The H-PBSMT system performs almost as well as the
    baseline system trained on over 4 times the
    amount of data

SMT phrases vs. EBMT chunks
  • Many more SMT phrases are derived than EBMT
  • Not reflected in scores
  • Doubling amount of data, doubles amount of
    sub-sentential alignments for both systems
  • Indicates the heterogeneous nature of the
    Europarl corpus
  • Taking the 322K training set
  • 93.0 SMT chunks found only once, 99.4 occur lt
    10 times
  • 96.6 EBMT chunks found only once, 99.8 occur lt
    10 times
  • Of the top 10 most frequent chunks in SMT-only
    set, 7 are made up solely of marker words
  • du ? of the
  • de la ? of the
  • union européenne ? union
  • états membres ? member states
  • de l ? of the
  • dans le ? in the
  • n est ? is
  • parlement européen ? parliament
  • que nous ? that we
  • que la ? that the

Translation Examples
  • PBSMT we have all accepted the lesson of the
    food crisis the 1990s
  • H-PBSMT we have all accepted the lesson of
    the food crisis in the 1990s
  • REF we have all learned our lesson from the
    food crisis of the 90s
  • --------------------------------------------------
  • PBSMT indeed if the second-pillar example
    were less frequent there would be fewer poor
  • H-PBSMT indeed if pensions for example were
    less frequent there would be fewer poor
  • REF if indeed for example pensions were less
    inadequate there would be fewer poor people
  • --------------------------------------------------
  • PBSMT in this regard the port controls there
    should be making the regulations still more
  • H-PBSMT when it comes to port controls we must
    make the regulations still more stringent
  • REF it is important to tighten up regulations
    regarding the control of harbours and ports even
  • --------------------------------------------------
  • PBSMT it also requires that we continue to
    discussed the entry into force of fiscal
  • H-PBSMT we also need to continue to ask
    ourselves questions about the implementation of
    fiscal harmonization
  • REF we also still need to continue to question
    the implementing of fiscal harmonisation

  • Groves Way, 05 showed how an EBMT system
    outperforms a PBSMT system when trained on the
    Sun Microsystems data set
  • This time around, the baseline PBSMT system
    achieves higher quality than all variants of the
    EBMT system
  • Heterogeneous Europarl vs. Homogeneous Sun data
  • Chunk coverage is lower on Europarl data set 6
    translations produced using chunks alone (Sun)
    vs. 1 on Europarl
  • EBMT system considered 13 words on average for
    direct translation (vs. 7 for Sun data)
  • Significant improvements seen when using
    higher-quality lexicon
  • Improvements also seen when LM introduced
  • H-PBSMT system able to outperform baseline PBSMT
  • Further gains to be made from hybrid corpus-based
  • Small overlap on chunks extracted via EBMT and
    SMT methods

Hybrid Example-Based SMT The MaTrEx system
Hybrid Example-Based SMT
  • Armstrong et al., 06 OpenLab MT-EVAL (March
    06)adding EBMT chunks to vanilla Pharaoh
    PB-SMT system adds about 4 BLEU points for ES?EN
  • Stroppa et al., 06 adding EBMT chunks to
    vanilla Pharaoh PB-SMT system adds about 5 BLEU
    points for Basque?EN
  • Good performance in IWSLT-06

Outline Recap
  • Motivations
  • Example-Based Machine Translation
  • Marker-Based EBMT
  • Statistical Machine Translation
  • Experiments
  • Language Pairs Corpora Used
  • EBMT and PBSMT baseline systems
  • Hybrid System Experiments
  • Making use of merged data sets
  • Phrases, Chunks and Training-Test Corpora
  • Conclusions
  • Future Work

Phrases, Chunks and Training-Test Corpora
  • SMT phrases are contiguous sequences of n-grams
  • Typically, EBMT performance is comparable with
    PB-SMT with fewer sub-sentential alignments
  • As EBMT chunks are different from SMT phrases,
    use them if available in your PB-SMT systems (cf.
    OpenLab ES?EN and AMTA Basque?EN results). They
  • Provide longer sequences of context ? better
  • Reinforce probability of good but infrequent SMT
  • As SMT phrases are different from EBMT chunks,
    use them if available in your EBMT systems
  • SMT phrases typically shorter than EBMT chunks,
    so more useful where training/test material is
    more heterogeneouswhere EBMT chunks are too
    long to cover the input data, SMT n-grams can
    fill in before we need to resort to W2W
    translation (always last resort)
  • cf. CMU findings in recent NIST MT-Eval

Phrases, Chunks and Training-Test Corpora
  • Looks like EBMT better on homogeneous training
  • EBMT gt PB-SMT on Sun TM (EN?FR)
  • EBMT gt PB-SMT on EF TM (Basque?EN)
  • SMT better on (more) heterogeneous data
  • PB-SMT gt EBMT on Europarl (EN?FR)
  • Predictors of Usefulness of Approach given Text
  • Chunk coverage
  • Amount of W2W Translation

  • Combining SMT phrases and EBMT chunks in a
    hybrid statistical EBMT or example-based SMT
    system will improve your system output
  • Blind adherence to one approach will guarantee
    that your performance is less than it could
    otherwise be
  • John Hutchins EBMT is Hybrid MT
  • Joe Olive Need combination of rules and

Ongoing Future Work
  • Automatic detection of Marker Words
  • Most common SMT phrases consist mainly of marker
  • Plan to increase levels of hybridity
  • Code a simple EBMT decoder, factoring in
    Marker-Based recombination approach along with
  • Use exact sentence matching in PBSMT, as in EBMT
  • Integration of generalized templates into PBSMT
    system (and reintegrate them into EBMT system)
  • Integrate marker tag information into SMT
    language and translation models
  • Hybrid EBMT-EBMT System (with CMU)?!
  • Whats the contribution of EBMT chunks if an SMT
    system is allowed as much training data as it

  • Thank you for your attention.
Write a Comment
User Comments (0)