MT System Combination - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

MT System Combination

Description:

Make use of the individual strengths of the different systems to ... 2.7 giga words English data, 5-gram LM. SRI LM toolkit. LDC: Linguistic Data Consortium ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 52
Provided by: muntsi
Category:

less

Transcript and Presenter's Notes

Title: MT System Combination


1
MT System Combination
2
System Combination in MT
  • Methods of machine translation
  • Rule based
  • Example based
  • Statistical
  • Hierarchical
  • Syntax based
  • Output is different
  • Make use of the individual strengths of the
    different systems to improve translation quality
  • Just selecting the best output on a
    sentence-by-sentence basis or synthetic
    combination of the output from the original
    systems?

3
Parallel Combination
MT System
MT SystemCombination
MT System
Translation
Source Language Text
MT System
MT System
4
Serial Combination
MT System
MT System
Source Language Text
Translation
5
Model Combination
Source Language Text
Phrase Table S1
Phrase TableCombination
Decoder
Phrase Table S2
Lexicon S1
LexiconCombination
Lexicon S2
Reordering S1
ReorderingCombination
Reordering S2
Target Language Text

6
MT System Combination approaches
  • Parallel Combination
  • Hypothesis Selection
  • Lattice based Combination
  • CN based Combination
  • Serial combination
  • RBMT SMT
  • Cross combo Hypothesis Selection CN based
    Combination
  • Model level combination
  • Combine lexica
  • Combine phrase tables
  • Combine input systems reorderings

7
System Combination in MT
Example based
Phrase based statistical
Hierarchical statistical
8
System Combination in MT
hoffman was mesmerized by drug but woke up in a
timely manner to create career
9
Hypothesis Selection
  • How to decide which hypothesis to pick?
  • System bias
  • Boost hypothesis from each system according to
    its overall (BLEU) score on development data
  • MT System confidence score
  • Sytem tells you what it thinks how well it
    translated the sentence
  • Problematic, because systems estimates are not
    comparable, between systems as well as between
    sentences within a system
  • Need to be normalized
  • Language model

10
Hypothesis Selection
  • Pick the best hypothesis from the different
    systems for each source sentence
  • Use n-best list re-ranking approach
  • Add n-best hypotheses from each system
  • Re-rank joint n-best list
  • Find good featuresadd n-best list based
    features

11
N-best list re-ranking Features
  • Consistently calculated for the joint n-best list
  • Language model
  • Statistical word lexicon
  • Position dependent word agreement
  • Position independent n-gram agreement
  • N-best list n-gram probability
  • Sentence length features
  • Rank in systems n-best list
  • System bias
  • Minimum error rate training to determine feature
    weights on a development test set

12
LM and Lexicon
  • Language model
  • Kneser-Ney smoothing
  • Sentence score is normalized by the sentence
    length
  • as large as possible e.g. complete LDC
    GigaWordCorpus, 2.7 giga words English data,
    5-gram LM
  • SRI LM toolkit
  • LDC Linguistic Data Consortium

13
LM and Lexicon
  • Statistical word lexica
  • lexicon probability sum
  • lexicon probability maximum
  • Both language directions
  • From model 4 GIZA training
  • all bilingual data you can get e.g. 260 million
    words Chinese - English
  • Sentence score is normalized by the sentence
    length

14
Position dependent word agreement
  • N-best list based feature
  • Relative frequency of n-best list entries
    containing word e in the same position
  • Very restrictive, to loosen the restriction
  • Use window around original position
  • Window sizes t 0 2 as three separate
    features
  • Average word score Sentence score

15
Position Dependent Word Agreement
t 2
t 1
t 0agreement 30
n-best list for one source sentence
  • majority vote on words (ROVER style)

16
Position Independent N-gram Agreement
  • N-best list based feature
  • Relative frequency of n-best list entries
    containing the n-gram
  • Average n-gram score sentence score
  • Uni-gram to 6-gram as 6 separate features

17
Position Independent N-gram Agreement
n-best list for one source sentence
n 1word agreement90
n 3tri-gram agreement60
n 55-gram agreement40
  • Agreement score for n 1 to 6 as separate
    features

18
N-best List N-gram Probability
  • N-best list based feature
  • Standard language model probability, estimated on
    the n-best hypothesis for one source sentence
  • No smoothing, because counts are never zero
  • Average word score Sentence score

19
N-gram Agreement vs. N-gram Probability
San Francisco
n-best list for one source sentence
n 2bi-gram agreement30
P( Francisco San ) 3 / 3 bi-gram
probability100
  • LM n-gram probability gives information on word
    order.

20
Sentence Length Features
  • Deviation of the sentence length from the average
    hypothesis length of all hypothesis in the n-best
    list for this source sentence
  • Ratio between the source and the target sentence
    length
  • Deviation from the overall source-target token
    count ratio in the bilingual training corpus for
    this language pair
  • Deviation of the source/target ratio from the
    average source to target length ratio within its
    class
  • divide training corpus into several buckets by
    sentence length
  • calculate average source/target length ratio
    or average target length per bucket
  • source/target ratio might be quite different for
    short, medium and long sentences

21
System Bias
  • Boost each hypothesis according to its rank in
    the original systems n-best list
  • Boost each translation coming from the best
    system
  • Boost each hypothesis according to its systems
    performance on development data
  • Add system indicator feature to the feature set
    and optimize weights using MERT

22
System Combination in MT
Chinese-English MT06
hoffman was mesmerized by drug
fortunately awakening
in a timely manner to create career
in performing arts
23
Lattice-based Example
24
Lattice-based combination
hoffman was addicted to drugs, fortunately
awaking in a timely manner to begin an acting
career
hoffman was obsessed
previously enamored drug
was mesmerized by
drug
hoffman
were
1
2
3
5
6
4
hoffman
was
addicted
to
drugs
was obsessed
25
Lattice-based Combination
  • Build lattice from multiple system output
  • Needs phrase translation boundary and source
    alignment information
  • Can also combine complete system internal
    translation lattices (if available)
  • Systems internal scores are most likely not
    comparable, consistent scoring is difficult
  • Same translation proposed by several system is
    not preferred/boosted
  • LM score
  • Re-decode

26
Confusion Network based Combo
was (3)
obsessed (2)
hoffman (5)
were (1)
previously (1)
enamored (1)
drug (6)
1
2
3
5
6
4
e (1)
e (3)
e (2)
hoffmann (1)
has (1)
fortunately (1)
mesmerized (1)
27
Lattice vs. CN
hoffman was obsessed
previously enamored drug
was mesmerized by
hoffman
were
drug
1
2
3
5
6
4
was
hoffman
addicted
to
drugs
was obsessed
was (3)
obsessed (2)
hoffman (5)
were (1)
previously (1)
enamored (1)
drug (6)
1
2
3
5
6
4
e (1)
e (3)
e (2)
hoffmann (1)
has (1)
fortunately (1)
mesmerized (1)
28
Word level confusion network decoding
  • Choose one translation hypothesis as skeleton
    (determines word order)
  • Align each hypothesis to the skeleton using TER,
    ITGs, statistical word alignment
  • Build confusion network with consensus votes
  • Add LM score into network
  • Train system weights, add into network
  • Choose best path through the network (decode)
  • Output consensus translation

29
confusion network decoding
choose as skeleton
skeleton determines word order
30
confusion network decoding
  • Biggest challenge word alignment
  • Pairwise vs. incremental alignment
  • TER alignment
  • Use morphology, synonyms, POS tag
  • Go to phrases (without source-target phrase
    alignment available

31
confusion network decoding
  • comparison pairwise lt-gt incremental alignment
  • next hypothesis is aligned to the existing
    network, not to the skeleton
  • order of adding hypothesis does make a
    difference, e.g. use increasing TER/decreasing
    BLEU of the system
  • But choice of skeleton is not that crucial any
    more

32
pairwise vs. incremental alignment
pairwise
incremental
33
Serial Combination
RBMT System
SMT System
Source Language Text
Translation
34
Serial Combination
  • RBMT and SMT are good on very different things
  • RBMT produces very good translations, if its
    rules cover the sentence well and fails utterly
    for e.g. long complicated sentences
  • SMT produces more or less erroneous output on
    everything
  • Serial Combination
  • Translate entire training corpus with RBMT
  • Train SMT on parallel corpus RMBT-translation
    -gt English
  • SMT system as automatic post editor for RBMT
  • Smooth out RBMT problems without loosing its
    strengths
  • Give RBMT strengths a better chance as in
    parallel combination, because statistical models
    bias towards SMT there

35
Cross Combination
  • Hypothesis selection output serves as skeleton
    for CN generation
  • Smart choice of the skeleton for CN generation
    has impact on translation quality e.g. because
    it determines word order
  • Reverse order works as well
  • each input system serves as skeleton for n
    confusion networks
  • hypothesis selection selects from combined
    n-best lists from the CN decoding

MT System
Combination via Hypothesis Selection
MT System
Source Language Text
Combination viaConfusion NetworkDecoding
MT System
MT System
36
Model Combination
Source Language Text
Phrase Table S1
Phrase TableCombination
Decoder
Phrase Table S2
Lexicon S1
LexiconCombination
Lexicon S2
Reordering S1
ReorderingConstraint
Reordering S2
Target Language Text

37
Lexicon Combination
  • Combine Systems Lexica
  • Re-estimate joint probabilities
  • only useful, if the systems have different
    training data available
  • Train test set specific Lexicon
  • Treat the source and its various translations as
    special training data
  • Build a lexicon only with entries from the
    systems translations
  • If available, use systems phrase alignment to
    constrain word alignment
  • Interpolate this test set specific sharp lexicon
    with large lexicon trained from all training data

38
Phrase Table combination
  • Source-target phrase pairs available for the test
    set from the input systems
  • Combine with full baseline phrase table,
    adjusted weight for the phrase counts from the
    systems output
  • Rescore phrases using scaled systems total or
    confidence score
  • Agreement boost for phrases coming from several
    systems
  • Exact match
  • Same phrase, different distortion
  • Overlapping source interval with same target
    words
  • Overlapping target words
  • Prune phrase table
  • From full phrase table, only keep phrases,
    covered by one of the systems output

39
Phrase Table combination
  • Several rule based systems as input, so no
    phrase pairs available
  • Train word alignment on parallel corpus
  • Align systems output to source with
    statistically learned alignments
  • Extract test set specific phrase table from
    systems translations (moses phrase extraction)
  • Interpolate with baseline statistical phrase
    table
  • Build translation lattice
  • Re-decode

40
Reordering
  • While re-decoding
  • Restrict to reorderings used by one of the
    input systems
  • Boost word order chosen by one of the systems

41
MT System Combination approaches
  • Parallel Combination
  • Hypothesis Selection
  • Lattice based Combination
  • CN based Combination
  • Serial combination
  • RBMT SMT
  • Cross combo Hypothesis Selection CN based
    Combination
  • Model level combination
  • Combine lexica
  • Combine phrase tables
  • Combine input systems reorderings

42
  • Thank you!

43
Hypothesis selection
  • 6 Large scale Chinese English MT systems
  • 3 translation research groups
  • 4 MT decoders
  • phrase based, hierarchical, example based systems

scores in BLEU
44
Questions to answer
  • N-best list size per system
  • Do n-best translations help at all?
  • If yes, how many?
  • How many systems to include
  • More systems always better?
  • Does a low quality system hurt combination?
  • Feature impact
  • Which features are the most useful?
  • How does this compare to MBR on n-best list?

45
N-best List Size
baseline 31.45
50
BLEU on Chinese MT06
46
Combining all Systems
  • Adding systems one by one to the combination
  • Ordered by their BLEU score on the unseen test
    set

2.27 BLEU
baseline 31.45
BLEU on MT06
47
Feature Impact
  • Compare to two additional baselines
  • LM re-ranking only
  • LM statistical word lexicon
  • 23 features, 5 feature groups, cant run all
    combinations
  • Remove one feature group at a time
  • LM
  • Lexicon
  • Position dependent word agreement
  • Position independent n-gram agreement
  • N-best list LM probability

48
Feature Impact
8 features
23 features
scores in BLEU
49
Contribution of the Systems to the Combination
Chinese-English contributions per system
50
Comparison to MBR
  • Chinese English
  • 623 sentences tuning set / 588 sentences blind
    test
  • Normalized system specific cost
  • 200 best per system
  • Hypo-sel w/o normalized system cost as additional
    feature

51
Cross Combination
  • Hypothesis selection output serves as skeleton
    for CN generation (JHU CN-based combo)
  • Smart choice of the skeleton for CN generation
    has impact on translation quality
Write a Comment
User Comments (0)
About PowerShow.com