Title: MT System Combination
1MT System Combination
2System Combination in MT
- Methods of machine translation
- Rule based
- Example based
- Statistical
- Hierarchical
- Syntax based
- Output is different
- Make use of the individual strengths of the
different systems to improve translation quality - Just selecting the best output on a
sentence-by-sentence basis or synthetic
combination of the output from the original
systems?
3Parallel Combination
MT System
MT SystemCombination
MT System
Translation
Source Language Text
MT System
MT System
4Serial Combination
MT System
MT System
Source Language Text
Translation
5Model Combination
Source Language Text
Phrase Table S1
Phrase TableCombination
Decoder
Phrase Table S2
Lexicon S1
LexiconCombination
Lexicon S2
Reordering S1
ReorderingCombination
Reordering S2
Target Language Text
6MT System Combination approaches
- Parallel Combination
- Hypothesis Selection
- Lattice based Combination
- CN based Combination
- Serial combination
- RBMT SMT
- Cross combo Hypothesis Selection CN based
Combination - Model level combination
- Combine lexica
- Combine phrase tables
- Combine input systems reorderings
7System Combination in MT
Example based
Phrase based statistical
Hierarchical statistical
8System Combination in MT
hoffman was mesmerized by drug but woke up in a
timely manner to create career
9Hypothesis Selection
- How to decide which hypothesis to pick?
- System bias
- Boost hypothesis from each system according to
its overall (BLEU) score on development data - MT System confidence score
- Sytem tells you what it thinks how well it
translated the sentence - Problematic, because systems estimates are not
comparable, between systems as well as between
sentences within a system - Need to be normalized
- Language model
10Hypothesis Selection
- Pick the best hypothesis from the different
systems for each source sentence - Use n-best list re-ranking approach
- Add n-best hypotheses from each system
- Re-rank joint n-best list
- Find good featuresadd n-best list based
features
11N-best list re-ranking Features
- Consistently calculated for the joint n-best list
- Language model
- Statistical word lexicon
- Position dependent word agreement
- Position independent n-gram agreement
- N-best list n-gram probability
- Sentence length features
- Rank in systems n-best list
- System bias
- Minimum error rate training to determine feature
weights on a development test set
12LM and Lexicon
- Language model
- Kneser-Ney smoothing
- Sentence score is normalized by the sentence
length - as large as possible e.g. complete LDC
GigaWordCorpus, 2.7 giga words English data,
5-gram LM - SRI LM toolkit
- LDC Linguistic Data Consortium
13LM and Lexicon
- Statistical word lexica
- lexicon probability sum
- lexicon probability maximum
- Both language directions
- From model 4 GIZA training
- all bilingual data you can get e.g. 260 million
words Chinese - English - Sentence score is normalized by the sentence
length
14Position dependent word agreement
- N-best list based feature
- Relative frequency of n-best list entries
containing word e in the same position - Very restrictive, to loosen the restriction
- Use window around original position
- Window sizes t 0 2 as three separate
features - Average word score Sentence score
15Position Dependent Word Agreement
t 2
t 1
t 0agreement 30
n-best list for one source sentence
- majority vote on words (ROVER style)
16Position Independent N-gram Agreement
- N-best list based feature
- Relative frequency of n-best list entries
containing the n-gram - Average n-gram score sentence score
- Uni-gram to 6-gram as 6 separate features
17Position Independent N-gram Agreement
n-best list for one source sentence
n 1word agreement90
n 3tri-gram agreement60
n 55-gram agreement40
- Agreement score for n 1 to 6 as separate
features
18N-best List N-gram Probability
- N-best list based feature
- Standard language model probability, estimated on
the n-best hypothesis for one source sentence - No smoothing, because counts are never zero
- Average word score Sentence score
19N-gram Agreement vs. N-gram Probability
San Francisco
n-best list for one source sentence
n 2bi-gram agreement30
P( Francisco San ) 3 / 3 bi-gram
probability100
- LM n-gram probability gives information on word
order.
20Sentence Length Features
- Deviation of the sentence length from the average
hypothesis length of all hypothesis in the n-best
list for this source sentence - Ratio between the source and the target sentence
length - Deviation from the overall source-target token
count ratio in the bilingual training corpus for
this language pair - Deviation of the source/target ratio from the
average source to target length ratio within its
class - divide training corpus into several buckets by
sentence length - calculate average source/target length ratio
or average target length per bucket - source/target ratio might be quite different for
short, medium and long sentences
21System Bias
- Boost each hypothesis according to its rank in
the original systems n-best list - Boost each translation coming from the best
system - Boost each hypothesis according to its systems
performance on development data - Add system indicator feature to the feature set
and optimize weights using MERT
22System Combination in MT
Chinese-English MT06
hoffman was mesmerized by drug
fortunately awakening
in a timely manner to create career
in performing arts
23Lattice-based Example
24Lattice-based combination
hoffman was addicted to drugs, fortunately
awaking in a timely manner to begin an acting
career
hoffman was obsessed
previously enamored drug
was mesmerized by
drug
hoffman
were
1
2
3
5
6
4
hoffman
was
addicted
to
drugs
was obsessed
25Lattice-based Combination
- Build lattice from multiple system output
- Needs phrase translation boundary and source
alignment information - Can also combine complete system internal
translation lattices (if available) - Systems internal scores are most likely not
comparable, consistent scoring is difficult - Same translation proposed by several system is
not preferred/boosted - LM score
- Re-decode
26Confusion Network based Combo
was (3)
obsessed (2)
hoffman (5)
were (1)
previously (1)
enamored (1)
drug (6)
1
2
3
5
6
4
e (1)
e (3)
e (2)
hoffmann (1)
has (1)
fortunately (1)
mesmerized (1)
27Lattice vs. CN
hoffman was obsessed
previously enamored drug
was mesmerized by
hoffman
were
drug
1
2
3
5
6
4
was
hoffman
addicted
to
drugs
was obsessed
was (3)
obsessed (2)
hoffman (5)
were (1)
previously (1)
enamored (1)
drug (6)
1
2
3
5
6
4
e (1)
e (3)
e (2)
hoffmann (1)
has (1)
fortunately (1)
mesmerized (1)
28Word level confusion network decoding
- Choose one translation hypothesis as skeleton
(determines word order) - Align each hypothesis to the skeleton using TER,
ITGs, statistical word alignment - Build confusion network with consensus votes
- Add LM score into network
- Train system weights, add into network
- Choose best path through the network (decode)
- Output consensus translation
29confusion network decoding
choose as skeleton
skeleton determines word order
30confusion network decoding
- Biggest challenge word alignment
- Pairwise vs. incremental alignment
- TER alignment
- Use morphology, synonyms, POS tag
- Go to phrases (without source-target phrase
alignment available
31confusion network decoding
- comparison pairwise lt-gt incremental alignment
- next hypothesis is aligned to the existing
network, not to the skeleton - order of adding hypothesis does make a
difference, e.g. use increasing TER/decreasing
BLEU of the system - But choice of skeleton is not that crucial any
more
32pairwise vs. incremental alignment
pairwise
incremental
33Serial Combination
RBMT System
SMT System
Source Language Text
Translation
34Serial Combination
- RBMT and SMT are good on very different things
- RBMT produces very good translations, if its
rules cover the sentence well and fails utterly
for e.g. long complicated sentences - SMT produces more or less erroneous output on
everything - Serial Combination
- Translate entire training corpus with RBMT
- Train SMT on parallel corpus RMBT-translation
-gt English - SMT system as automatic post editor for RBMT
- Smooth out RBMT problems without loosing its
strengths - Give RBMT strengths a better chance as in
parallel combination, because statistical models
bias towards SMT there
35Cross Combination
- Hypothesis selection output serves as skeleton
for CN generation - Smart choice of the skeleton for CN generation
has impact on translation quality e.g. because
it determines word order - Reverse order works as well
- each input system serves as skeleton for n
confusion networks - hypothesis selection selects from combined
n-best lists from the CN decoding
MT System
Combination via Hypothesis Selection
MT System
Source Language Text
Combination viaConfusion NetworkDecoding
MT System
MT System
36Model Combination
Source Language Text
Phrase Table S1
Phrase TableCombination
Decoder
Phrase Table S2
Lexicon S1
LexiconCombination
Lexicon S2
Reordering S1
ReorderingConstraint
Reordering S2
Target Language Text
37Lexicon Combination
- Combine Systems Lexica
- Re-estimate joint probabilities
- only useful, if the systems have different
training data available - Train test set specific Lexicon
- Treat the source and its various translations as
special training data - Build a lexicon only with entries from the
systems translations - If available, use systems phrase alignment to
constrain word alignment - Interpolate this test set specific sharp lexicon
with large lexicon trained from all training data
38Phrase Table combination
- Source-target phrase pairs available for the test
set from the input systems - Combine with full baseline phrase table,
adjusted weight for the phrase counts from the
systems output - Rescore phrases using scaled systems total or
confidence score - Agreement boost for phrases coming from several
systems - Exact match
- Same phrase, different distortion
- Overlapping source interval with same target
words - Overlapping target words
- Prune phrase table
- From full phrase table, only keep phrases,
covered by one of the systems output
39Phrase Table combination
- Several rule based systems as input, so no
phrase pairs available - Train word alignment on parallel corpus
- Align systems output to source with
statistically learned alignments - Extract test set specific phrase table from
systems translations (moses phrase extraction) - Interpolate with baseline statistical phrase
table - Build translation lattice
- Re-decode
40Reordering
- While re-decoding
- Restrict to reorderings used by one of the
input systems - Boost word order chosen by one of the systems
41MT System Combination approaches
- Parallel Combination
- Hypothesis Selection
- Lattice based Combination
- CN based Combination
- Serial combination
- RBMT SMT
- Cross combo Hypothesis Selection CN based
Combination - Model level combination
- Combine lexica
- Combine phrase tables
- Combine input systems reorderings
42 43Hypothesis selection
- 6 Large scale Chinese English MT systems
- 3 translation research groups
- 4 MT decoders
- phrase based, hierarchical, example based systems
scores in BLEU
44Questions to answer
- N-best list size per system
- Do n-best translations help at all?
- If yes, how many?
- How many systems to include
- More systems always better?
- Does a low quality system hurt combination?
- Feature impact
- Which features are the most useful?
- How does this compare to MBR on n-best list?
45N-best List Size
baseline 31.45
50
BLEU on Chinese MT06
46Combining all Systems
- Adding systems one by one to the combination
- Ordered by their BLEU score on the unseen test
set
2.27 BLEU
baseline 31.45
BLEU on MT06
47Feature Impact
- Compare to two additional baselines
- LM re-ranking only
- LM statistical word lexicon
- 23 features, 5 feature groups, cant run all
combinations - Remove one feature group at a time
- LM
- Lexicon
- Position dependent word agreement
- Position independent n-gram agreement
- N-best list LM probability
48Feature Impact
8 features
23 features
scores in BLEU
49Contribution of the Systems to the Combination
Chinese-English contributions per system
50Comparison to MBR
- Chinese English
- 623 sentences tuning set / 588 sentences blind
test - Normalized system specific cost
- 200 best per system
- Hypo-sel w/o normalized system cost as additional
feature
51Cross Combination
- Hypothesis selection output serves as skeleton
for CN generation (JHU CN-based combo) - Smart choice of the skeleton for CN generation
has impact on translation quality