MT Parameter Estimation Minimum Error Rate Training presentation

About This Presentation

Transcript and Presenter's Notes

Title: MT Parameter Estimation Minimum Error Rate Training

1
MT Parameter EstimationMinimum Error Rate
Training
2
Overview

Parameter Estimation / Tuning / Minimum Error
Rate Training
Tuning Set
Difficulties
Computationally expensive to calculate objective
function
Error surface makes search non-trivial
N-Best Lists
Powell Search
Simplex Algorithm
N-Best List Re-scoring
Minimum Bayes Risk

3
System overview
4
System overview
Translation Model Phrase Table e -gt f
Source Language Text
Translation Model Phrase Table f -gt e
Preprocessing
Parameter Estimation
Translation Model Lexicon e -gt f
Translation Model Lexicon f -gt e
Decoder

Language Model
POS LM

Distortion Model
Word Count
Phrase Count
Target Language Text
Cohesion Constraint

5
Parameter Estimation / Tuning

Need training data to optimize weights (?1, ,
?n)
Set of sentences with reference translation
usually around 1000 sentences
held out from training data for translation and
language models
is called tuning set or development set
Tuning towards better translation
needs automatic translation evaluation metric
(e.g. BLEU, TER, METEOR)
minimize translation error rate (maximize
translation score)
Minimum Error Rate Training (MERT)

6
Parameter Estimation / Tuning

Find (?1, , ?n) so the translation error rate is
minimal
To evaluate the objective function we need to
set weights
run decoder with these weights
evaluate resulting translation
computationally expensive!
Error surface is not nice
Not convex -gt many local minima
Not differentiable -gt gradient descent methods
not readily applicable

?i
7
N-Best Lists

Optimize on n-best list
output e.g. 500 best translations with their
feature scores
pre-calculate error rate for each n-best list
entry
optimize weights so the best translation
(according to error metric) is getting the best
total score
Powell search / Simplex Algorithm
re-run decoder with updated weights
add new n-best list to the previous one (more
stability)
run optimizer over larger n-best lists
repeat until no new translations, or improvement
lt epsilon, or just k times (typically 5-10
iterations)

8
N-Best Lists Example

N-best hypotheses

17 since october 9th , the dprk announced to
conduct nuclear tests , japan's ruling liberal
democratic party policy chief shoichi nakagawa
repeatedly on different occasions claimed that
discussions on whether japan should possess
nuclear weapons . cost 12.271806 17
since october 9th , the dprk announced to conduct
nuclear tests , japan's ruling liberal democratic
party policy chief shoichi nakagawa repeatedly on
different occasions that japan should discuss
whether it possesses nuclear weapons . cost
12.488882 17 since october 9th north korea
announced a nuclear test , japan's ruling liberal
democratic party policy chief shoichi nakagawa
repeatedly on different occasions claimed that
discussions on whether japan should possess
nuclear weapons . cost 12.599372 17
since october 9th , the dprk announced to conduct
nuclear tests , discussion of japan's ruling
liberal democratic party policy chief shoichi
nakagawa repeatedly on different occasions that
japan should have nuclear weapons . cost
12.612238 17 beginning october 9th , north
korea announced a nuclear test , japan's ruling
liberal democratic party policy chief shoichi
nakagawa repeatedly on different occasions
claimed that discussions on whether japan should
possess nuclear weapons . cost 12.970050 17
since october 9th north korea announced a
nuclear test , japan's ruling liberal democratic
party policy chief shoichi nakagawa repeatedly on
different occasions that japan should discuss
whether it possesses nuclear weapons . cost
13.050649 17 from october 9th , the dprk
announced to conduct nuclear tests , japan's
ruling liberal democratic party policy chief
shoichi nakagawa repeatedly on different
occasions claimed that discussions on whether
japan should possess nuclear weapons . cost
13.192306
9
N-Best Lists Example
16 there will be no ( removed ) . "
cost 2.8804655652387 2.93105 15.7766 0.155555
1.18532 1.84075 2.38613 5.12687 2.43787 5 -9
4.44849 16 there will not be ( removed ) .
" cost 3.0528793365197 3.39828 18.0113
0.155555 0.941883 1.635 1.84705 4.29711 2.43787 5
-9 3.95439 16 there will be no ( the
dismissal ) . " cost 3.094871928852 3.51552
18.9155 0.04 2.26044 2.48211 1.08776 5.11773
1.90654 5 -10 4.56892 16 there will be no
dismiss ( ) . " cost 3.2658643839427 3.26587
17.0452 0.155555 2.36578 2.33918 1.48557 5.72912
2.23892 5 -9 3.95801 16 there will be a (
removed ) . " cost 3.4441157853547 3.44412
15.6787 0.155555 2.06722 3.24385 5.58451 3.79602
2.43787 5 -9 6.31365 16 there will be a (
recall ) . " cost 3.4704808758737 3.47048
15.65 0.155555 2.98438 3.68957 5.63635 3.5788
2.23326 5 -9 6.104 16 ( the recall ) . "
cost 3.49422747670162 3.49422 10.8817
0.733333 3.29764 4.00283 8.77178 1.37352 2.23326
5 -6 9.02326 16 there will be no ( the
dismissal ) " . cost 3.515502727333 3.51552
18.9155 0.04 2.26044 2.48211 1.08776 5.11773
1.90654 5 -10 4.56892
10
N-Best Lists

Optimize on n-best list
output e.g. 500 best translations with their
feature scores
pre-calculate error rate for each n-best list
entry
optimize weights so the best translation
(according to error metric) is getting the best
total score
Powell search / Simplex Algorithm
re-run decoder with updated weights
add new n-best list to the previous one (more
stability)
run optimizer over larger n-best lists
repeat until no new translations, or improvement
lt epsilon, or just k times (typically 5-10
iterations)

11
Powell search

Powell search is a line search
consider one parameter at a time

12
Powell search

Change one parameter at a time to find optimum
for each dimension
Cannot go diagonally
Not guaranteed to find global optimum
End point depends on where the search starts
What step size is good?
High dimensionality
Computationally expensive many points to
evaluate
Can we reduce the number of points that need to
be evaluated?

13
Powell search

Linear combination of models
Model cost
Only look at one feature weight at a time
Total cost for one hypothesis considering only
one weight changing is a linear function

14
Powell search

Model score for one hypothesis
Changing one feature weight

e12 TER 5
Model Score
hk
1
?k
15
Powell search

Depending on scaling factor ?k, different hyps
are in 1-best position
Set ?k to have metric-best hypothesis also being
model-best

e11 TER 8
e12 TER 5
Model Score
e13 TER 4
best hyp
?k
e12
e13
e11
8
5
4
16
Powell search

Select minimum number of evaluation points
Calculate intersection point
Keep only if hypotheses are minimum at that
point
Choose evaluation points between intersection
points

e11 TER 8
e12 TER 5
Model Score
e13 TER 4
best hyp
?k
e12
e13
e11
8
5
4
17
Powell search

Different source sentence
No matter which ?k, h22 would never be 1-best

e21 TER 2
e22 TER 0
e23 TER 5
Model Score
best hyp
?k
e23
e21
2
5
18
Powell search

Multiple sentences

e21 TER 2
e22 TER 0
e23 TER 5
e11 TER 8
e12 TER 5
Model Score
e13 TER 4
e22
e21
best hyp
?k
e12
e13
e11
10
7
10
9
19
Simplex Algorithm

Downhill Simplex is essentially like a gradient
descend
Error function is not differentiable
Looking at two dimensions
Evaluate three points to find the direction in
which the error decreases
consider additional points to ensure
convergence

20
Simplex Algorithm

Replace worst with point (in this order)
R reflection
E expansion
C contraction
S shrinkingreplace worst with Sand good
with M

21
Simplex Algorithm

For n dimensions
Start with n1 random weight vectors
Evaluate translation for each configuration -gt
objective function
Sort points xk according to objective
function f(x1) lt f(x2) lt lt f(xn1)
Calculate x0 as center of gravity for x1 xn
Replace worst point with a point reflected
through the centroid xr x0 r (x0 xn1)
Or consider additional points

22
Random Restarts

Comparison Simplex/Powell
(Alok Parlikar, unpublished)
(Bing Zhao, unpublished)

Alok Simplex jumpier then Powell
Bing Simplex better than MER
Both you need many restarts

23
Notes on Tuning

Optimization can get stuck in local minimum
Restart multiple times, random seeds
Models are not improved, only their combination
Some parameters change performance of decoder,
but are not in the linear combination
Beam size
Word reordering restrictions
Optimization towards different automatic metrics
Optimization using different development data

24
N-Best List Re-Ranking

N-best list re-ranking is a standard technique
use for computationally expensive things
try new ideas with minimum implementation effort
apply additional models which are too large to
be loaded
apply additional models which can not be applied
to a lattice
E.g. consensus over all translations in the
n-best list

25
N-Best List Re-Ranking

On n-best list
add additional feature score
optimize and re-rank
output new first best translation from list
Oracle score on n-best
pick the metric best hypothesis for each source
sentence
usually e.g. 8 to 12 points (in BLEU) better
than decoder first best
means our models are not strong enough to
select the best hypothesis

26
Minimum Bayes Risk

Maximum a posteriori solution
Minimum Bayes risk solution

27
Minimum Bayes Risk

Loss function L(e,e) hypotheses that are
similar, close together in the hypothesis space?
Use the automatic evaluation metric we want to
improve
e is hypothesis, e treated as reference
calculate translation error rate as distance
measure
Requires pairwise comparison of all hypotheses,
O(n2)
Only consider e.g. 1000 best hypotheses (n-best
list)

28
Summary

Parameter Estimation / Tuning / Minimum Error
Rate Training
Tuning Set
Difficulties
Computationally expensive to calculate objective
function
Error surface makes search non-trivial
N-Best-lists
Powell Search
Simplex Algorithm
N-Best List Re-scoring
Minimum Bayes Risk

Write a Comment

User Comments (0)

About PowerShow.com

MT Parameter Estimation Minimum Error Rate Training PowerPoint PPT Presentation