MT Parameter Estimation Minimum Error Rate Training - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

MT Parameter Estimation Minimum Error Rate Training

Description:

... to conduct nuclear tests , japan's ruling liberal democratic party policy chief ... democratic party policy chief shoichi nakagawa repeatedly on different ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 29
Provided by: muntsi
Category:

less

Transcript and Presenter's Notes

Title: MT Parameter Estimation Minimum Error Rate Training


1
MT Parameter EstimationMinimum Error Rate
Training
2
Overview
  • Parameter Estimation / Tuning / Minimum Error
    Rate Training
  • Tuning Set
  • Difficulties
  • Computationally expensive to calculate objective
    function
  • Error surface makes search non-trivial
  • N-Best Lists
  • Powell Search
  • Simplex Algorithm
  • N-Best List Re-scoring
  • Minimum Bayes Risk

3
System overview
4
System overview
Translation Model Phrase Table e -gt f
Source Language Text
Translation Model Phrase Table f -gt e
Preprocessing
Parameter Estimation
Translation Model Lexicon e -gt f
Translation Model Lexicon f -gt e
Decoder

Language Model
POS LM

Distortion Model
Word Count
Phrase Count
Target Language Text
Cohesion Constraint

5
Parameter Estimation / Tuning
  • Need training data to optimize weights (?1, ,
    ?n)
  • Set of sentences with reference translation
  • usually around 1000 sentences
  • held out from training data for translation and
    language models
  • is called tuning set or development set
  • Tuning towards better translation
  • needs automatic translation evaluation metric
    (e.g. BLEU, TER, METEOR)
  • minimize translation error rate (maximize
    translation score)
  • Minimum Error Rate Training (MERT)

6
Parameter Estimation / Tuning
  • Find (?1, , ?n) so the translation error rate is
    minimal
  • To evaluate the objective function we need to
  • set weights
  • run decoder with these weights
  • evaluate resulting translation
  • computationally expensive!
  • Error surface is not nice
  • Not convex -gt many local minima
  • Not differentiable -gt gradient descent methods
    not readily applicable

?i
7
N-Best Lists
  • Optimize on n-best list
  • output e.g. 500 best translations with their
    feature scores
  • pre-calculate error rate for each n-best list
    entry
  • optimize weights so the best translation
    (according to error metric) is getting the best
    total score
  • Powell search / Simplex Algorithm
  • re-run decoder with updated weights
  • add new n-best list to the previous one (more
    stability)
  • run optimizer over larger n-best lists
  • repeat until no new translations, or improvement
    lt epsilon, or just k times (typically 5-10
    iterations)

8
N-Best Lists Example
  • N-best hypotheses

17 since october 9th , the dprk announced to
conduct nuclear tests , japan's ruling liberal
democratic party policy chief shoichi nakagawa
repeatedly on different occasions claimed that
discussions on whether japan should possess
nuclear weapons . cost 12.271806 17
since october 9th , the dprk announced to conduct
nuclear tests , japan's ruling liberal democratic
party policy chief shoichi nakagawa repeatedly on
different occasions that japan should discuss
whether it possesses nuclear weapons . cost
12.488882 17 since october 9th north korea
announced a nuclear test , japan's ruling liberal
democratic party policy chief shoichi nakagawa
repeatedly on different occasions claimed that
discussions on whether japan should possess
nuclear weapons . cost 12.599372 17
since october 9th , the dprk announced to conduct
nuclear tests , discussion of japan's ruling
liberal democratic party policy chief shoichi
nakagawa repeatedly on different occasions that
japan should have nuclear weapons . cost
12.612238 17 beginning october 9th , north
korea announced a nuclear test , japan's ruling
liberal democratic party policy chief shoichi
nakagawa repeatedly on different occasions
claimed that discussions on whether japan should
possess nuclear weapons . cost 12.970050 17
since october 9th north korea announced a
nuclear test , japan's ruling liberal democratic
party policy chief shoichi nakagawa repeatedly on
different occasions that japan should discuss
whether it possesses nuclear weapons . cost
13.050649 17 from october 9th , the dprk
announced to conduct nuclear tests , japan's
ruling liberal democratic party policy chief
shoichi nakagawa repeatedly on different
occasions claimed that discussions on whether
japan should possess nuclear weapons . cost
13.192306
9
N-Best Lists Example
16 there will be no ( removed ) . "
cost 2.8804655652387 2.93105 15.7766 0.155555
1.18532 1.84075 2.38613 5.12687 2.43787 5 -9
4.44849 16 there will not be ( removed ) .
" cost 3.0528793365197 3.39828 18.0113
0.155555 0.941883 1.635 1.84705 4.29711 2.43787 5
-9 3.95439 16 there will be no ( the
dismissal ) . " cost 3.094871928852 3.51552
18.9155 0.04 2.26044 2.48211 1.08776 5.11773
1.90654 5 -10 4.56892 16 there will be no
dismiss ( ) . " cost 3.2658643839427 3.26587
17.0452 0.155555 2.36578 2.33918 1.48557 5.72912
2.23892 5 -9 3.95801 16 there will be a (
removed ) . " cost 3.4441157853547 3.44412
15.6787 0.155555 2.06722 3.24385 5.58451 3.79602
2.43787 5 -9 6.31365 16 there will be a (
recall ) . " cost 3.4704808758737 3.47048
15.65 0.155555 2.98438 3.68957 5.63635 3.5788
2.23326 5 -9 6.104 16 ( the recall ) . "
cost 3.49422747670162 3.49422 10.8817
0.733333 3.29764 4.00283 8.77178 1.37352 2.23326
5 -6 9.02326 16 there will be no ( the
dismissal ) " . cost 3.515502727333 3.51552
18.9155 0.04 2.26044 2.48211 1.08776 5.11773
1.90654 5 -10 4.56892
10
N-Best Lists
  • Optimize on n-best list
  • output e.g. 500 best translations with their
    feature scores
  • pre-calculate error rate for each n-best list
    entry
  • optimize weights so the best translation
    (according to error metric) is getting the best
    total score
  • Powell search / Simplex Algorithm
  • re-run decoder with updated weights
  • add new n-best list to the previous one (more
    stability)
  • run optimizer over larger n-best lists
  • repeat until no new translations, or improvement
    lt epsilon, or just k times (typically 5-10
    iterations)

11
Powell search
  • Powell search is a line search
  • consider one parameter at a time

12
Powell search
  • Change one parameter at a time to find optimum
    for each dimension
  • Cannot go diagonally
  • Not guaranteed to find global optimum
  • End point depends on where the search starts
  • What step size is good?
  • High dimensionality
  • Computationally expensive many points to
    evaluate
  • Can we reduce the number of points that need to
    be evaluated?

13
Powell search
  • Linear combination of models
  • Model cost
  • Only look at one feature weight at a time
  • Total cost for one hypothesis considering only
    one weight changing is a linear function

14
Powell search
  • Model score for one hypothesis
  • Changing one feature weight

e12 TER 5
Model Score
hk
1
?k
15
Powell search
  • Depending on scaling factor ?k, different hyps
    are in 1-best position
  • Set ?k to have metric-best hypothesis also being
    model-best

e11 TER 8
e12 TER 5
Model Score
e13 TER 4
best hyp
?k
e12
e13
e11
8
5
4
16
Powell search
  • Select minimum number of evaluation points
  • Calculate intersection point
  • Keep only if hypotheses are minimum at that
    point
  • Choose evaluation points between intersection
    points

e11 TER 8
e12 TER 5
Model Score
e13 TER 4
best hyp
?k
e12
e13
e11
8
5
4
17
Powell search
  • Different source sentence
  • No matter which ?k, h22 would never be 1-best

e21 TER 2
e22 TER 0
e23 TER 5
Model Score
best hyp
?k
e23
e21
2
5
18
Powell search
  • Multiple sentences

e21 TER 2
e22 TER 0
e23 TER 5
e11 TER 8
e12 TER 5
Model Score
e13 TER 4
e22
e21
best hyp
?k
e12
e13
e11
10
7
10
9
19
Simplex Algorithm
  • Downhill Simplex is essentially like a gradient
    descend
  • Error function is not differentiable
  • Looking at two dimensions
  • Evaluate three points to find the direction in
    which the error decreases
  • consider additional points to ensure
    convergence

20
Simplex Algorithm
  • Replace worst with point (in this order)
  • R reflection
  • E expansion
  • C contraction
  • S shrinkingreplace worst with Sand good
    with M

21
Simplex Algorithm
  • For n dimensions
  • Start with n1 random weight vectors
  • Evaluate translation for each configuration -gt
    objective function
  • Sort points xk according to objective
    function f(x1) lt f(x2) lt lt f(xn1)
  • Calculate x0 as center of gravity for x1 xn
  • Replace worst point with a point reflected
    through the centroid xr x0 r (x0 xn1)
  • Or consider additional points

22
Random Restarts
  • Comparison Simplex/Powell
  • (Alok Parlikar, unpublished)
    (Bing Zhao, unpublished)
  • Alok Simplex jumpier then Powell
  • Bing Simplex better than MER
  • Both you need many restarts

23
Notes on Tuning
  • Optimization can get stuck in local minimum
  • Restart multiple times, random seeds
  • Models are not improved, only their combination
  • Some parameters change performance of decoder,
    but are not in the linear combination
  • Beam size
  • Word reordering restrictions
  • Optimization towards different automatic metrics
  • Optimization using different development data

24
N-Best List Re-Ranking
  • N-best list re-ranking is a standard technique
  • use for computationally expensive things
  • try new ideas with minimum implementation effort
  • apply additional models which are too large to
    be loaded
  • apply additional models which can not be applied
    to a lattice
  • E.g. consensus over all translations in the
    n-best list

25
N-Best List Re-Ranking
  • On n-best list
  • add additional feature score
  • optimize and re-rank
  • output new first best translation from list
  • Oracle score on n-best
  • pick the metric best hypothesis for each source
    sentence
  • usually e.g. 8 to 12 points (in BLEU) better
    than decoder first best
  • means our models are not strong enough to
    select the best hypothesis

26
Minimum Bayes Risk
  • Maximum a posteriori solution
  • Minimum Bayes risk solution

27
Minimum Bayes Risk
  • Loss function L(e,e) hypotheses that are
    similar, close together in the hypothesis space?
  • Use the automatic evaluation metric we want to
    improve
  • e is hypothesis, e treated as reference
  • calculate translation error rate as distance
    measure
  • Requires pairwise comparison of all hypotheses,
    O(n2)
  • Only consider e.g. 1000 best hypotheses (n-best
    list)

28
Summary
  • Parameter Estimation / Tuning / Minimum Error
    Rate Training
  • Tuning Set
  • Difficulties
  • Computationally expensive to calculate objective
    function
  • Error surface makes search non-trivial
  • N-Best-lists
  • Powell Search
  • Simplex Algorithm
  • N-Best List Re-scoring
  • Minimum Bayes Risk
Write a Comment
User Comments (0)
About PowerShow.com