Wordbased SMT - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Wordbased SMT

Description:

Parallel data (a set of 'sentence' pairs) Main concepts: ... cher. Decomposition. Approximations and. types of parameters. Where N is the number of empty slots. ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 79
Provided by: facultyWa
Category:
Tags: smt | cher | wordbased

less

Transcript and Presenter's Notes

Title: Wordbased SMT


1
Word-based SMT
  • Ling 580
  • Fei Xia
  • Week 1 1/3/06

2
Outline
  • General concepts
  • Source channel model
  • Notations
  • Word alignment
  • Model 1-2
  • Model 3-4
  • Model 5

3
IBM Model Basics
  • Classic paper Brown et. al. (1993)
  • Translation F ? E (or Fr ? Eng)
  • Resource required
  • Parallel data (a set of sentence pairs)
  • Main concepts
  • Source channel model
  • Hidden word alignment
  • EM training

4
Intuition
  • Sentence pairs word mapping is one-to-one.
  • (1) S a b c d e
  • T l m n o p
  • (2) S c a e
  • T p n m
  • (3) S d a c
  • T n p l
  • ? (b, o), (d, l), (e, m), and
  • (a, p), (c, n), or
  • (a, n), (c, p)

5
Source channel model
  • Task S ? T
  • Source channel (a.k.a. noisy channel, noisy
    source channel) use the Bayes Rule.
  • Two types of parameters
  • P(T) language model
  • P(S T) its meaning varies.

6
Source channel model for MT
P(T)
P(S T)
Tgt sent
Src sent
Noisy channel
  • Two types of parameters
  • Language model P(T)
  • Translation model P(S T)

7
Source channel model for MT
P(E)
P(F E)
Fr sent
Eng sent
Noisy channel
  • Two types of parameters
  • Language model P(E)
  • Translation model P(F E)

8
Source channel for MT
  • People think in English.
  • English thoughts can be characterized by a
    plausibility filter P(E).
  • Sentences are corrupted into a different
    language by a translation model P(F E).
  • Our goal is to find the original, uncorrupted
    English sentence e. To achieve this goal, we
    efficiently evaluate P(E) P(F E) over many
    candidate Eng sentences.

9
Source channel vs. direct model
  • Source channel demand plausible Eng and strong
    correlation between e and f.
  • Direct model demand strong correlation between
    e and f.
  • Question
  • Formally, they are the same.
  • In practice, they are not due to different
    approximations.

10
Word alignment
  • a(j)i ? aj i
  • a (a1, , am)
  • Ex
  • F f1 f2 f3 f4 f5
  • E e1 e2 e3 e4
  • a43
  • a (0, 1, 1, 3, 2)

11
The constraint on word alignment
  • The constraint each fr word is generated by
    exactly one Eng word (including e0) l is Eng
    sent length, m is Fr sent length
  • Without the constraint 2lm.
  • With the constraint (l1)m.
  • Why the models use the constraint?
  • We want to use P(fj ei) to estimate P(F E).
  • How to handle the exceptional cases?
  • Various methods target word grouping,
    phrase-based SMT, etc.

12
Modeling p(F E) with alignment
13
Notation
  • E the Eng sentence E e1 el
  • ei the i-th Eng word.
  • F the Fr sentence f1 fm
  • fj the j-th Fr word.
  • e0 the Eng NULL word
  • F0 the Fr NULL word.
  • aj the position of Eng word that generates
    fj.

14
Word alignment
  • An alignment, a, is a function from Fr word
    position to Eng word position a(j)i means that
    the fj is generated by ei.
  • The constraint each fr word is generated by
    exactly one Eng word (including e0)

15
Notation (cont)
  • l Eng sent leng
  • m Fr sent leng
  • i Eng word position
  • j Fr word position
  • e an Eng word
  • f a Fr word

16
Outline
  • General concepts
  • Source channel model
  • Word alignment
  • Notations
  • Model 1-2
  • Model 3-4

17
Model 1 and 2
18
Model 1 and 2
  • Modeling
  • Generative process
  • Decomposition
  • Formula and types of parameters
  • Training
  • Finding the best alignment
  • Decoding

19
Generative process
  • To generate F from E
  • Pick a length m for F, with prob P(m l)
  • Choose an alignment a, with prob P(a E, m)
  • Generate Fr sent given the Eng sent and the
    alignment, with prob P(F E, a, m).
  • Another way to look at it
  • Pick a length m for F, with prob P(m l).
  • For j1 to m
  • Pick an Eng word index aj, with prob P(aj j, m,
    l).
  • Pick a Fr word fj according to the Eng word ei,
    where ajI, with prob P(fj ei ).

20
Decomposition
21
Approximation
  • Fr sent length depends only on Eng sent length
  • Fr word depends only on the Eng word that
    generates it

22
Approximation (cont)
  • Estimating P(a E, m)
  • Model 1 All alignments are equally likely
  • Model 2 alignments have different prob
  • Model 1 can be seen as a special case of Model 2,
  • where

23
Decomposition for Model 1
24
The magic (for Model 1)
25
Final formula and parameters for Model 1
  • Two types of parameters
  • Length prob P(m l)
  • Translation prob P(fj ei), or t(fj ei),

26
Decomposition for Model 2
  • Same as Model 1 except that Model 2 does not
    assume all alignments are equally likely.

27
The magic for Model 2
28
Final formula and parameters for Model 2
  • Three types of parameters
  • Length prob P(m l)
  • Translation prob t(fj ei)
  • Distortion prob d(i j, m, l)

29
Summary of Modeling
Model 1
Model 2
  • Parameters
  • Length prob P(m
    l)
  • Translation prob t(fj
    ei)
  • Distortion prob (for Model 2) d(i j, m, l)

30
Model 1 and 2
  • Modeling
  • Generative process
  • Decomposition
  • Formula and types of parameters
  • Training
  • Finding the best alignment
  • Decoding

31
Training
  • Mathematically motivated
  • Having an objective function to optimize
  • Using several clever tricks
  • The resulting formulae
  • are intuitively expected
  • can be calculated efficiently
  • EM algorithm
  • Hill climbing, and each iteration guarantees to
    improve objective function
  • It does not guaranteed to reach global optimal.

32
Length prob P(j i)
  • Let Ct (j, i) be the number of sentence pairs
    where the Fr leng is j, and Eng leng is i.
  • Length prob
  • No need for iterations

33
Estimating t(fe) a naïve approach
  • A naïve approach
  • Count the times that f appears in F and e appears
    in E.
  • Count the times that e appears in E
  • Divide the 1st number by the 2nd number.
  • Problem
  • It cannot distinguish true translations from pure
    coincidence.
  • Ex t(el white) t(blanco white)
  • Solution count the times that f aligns to e.

34
Estimating t(fe) in Model 1
  • When each sent pair has a unique word alignment
  • When each sent pair has several word alignments
    with prob
  • When there are no word alignments

35
When there is a single word alignment
  • We can simply count.
  • Training data
  • Eng b c b
  • Fr x y y
  • Prob
  • ct(x,b)0, ct(y,b)2, ct(x,c)1, ct(y,c)0
  • t(xb)0, t(yb)1.0, t(xc)1.0, t(yc)0

36
When there are several word alignments
  • If a sent pair has several word alignments, use
    fractional counts.
  • Training data
  • P(aE,F)0.3 0.2 0.4 0.1
    1.0
  • b c b c b c
    b c b
  • x y x y x y
    x y y
  • Prob
  • Ct(x,b)0.7, Ct(y,b)1.5, Ct(x,c)0.3,
    Ct(y,c)0.5
  • P(xb)7/22, P(yb)15/22, P(xc)3/8, P(yc)5/8

37
Fractional counts
  • Let Ct(f, e) be the fractional count of (f, e)
    pair in the training data, given alignment prob
    P.

Alignment prob
Actual count of times e and f are linked in
(E,F) by alignment a
38
When there are no word alignments
  • We could list all the alignments, and estimate
    P(a E, F).

39
Formulae so far
? New estimate for t(fe)
40
The algorithm
  • Start with an initial estimate of t(f e) e.g.,
    uniform distribution
  • Calculate P(a F, E)
  • Calculate Ct (f, e), Normalize to get t(fe)
  • Repeat Steps 2-3 until the improvement is too
    small.

41
So far, we estimate t(f e) by enumerating all
possible alignments
  • This process is very expensive, as the number of
    all possible alignments is (l1)m.

Prev iterations Estimate of Alignment prob
Actual count of times e and f are linked in
(E,F) by alignment a
42
No need to enumerate all word alignments
  • Luckily, for Model 1, there is a way to calculate
    Ct(f, e) efficiently.

43
The algorithm
  • Start with an initial estimate of t(f e) e.g.,
    uniform distribution
  • Calculate P(a F, E)
  • Calculate Ct (f, e), Normalize to get t(fe)
  • Repeat Steps 2-3 until the improvement is too
    small.

44
Estimating t(f e) in Model 2
  • Ct(f, e) is slightly different from the one in
    Model 1

45
Estimating d(i j, m,l) in Model 2
  • Let Ct(i, j, m, l) be the fractional count that
    Fr position j is linked to the Eng position i.

46
The algorithm
  • Start with an initial estimate of t(f e) e.g.,
    uniform distribution
  • Calculate P(a F, E)
  • Calculate Ct (f, e), Normalize to get t(fe)
  • Repeat Steps 2-3 until the improvement is too
    small.

47
Training Summary
  • EM algorithm
  • Hill climbing, and each iteration guarantees to
    improve objective function
  • It does not guaranteed to reach global optimal.
  • The resulting formulae
  • are intuitively expected
  • can be calculated efficiently

48
Model 1 and 2
  • Modeling
  • Generative process
  • Decomposition
  • Formula and types of parameters
  • Training
  • Finding the best alignment

49
The best alignment in Model 1-5
Given E and F, we are looking for the best
alignment a
50
The best alignment in Model 1
51
The best alignment in Model 2
52
Summary of Model 1 and 2
  • Modeling
  • Pick the length of F with prob P(m l).
  • For each position j
  • Pick an English word position aj, with prob P(aj
    j, m, l).
  • Pick a Fr word fj according to the Eng word ei,
    with t(fj ei), where iaj
  • The resulting formula can be calculated
    efficiently.
  • Training EM algorithm. The update can be done
    efficiently.
  • Finding the best alignment can be easily done.

53
Limitations of Model 1 and 2
  • There could be some relations among the Fr words
    generated by the same Eng word (w.r.t. positions
    and fertility).
  • The relations are not captured by Model 1 and 2.
  • They are captured by Model 3 and 4.

54
Outline
  • General concepts
  • Source channel model
  • Word alignment
  • Notations
  • Model 1-2
  • Model 3-4

55
Model 3 and 4
56
Model 3 and 4
  • Modeling
  • Generative process
  • Decomposition and final formula
  • Types of parameters
  • Training
  • Finding the best alignment
  • Decoding

57
Generative process
  • For each Eng word ei, choose a fertility
  • For each ei, generate Fr words
  • Choose the position of each Fr word.

58
An example
NULL the cheapest nonstop flights
59
An example
NULL the cheapest nonstop flights
vols
sans
escale
le
moins
cher
60
Decomposition
61
Approximations and types of parameters
Where N is the number of empty slots.
62
Approximations and types of parameters (cont)
63
Modeling summary
  • For each Eng word ei, choose a fertility
  • which only depends on ei.
  • For each ei, generate Fr words, which only
    depends on ei.
  • Choose the position of each Fr word
  • Model 3 the position depends only on the
    position of the Eng word generating it.
  • Model 4 the position depends on more.

64
Training
  • Use EM, just like Model 1 and 2
  • Translation and distortion probabilities can be
    calculated efficiently, fertility probabilities
    cannot.
  • No efficient algorithms to find the best
    alignment.

65
Model 3 and 4
  • Modeling
  • Generative process
  • Decomposition and final formula
  • Types of parameters
  • Training
  • Finding the best alignment
  • Decoding

66
Model 1-4 modeling
67
Model 1-4 training
  • Similarities
  • Same objective function
  • Same algorithm EM algorithm
  • Differences
  • Summation over all alignments can be done
    efficiently for Model 1-2, but not for Model 3-4.
  • Best alignment can be found efficiently for Model
    1-2, but not for Model 3-4.

68
Summary
  • General concepts
  • Source channel model P(E) and P(FE)
  • Notations
  • Word alignment each Fr word comes from exactly
    one Eng word (including e0).
  • Model 1-2
  • Model 3-4

69
Additional slides
70
An example of Model 1 training
  • Training data
  • Sent 1 Eng b c, Fr x y
  • Sent 2 Eng b, Fr y
  • To reduce the number of alignments, assume that
    each Eng word generates exactly one Fr word ? Two
    possible alignments for Sent1, and one for Sent2.
  • Step 1 Initial t(fe) t(xb)t(yb)1/2,
    t(xc)t(yc)1/2

71
Step 2 calculating P(aF,E)
  • a1 b c a2 b c a3
    b
  • x y x y
    y
  • Before normalization
  • P(a1E1,F1)Z1/21/21/4
  • P(a2E1,F1)Z1/21/21/4
  • P(a3E2,F2)Z1/2
  • After normalization
  • P(a1E1,F1)1/4 / (1/41/4) ½
  • P(a2E1,F1)1/4 / ½ ½.
  • P(a3E2,F2) ½ / ½ 1

72
Step 3 calculating t(f e)
  • a1 b c a2 b c a3
    b
  • x y x y
    y
  • Collecting counts
  • Ct(x,b) 1/2
  • Ct(y,b) ½ 1 3/2
  • Ct(x,c)1/2
  • Ct(y,c)1/2
  • After normalization
  • t(x b) ½ / (1/23/2) ¼, t(y b) 3/4
  • t(x c) ½ / 1 ½, t(y c)1/2

73
Repeating step 2 calculating P(aF,E)
  • a1 b c a2 b c a3
    b
  • x y x y
    y
  • Before normalization
  • P(a1E1,F1)Z1/41/21/8
  • P(a2E1,F1)Z3/41/23/8
  • P(a3E2,F2)Z3/4
  • After normalization
  • P(a1E1,F1)1/8 / (1/83/8) 1/4
  • P(a2E1,F2)3/8 / 4/8 3/4.
  • P(a3E2,F2) 3/4 / 3/4 1

74
Repeating step 3 calculating t(f e)
  • a1 b c a2 b c a3
    b
  • x y x y
    y
  • Collecting counts
  • Ct(x,b) 1/4
  • Ct(y,b) 3/4 1 7/4
  • Ct(x,c)3/4
  • Ct(y,c)1/4
  • After normalization
  • t(x b) 1/4 / (1/47/4) 1/8, t(y b) 7/8
  • t(x c) 3/4 / (3/41/4) 3/4, t(y c)1/4

75
See the trend?
76
Calculating t(f e) with the new formulae
  • E1 b c E2 b
  • F1 x y F2 y
  • Collecting counts
  • Ct(x,b) 1/2/(1/21/2)
  • Ct(y,b) ½ /(1/21/2) 1/1 3/2
  • Ct(x,c)1/2 / (1/21/2) 1/2
  • Ct(y,c)1/2 / (1/21/2) 1/2
  • After normalization
  • t(x b) ½ / (1/23/2) ¼, t(y b) 3/4
  • t(x c) ½ / 1 ½, t(y c)1/2

77
EM algorithm
  • EM expectation maximization
  • In a model with hidden states (e.g., word
    alignment), how can we estimate model parameters?
  • EM does the following
  • E-step Take an initial model parameterization
    and calculate the expected values of the hidden
    data.
  • M-step Use the expected values to maximize the
    likelihood of the training data.

78
Objective function
Write a Comment
User Comments (0)
About PowerShow.com