Title: A Phrase-Based, Joint Probability Model for Statistical Machine Translation
1A Phrase-Based, Joint Probability Model for
Statistical Machine Translation
- Daniel Marcu, William Wong(2002)
- Presented by Ping Yu
- 01/17/2006
2- Statistical Machine Translation
- a refresh
3The Noisy Channel
f
e
e
E ? F
F ? E
encoder
decoder
e argmax P(ef) argmax P(e) P(fe)
e
e
source model language model
channel model translation model
4Language Model
- Bag Translation sentence gt bag of words
- N-gram language model
5Translation Model
- Alignment P(f,ae)
- Fertility dependent solely on the English word
- Mary did not slap the green witch
-
- Mary no daba una botefada a la verde bruja
(Spanish) - Mary fertility 1 did 0 slap3 the 2 green
1 witch 1 - (Example from Kevin Knights tutorial)
6Conditional Probability Word-based Statistical
MT
Fertility
one-to-one mapping from e to f one-to-many
mapping from e to f
- Conditional Probability given e, what is
alignment probability with f? i.e., p(f,ae) - Word-based MT
- IBM 1-5
7How About Many-to-many Mapping?
8Out of sight, out of mind Invisible Idiot
- Output from Systran
- French Hors de la vue, hors de
lesprit.Back to English Out of the sight, of
the spirit.German Aus dem Anblick des Geistes
heraus.Translated back to English From the
sight of the spirit out.Italian Dalla vista
dello spirito fuori.Translated back to English
From the sight of the spirit outside.Portuguese
Da vista do espírito fora.Translated back to
English Of the sight of the spirit it
are.Spanish De la vista del alcohol
está.Translated back to English Of the Vista of
the alcohol it is. - From http//www.discourse.net/archives/2005/06/of_
the_vista_of_the_alcohol_it_is.html
9Lost in Translation
10Solution
- many-to-many mapping
- How?
- Word-based
Phrase-based -
11Alignment between Multiple Phrases
- Phrases are not really phrases
- Phrases defined differently in different models
- Most extracted phrases based on word-based
alignment - Och and Ney (1999) alignment template model
- Melamed (2001) Non-compositional compounds model
12 13Promising Features
- Looking for phrases and alignments simultaneously
for both Source and Target sentences - Directly modeling phrase-based probabilities
- Not dependent on word-based probabilities
14Phrase Concept
- phrase a sequence of consecutive words.
- concept a pair of aligned phrases
A set of concepts can be linearized into a
sentence pair (E, F) if E and F can be obtained
by permuting the phrases ei and fi that
characterize all concepts ci ? C. This property
is denoted in the predicate L(E, F, C)
15Two Models
- Model 1
- Joint probability distribution
- phrases are equivalent translations
16Model 2
- A position-based distortion joint probability
model - Probability of the alignment between two phrases
17Probability to Generate a Sentence Pair
18How?
- Sentences
- Phrases
- Concepts
19Four Steps
- Phrases Concepts determination
- Initialize the joint probability of concepts,
i.e., - t-distribution table
- EM training on Viterbi alignments
- Calculate t-distribution table
- Full Iteration and then approximation of EM
- Viterbi alignment
- Smoothing
- Generate conditional probability from joint
probability, needed in the decoder
20Step 1 Phrase Determination
- All unigram
- Frequency of n-gram gt5
21Step 2 Initialize the t-distribution Table
- Given a sentence E of l words, there are S(l, k)
ways in which the l words can be partitioned into
k non-empty concepts
22- S(m, k) ways for a sentence of F be partitioned
into k non-empty concepts - The number of concepts k is between 1 and min(l,
m) - Total number of concepts alignment between two
sentences
23Probability of Two Concepts
24How about the Word Order
- The equation doesnt take word order into
consideration. Phrases must consist of
consecutive words - The formula overestimates the numerator and
denominator equally, so the approximation works
well in the practice
25Step 3 EM training on Viterbi Alignments
- After the initial t-table is built, EM can be
used to improve the parameters - However, it is impossible to calculate
expectations over all possible alignments - So for the initial alignment, only the concepts
with high t_probabilities are aligned
26Implementation
- Greedy Alignment Greedily produce an initial
alignment - Hillclimbing examine the probability of neighbor
concepts to get local maxima by performing the
following operations
27- Swap concepts lte1, f1gt, lte2, f2gtgtlte1, f2gt,lte2,
f1gt - Merge concepts lte1, f1gt, lte2, f2gtgtlte1 e2,
f1f2gt - Break conceptslte1 e2, f1f2gt gtlte1, f1gt, lte2,
f2gt - Move words across concepts
- lte1e2, f1gt, lte3, f2gt gt lte1, f1gt, lte2 e3,
f2gt - From www.iccs.informatics.ed.ac.uk/
osborne/msc-projects/oconnor.pdf
28 29Training Iteration
- First iteration used Model 1
- Rest iterations used Model 2
30Step 4 Derivation of Conditional Probability
Model
- P(fe) p(e, f)/p(e)
- Used in the decoder model
31Encoder
- Given a Foreign sentence F, maximize the
probability p(E, F) - Hillclimb by modifying E and the alignment
between E and F to maximize p(E)P(FE) - P(E) is a trigram-based language model at word
level instead of phrase level
32Evaluation
- Data French-English Hansard data
- Compared with Giza (IBM Model 4)
- Training 100,000 sentence pairs
- Testing 500 unseen sentences, uniformly
distributed across length 6, 8, 10, 15 and 20
33Results
34Comparison of the Model from Koehn et.al (2003)
35Limitations of the model complexity problems
- Phrases up to 6 words
- Size of t-table
- Large number of possible alignments
- Memory management
- Expensive operations such as swap, break, merge
during Viterbi training
36Limitations of the model non-consecutive phrases
- English not
- French ne pas
- is not gtne est pas
- is not here gt ne est pas ici
- Longer alignment? Sparse problem
37Complexity vs. Performance
- Marcu and Wong n-gram lt6
- Keohn et al. (2003)
- Allow Length of words gt3
- Complexity increases largely but no significant
improvement
38