A Phrase-Based, Joint Probability Model for Statistical Machine Translation PowerPoint PPT Presentation

presentation player overlay
1 / 38
About This Presentation
Transcript and Presenter's Notes

Title: A Phrase-Based, Joint Probability Model for Statistical Machine Translation


1
A Phrase-Based, Joint Probability Model for
Statistical Machine Translation
  • Daniel Marcu, William Wong(2002)
  • Presented by Ping Yu
  • 01/17/2006

2
  • Statistical Machine Translation
  • a refresh

3
The Noisy Channel
  • Translate from f to e

f
e
e
E ? F
F ? E
encoder
decoder
e argmax P(ef) argmax P(e) P(fe)
e
e
source model language model
channel model translation model
4
Language Model
  • Bag Translation sentence gt bag of words
  • N-gram language model

5
Translation Model
  • Alignment P(f,ae)
  • Fertility dependent solely on the English word
  • Mary did not slap the green witch
  • Mary no daba una botefada a la verde bruja
    (Spanish)
  • Mary fertility 1 did 0 slap3 the 2 green
    1 witch 1
  • (Example from Kevin Knights tutorial)

6
Conditional Probability Word-based Statistical
MT

Fertility
one-to-one mapping from e to f one-to-many
mapping from e to f
  • Conditional Probability given e, what is
    alignment probability with f? i.e., p(f,ae)
  • Word-based MT
  • IBM 1-5

7
How About Many-to-many Mapping?
  • a b c
  • x y

8
Out of sight, out of mind Invisible Idiot
  • Output from Systran
  • French Hors de la vue, hors de
    lesprit.Back to English Out of the sight, of
    the spirit.German Aus dem Anblick des Geistes
    heraus.Translated back to English From the
    sight of the spirit out.Italian Dalla vista
    dello spirito fuori.Translated back to English
    From the sight of the spirit outside.Portuguese
    Da vista do espírito fora.Translated back to
    English Of the sight of the spirit it
    are.Spanish De la vista del alcohol
    está.Translated back to English Of the Vista of
    the alcohol it is.
  • From http//www.discourse.net/archives/2005/06/of_
    the_vista_of_the_alcohol_it_is.html

9
Lost in Translation
10
Solution
  • many-to-many mapping
  • How?
  • Word-based
    Phrase-based



11
Alignment between Multiple Phrases
  • Phrases are not really phrases
  • Phrases defined differently in different models
  • Most extracted phrases based on word-based
    alignment
  • Och and Ney (1999) alignment template model
  • Melamed (2001) Non-compositional compounds model

12
  • Marcu and Wong (2002)

13
Promising Features
  • Looking for phrases and alignments simultaneously
    for both Source and Target sentences
  • Directly modeling phrase-based probabilities
  • Not dependent on word-based probabilities

14
Phrase Concept
  • phrase a sequence of consecutive words.
  • concept a pair of aligned phrases

A set of concepts can be linearized into a
sentence pair (E, F) if E and F can be obtained
by permuting the phrases ei and fi that
characterize all concepts ci ? C. This property
is denoted in the predicate L(E, F, C)
15
Two Models
  • Model 1
  • Joint probability distribution
  • phrases are equivalent translations

16
Model 2
  • A position-based distortion joint probability
    model
  • Probability of the alignment between two phrases

17
Probability to Generate a Sentence Pair
18
How?
  • Sentences
  • Phrases
  • Concepts

19
Four Steps
  • Phrases Concepts determination
  • Initialize the joint probability of concepts,
    i.e.,
  • t-distribution table
  • EM training on Viterbi alignments
  • Calculate t-distribution table
  • Full Iteration and then approximation of EM
  • Viterbi alignment
  • Smoothing
  • Generate conditional probability from joint
    probability, needed in the decoder

20
Step 1 Phrase Determination
  • All unigram
  • Frequency of n-gram gt5

21
Step 2 Initialize the t-distribution Table
  • Given a sentence E of l words, there are S(l, k)
    ways in which the l words can be partitioned into
    k non-empty concepts

22
  • S(m, k) ways for a sentence of F be partitioned
    into k non-empty concepts
  • The number of concepts k is between 1 and min(l,
    m)
  • Total number of concepts alignment between two
    sentences

23
Probability of Two Concepts
24
How about the Word Order
  • The equation doesnt take word order into
    consideration. Phrases must consist of
    consecutive words
  • The formula overestimates the numerator and
    denominator equally, so the approximation works
    well in the practice

25
Step 3 EM training on Viterbi Alignments
  • After the initial t-table is built, EM can be
    used to improve the parameters
  • However, it is impossible to calculate
    expectations over all possible alignments
  • So for the initial alignment, only the concepts
    with high t_probabilities are aligned

26
Implementation
  • Greedy Alignment Greedily produce an initial
    alignment
  • Hillclimbing examine the probability of neighbor
    concepts to get local maxima by performing the
    following operations

27
  • Swap concepts lte1, f1gt, lte2, f2gtgtlte1, f2gt,lte2,
    f1gt
  • Merge concepts lte1, f1gt, lte2, f2gtgtlte1 e2,
    f1f2gt
  • Break conceptslte1 e2, f1f2gt gtlte1, f1gt, lte2,
    f2gt
  • Move words across concepts
  • lte1e2, f1gt, lte3, f2gt gt lte1, f1gt, lte2 e3,
    f2gt
  • From www.iccs.informatics.ed.ac.uk/
    osborne/msc-projects/oconnor.pdf

28
  • Viterbi Search
  • Smoothing

29
Training Iteration
  • First iteration used Model 1
  • Rest iterations used Model 2

30
Step 4 Derivation of Conditional Probability
Model
  • P(fe) p(e, f)/p(e)
  • Used in the decoder model

31
Encoder
  • Given a Foreign sentence F, maximize the
    probability p(E, F)
  • Hillclimb by modifying E and the alignment
    between E and F to maximize p(E)P(FE)
  • P(E) is a trigram-based language model at word
    level instead of phrase level

32
Evaluation
  • Data French-English Hansard data
  • Compared with Giza (IBM Model 4)
  • Training 100,000 sentence pairs
  • Testing 500 unseen sentences, uniformly
    distributed across length 6, 8, 10, 15 and 20

33
Results
34
Comparison of the Model from Koehn et.al (2003)
35
Limitations of the model complexity problems
  • Phrases up to 6 words
  • Size of t-table
  • Large number of possible alignments
  • Memory management
  • Expensive operations such as swap, break, merge
    during Viterbi training

36
Limitations of the model non-consecutive phrases
  • English not
  • French ne pas
  • is not gtne est pas
  • is not here gt ne est pas ici
  • Longer alignment? Sparse problem

37
Complexity vs. Performance
  • Marcu and Wong n-gram lt6
  • Keohn et al. (2003)
  • Allow Length of words gt3
  • Complexity increases largely but no significant
    improvement

38
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com