Machine Translation Discriminative Word Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Machine Translation Discriminative Word Alignment

Description:

... Year of DWA Yang Liu et al. 2005 More Features Search Moore 2005 Training Modeling Alignment with CRF Modeling Alignment Matrix Modeling Alignment ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 25
Provided by: Vog55
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Machine Translation Discriminative Word Alignment


1
Machine TranslationDiscriminative Word Alignment
Stephan Vogel Spring Semester 2011
2
Generative Alignment Models
  • Generative word alignment models P(f, ae)
  • Alignment a as hidden variable
  • Actual word alignment is not observed
  • Sum over all alignments
  • Well-known IBM 1 5 models, HMM, ITG
  • Model lexical association, distortion, fertility
  • It is difficult to incorporate additional
    information
  • POS of words (used in distortion model, not as
    direct link features)
  • Manual dictionary
  • Syntax information

3
Discriminative Word Alignment
  • Model alignment directly p(a f, e)
  • Find alignment that maximizes p(a f, e)
  • Well-suited framework maximum entropy
  • Set of feature functions hm(a, f, e), m 1, , M
  • Set of model parameters (feature weights) cm, m
    1, , M
  • Decision rule

4
Integrating Additional Dependencies
  • Log-linear model allows integration of additional
    dependencies, which contain additional
    information
  • POS
  • Parse trees
  • Add additional variable to capture these
    dependencies
  • New decision rule

5
Tasks
  • Modeling design feature functions which capture
    cross-lingual divergences
  • Search find alignment with highest probability
  • Training find optimal feature weights
  • Minimize alignment errors given some
    gold-standard alignments(Notice Alignments no
    longer hidden!)
  • Supervised training, i.e. we evaluate against
    gold standard
  • Notice features functions may result from some
    training procedure themselves
  • E.g. use statistical dictionary resulting from
    IBMn alignment, trained on large corpus
  • Here now additional training step, on small
    (hand-aligned) corpus(Similar to MERT for
    decoder)

6
2005 Year of DWA
  • Yang Liu, Qun Liu, and Shouxun Lin.
    2005.Loglinear Models for Word Alignment.
  • Abraham Ittycheriah and Salim Roukos. 2005.A
    Maximum Entropy Word Aligner for Arabic-English
    Machine Translation.
  • Ben Taskar, Simon Lacoste-Julien, and Dan Klein.
    2005.A Discriminative Matching Approach to Word
    Alignment.
  • Robert C. Moore. 2005.A Discriminative Framework
    for Bilingual Word Alignment.
  • Necip Fazil Ayan, Bonnie J. Dorr, and Christof
    Monz. 2005.NeurAlign Combining Word Alignments
    Using Neural Networks.

7
Yang Liu et al. 2005
  • Start out with features used in generative
    alignment
  • Lexicons E.g. IBM1
  • Use both directions p(fjei) and p(eifj),
    gtSymmetrical alignment model
  • And/or symmetric model
  • Fertility model p(fiei)

8
More Features
  • Cross count number of crossings in alignment
  • Neighbor count count the number of links in the
    immediate neighborhood
  • Exact match count number of src/tgt pairs,
    where srctgt
  • Linked word count total number of links (to
    influence density)
  • Link types count how many 1-1, 1-m, m-1, n-m
    alignments
  • Sibling distance if word is aligned to multiple
    words, add the distance between these aligned
    words
  • Link Co-occurrence count given multiple
    alignments (e.g. Viterbi alignments from IBM
    models) count how often links co-occur

9
Search
  • Greedy search based on gain by adding a link
  • For each of the features the gain can be
    calculated
  • E.g. IBM1
  • AlgorithmStart with empty alignmentLoop until
    no addition gain Loop over all (j,i) not in
    set if gain(j,i) gt best_gain then store as
    (j,i) Set link(j,i)

10
Moore 2005
  • Log-Likelihood-based model
  • Measure word association strength
  • Values can get large
  • Conditional-Link-Probability-based
  • Estimated probability of two words being linked
  • Used simpler alignment model to establish links
  • Add simple smoothing
  • Additional features one-to-one, one-to-many,
    non-monotonicity

11
Training
  • Finding optimal alignment is non-trivial
  • Adding link can affect nonmonotonicity,
    one-to-many features
  • Dynamic programming does not work
  • Beam search could be used
  • Requires pruning
  • Parameter optimization
  • Modified version of average perceptron learning

12
Modeling Alignment with CRF
  • CRF is an undirected graphical model
  • Each vertex (node) represents a random variable
    whose distribution is to be inferred
  • Each edge represents a dependency between two
    random variables
  • The distribution of each discrete random variable
    Y in the graph is conditioned on an input
    sequence X.
  • Cliques set of nodes in graph fully connected
  • In our case
  • Features derived from source and target words are
    the input sequence X
  • Alignment links are the random variables Y
  • Different ways to model alignment
  • Blunsom Cohn (2006) many-to-one word
    alignments, where each source word is aligned
    with zero or one target words (-gt asymmetric)
  • Niehues Vogel (2008) model not sequence, but
    entire alignment matrix(-gtsymmetric)

13
Modeling Alignment Matrix
  • Random variables yji for all possible alignment
    links
  • 2 values 0/1 word in position j is not
    linked/linked to word in position i
  • Represented as nodes in a graph

14
Modeling Alignment Matrix
  • Factored nodes x representing features
    (observables)
  • Linked to random variables
  • Define potential for each yji

15
Probability of Alignment
16
Features
  • Local features, e.g. lexical, POS,
  • Fertility features
  • First-order features capturing relation between
    links
  • Phrase-features interaction between word and
    phrase alignment

17
Local Features
  • Local information about link probability
  • Features derived from positions j and i only
  • Factored node connected to only one random
    variable
  • Features
  • Lexical probabilities, also normalized to (f,e)
  • Word identity (e.g. for numbers, names)
  • Word similarity (e.g. cognates)
  • Relative position distance
  • Link indicator feature is (j,i) linkedin
    Viterbi alignment from generative alignment
  • POS Indicator feature for every src/tgt POS pair
  • High frequency word indicator feature for
    everysrc/tgt word pair for most frequent words

18
Fertility Features
  • Model word fertility, src and tgt side
  • Link to all nodes in row/column
  • Constraint model fertility only upto maximum
    fertility
  • Indicator featuresone for each fertility n lt
    None for all fertilities n gt N
  • Alternative use fertility probabilitiesfrom
    IBM4 training
  • Now different for different words

19
First Order Features
  • Links depend on links ofneighboring words
  • Link always 2 nodes
  • Different features for differentdirections
  • (1,1), (1,2), (2,1), (1,0),
  • Captures distortions, similar toHMM and IBM4
    alignment
  • Indicator features, if both links are set
  • Also POS 1-order feature indicator feature
    link(j,i) and (POSj, POSi) and link(jk, il)

20
Inference Finding the Best Alignment
  • Word alignment corresponds to assignment of
    random variables
  • gt Find most probable variable assignment
  • Problem
  • Complex model structure many loops
  • No exact inference possible
  • Solution
  • Belief propagation algorithm
  • Inference by message passing
  • Runtime exponential in number of connected nodes

21
Belief Propagation
  • Messages are sent from random variable nodes to
    factored nodes, and also in the opposite
    direction
  • Start with some initial values, e.g. uniform
  • In each iteration
  • Calculate messages from hidden node (j,i) and
    sent to factored node c
  • Calculate messages from factored node c and sent
    to hidden node (j,i)

22
Getting the Probability
  • After several iterations, belief value calculated
    from messages send to hidden nodes
  • Belief value can be interpreted as posterior
    probability

23
Training
  • Maximum log-likelihood of correct alignment
  • Use gradient descent to find optimum
  • Train towards minimum alignment error
  • Need smoothed version of AER
  • Express AER in terms of link indicator functions
  • Use sigmoid of link probability
  • Can use 2-step approach
  • 1. Optimize towards ML
  • 2. Optimize towards AER

24
Some Results Spanish-English
  • Features
  • IBM1 and IBM4 lexicons
  • Fertilties
  • Link indicator feature
  • POS features
  • Phrase features
  • Impact on translationquality (Bleu scores)

Dev Eval
Baseline 40.04 47.73
DWA 41.62 48.13
25
Summary
  • In last 5 years new efforts in word alignment
  • Discriminative word alignment
  • Integrate many features
  • Need small amount of hand aligned data to tune
    (train) feature weights
  • Different variants
  • Log-linear modeling
  • Conditional random fields sequence and alignment
    matrix
  • Significant improvements in word alignment error
    rate
  • Not always improvements in translation quality
  • Different density of alignment -gt different
    phrase table size
  • Need to adjust phrase extraction algorithms?
Write a Comment
User Comments (0)
About PowerShow.com