Machine Translation Discriminative Word Alignment - PowerPoint PPT Presentation

About This Presentation

Title:

Machine Translation Discriminative Word Alignment

Description:

... Year of DWA Yang Liu et al. 2005 More Features Search Moore 2005 Training Modeling Alignment with CRF Modeling Alignment Matrix Modeling Alignment ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 25

Provided by: Vog55

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Machine Translation Discriminative Word Alignment

1
Machine TranslationDiscriminative Word Alignment
Stephan Vogel Spring Semester 2011
2
Generative Alignment Models

Generative word alignment models P(f, ae)
Alignment a as hidden variable
Actual word alignment is not observed
Sum over all alignments
Well-known IBM 1 5 models, HMM, ITG
Model lexical association, distortion, fertility
It is difficult to incorporate additional
information
POS of words (used in distortion model, not as
direct link features)
Manual dictionary
Syntax information

3
Discriminative Word Alignment

Model alignment directly p(a f, e)
Find alignment that maximizes p(a f, e)
Well-suited framework maximum entropy
Set of feature functions hm(a, f, e), m 1, , M
Set of model parameters (feature weights) cm, m
1, , M
Decision rule

4
Integrating Additional Dependencies

Log-linear model allows integration of additional
dependencies, which contain additional
information
POS
Parse trees
Add additional variable to capture these
dependencies
New decision rule

5
Tasks

Modeling design feature functions which capture
cross-lingual divergences
Search find alignment with highest probability
Training find optimal feature weights
Minimize alignment errors given some
gold-standard alignments(Notice Alignments no
longer hidden!)
Supervised training, i.e. we evaluate against
gold standard
Notice features functions may result from some
training procedure themselves
E.g. use statistical dictionary resulting from
IBMn alignment, trained on large corpus
Here now additional training step, on small
(hand-aligned) corpus(Similar to MERT for
decoder)

6
2005 Year of DWA

Yang Liu, Qun Liu, and Shouxun Lin.
2005.Loglinear Models for Word Alignment.
Abraham Ittycheriah and Salim Roukos. 2005.A
Maximum Entropy Word Aligner for Arabic-English
Machine Translation.
Ben Taskar, Simon Lacoste-Julien, and Dan Klein.
2005.A Discriminative Matching Approach to Word
Alignment.
Robert C. Moore. 2005.A Discriminative Framework
for Bilingual Word Alignment.
Necip Fazil Ayan, Bonnie J. Dorr, and Christof
Monz. 2005.NeurAlign Combining Word Alignments
Using Neural Networks.

7
Yang Liu et al. 2005

Start out with features used in generative
alignment
Lexicons E.g. IBM1
Use both directions p(fjei) and p(eifj),
gtSymmetrical alignment model
And/or symmetric model
Fertility model p(fiei)

8
More Features

Cross count number of crossings in alignment
Neighbor count count the number of links in the
immediate neighborhood
Exact match count number of src/tgt pairs,
where srctgt
Linked word count total number of links (to
influence density)
Link types count how many 1-1, 1-m, m-1, n-m
alignments
Sibling distance if word is aligned to multiple
words, add the distance between these aligned
words
Link Co-occurrence count given multiple
alignments (e.g. Viterbi alignments from IBM
models) count how often links co-occur

9
Search

Greedy search based on gain by adding a link
For each of the features the gain can be
calculated
E.g. IBM1
AlgorithmStart with empty alignmentLoop until
no addition gain Loop over all (j,i) not in
set if gain(j,i) gt best_gain then store as
(j,i) Set link(j,i)

10
Moore 2005

Log-Likelihood-based model
Measure word association strength
Values can get large
Conditional-Link-Probability-based
Estimated probability of two words being linked
Used simpler alignment model to establish links
Add simple smoothing
Additional features one-to-one, one-to-many,
non-monotonicity

11
Training

Finding optimal alignment is non-trivial
Adding link can affect nonmonotonicity,
one-to-many features
Dynamic programming does not work
Beam search could be used
Requires pruning
Parameter optimization
Modified version of average perceptron learning

12
Modeling Alignment with CRF

CRF is an undirected graphical model
Each vertex (node) represents a random variable
whose distribution is to be inferred
Each edge represents a dependency between two
random variables
The distribution of each discrete random variable
Y in the graph is conditioned on an input
sequence X.
Cliques set of nodes in graph fully connected
In our case
Features derived from source and target words are
the input sequence X
Alignment links are the random variables Y
Different ways to model alignment
Blunsom Cohn (2006) many-to-one word
alignments, where each source word is aligned
with zero or one target words (-gt asymmetric)
Niehues Vogel (2008) model not sequence, but
entire alignment matrix(-gtsymmetric)

13
Modeling Alignment Matrix

Random variables yji for all possible alignment
links
2 values 0/1 word in position j is not
linked/linked to word in position i
Represented as nodes in a graph

14
Modeling Alignment Matrix

Factored nodes x representing features
(observables)
Linked to random variables
Define potential for each yji

15
Probability of Alignment
16
Features

Local features, e.g. lexical, POS,
Fertility features
First-order features capturing relation between
links
Phrase-features interaction between word and
phrase alignment

17
Local Features

Local information about link probability
Features derived from positions j and i only
Factored node connected to only one random
variable
Features
Lexical probabilities, also normalized to (f,e)
Word identity (e.g. for numbers, names)
Word similarity (e.g. cognates)
Relative position distance
Link indicator feature is (j,i) linkedin
Viterbi alignment from generative alignment
POS Indicator feature for every src/tgt POS pair
High frequency word indicator feature for
everysrc/tgt word pair for most frequent words

18
Fertility Features

Model word fertility, src and tgt side
Link to all nodes in row/column
Constraint model fertility only upto maximum
fertility
Indicator featuresone for each fertility n lt
None for all fertilities n gt N
Alternative use fertility probabilitiesfrom
IBM4 training
Now different for different words

19
First Order Features

Links depend on links ofneighboring words
Link always 2 nodes
Different features for differentdirections
(1,1), (1,2), (2,1), (1,0),
Captures distortions, similar toHMM and IBM4
alignment
Indicator features, if both links are set
Also POS 1-order feature indicator feature
link(j,i) and (POSj, POSi) and link(jk, il)

20
Inference Finding the Best Alignment

Word alignment corresponds to assignment of
random variables
gt Find most probable variable assignment
Problem
Complex model structure many loops
No exact inference possible
Solution
Belief propagation algorithm
Inference by message passing
Runtime exponential in number of connected nodes

21
Belief Propagation

Messages are sent from random variable nodes to
factored nodes, and also in the opposite
direction
Start with some initial values, e.g. uniform
In each iteration
Calculate messages from hidden node (j,i) and
sent to factored node c
Calculate messages from factored node c and sent
to hidden node (j,i)

22
Getting the Probability