11-731 Machine Translation Syntax-Based Translation Models

About This Presentation

Title:

11-731 Machine Translation Syntax-Based Translation Models

Description:

... (Automatically-acquired) {PP,24691} ;;SL: des principes ;;TL: with the principles PP::PP [ des N] - [ with the N] ( (X1::Y1) ) {PP,312} ;;SL: ... – PowerPoint PPT presentation

Number of Views:287

Avg rating:3.0/5.0

Slides: 45

Provided by: Vog55

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: 11-731 Machine Translation Syntax-Based Translation Models

1
11-731 Machine TranslationSyntax-Based
Translation Models Principles, Approaches,
Acquisition

Alon Lavie
16 March 2011

2
Outline

Syntax-based Translation Models Rationale and
Motivation
Resource Scenarios and Model Definitions
String-to-Tree, Tree-to-String and Tree-to-Tree
Hierarchical Phrase-based Models (Chiangs Hiero)
Syntax-Augmented Hierarchical Models (Venugopal
and Zollmann)
String-to-Tree Models
Phrase-Structure-based Model (Galley et al.,
2004, 2006)
Tree-to-Tree Models
Phrase-Structure-based Stat-XFER Model (Lavie et
al., 2008)
DCU Tree-bank Alignment method (Zhachev, Tinsley
et. al.)
Tree-to-String Models
Tree Transduction Models (Yamada and Knight,
Gildea et al.)

3
Syntax-based Models Rationale

Phrase-based models model translation at very
shallow levels
Translation equivalence modeled at the multi-word
lexical level
Phrases capture some cross-language local
reordering, but only for phrases that were seen
in training No effective generalization
Non-local cross-language reordering is modeled
only by permuting order of phrases during
decoding
No explicit modeling of syntax, structural
divergences or syntax-to-semantic mapping
differences
Goal Improve translation quality using
syntax-based models
Capture generalizations, reorderings and
divergences at appropriate levels of abstraction
Models direct the search during decoding to more
accurate translations
Still Statistical MT Acquire translation models
automatically from (annotated) parallel-data and
model them statistically!

4
Syntax-based Statistical MT

Building a syntax-based Statistical MT system
Similar in concept to simpler phrase-based SMT
methods
Model Acquisition from bilingual
sentence-parallel corpora
Decoders that given an input string can find the
best translation according to the models
Our focus today will be on the models and their
acquisition
Next week Chris Dyer will cover decoding for
hierarchical and syntax-based MT

5
Syntax-based Resources vs. Models

Important Distinction
What structural information for the parallel-data
is available during model acquisition and
training?
What type of translation models are we acquiring
from the annotated parallel data?
Structure available during Acquisition Main
Distinctions
Syntactic/structural information for the parallel
training data
Given by external components (parsers) or
inferred from the data?
Syntax/Structure available for one language or
for both?
Phrase-Structure or Dependency-Structure?
What do we extract from parallel-sentences?
Sub-sentential units of translation equivalence
annotated with structure
Rules/structures that determine how these units
combine into full transductions

6
Syntax-based Translation Models

String-to-Tree
Models explain how to transduce a string in the
source language into a structural representation
in the target language
During decoding
No separate parsing on source side
Decoding results in set of possible translations,
each annotated with syntactic structure
The best-scoring stringstructure can be selected
as the translation
Example

ne VB pas ? (VP (AUX (does) RB (not) x2
7
Syntax-based Translation Models

Tree-to-String
Models explain how to transduce a structural
representation of the source language input into
a string in the target language
During decoding
Parse the source string to derive its structure
Decoding explores various ways of decomposing the
parse tree into a sequence of composable models,
each generating a translation string on the
target side
The best-scoring string can be selected as the
translation
Examples

8
Syntax-based Translation Models

Tree-to-Tree
Models explain how to transduce a structural
representation of the source language input into
a structural representation in the target
language
During decoding
Decoder synchronously explores alternative ways
of parsing the source-language input string and
transduce it into corresponding target-language
structural output.
The best-scoring structurestring can be selected
as the translation
Example

NPNP VP ? CD ? ?? ? one of the CD countries
that VP ( Alignments (X1Y7) (X3Y4) )
9
Structure Available During Acquisition

What information/annotations are available for
the bilingual sentence-parallel training data?
(Symerticized) Viterbi Word Alignments (i.e. from
GIZA)
(Non-syntactic) extracted phrases for each
parallel sentence
Parse-trees/dependencies for source language
Parse-trees/dependencies for target language
Some major potential issues and problems
GIZA word alignments are not aware of syntax
word-alignment errors can have bad consequences
on the extracted syntactic models
Using external monolingual parsers is also
problematic
Using single-best parse for each sentence
introduces parsing errors
Parsers were designed for monolingual parsing,
not translation
Parser design decisions for each language may be
very different
Different notions of constituency and structure
Different sets of POS and constituent labels

10
Hierarchical Phrase-Based Models

Proposed by David Chiang in 2005
Natural hierarchical extension to phrase-based
models
Representation rules in the form of synchronous
CFGs
Formally syntactic, but with no direct
association to linguistic syntax
Single non-terminal X
Acquisition Scenario Similar to standard
phrase-based models
No independent syntactic parsing on either side
of parallel data
Uses symetricized bi-directional viterbi word
alignments
Extracts phrases and rules (hierarchical phrases)
from each parallel sentence
Models the extracted phrases statistically using
MLE scores

11
Hierarchical Phrase-Based Models

Extraction Process Overview
Start with standard phrase extraction from
symetricized viterbi word-aligned sentence-pair
For each phrase-pair, find all embedded
phrase-pairs, and create a hierarchical rule for
each instance
Accumulate collection of all such rules from the
entire corpus along with their counts
Model them statistically using maximum likelihood
estimate (MLE) scores
P(targetsource) count(source,target)/count(sour
ce)
P(sourcetarget) count(source,target)/count(targ
et)
Filtering
Rules of length lt 5 (terminals and non-terminals)
At most two non-terminals X
Non-terminals must be separated by a terminal

12
Hierarchical Phrase-Based Models

Example
Chinese-to-English Rules

13
Syntax-Augmented Hierarchical Model

Proposed by CMUs Venugopal and Zollmann in 2006
Representation rules in the form of synchronous
CFGs
Main Goal add linguistic syntax to the
hierarchical rules that are extracted by the
Hiero method
Hieros X labels are completely generic allow
substituting any sub-phrase into an X-hole (if
context matches)
Linguistic structure has labeled constituents
the labels determine what sub-structures are
allowed to combine together
Idea use labels that are derived from parse
structures on one side of parallel data to label
the X labels in the extracted rules
Labels from one language (i.e. English) are
projected to the other language (i.e. Chinese)
Major Issues/Problems
How to label X-holes that are not complete
constituents?
What to do about rule fragmentation rules
that are the same other than the labels inside
them?

14
Syntax-Augmented Hierarchical Model

Extraction Process Overview
Parse the strong side of the parallel data
(i.e. English)
Run the Hiero extraction process on the
parallel-sentence instance and find all
phrase-pairs and all hierarchical rules for
parallel-sentence
Labeling for each X-hole that corresponds to a
parse constituent C, label X as C. For all other
X-holes, assign combination labels
Accumulate collection of all such rules from the
entire corpus along with their counts
Model the rules statistically Venagopal
Zollman use six different rule score features
instead of just two MLE scores.
Filtering similar to Hiero rule filtering
Advanced Modeling Preference Grammars
Avoid rule fragmentation instead of explicitly
labeling the X-holes in the rules with different
labels, keep them as X, with distributions over
the possible labels that could fill the X.
These are used as features during decoding

15
Syntax-Augmented Hierarchical Model

Example

16
Tree-to-Tree Stat-XFER

Developed by Lavie, Ambati and Parlikar in 2007
Goal Extract linguistically-supported syntactic
phrase-pairs and synchronous transfer rules
automatically from parsed parallel corpora
Representation Synchronous CFG rules with
constituent-labels, POS-tags or lexical items on
RHS of rules. Syntax-labeled phrases are
fully-lexicalized S-CFG rules.
Acquisition Scenario
Parallel corpus is word-aligned using GIZA,
symetricized.
Phrase-structure parses for source and/or target
language for each parallel-sentence are obtained
using monolingual parsers

17
Transfer Rule Formalism
SL the old man, TL ha-ish ha-zaqen NPNP
DET ADJ N -gt DET N DET ADJ ( (X1Y1) (X1Y3)
(X2Y4) (X3Y2) ((X1 AGR) 3-SING) ((X1 DEF
DEF) ((X3 AGR) 3-SING) ((X3 COUNT)
) ((Y1 DEF) DEF) ((Y3 DEF) DEF) ((Y2 AGR)
3-SING) ((Y2 GENDER) (Y4 GENDER)) )

Type information
Part-of-speech/constituent information
Alignments
x-side constraints
y-side constraints
xy-constraints,
e.g. ((Y1 AGR) (X1 AGR))

18
Translation Lexicon French-to-English Examples
DETDET le" -gt the" ( (X1Y1) ) Prep
Prep dans -gt in ( (X1Y1) ) NN
principes" -gt principles" ( (X1Y1) ) NN
respect" -gt accordance" ( (X1Y1) )
NPNP le respect" -gt accordance" ( ) PP
PP dans le respect" -gt in
accordance" ( ) PPPP des principes" -gt
with the principles" ( )
19
French-English Transfer Grammar Example
Rules(Automatically-acquired)
PP,24691 SL des principes TL with the
principles PPPP des N -gt with the
N ( (X1Y1) )
PP,312 SL dans le respect des
principes TL in accordance with the
principles PPPP Prep NP -gt Prep
NP ( (X1Y1) (X2Y2) )
20
Syntax-driven Acquisition Process

Overview of Extraction Process
Word-align the parallel corpus (GIZA)
Parse the sentences independently for both
languages
Tree-to-tree Constituent Alignment
Run our Constituent Aligner over the parsed
sentence pairs
Enhance alignments with additional Constituent
Projections
Extract all aligned constituents from the
parallel trees
Extract all derived synchronous transfer rules
from the constituent-aligned parallel trees
Construct a data-base of all extracted parallel
constituents and synchronous rules with their
frequencies and model them statistically (assign
them MLE maximum-likelihood probabilities)

21
PFA Constituent Node Aligner

Input a bilingual pair of parsed and
word-aligned sentences
Goal find all sub-sentential constituent
alignments between the two trees which are
translation equivalents of each other
Equivalence Constraint a pair of constituents
ltS,Tgt are considered translation equivalents if
All words in yield of ltSgt are aligned only to
words in yield of ltTgt (and vice-versa)
If ltSgt has a sub-constituent ltS1gt that is aligned
to ltT1gt, then ltT1gt must be a sub-constituent of
ltTgt (and vice-versa)
Algorithm is a bottom-up process starting from
word-level, marking nodes that satisfy the
constraints

22
PFA Node Alignment Algorithm Example

Words dont have to align one-to-one
Constituent labels can be different in each
language
Tree Structures can be highly divergent

23
PFA Node Alignment Algorithm Example

Aligner uses a clever arithmetic manipulation to
enforce equivalence constraints
Resulting aligned nodes are highlighted in figure

24
PFA Node Alignment Algorithm Example

Extraction of Phrases
Get the yields of the aligned nodes and add them
to a phrase table tagged with syntactic
categories on both source and target sides
Example
NP NP
?? Australia

25
PFA Node Alignment Algorithm Example

All Phrases from this tree pair
IP S ?? ? ? ?? ? ?? ? ?? ?? ?? ? Australia
is one of the few countries that have diplomatic
relations with North Korea .
VP VP ? ? ?? ? ?? ? ?? ?? ?? is one of the
few countries that have diplomatic relations with
North Korea
NP NP ? ?? ? ?? ? ?? ?? ?? one of the few
countries that have diplomatic relations with
North Korea
VP VP ? ?? ? ?? have diplomatic relations
with North Korea
NP NP ?? diplomatic relations
NP NP ?? North Korea
NP NP ?? Australia

26
Further Improvements

The Tree-to-Tree (T2T) method is high precision
but suffers from low recall
Alternative Tree-to-String (T2S) methods (i.e.
Galley et al., 2006) use trees on ONE side and
project the nodes based on word alignments
High recall, but lower precision
Recent work by Vamshi Ambati Ambati and Lavie,
2008 combine both methods (T2T) by seeding
with the T2T correspondences and then adding in
additional consistent projected nodes from the
T2S method
Can be viewed as restructuring target tree to be
maximally isomorphic to source tree
Produces richer and more accurate syntactic
phrase tables that improve translation quality
(versus T2T and T2S)

27
Extracted Syntactic Phrases
English French
The principles Principes
With the principles des Principes
Accordance with the.. Respect des principes
Accordance Respect
In accordance with the Dans le respect des principes
Is all in accordance with.. Tout ceci dans le respect
This et
English French
The principles Principes
With the principles Principes
Accordance with the.. Respect des principes
Accordance Respect
In accordance with the Dans le respect des principes
Is all in accordance with.. Tout ceci dans le respect
This et
English French
The principles Principes
With the principles des Principes
Accordance Respect
TnT
TnS
TnT
28
Comparative Results French-to-English

MT Experimental Setup
Dev Set 600 sents, WMT 2006 data, 1 reference
Test Set 2000 sents, WMT 2007 data, 1 reference
NO transfer rules, Stat-XFER monotonic decoder
SALM Language Model (4M words)

29
Transfer Rule Acquisition

Input Constituent-aligned parallel trees
Idea Aligned nodes act as possible decomposition
points of the parallel trees
The sub-trees of any aligned pair of nodes can be
broken apart at any lower-level aligned nodes,
creating an inventory of tree-fragment
correspondences
Synchronous tree-frags can be converted into
synchronous rules
Algorithm
Find all possible tree-frag decompositions from
the node aligned trees
Flatten the tree-frags into synchronous CFG
rules

30
Rule Extraction Algorithm
Sub-Treelet extraction Extract Sub-tree
segments including synchronous alignment
information in the target tree. All the sub-trees
and the super-tree are extracted.
31
Rule Extraction Algorithm
Flat Rule Creation Each of the treelets pairs
is flattened to create a Rule in the Stat-XFER
Formalism Four major parts to the rule 1.
Type of the rule Source and Target side type
information 2. Constituent sequence of the
synchronous flat rule 3. Alignment information
of the constituents 4. Constraints in the rule
(Currently not extracted)
32
Rule Extraction Algorithm
Flat Rule Creation Sample rule IPS NP
VP . -gt NP VP . ( Alignments (X1Y1) (X2Y
2) Constraints )
33
Rule Extraction Algorithm

Flat Rule Creation
Sample rule
NPNP VP ? CD ? ?? -gt one of the CD
countries that VP
(
Alignments
(X1Y7)
(X3Y4)
)
Note
Any one-to-one aligned words are elevated to
Part-Of-Speech in flat rule.

34
Rule Extraction Algorithm
All rules extracted VPVP VC NP -gt VBZ
NP ( (score 0.5) Alignments (X1Y1) (X2Y2
) ) VPVP VC NP -gt VBZ NP ( (score
0.5) Alignments (X1Y1) (X2Y2) ) NPNP
NR -gt NNP ( (score 0.5)
Alignments (X1Y1) (X2Y2) ) VPVP ? NP VE
NP -gt VBP NP with NP ( (score 0.5)
Alignments (X2Y4) (X3Y1) (X4Y2) )
All rules extracted NPNP VP ? CD ? ?? -gt
one of the CD countries that VP ( (score
0.5) Alignments (X1Y7) (X3Y4) ) IPS
NP VP -gt NP VP ( (score 0.5)
Alignments (X1Y1) (X2Y2) ) NPNP ?? -gt
North Korea ( Many to one alignment is a
phrase )
34
35
Some Chinese XFER Rules

SL(2,4) ? ? ??
TL(3,5) trade to taiwan
Score22
NP,1045537
NPNP PP NP -gt NP PP
((score 0.916666666666667)
(X2Y1)
(X1Y2))
SL(2,7) ?? ?? ? ? ? ??
TL(1,7) commercials that directly mention the
name viagra
Score5
NP,1017929
NPNP VP "?" NP -gt NP "that" VP
((score 0.111111111111111)
(X3Y1)
(X1Y3))
SL(4,14) ? ? ? ? ? ? ? ?? ?? ? ??

36
DCU Tree-bank Alignment method

Proposed by Tinsley, Zhechev et al. in 2007
Main Idea
Focus on parallel treebank scenario parallel
sentences annotated with constituent parse-trees
for both sides (obtained by parsing)
Same notion and idea as Lavie et al. find
sub-sentential constituent nodes across the two
trees that are translation equivalents
Main difference does not depend on the viterbi
word alignments
Instead, use the lexical probabilities (obtained
by GIZA) to score all possible node-to-node
alignments and incrementally grow the set of
aligned-nodes.
Various types of rules can then be extracted
(i.e. Stat-XFER rules, etc.)
Overcomes some of the problems due to incorrect
and sparse word alignments
Produces surprisingly different collections of
rules than the Stat-XFER method

37
String-to-Tree Galley et al. (GHKM)

Proposed by Galley et al. in 2004 and improved in
2006
Idea model full syntactic structure on the
target-side only in order to produce translations
that are more grammatical
Representation synchronous hierarchical strings
on the source side and their corresponding tree
fragments on the target side
Example

ne VB pas ? (VP (AUX (does) RB (not) x2
38
String-to-Tree Galley et al. (GHKM)

Overview of Extraction Process
Obtain symetricized viterbi word-alignments for
parallel sentences
Parse the strong side of the parallel data
(i.e. English)
Find all constituent nodes in the source-language
tree that have consistent word alignments to
strings in target-language
Treat these as decomposition points extract
tree-fragments on target-side along with
corresponding gapped string on source-side
Labeling for each gap that corresponds to a
parse constituent C, label the gap as C.
Accumulate collection of all such rules from the
entire corpus along with their counts
Model the rules statistically initially used
standard P(tgtsrc) MLE scores. Also
experimented with other scores, similar to SAMT
Advanced Modeling Extraction of composed rules,
not just minimal rules

39
Tree Transduction Models

Originally proposed by Yamada and Knight, 2001.
Influenced later work by Gildea et al. on
Tree-to-String models
Conceptually simpler than most other models
Learn finite-state transductions on
source-language parse-trees in order to map them
into well-ordered and well-formed target
sentences, based on the viterbi word alignments
Representation simple local transformations on
tree structure, given contextual structure in the
tree
Transduce leaf words in the tree from source to
target language
Delete a leaf-word or a sub-tree in a given
context
Insert a leaf-word or a sub-tree in a given
context
Transpose (invert order) of two sub-trees in a
given context
Advanced model by Gildea duplicate and insert a
sub-tree

40
Tree Transduction Models

Main Issues/Problems
Some complex reorderings and correspondences
cannot be modeled using these simple tree
transductions
Highly sensitive to errors in the source-language
parse-tree and to word-alignment errors

41
Summary

Variety of structure and syntax based models
string-to-tree, tree-to-string, tree-to-tree
Different models utilize different structural
annotations on training resources and depend on
different independent components (parsers, word
alignments)
Different model acquisition processes from
parallel data, but several recurring themes
Finding sub-sentential translation equivalents
and relating them via hierarchical and/or
syntax-based structure
Statistical modeling of the massive collections
of rules acquired from the parallel data

42
Major Challenges

Sparse Coverage the acquired syntax-based models
are often much sparser in coverage than
non-syntactic phrases
Because they apply additional hard constraints
beyond word-alignment as evidence of translation
equivalence
Because the models fragment the data they are
often observed far fewer times in training data ?
more difficult to model them statistically
Consequently, pure syntactic models often lag
behind phrase-based models in translation
performance observed and learned again and
again by different groups (including our own)
This motivates approaches that integrate
syntax-based models with phrase-based models
Overcoming Pipeline Errors
Adding independent components (parser output,
viterbi word alignments) introduces cumulative
errors that are hard to overcome
Various approaches try to get around these
problems
Also recent work on syntax-aware
word-alignment, bi-lingual-aware parsing

43
Major Challenges

Optimizing for Structure Granularity and Labels
Syntactic structure in MT heavily based on Penn
TreeBank structures and labels (POS and
constituents) are these needed and optimal for
MT, even for MT into English?
Approaches range from single abstract
hierarchical X label, to fully lexicalized
constituent labels. What is optimal? How do we
answer this question?
Alternative Approaches (i.e. ITGs) aim to
overcome this problem by unsupervised inference
of the structure from the data
Direct Contrast and Comparison of alternative
approaches is extremely difficult
Decoding with these syntactic models is highly
complex and computationally intensive
Different groups/approaches develop their own
decoders
Hard to compare anything beyond BLEU (or other
metric) scores
Different groups continue to pursue different
approaches this is at the forefront of current
research in Statistical MT

44
References

(2008) Vamshi Ambati Alon Lavie Improving
syntax driven translation models by
re-structuring divergent and non-isomorphic parse
tree structures. AMTA-2008. MT at work
Proceedings of the Eighth Conference of the
Association for Machine Translation in the
Americas, Waikiki, Hawaii, 21-25 October 2008
pp.235-244
(2005) David Chiang A hierarchical phrase-based
model for statistical machine translation.
ACL-2005 43rd Annual meeting of the Association
for Computational Linguistics, University of
Michigan, Ann Arbor, 25-30 June 2005 pp.
263-270.
(2004) Michel Galley, Mark Hopkins, Kevin Knight
Daniel Marcu Whats in a translation rule?
HLT-NAACL 2004 Human Language Technology
conference and North American Chapter of the
Association for Computational Linguistics annual
meeting, May 2-7, 2004, The Park Plaza Hotel,
Boston, USA pp.273-280.
(2006) Michel Galley, Jonathan Graehl, Kevin
Knight, Daniel Marcu, Steve DeNeefe, Wei Wang,
Ignacio Thayer Scalable inference and training
of context-rich syntatic translation models.
Coling-ACL 2006 Proceedings of the 21st
International Conference on Computational
Linguistics and 44th Annual Meeting of the
Association for Computational Linguistics,
Sydney, 17-21 July 2006 pp.961-968.
(2008) Alon Lavie, Alok Parlikar, Vamshi
Ambati Syntax-driven learning of sub-sentential
translation equivalents and translation rules
from parsed parallel corpora. Second ACL Workshop
on Syntax and Structure in Statistical
Translation (ACL-08 SSST-2), Proceedings, 20 June
2008, Columbus, Ohio, USA pp.87-95.
(2007) John Tinsley, Ventsislav Zhechev, Mary
Hearne, Andy Way Robust language
pair-independent sub-tree alignment. MT Summit
XI, 10-14 September 2007, Copenhagen, Denmark.
Proceedings pp.467-474
(2007) Ashish Venugopal Andreas Zollmann
Hierarchical and syntax structured MT. First
Machine Translation Marathon, Edinburgh, April
16-20, 2007 52pp.
(2001) Kenji Yamada Kevin Knight A
syntax-based statistical translation model
ACL-EACL-2001 39th Annual meeting of the
Association for Computational Linguistics and
10th Conference of the European Chapter of ACL,
July 9th - 11th 2001, Toulouse, France
pp.523-530.
(2006) Andreas Zollmann Ashish Venugopal
Syntax augmented machine translation via chart
parsing. HLT-NAACL 2006 Proceedings of the
Workshop on Statistical Machine Translation, New
York, NY, USA, June 2006 pp. 138-141