Title: Transfer-based MT
1Transfer-based MT
2Syntactic Transfer-based Machine Translation
- Direct and Example-based approaches
- Two ends of a spectrum
- Recombination of fragments for better coverage.
- What if the matching/transfer is done at
syntactic parse level - Three Steps
- Parse Syntactic parse of the source language
sentence - Hierarchical representation of a sentence
- Transfer Rules to transform source parse tree
into target parse tree - Subject-Verb-Object ? Subject-Object-Verb
- Generation Regenerating target language sentence
from parse tree - Morphology of the target language
- Tree-structure provides better matching and
longer distance transformations than is possible
in string-based EBMT.
3Examples of SynTran-MT
quiero
wanna
I
ajá
usar
yeah
use
mi
tarjeta
my
card
credit
de
crédito
- Mostly parallel parse structures
- Might have to insert word pronouns,
morphological particles
4Example of SynTran MT -2
- Pros
- Allows for structure transfer
- Re-orderings are typically restricted to the
parent-child nodes. - Cons
- Transfer rules are for each language pair (N2
sets of rules) - Hard to reuse rules when one of the languages is
changed
5Lexical-semantic Divergences
- Linguistic Divergences
- Structural differences between languages
- Categorical Divergence
- Translation of words in one language into words
that have different parts of speech in another
language - To be jealous
- Tener celos (To have jealousy)
6Issues
- Linguistic Divergences
- Conflational Divergence
- Translation of two or more words in one language
into one word in another language - To kick
- Dar una patada (Give a kick)
7Issues
- Linguistic Divergences
- Structural Divergence
- Realization of verb arguments in different
syntactic configurations in different languages - To enter the house
- Entrar en la casa (Enter in the house)
8Issues
- Linguistic Divergences
- Head-Swapping Divergence
- Inversion of a structural-dominance relation
between two semantically equivalent words - To run in
- Entrar corriendo (Enter running)
9Issues
- Linguistic Divergences
- Thematic Divergence
- Realization of verb arguments that reflect
different thematic to syntactic mapping orders - I like grapes
- Me gustan uvas (To-me please grapes)
10Divergence counts from Bonnie Dorr
- 32 of sentences in UN Spanish/English Corpus (5K)
11Transfer rules
12Syntax-driven statistical machine translation
Slides from Devi Xiong, CAS, Beijing
13Why syntax-based SMT
- Weakness of phrase-based SMT
- Long-distance reordering phrase-level reordering
- Discontinuous phrases
- Generalization
-
- Other methods using syntactic knowledge
- Word alignment integrating syntactic constraints
- Pre-order source sentences
- Rerank n-best output of translation models
14SSMT based on formal structures
- Compared with phrase-based SMT
- Translated hierarchically
- The target structures finally generated are not
necessarily real linguistic structures, but - Make long-distance reordering more feasible
- Introduce non-terminals/variables
- Discontinuous phrases put x on, ? x ?
- Generalization
15SCFG
- Formulated
- Two CFGs and there correspondences
- Or
- P
16SCFG an example
17SCFG derivation
18ITG
- synchronous CFGs in which the links between
nonterminals in a production are restricted to
two possible configurations - Inverted
- Straight
- Any ITG can be converted into a synchronous CFG
of rank two.
19BTG
20ITG as reordering constraint
- Two kinds of reordering
- Inverted
- straight
- Coverage
- Wu(1997) been unable to find real examples of
cases where alignments would fail under this
constraint, at least in lightly inflected
languages, such as English and Chinese. - Wellington(2006) we found examples, at least
5 of the Chinese/English sentence pairs. - Weakness
- No strong mechanism determining which order is
better, inverted or straight.
21Chiang05 Hierarchical Phrase-based Model (HPM)
- Rules
- Glue rule
- Model log-linear
- Decoder CKY
22Chiang05 rule extraction
23Chiang05 rule extraction restrictions
- Initial base rule at most 15 on French side
- Final rule at most 5 on French side
- At most two non-terminals on each side,
nonadjacent - At least one aligned terminal pair
24Chiang05 Model
25Chiang05 decoder
26SSMT based on phrase structures
- Using grammars with linguistic knowledge
- The grammars are based on SCFG
- Two categories
- Tree-string
- Tree-to-string
- String-to-tree
- Tree-tree
27Yamada Knight 2001, 2003
28Yamadas work vs. SCFG
- Insertion operation
- A ? (wA1, A1)
- Reordering operation
- A ?(A1A2A3, A1A3A2)
- Translating operation
- A ?(x, y)
29Yamada weakness
- Single-level mapping
- Multi-level reordering
- Yamada flatten
- Word-based
- Yamada phrasal leaf
30Galley et al. 2004, 2006
- translation model incorporates syntactic
structure on the target language side - trained by learning translation rules from
bilingual data - the decoder uses a parser-like method to create
syntactic trees as output hypotheses
31Translation rules
- Translation rules
- Target multi-level subtrees
- Source continuous or discontinuous phrases
- Types of translation rules
- Translating source phrases into target chunks
- NPB(PRP/I) ??
- NP-C(NPB(DT/this NN/address)) ??? ??
32Types of translation rules
- Have variables
- NP-C(NPB(PRP/my x0NN)) ?? ? x0
- PP(TO/to NP-C(NPB(x0NNS NNP/park))) ? ? x0 ??
- Combine previously translated results together
- VP(x0VBZ x1NP-C) ? x1 x0
- takes a noun phrase followed by a verb, switches
their order, then combines them into a new verb
phrase
33Rules extraction
- Word-align a parallel corpus
- Parse the target side
- Extract translation rules
- Minimal rules can not be decomposed
- Composed rules composed by minimal rules
- Estimate probalities
34Rule extraction
Minimal rule
35Composed rules
36Format is Expressive
Non-constituent Phrases
Phrasal Translation
Non-contiguous Phrases
S
VP
VP
poner, x0
hay, x0
está, cantando
PRO
VP
VB
x0NP
PRT
VBZ
VBG
VB
x0NP
there
on
is
singing
put
are
Multilevel Re-Ordering
Lexicalized Re-Ordering
Context-Sensitive Word Insertion
NP
S
NPB
x0
x0NP
PP
x1, , x0
x1, x0, x2
x0NP
VP
DT
x0NNS
P
x1NP
x1VB
x2NP2
the
of
Knight Graehl, 2005
37decoder
- probabilistic CYK-style parsing algorithm with
beams - results in an English syntax tree corresponding
to the Chinese sentence - guarantees the output to have some kind of
globally coherent syntactic structure
38Decoding example
39Decoding example
40Decoding example
41Decoding example
42Decoding example
43Marcu et al. 2006
- SPMT
- Integrating non-syntactifiable phrases
- Multiple features for each rule
- Decoding with multiple models
44SSMT based on phrase structures
- Two categories
- Tree-string
- String-to-tree
- Tree-to-string
- Tree-tree
45Tree-to-string
- Liu et al. 2006
- Tree-to-string alignment template model
46TAT
47TAT extraction
- Constraints
- Source trees have to be Subtree
- Have to be consistent with word alignment
- Restrictions on extraction
- both the first and last symbols in the target
string must be aligned to some source symbols - The height of T(z) is limited to no greater than
h - The number of direct descendants of a node of
T(z) is limited to no greater than c
48TAT Model
49Decoding
50Tree-to-string vs. string-to-tree
- Tree-to-string
- Integrating source structures into translation
and reordering - The output can not be grammatical
- string-to-tree
- guarantees the output to have some kind of
globally coherent syntactic structure - Can not use any knowledge from source structures
51SSMT based on phrase structures
- Two categories
- Tree-string
- String-to-tree
- Tree-to-string
- Tree-tree
52Tree-Tree
- Synchronous tree-adjoining grammar (STAG)
- Synchronous tree substitution grammar (STSG)
53STAG
54STAG derivation
55STSG
56STSG elementary trees
57Dependency structures
IP
VP
NP
??
NP
NP
NP
ADJP
NP
??
NN
NN
NN
VV
NR
NN
JJ
NN
???
??
??
??
??
??
?? ?? ?? ?? ?? ?? ?? ???
(b)
(a)
58For MT dependency structures vs. phrase
structures
- Advantages of dependency structures over phrase
structures for machine translation - Inherent lexicalization
- Meaning-relative
- Better representation of divergences across
languages
59SSMT based on dependency structures
- Lin 2004
- A Path-based Transfer Model for Machine
Translation - Quirk et al. 2005
- Dependency Treelet Translation Syntactically
Informed Phrasal SMT - Ding et al. 2005
- Machine Translation Using Probabilistic
Synchronous Dependency Insertion Grammars
60Lin 2004
- Translation model trained by learning transfer
rules from bilingual corpus where the source
language sentences are parsed. - decoding finding the minimum path covering of
the source language dependency tree
61Lin 2004 path
62Lin 2004 transfer rule
63Quirk et al. 2005
- Translation model trained by learning treelet
pairs from bilingual corpus where the source
language sentences are parsed. - Decoding CKY-style
64Treelet pairs
65Quirk 2005 decoding
66Ding 2005
67summary
68State of the art machine translation systems
based on statistical models rooted in the theory
of formal grammars/automata Translation models
based on finite state devices cannot easily model
translations between languages with strong
differences in word ordering Recently, several
models based on context-free grammars have been
investigated, borrowing from the theory of
compilers the idea of synchronous rewriting
Slides from G. Satta
69Translation models based on synchronous
rewriting Inversion Transduction Grammars (Wu,
1997) Head Transducer Grammars (Alshawi et al.,
2000) Tree-to-string models (Yamada Knight,
2001 Galley et al, 2004) Loosely tree-based
model (Gildea, 2003) Multi-Text Grammars
(Melamed, 2003) Hierarchical phrase-based model
(Chiang, 2005) We use synchronous CFGs to study
formal properties of all these
70A synchronous context-free grammar (SCFG) is
based on three components Context free grammar
(CFG) for source language CFG for target
language Pairing relation on the productions of
the two grammars and on the nonterminals in their
right-hand sides
71Example (Yamada Knight, 2001)
VB --gt PRP(1) VB1(2) VB2(3) VB2 --gt VB(1)
TO(2) TO --gt TO(1) NN(2) PRP --gt he VB1
--gt adores VB --gt listening TO --gt to NN
--gt music
VB --gt PRP(1) VB2(3) VB1(2) VB2 --gt TO(2)
VB(1) ga TO --gt NN(2) TO(1) PRP --gt kare
ha VB1 --gt daisuki desu VB --gt kiku no TO
--gt wo NN --gt ongaku
72Example (contd)
73A pair of CFG productions in a SCFG is called a
synchronous production A SCFG generates pairs of
trees/strings, where each component is a
translation of the other A SCFG can be extended
with probabilities Each pair of productions is
assigned a probability Probability of a pair of
trees is the product of probabilities of
synchronous productions involved
74The membership problem (Wu, 1997) for SCFGs is
defined as follows Input SCFG and pair of
strings w1, w2 Output Yes/No depending on
whether w1 translates into w2 under the
SCFG Applications in segmentation, word alignment
and bracketing of parallel corpora Assumption
that SCFG is part of the input is made here to
investigate the dependency of problem complexity
on grammar size
75Result Membership problem for SCFGs is
NP-complete Proof uses SCFG derivations to
explore space of consistent truth assignments
that satisfy source 3SAT instance Remarks
Result transfers to (Yamada Knight, 2001),
(Gildea, 2003), (Melamed, 2003), which are at
least as powerful as SCFG
76- Remarks (contd)
- Problem can be solved in polynomial time if
- input grammar is fixed or production length is
bounded (Melamed, 2004) - Inversion Transduction Grammars (Wu, 1997)
- Head Transducer Grammars (Alshawi et al., 2000)
- For NLP applications, it is more realistic to
assume a fixed grammar and varying input string
77Providing an exponential time lower bound for the
membership problem would amount to showing P ?
NP But we can show such a lower bound if we make
some assumptions on the class of algorithms and
data structures that we use to solve the
problem Result If chart parsing techniques are
used to solve the membership problem for SCFG, a
number of partial analyses is obtained that grows
exponentially with the production length of the
input grammar
78Chart parsing for CFGs works by combining
completed constituents with partial analyses
A --gt B1 B2 B3 Bn
Three indices are used to process each
combination, for a total number of O(n3)
possible combinations that must be checked, n
the length of the input string
79Consider the synchronous production A --gt
B (1) B (2) B (3) B (4) , A --gt B (3) B (1) B
(4) B (2) representing the permutation
80When applying chart parsing, there is no way to
keep partial analyses contiguous
81The proof of our result generalizes the previous
observations We show that, for some worst case
permutations of length q, any combination
strategy we choose leads to a number of indices
growing with order at least sqrt(q) Then for
SCFGs of size q, sqrt(q) is an asymptotic lower
bound for the membership problem when chart
parsing algorithms are used
82A probabilistic SCFG provides the probability
that tree t1 translates into tree t2 Pr( t1 ,
t2 ) Accordingly, we can define the probability
that string w1 translates into string w2 Pr(
w1 , w2 ) ?t1?w1,t2?w2 Pr( t1 , t2 ) and
the probability that string w translates into
tree t Pr( w , t ) ?t1?w Pr( t1 , t )
83The string-to-tree translation problem for
probabilistic SCFGs is defined as follows Input
Probabilistic SCFG and string w Output tree t
such that Pr(w, t ) is maximized Application in
machine translation Again, assumption that SCFG
is part of the input is made to investigate the
dependency of problem complexity on grammar size
84Result string-to-tree translation problem for
probabilistic SCFGs (summing over possible source
trees) is NP-hard Proof reduces from consensus
problem Strings generated by probabilistic
finite automaton or hidden Markov model have
probabilities defined as sum of probabilities of
several paths Maximizing such summation is
NP-hard (Casacuberta Higuera, 2000) (Lyngso
Pedersen, 2002)
85Remarks Source of complexity of the problem
comes from the fact that several source trees can
be translated into the same target tree Result
persists if there is a constant bound on length
of synchronous productions Open can the problem
be solved in polynomial time if probabilistic
SCFG is fixed?
86Learning Non-Isomorphic Tree Mappings for Machine
Translation
a
A
b
B
misinform
report
events
wrongly
to-John
of
him
events
the
wrongly report events to-John
him misinform of the events
Slides from J. Eisner
87Syntax-Based Machine Translation
- Previous work assumes essentially isomorphic
trees - Wu 1995, Alshawi et al. 2000, Yamada Knight
2000 - But trees are not isomorphic!
- Discrepancies between the languages
- Free translation in the training data
88Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
89Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
90Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. A much worse alignment ...
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
NP
beaucoup(lots)
quite
d (of)
NP
Adv
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
91Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
92Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. Alignment shows how trees are
generated synchronously from little trees ...
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
93Grammar Set of Elementary Trees
94Grammar Set of Elementary Trees
95Grammar Set of Elementary Trees
96Grammar Set of Elementary Trees
97Grammar Set of Elementary Trees
98Grammar Set of Elementary Trees
99Probability model similar to PCFG
Probability of generating training trees T1, T2
with alignment A
P(T1, T2, A) ? p(t1,t2,a n)
probabilities of the little trees that are used
100Form of model of big tree pairs
Joint model P?(T1,T2).
Wise to use noisy-channel form P?(T1 T2)
P?(T2)
But any joint model will do.
could be trained on zillionsof target-language
trees
train on paired trees (hard to get)
In synchronous TSG, aligned big tree pair is
generated by choosing a sequence of little tree
pairs
P(T1, T2, A) ? p(t1,t2,a n)
101Maxent model of little tree pairs
p(
- FEATURES
- reportwrongly ? misinform?(use dictionary)
- report ? misinform? (at root)
- wrongly ? misinform?
- verb incorporates adverb child?
- verb incorporates child 1 of 3?
- children 2, 3 switch positions?
- common tree sizes shapes?
- ... etc. ....
102Inside Probabilities
a
A
b
B
misinform
report
VP
events
wrongly
to-John
of
him
events
the
?( ) ...
103Inside Probabilities
a
A
only O(n2)
b
B
misinform
report
VP
events
wrongly
to-John
of
him
NP
events
NP
the
?( ) ...
104P(T1, T2, A) ? p(t1,t2,a n)
- Alignment find A to max P?(T1,T2,A)
- Decoding find T2, A to max P?(T1,T2,A)
- Training find ? to max ?A P?(T1,T2,A)
- Do everything on little trees instead!
- Only need to train decode a model of
p?(t1,t2,a) - But not sure how to break up big tree correctly
- So try all possible little trees all ways
of combining them, by dynamic prog.
105Alignment Pseudocode
- for each node c1 of T1 (bottom-up)
- for each possible little tree t1 rooted at c1
- for each node c2 of T2 (bottom-up)
- for each possible little tree t2 rooted at c2
- for each matching a between frontier nodes of t1
and t2 - p p(t1,t2,a)
- for each pair (d1,d2) of frontier nodes matched
by a - p p ?(d1,d2) // inside probability of
kids - ?(c1,c2) ?(c1,c2) p // our inside
probability - Nonterminal states are used in practice but not
shown here - For EM training, also find outside probabilities
106An MT Architecture
dynamic programming engine
Decoder
Trainer
scores all alignmentsbetween a big tree T1 a
forest of big trees T2
scores all alignmentsof two big trees T1,T2
Probability Model p?(t1,t2,a) of Little Trees
score little tree pair
propose translations t2 of little tree t1
update parameters ?
107Related Work
- Synchronous grammars (Shieber Schabes 1990)
- Statistical work has allowed only 11 (isomorphic
trees) - Stochastic inversion transduction grammars (Wu
1995) - Head transducer grammars (Alshawi et al. 2000)
- Statistical tree translation
- Noisy channel model (Yamada Knight 2000)
- Infers tree trains on (string, tree) pair, not
(tree, tree) pair - But again, allows only 11, plus 10 at leaves
- Data-oriented translation (Poutsma 2000)
- Synchronous DOP model trained on already aligned
trees - Statistical tree generation
- Similar to our decoding construct forest of
appropriate trees, pick by highest prob - Dynamic prog. search in packed forest (Langkilde
2000) - Stack decoder (Ratnaparkhi 2000)
108What Is New Here?
- Learning full elementary tree pairs, not rule
pairs or subcat pairs - Previous statistical formalisms have basically
assumed isomorphic trees - Maximum-entropy modeling of elementary tree pairs
- New, flexible formalization of synchronous Tree
Subst. Grammar - Allows either dependency trees or
phrase-structure trees - Empty trees permit insertion and deletion
during translation - Concrete enough for implementation (cf. informal
previous descriptions) - TSG is more powerful than CFG for modeling trees,
but faster than TAG - Observation that dynamic programming is
surprisingly fast - Find all possible decompositions into aligned
elementary tree pairs - O(n2) if both input trees are fully known and
elem. tree size is bounded