Transfer-based MT - PowerPoint PPT Presentation

About This Presentation

Title:

Transfer-based MT

Description:

Transfer-based MT Syntactic Transfer-based Machine Translation Direct and Example-based approaches Two ends of a spectrum Recombination of fragments for better coverage. – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 102

Provided by: csPrincet

Learn more at: https://www.cs.princeton.edu

Category:

more less

Transcript and Presenter's Notes

Title: Transfer-based MT

1
Transfer-based MT
2
Syntactic Transfer-based Machine Translation

Direct and Example-based approaches
Two ends of a spectrum
Recombination of fragments for better coverage.
What if the matching/transfer is done at
syntactic parse level
Three Steps
Parse Syntactic parse of the source language
sentence
Hierarchical representation of a sentence
Transfer Rules to transform source parse tree
into target parse tree
Subject-Verb-Object ? Subject-Object-Verb
Generation Regenerating target language sentence
from parse tree
Morphology of the target language
Tree-structure provides better matching and
longer distance transformations than is possible
in string-based EBMT.

3
Examples of SynTran-MT
quiero
wanna
I
ajá
usar
yeah
use
mi
tarjeta
my
card
credit
de
crédito

Mostly parallel parse structures
Might have to insert word pronouns,
morphological particles

4
Example of SynTran MT -2

Pros
Allows for structure transfer
Re-orderings are typically restricted to the
parent-child nodes.
Cons
Transfer rules are for each language pair (N2
sets of rules)
Hard to reuse rules when one of the languages is
changed

5
Lexical-semantic Divergences

Linguistic Divergences
Structural differences between languages
Categorical Divergence
Translation of words in one language into words
that have different parts of speech in another
language
To be jealous
Tener celos (To have jealousy)

6
Issues

Linguistic Divergences
Conflational Divergence
Translation of two or more words in one language
into one word in another language
To kick
Dar una patada (Give a kick)

7
Issues

Linguistic Divergences
Structural Divergence
Realization of verb arguments in different
syntactic configurations in different languages
To enter the house
Entrar en la casa (Enter in the house)

8
Issues

Linguistic Divergences
Head-Swapping Divergence
Inversion of a structural-dominance relation
between two semantically equivalent words
To run in
Entrar corriendo (Enter running)

9
Issues

Linguistic Divergences
Thematic Divergence
Realization of verb arguments that reflect
different thematic to syntactic mapping orders
I like grapes
Me gustan uvas (To-me please grapes)

10
Divergence counts from Bonnie Dorr

32 of sentences in UN Spanish/English Corpus (5K)

11
Transfer rules
12
Syntax-driven statistical machine translation
Slides from Devi Xiong, CAS, Beijing
13
Why syntax-based SMT

Weakness of phrase-based SMT
Long-distance reordering phrase-level reordering
Discontinuous phrases
Generalization
Other methods using syntactic knowledge
Word alignment integrating syntactic constraints
Pre-order source sentences
Rerank n-best output of translation models

14
SSMT based on formal structures

Compared with phrase-based SMT
Translated hierarchically
The target structures finally generated are not
necessarily real linguistic structures, but
Make long-distance reordering more feasible
Introduce non-terminals/variables
Discontinuous phrases put x on, ? x ?
Generalization

15
SCFG

Formulated
Two CFGs and there correspondences
Or
P

16
SCFG an example
17
SCFG derivation
18
ITG

synchronous CFGs in which the links between
nonterminals in a production are restricted to
two possible configurations
Inverted
Straight
Any ITG can be converted into a synchronous CFG
of rank two.

19
BTG
20
ITG as reordering constraint

Two kinds of reordering
Inverted
straight
Coverage
Wu(1997) been unable to find real examples of
cases where alignments would fail under this
constraint, at least in lightly inflected
languages, such as English and Chinese.
Wellington(2006) we found examples, at least
5 of the Chinese/English sentence pairs.
Weakness
No strong mechanism determining which order is
better, inverted or straight.

21
Chiang05 Hierarchical Phrase-based Model (HPM)

Rules
Glue rule
Model log-linear
Decoder CKY

22
Chiang05 rule extraction
23
Chiang05 rule extraction restrictions

Initial base rule at most 15 on French side
Final rule at most 5 on French side
At most two non-terminals on each side,
nonadjacent
At least one aligned terminal pair

24
Chiang05 Model

Log-linear form
and

25
Chiang05 decoder
26
SSMT based on phrase structures

Using grammars with linguistic knowledge
The grammars are based on SCFG
Two categories
Tree-string
Tree-to-string
String-to-tree
Tree-tree

27
Yamada Knight 2001, 2003
28
Yamadas work vs. SCFG

Insertion operation
A ? (wA1, A1)
Reordering operation
A ?(A1A2A3, A1A3A2)
Translating operation
A ?(x, y)

29
Yamada weakness

Single-level mapping
Multi-level reordering
Yamada flatten
Word-based
Yamada phrasal leaf

30
Galley et al. 2004, 2006

translation model incorporates syntactic
structure on the target language side
trained by learning translation rules from
bilingual data
the decoder uses a parser-like method to create
syntactic trees as output hypotheses

31
Translation rules

Translation rules
Target multi-level subtrees
Source continuous or discontinuous phrases
Types of translation rules
Translating source phrases into target chunks
NPB(PRP/I) ??
NP-C(NPB(DT/this NN/address)) ??? ??

32
Types of translation rules

Have variables
NP-C(NPB(PRP/my x0NN)) ?? ? x0
PP(TO/to NP-C(NPB(x0NNS NNP/park))) ? ? x0 ??
Combine previously translated results together
VP(x0VBZ x1NP-C) ? x1 x0
takes a noun phrase followed by a verb, switches
their order, then combines them into a new verb
phrase

33
Rules extraction

Word-align a parallel corpus
Parse the target side
Extract translation rules
Minimal rules can not be decomposed
Composed rules composed by minimal rules
Estimate probalities

34
Rule extraction
Minimal rule
35
Composed rules
36
Format is Expressive
Non-constituent Phrases
Phrasal Translation
Non-contiguous Phrases
S
VP
VP
poner, x0
hay, x0
está, cantando
PRO
VP
VB
x0NP
PRT
VBZ
VBG
VB
x0NP
there
on
is
singing
put
are
Multilevel Re-Ordering
Lexicalized Re-Ordering
Context-Sensitive Word Insertion
NP
S
NPB
x0
x0NP
PP
x1, , x0
x1, x0, x2
x0NP
VP
DT
x0NNS
P
x1NP
x1VB
x2NP2
the
of
Knight Graehl, 2005
37
decoder

probabilistic CYK-style parsing algorithm with
beams
results in an English syntax tree corresponding
to the Chinese sentence
guarantees the output to have some kind of
globally coherent syntactic structure

38
Decoding example
39
Decoding example
40
Decoding example
41
Decoding example
42
Decoding example
43
Marcu et al. 2006

SPMT
Integrating non-syntactifiable phrases
Multiple features for each rule
Decoding with multiple models

44
SSMT based on phrase structures

Two categories
Tree-string
String-to-tree
Tree-to-string
Tree-tree

45
Tree-to-string

Liu et al. 2006
Tree-to-string alignment template model

46
TAT
47
TAT extraction

Constraints
Source trees have to be Subtree
Have to be consistent with word alignment
Restrictions on extraction
both the first and last symbols in the target
string must be aligned to some source symbols
The height of T(z) is limited to no greater than
h
The number of direct descendants of a node of
T(z) is limited to no greater than c

48
TAT Model
49
Decoding
50
Tree-to-string vs. string-to-tree

Tree-to-string
Integrating source structures into translation
and reordering
The output can not be grammatical
string-to-tree
guarantees the output to have some kind of
globally coherent syntactic structure
Can not use any knowledge from source structures

51
SSMT based on phrase structures

Two categories
Tree-string
String-to-tree
Tree-to-string
Tree-tree

52
Tree-Tree

Synchronous tree-adjoining grammar (STAG)
Synchronous tree substitution grammar (STSG)

53
STAG
54
STAG derivation
55
STSG
56
STSG elementary trees
57
Dependency structures
IP
VP
NP
??
NP
NP
NP
ADJP
NP
??
NN
NN
NN
VV
NR
NN
JJ
NN
???
??
??
??
??
??
?? ?? ?? ?? ?? ?? ?? ???
(b)
(a)
58
For MT dependency structures vs. phrase
structures

Advantages of dependency structures over phrase
structures for machine translation
Inherent lexicalization
Meaning-relative
Better representation of divergences across
languages

59
SSMT based on dependency structures

Lin 2004
A Path-based Transfer Model for Machine
Translation
Quirk et al. 2005
Dependency Treelet Translation Syntactically
Informed Phrasal SMT
Ding et al. 2005
Machine Translation Using Probabilistic
Synchronous Dependency Insertion Grammars

60
Lin 2004

Translation model trained by learning transfer
rules from bilingual corpus where the source
language sentences are parsed.
decoding finding the minimum path covering of
the source language dependency tree

61
Lin 2004 path
62
Lin 2004 transfer rule
63
Quirk et al. 2005

Translation model trained by learning treelet
pairs from bilingual corpus where the source
language sentences are parsed.
Decoding CKY-style

64
Treelet pairs
65
Quirk 2005 decoding
66
Ding 2005
67
summary
68
State of the art machine translation systems
based on statistical models rooted in the theory
of formal grammars/automata Translation models
based on finite state devices cannot easily model
translations between languages with strong
differences in word ordering Recently, several
models based on context-free grammars have been
investigated, borrowing from the theory of
compilers the idea of synchronous rewriting
Slides from G. Satta
69
Translation models based on synchronous
rewriting Inversion Transduction Grammars (Wu,
1997) Head Transducer Grammars (Alshawi et al.,
2000) Tree-to-string models (Yamada Knight,
2001 Galley et al, 2004) Loosely tree-based
model (Gildea, 2003) Multi-Text Grammars
(Melamed, 2003) Hierarchical phrase-based model
(Chiang, 2005) We use synchronous CFGs to study
formal properties of all these
70
A synchronous context-free grammar (SCFG) is
based on three components Context free grammar
(CFG) for source language CFG for target
language Pairing relation on the productions of
the two grammars and on the nonterminals in their
right-hand sides
71
Example (Yamada Knight, 2001)
VB --gt PRP(1) VB1(2) VB2(3) VB2 --gt VB(1)
TO(2) TO --gt TO(1) NN(2) PRP --gt he VB1
--gt adores VB --gt listening TO --gt to NN
--gt music
VB --gt PRP(1) VB2(3) VB1(2) VB2 --gt TO(2)
VB(1) ga TO --gt NN(2) TO(1) PRP --gt kare
ha VB1 --gt daisuki desu VB --gt kiku no TO
--gt wo NN --gt ongaku
72
Example (contd)
73
A pair of CFG productions in a SCFG is called a
synchronous production A SCFG generates pairs of
trees/strings, where each component is a
translation of the other A SCFG can be extended
with probabilities Each pair of productions is
assigned a probability Probability of a pair of
trees is the product of probabilities of
synchronous productions involved
74
The membership problem (Wu, 1997) for SCFGs is
defined as follows Input SCFG and pair of
strings w1, w2 Output Yes/No depending on
whether w1 translates into w2 under the
SCFG Applications in segmentation, word alignment
and bracketing of parallel corpora Assumption
that SCFG is part of the input is made here to
investigate the dependency of problem complexity
on grammar size
75
Result Membership problem for SCFGs is
NP-complete Proof uses SCFG derivations to
explore space of consistent truth assignments
that satisfy source 3SAT instance Remarks
Result transfers to (Yamada Knight, 2001),
(Gildea, 2003), (Melamed, 2003), which are at
least as powerful as SCFG
76

Remarks (contd)
Problem can be solved in polynomial time if
input grammar is fixed or production length is
bounded (Melamed, 2004)
Inversion Transduction Grammars (Wu, 1997)
Head Transducer Grammars (Alshawi et al., 2000)
For NLP applications, it is more realistic to
assume a fixed grammar and varying input string

77
Providing an exponential time lower bound for the
membership problem would amount to showing P ?
NP But we can show such a lower bound if we make
some assumptions on the class of algorithms and
data structures that we use to solve the
problem Result If chart parsing techniques are
used to solve the membership problem for SCFG, a
number of partial analyses is obtained that grows
exponentially with the production length of the
input grammar
78
Chart parsing for CFGs works by combining
completed constituents with partial analyses
A --gt B1 B2 B3 Bn
Three indices are used to process each
combination, for a total number of O(n3)
possible combinations that must be checked, n
the length of the input string
79
Consider the synchronous production A --gt
B (1) B (2) B (3) B (4) , A --gt B (3) B (1) B
(4) B (2) representing the permutation
80
When applying chart parsing, there is no way to
keep partial analyses contiguous
81
The proof of our result generalizes the previous
observations We show that, for some worst case
permutations of length q, any combination
strategy we choose leads to a number of indices
growing with order at least sqrt(q) Then for
SCFGs of size q, sqrt(q) is an asymptotic lower
bound for the membership problem when chart
parsing algorithms are used
82
A probabilistic SCFG provides the probability
that tree t1 translates into tree t2 Pr( t1 ,
t2 ) Accordingly, we can define the probability
that string w1 translates into string w2 Pr(
w1 , w2 ) ?t1?w1,t2?w2 Pr( t1 , t2 ) and
the probability that string w translates into
tree t Pr( w , t ) ?t1?w Pr( t1 , t )
83
The string-to-tree translation problem for
probabilistic SCFGs is defined as follows Input
Probabilistic SCFG and string w Output tree t
such that Pr(w, t ) is maximized Application in
machine translation Again, assumption that SCFG
is part of the input is made to investigate the
dependency of problem complexity on grammar size
84
Result string-to-tree translation problem for
probabilistic SCFGs (summing over possible source
trees) is NP-hard Proof reduces from consensus
problem Strings generated by probabilistic
finite automaton or hidden Markov model have
probabilities defined as sum of probabilities of
several paths Maximizing such summation is
NP-hard (Casacuberta Higuera, 2000) (Lyngso
Pedersen, 2002)
85
Remarks Source of complexity of the problem
comes from the fact that several source trees can
be translated into the same target tree Result
persists if there is a constant bound on length
of synchronous productions Open can the problem
be solved in polynomial time if probabilistic
SCFG is fixed?
86
Learning Non-Isomorphic Tree Mappings for Machine
Translation
a
A
b
B
misinform
report
events
wrongly
to-John
of
him
events
the
wrongly report events to-John
him misinform of the events
Slides from J. Eisner
87
Syntax-Based Machine Translation

Previous work assumes essentially isomorphic
trees
Wu 1995, Alshawi et al. 2000, Yamada Knight
2000
But trees are not isomorphic!
Discrepancies between the languages
Free translation in the training data

88
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
89
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
90
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. A much worse alignment ...
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
NP
beaucoup(lots)
quite
d (of)
NP
Adv
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
91
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
92
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. Alignment shows how trees are
generated synchronously from little trees ...
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
93
Grammar Set of Elementary Trees
94
Grammar Set of Elementary Trees
95
Grammar Set of Elementary Trees
96
Grammar Set of Elementary Trees
97
Grammar Set of Elementary Trees
98
Grammar Set of Elementary Trees
99
Probability model similar to PCFG
Probability of generating training trees T1, T2
with alignment A
P(T1, T2, A) ? p(t1,t2,a n)
probabilities of the little trees that are used
100
Form of model of big tree pairs
Joint model P?(T1,T2).
Wise to use noisy-channel form P?(T1 T2)
P?(T2)
But any joint model will do.
could be trained on zillionsof target-language
trees
train on paired trees (hard to get)
In synchronous TSG, aligned big tree pair is
generated by choosing a sequence of little tree
pairs
P(T1, T2, A) ? p(t1,t2,a n)
101
Maxent model of little tree pairs
p(

FEATURES
reportwrongly ? misinform?(use dictionary)
report ? misinform? (at root)
wrongly ? misinform?

verb incorporates adverb child?
verb incorporates child 1 of 3?
children 2, 3 switch positions?
common tree sizes shapes?
... etc. ....

102
Inside Probabilities
a
A
b
B
misinform
report
VP
events
wrongly
to-John
of
him
events
the
?( ) ...
103
Inside Probabilities
a
A
only O(n2)
b
B
misinform
report
VP
events
wrongly
to-John
of
him
NP
events
NP
the
?( ) ...
104
P(T1, T2, A) ? p(t1,t2,a n)

Alignment find A to max P?(T1,T2,A)
Decoding find T2, A to max P?(T1,T2,A)
Training find ? to max ?A P?(T1,T2,A)
Do everything on little trees instead!
Only need to train decode a model of
p?(t1,t2,a)
But not sure how to break up big tree correctly
So try all possible little trees all ways
of combining them, by dynamic prog.

105
Alignment Pseudocode

for each node c1 of T1 (bottom-up)
for each possible little tree t1 rooted at c1
for each node c2 of T2 (bottom-up)
for each possible little tree t2 rooted at c2
for each matching a between frontier nodes of t1
and t2
p p(t1,t2,a)
for each pair (d1,d2) of frontier nodes matched
by a
p p ?(d1,d2) // inside probability of
kids
?(c1,c2) ?(c1,c2) p // our inside
probability
Nonterminal states are used in practice but not
shown here
For EM training, also find outside probabilities

106
An MT Architecture
dynamic programming engine
Decoder
Trainer
scores all alignmentsbetween a big tree T1 a
forest of big trees T2
scores all alignmentsof two big trees T1,T2
Probability Model p?(t1,t2,a) of Little Trees
score little tree pair
propose translations t2 of little tree t1
update parameters ?
107
Related Work

Synchronous grammars (Shieber Schabes 1990)
Statistical work has allowed only 11 (isomorphic
trees)
Stochastic inversion transduction grammars (Wu
1995)
Head transducer grammars (Alshawi et al. 2000)
Statistical tree translation
Noisy channel model (Yamada Knight 2000)
Infers tree trains on (string, tree) pair, not
(tree, tree) pair
But again, allows only 11, plus 10 at leaves
Data-oriented translation (Poutsma 2000)
Synchronous DOP model trained on already aligned
trees
Statistical tree generation
Similar to our decoding construct forest of
appropriate trees, pick by highest prob
Dynamic prog. search in packed forest (Langkilde
2000)
Stack decoder (Ratnaparkhi 2000)