Title: Inversion Transduction Grammar with Linguistic Constraints
1Inversion Transduction Grammar with Linguistic
Constraints
- Colin Cherry
- University of Alberta
2Edmonton Weather (Tuesday)
3Outline
- Bitext and Bitext Parsing
- Inversion Transduction Grammar (ITG)
- ITG with Linguistic Constraints
- Discriminative ITG with Linguistic Features
- Other Projects
4Statistical Machine Translation
- Input
- Source language sentence E
- Goal
- Produce a well-formed target language sentence F
with same meaning as E - Process
- Decoding search for an operation sequence O that
transforms E into F - Weights on individual operations are determined
empirically from examples of translation
5Bitext
Text in English
Same text, in French
- Valuable resource for training and testing
statistical machine translation systems - Large-scale examples of translation
- Needs analysis to determine small-scale
operations that generalize to unseen sentences
6Word Alignment
- Given a sentence and its translation, find the
word-to-word connections
the
minister
in
charge
of
the
Canadian
Wheat
Board
le
ministre
chargé
de
la
Commission
Canadienne
du
blé
7Word Alignment
- Given a sentence and its translation, find the
word-to-word connections - Link a single word-to-word connection
the
minister
in
charge
of
the
Canadian
Wheat
Board
le
ministre
chargé
de
la
Commission
Canadienne
du
blé
8Given a Word Alignment
- Extract bilingual phrase pairs for phrasal SMT
(Koehn et al. 2003) - Add in a parse tree and
- Extract treelet pairs for dependency translation
(Quirk et al. 2005) - Extract rules for a tree transducer (Galley et
al. 2004) - Other fun things
- Train monolingual paraphrasers (Quirk et al.
2004, Callison-Burch et al. 2005)
9Bitext Parsing
- Assume a context-free grammar generates two
languages at once - Like joint models, but position of words in both
languages is controlled by grammar
10Monolingual Parsing
Non-terminals
S
NP
VP
Production NP?Adj N
V
NP
Adv
V
Det
NP
Terminals
Adj
N
always
verbs
the
adjective
noun
he
11Another view
S
S ?NP VP
VP
NP
VP ?V NP
V
NP
NP
always verbs
he
the adjective noun
12Bitext Parsing is in 2D
S
English
French
13Bitext Parsing is in 2D
VP
English
NP
French
14Bitext Parsing is in 2D
NP
V
English
NP
French
15Bitext Parsing is in 2D
NP
V
English
Adv
NP
French
16Bitext Parsing is in 2D
NP
Det
V
English
Adv
NP
French
17Bitext Parsing is in 2D
N
Adj
Det
V
English
Adv
NP
French
18Bitext Parsing is in 2D
N
noun
Adj
adjective
Det
the
V
verbs
Adv
always
NP
he
il
verbe
toujours
le
nom
adjectif
19Why Bitext Parsing?
- Established polynomial algorithms
- Flexible framework, easy to add info
- Parse given an alignment
- Align given a parse (this work)
- Discoveries can be ported to parser-based
decoders (Zens et al. 2004, Melamed 2004) - Advances in parsing can be ported to word
alignment
20Outline
?
- Bitext and Bitext Parsing
- Inversion Transduction Grammar (ITG)
- ITG with Linguistic Constraints
- Discriminative ITG with Linguistic Features
- Other Projects
21Inversion Transduction Grammar
- Introduced in by Wu (1997)
- Transduction
- N ? noun / nom
- Inversion
- NP ? Det NP
- NP ? ltAdj Ngt
N
noun
nom
Straight
Inverted
22Binary Bracketing
- A?AA
- A?ltAAgt
- A?e/f
- No linguistic meaning to A
23Tree visualization
24Pros and Cons of Bracketing
- Pros
- Language independent
- Straight-forward and fast
- Symbols are minimally restrictive
- Cons
- Grammar is meaningless
- ITG Constraint
25ITG Constraint
12 are acceptable
to the commission
Mr Burton
fully or in part
12 are acceptable
to the commission
Mr Burton
fully or in part
26Outline
?
- Bitext and Bitext Parsing
- Inversion Transduction Grammar (ITG)
- ITG with Linguistic Constraints
- Discriminative ITG with Linguistic Features
- Other Projects
?
27Some questions
- Those ITG constraints are kind of scary. How bad
are they? Do they ever help? - Can we inject some linguistics into this
otherwise purely syntactic process? - Linguistic grammar would limit trees that can be
built - and therefore limit alignments
28Alignment Spaces
- Set of feasible alignments for a sentence pair
- Described by how links interact
- If links dont interact, problem loses its
structure - Should encourage competition between links
(Guidance) - Should not eliminate correct alignments
(Expressiveness)
29ITG Space
- Rules out inside-out alignments
- Limits how concepts can be re-ordered during
translation
30Permutation Space
- One-to-one each word in at most one link
- Allows any permutation of concepts
- Reduces to weighted maximum matching if each link
can be scored independently
the
tax
causes
unrest
l
impôt
cause
le
malaise
31Linguistic source Dependencies
- Tree structure defines dependencies between words
- Subtrees define contiguous phrases
the
minister
in
charge
of
32Linguistic source Dependencies
- Tree structure defines dependencies between words
- Subtrees define contiguous phrases
the
minister
in
charge
of
33Phrasal Cohesion
- Syntactic phrases in tree tend to stay together
after translation (Fox 2002) - We can use this idea to constrain an alignment
given an English dependency tree - Shown to improve alignment quality
- (Lin and Cherry 2003)
34Example
the
tax
causes
unrest
l
impôt
cause
le
malaise
35Example
the
tax
causes
unrest
l
impôt
cause
le
malaise
We can rule out the link, even with no one-to-one
violation
36ITG Dependency
- Both limit movement with phrasal cohesion
- ITG Cohesive in some binary tree
- Dep Cohesive in provided dependency tree
- Not subspaces of each other
the
big
red
dog
the
dog
ate
it
Dep ? ITG x
Dep x ITG ?
37D-ITG Space
- Force ITG to maintain phrasal cohesion with a
provided dependency tree - Intersects ITG and Dependency spaces
- Adds linguistic dependency tree to ITG parsing
38Chart Modification Solution
- Eliminate structures that allow tax to invert
away from the
the
tax
causes
unrest
39Effect on Parser
A ?
unrest
A x
causes
tax
A ?
the
l
impôt
cause
le
malaise
40Effect on Parser
A ?
unrest
causes
A ?
tax
the
l
impôt
cause
le
malaise
41Continuum of constraints
Permutation
ITG
D-ITG
Unconstrained
42Experimental Setup
- English-French Parliamentary debates
- 500 sentence labeled test set
- (Och and Ney, 2003)
- Dependency parses from Minipar
43Guidance Test
- Does the space stop incorrect alignments?
- Use a weighted link score built from
- Bilingual correlations between words
- Relative position of tokens
- Maximize summed link scores in all spaces, check
alignment error rate - AER Combined precision and recall, lower is
better
44Guidance Results
45Expressiveness Test
- Given a strong model, does the space hold us
back? - Use a cooked link score from the gold standard
- Only correct links are given positive scores
- Best space is unconstrained space
- Maximize summed link scores in all spaces, check
recall
46Expressiveness Results
47Contributions
- Algorithmic
- Method to inject ITG with linguistic constraints
- Experimental
- ITG constraints provide guidance, with virtually
no loss in expressiveness (French-English) - Dependency cohesion constraints provide greater
guidance, at the cost of some expressiveness
48Outline
?
- Bitext and Bitext Parsing
- Inversion Transduction Grammar (ITG)
- ITG with Linguistic Constraints
- Discriminative ITG with Linguistic Features
- Other Projects
?
?
49Remaining Problems
- Dependency cohesion stops correct links
- Parse errors, Paraphrase, Exceptions
- Would like a soft constraint
- Im not doing much learning
- ?2 competitive linking with an ITG search
50Soft Constraint
- Invalid spans need not be disallowed
- Instead parser could incur a penalty
- Easy to incorporate penalty into DP
51ITG Learning
- Zhang and Gildea 2004, 2005, 2006
- Expectation Maximization to parameterize a
stochastic grammar unsupervised - Driven by expensive 2D inside-outside
- Not doing much better than I am with ?2
- Meanwhile, EMNLP05 is happening
- Moore 2005, Taskar et al. 2005
- Suddenly its okay to use some training data
52Discriminative matching (Taskar et al. 05)
causes
?2 0.767 DIST 0.050 LCSR 0.833 HMM 0.0
60 -09 02 20
?
47.2
cause
Link Score
Features
Learned Weights
Max matching finds alignment that maximizes the
sum of link scores Entire alignment y can be
given feature vector ?(y) according to features
of links in y
53Learning objective
- Find weights w, such that for each example i
- Can formulate as constrained optimization
problem, do max margin training - Problem Exponential number of wrong answers
Structured Distance
Features
Learned Weights
54SVM Struct (Tsochantaridis et al. 2004)
w
Constrained optimization
Search for most violated
Empty constraints
Accumulated constraints
Theory of constraint generation in constrained
optimization guarantees convergence
55Similarities to Averaged Perceptron
- Online method driven by comparisons of current
output to correct answer - But
- Allows a notion of structural distance
- Returns a max margin solution (with slacks) at
each step - Remembers all of its past mistakes
56SVM-ITG
- Can learn ITG parameters discriminatively
- Link productions A?e/f are scored as in
discriminative matching - Non-terminal productions A?AA ltAAgt are scored
with two features - Is it inverted?
- Does it cover a span that would usually be
illegal?
causes
?2 0.767 DIST 0.050 LCSR 0.833 HMM 0.0
60 -09 02 20
47.2
A?causes / cause
cause
57Experimental Setup
- Identical to Taskar et al.
- 100 training
- 37 development
- 347 test
- Same unsupervised text as before to derive
features - 50k Hansards data
58Results
Bipartite matching SVM (Permutation) SVM
weights with hard constraint (D-ITG)
59Results
Bipartite matching SVM SVM weights with
hard constraint ITG SVM with soft cohesion
feature
60Contributions
- Algorithmic
- Discriminative learning method for ITGs
- Experimental
- Value of hard constraints is reduced in the
presence of a strong link score - Integrating constraint as a feature during
training can recover value of constraints,
improve AER recall
61Other Projects
- Applying techniques from SMT to new domains
- Unsupervised pronoun resolution
- Discriminative Structured Learning
- Discriminative parsing
62Unsupervised Pronoun ResolutionCherry and
Bergsma, CoNLL05
- The president entered the arena with his family.
- Input
- A pronoun in context, and a list of candidates
- his family, arena, president
- Output The correct candidate - president
- Big Idea
- Formulate a generative model, where a candidate
generates the pronoun and context, run EM - Similar to IBM-1 Align pronouns to candidates
63Pronoun Resolution Innovations
- Used linguistics to limit candidate list
- Binding theory, known noun genders
- Used unambiguous cases to initialize EM
- Re-weighted component models discriminatively
with maximum entropy - End result
- Within 5 of a supervised system, with
re-weighted model matching supervised performance
64Discriminative ParsingWang, Cherry, Lizotte and
Schuurmans, CoNLL06
- Input Segmented Chinese string
- Output Dependency parse tree
- Big Idea
- Score each link independently, with SVM weighting
features on links (MacDonald 2005), but
generalize without Part of Speech tags - Learn a weight for every word-pair seen in
training
65Parsing Innovations
- To promote generalization
- Altered large margin portion of SVM objective
so semantically similar word pairs have similar
weights - Tried two constraint types
- Local Link scores constrained so links present
in gold standard score higher than those absent - Global SVM Struct-style constraint generation
66Others in brief
- Dependency treelet decoder (here)
- Sequence tagging
- Biomedical Term recognition
- Highlight gene names, proteins in medical texts
- Character-based Syllabification
- Find syllable breaks in written words
67Outline
?
- Bitext and Bitext Parsing
- Inversion Transduction Grammar (ITG)
- ITG with Linguistic Constraints
- Discriminative ITG with Linguistic Features
- Other Projects
?
?
?
?
68(No Transcript)
69Connecting E and F
- One language generates the other
- IBM models (Brown et al. 1993), HMM (Vogel et al.
1996), Tree-to-string model (Yamada and Knight
2001) - Both languages generated simultaneously
- Joint model (Melamed 2000), Phrasal joint model
(Marcu and Wong 2002) - S and T generate an alignment
- Conditional model (Cherry and Lin 2003),
Discriminative models (Taskar et al. 2005, Moore
2005)
70Phrases agree, not trees
he
ran
here
quickly
Dependencies state that ran is modified here and
quickly separately We allow ITG to state that ran
is modified by here quickly Also tested these
additional head constraints
71Effect on Parser
A x
unrest
causes
tax
A ?
the
l
impôt
cause
le
malaise
72Custom Grammar Solution
- What trees force the and tax to stay together?
- Custom recursive grammar
- Same alignment space, canonical tree
tax causes unrest
the
tax
causes
unrest
ITG
the tax
ITG
73Guidance Results
74Expressiveness Results
75Expressiveness Analysis
- HD-ITG has systematic violations
- Discontinuous Constituents (Melamed, 2003)
- Maintains distance to head - not always
maintained in translation
Canadian
Wheat
Board
Canadian
Wheat
Board
Commission
Canadienne
du
blé
76Discriminative Alignment
- Alignment can be viewed as multi-class
classification
Wrong Answers
Input
the
tax
causes
unrest
the
tax
causes
unrest
l
impôt
cause
le
malaise
l
impôt
cause
le
malaise
the
tax
causes
unrest
Correct Answer
l
impôt
cause
le
malaise
the
tax
causes
unrest
the
tax
causes
unrest
l
impôt
cause
le
malaise
l
impôt
cause
le
malaise
77Problem
- Exponential number of incorrect alignments
- One solution
- Take advantage of properties of matching
algorithm - Factor constraints
- Doing the same factorization on ITG could be a
lot of work - need something more modular - Averaged perceptron?
- Structured SVM
78Final Challenge
- Need gold standard trees to train on, only have
gold standard alignments - Versatility of ITG makes this easy
- Search for best parse given an alignment
- Select the parse with fewest cohesion violations
and fewest inversions
79Redundancy
- Using A?AA ltAAgt e/f
- Several parses produce the same alignment
- Wu provides a canonical-form grammar
- Creates only one parse per alignment
- Useful for
- Counting methods like EM
- Detecting arbitrary bracketing decisions
80Results Table
81Guidance Results
82Expressiveness Results
83SVM Objective
Slack
Structured loss
Feature rep