Syntax-based Statistical Machine Translation Models - PowerPoint PPT Presentation

About This Presentation
Title:

Syntax-based Statistical Machine Translation Models

Description:

'One naturally wonders if the problem of translation could ... TAG STAG. etc. Monolingual parsers are extended for bitext parsing. Synchronous Grammar: SCFG ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 64
Provided by: scie5
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Syntax-based Statistical Machine Translation Models


1
Syntax-based Statistical Machine Translation
Models
  • Amr Ahmed
  • March 26th 2008

2
Outline
  • The Translation Problem
  • The Noisy Channel Model
  • Syntax-light SMT
  • Why Syntax?
  • Syntax-based SMT Models
  • Summary

3
Statistical Machine Translation
Problem
  • Given a sentence (f) in one language, produce it
    is equivalent in another language (e)

I know how to do this
One naturally wonders if the problem of
translation could conceivably be treated as a
problem in cryptography. When I look at an
article in Arabic, I say This is really written
in English, but it has been coded in some strange
symbols. I will now proceed to decode. , Warren
Weaver, 1947
4
Statistical Machine Translation
Problem
  • Given a sentence (f) in one language, produce it
    is equivalent in another language (e)

Noisy Channel Model
Noisy Channel
P(e)
We know how to factor P(e)!
e
f
P(e) models good English P(fe) models good
translation
Today How to factor p(fe)?
5
Outline
  • The Translation Problem
  • The Noisy Channel Model
  • Syntax-light SMT
  • Word-based Models
  • Phrase-based Models
  • Why Syntax?
  • Syntax-based SMT Models
  • Summary

6
Word-Translation Models
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
Blue word links arent observed in data.
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
  • What is the generative Story?
  • IBM Model 1-4
  • Roughly equivalent to FST (module reordering)
  • Learning and Decoding?

Slide Credit Adapted from Smith et. al.
7
Word-Based Translation Models
e
-Stochastic operations -Associated with
probabilities -Estimated using EM
In a Nutshell
Q What are we learning? A Word movement
f
Linguistic Hypothesis
Phrase based Models
1- Words move in blocks 2- Context is important
8
Phrase-Based Translation Models
e
Segment
-Stochastic operations -Associated with
probabilities -Estimated using EM
Translation
In a Nutshell
Q What are we learning? A Word movement
Re-ordering
f
Linguistic Hypothesis
Phrase based Models
1- Words move in blocks 2- Context is important
Markovian Dependency
9
Phrase-Based Translation Models
e
Segment
-Stochastic operations -Associated with
probabilities -Estimated using EM
Translation
In a Nutshell
a1
a2
a3
Q What are we learning? A Word movement
Re-ordering
f
Linguistic Hypothesis
Phrase based Models
1- Words move in blocks 2- Context is important
Markovian Dependency
10
Phrase-Based Models Example
Not necessarily syntactic phrases
Division into phrases is hidden
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
question
I
did
not
unfortunately
receive
an
answer
to
this
Score each phrase pair using several features
Slide Credit from Smith et. al.
11
Phrase Table Estimation
Basically count and Normalize
12
Outline
  • The Translation Problem
  • The Noisy Channel Model
  • Syntax-light SMT
  • Word-based Models
  • Phrase-based Models
  • Why Syntax?
  • Syntax-based SMT Models
  • Summary

13
Outline
  • The Translation Problem
  • The Noisy Channel Model
  • Syntax-light SMT
  • Why Syntax?
  • Syntax-based SMT Models
  • Summary

14
Why Syntax?
  • Reference consequently proposals are submitted
    to parliament under the assent procedure, meaning
    that parliament can no longer table amendments,
    as directives in this area were adopted as single
    market legislation under the codecision procedure
    on the basis of art.100a tec.
  • Translation consequently, the proposals
    parliament after the assent procedure, the tabled
    amendments for offers no possibility of community
    directives, because as part of the internal
    market legislation on the basis of article 100a
    of the treaty in the codecision procedure have
    been adopted.

Slide Credit Example from Cowan et. al.
15
Why Syntax?
Slide Credit Adapted from Cowan et. al.
  • Reference consequently proposals are submitted
    to parliament under the assent procedure, meaning
    that parliament can no longer table amendments,
    as directives in this area were adopted as single
    market legislation under the codecision procedure
    on the basis of art.100a tec.
  • Translation consequently, the proposals
    parliament after the assent procedure, the tabled
    amendments for offers no possibility of community
    directives, because as part of the internal
    market legislation on the basis of article 100a
    of the treaty in the codecision procedure have
    been adopted.

Here syntax Can help!
What Went Wrong?
  • phrase-based systems are very good at predicting
    content words,
  • But are less accurate in producing function
    words, or producing output that correctly encodes
    grammatical relations between content words

16
Structure Does Help!
Does adding more Structure help ?
Se
Se
x1
x2
x3
Noisy Channel
Noisy Channel
Sf
Sf
x2
x1
x3
Word-based
Phrase-based
Syntax-based
Better performance
?
17
Syntax and the Translation Pipeline
Input
Pre-reordering
Translation system
Syntax
Syntax in the Translation model
Output
Post processing (re-ranking)
18
Early Exposition (Koehn et al 2003)
  • Fix a phrase-based System and vary the way
    phrases are extracted
  • Frequency-based, Generative, Constituent
  • Adding syntax hurts the performance
  • Phrases like there is? es gibt is not a
    constituent (this eliminate 80 phrase-pairs)
  • Explanation
  • No hierarchical re-ordering
  • Syntax is not fully exploited here!
  • Parse trees produce errors

19
Outline
  • The Translation Problem
  • The Noisy Channel Model
  • Syntax-light SMT
  • Why Syntax?
  • Syntax-based SMT Models
  • Summary

20
The Big Picture Translation Models
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Word-based
Phrase-based
SCFG (Chiang 2005), ITG (Wu 97)
Inter-lingua
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Tree-Tree Transducers
Tree-String Transducers
String-Tree Transducers
21
Learning Synchronous Grammar
  • No linguistic annotation
  • Model P(e,f) jointly
  • Trees are hidden variables
  • EM doesnt work well with large missing
    information
  • Structural restrictions
  • Binary rules (ITG, Wu 97)
  • Lexical restriction Chiang 2005

SCFG to represent Hierarchal phrases
What is Synchronous Grammar?
22
Interlude Synchronous Grammar
  • Extension of monolingual theory to bitext
  • CFG ? SCFG
  • TAG ? STAG
  • etc.
  • Monolingual parsers are extended for bitext
    parsing

23
Synchronous Grammar SCFG
CFG
SCFG
24
Learning Synchronous Grammar
  • No linguistic annotation
  • Model P(e,f) jointly
  • Trees are hidden variables
  • EM doesnt work well with large missing
    information
  • Structural restrictions
  • Binary rules (ITG, Wu 97)
  • Lexical restriction Chiang 2005

SCFG to represent Hierarchal phrases
What is Synchronous Grammar?
How
25
Hierarchical Phrase-based Model
Hierarchical Phrased-based Models
S1
S1
x1
x1
S2
S2
x3
x3
f3
x2
f4
e3
x2
e4
f1
e1
e5
e6
e6
e5
Phrased-based Models
Se
Sf
x1
x2
x3
x2
x1
x3
26
Example (Chiang 2005)
27
Hierarchical Phrase-based Model
Question1
How to train the model?
What are the restrictions
-At most two recursive phrases -Restriction on
length
Question 2
How to decode?
28
Training and Decoding
  • Collect initial grammar rules

29
Training and Decoding
  • Collect initial grammar rules
  • Tune rule weights count and normalize!
  • Decoding
  • CYK (remember rules has at most two
    non-terminals)
  • Parse the f part only.

30
Does it help?
  • Experimental Details
  • Mandarin-to-English (FBIS corpus)
  • 7.2M 9.2 M words
  • Devset NIST 2002 MT evaluation
  • Test Set 2003 NIST MT evaluation
  • 7.5 relative improvement over phrase-based
    models using BLEU score
  • 0.02 absolute improvement over baseline

31
Does it help?
  • 7.5 relative improvement over phrase-based models
  • Learnt rules are formally SCFG but not
    linguistically interpretable
  • The model learns re-ordering patterns guided by
    lexical functional words
  • Capture long-range movements via recursion

32
Follow-Up study
  • Why not decorate the phrases with their
    grammatical constituents?
  • Zollmann et. Al. 2006, 2007
  • If possible decorate the phrase with a
    constituent
  • Generalize phrases as in Chiang 2005
  • Parse using chart parsing
  • Moved from 31.85 ?32.15 over CMU phrase-based
    system
  • Spanich-English corpus

33
The Big Picture Translation Models
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Word-based
Phrase-based
SCFG (Chiang 2005), ITG (Wu 97)
Inter-lingua
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Tree-Tree Transducers
Tree-String Transducers
String-Tree Transducers
34
Tree-String Tranceducers
  • Linguistic Tools
  • English Parse Trees
  • from statistical parser
  • Alignment
  • from Giza
  • Conditional Model
  • P(f Te)
  • Models differ on
  • How to factor P(f Te)
  • Domain of locality
  • SCFG (Yamada,Knight 2001)
  • STSG (Galley et. Al 2004)

Caveat
35
Tree-String (Yamada Knight)
  • Back to noisy channel model
  • Traduces Te into f
  • Stochastic Channel operations (on trees)
  • Reorder children
  • Insert node
  • Lexical Transplantation

36
Channel operations
P(VB T0? T0 VB)
P(rightPRP) Pi(ha)
37
Learning
  • Learn channel operation probabilities
  • Reordering
  • Insertion
  • Translation
  • Standard EM-Training
  • E-Step compute expected rule counts (Dyn.)
  • M-Step count and normalize

38
Decoding As Parsing
  • In a nutshell, we learnt how to parse the foreign
    side
  • Add CFG rules from the English side
  • Channel rules
  • Reordering
  • If (VB2-gtVB T0) reordered as (VB2? T0 VB)
  • Add rule VB2?p T0 VB
  • Insertion
  • V?plXV and V?prV X and X?fi
  • Translation
  • ei?pt fi

39
Decoding Example
40
Results and Expressiveness
  • English-Chinese task
  • Short sentences lt 20 words (3M word corpus)
  • Test set 347 sentence with at most 14 words
  • Better Bleu score (0.102) than IBM-4 (.072)

What it can represent
  • Depends on syntactic divergence between languages
    pairs
  • Tree must be isomorphic up to child re-reordering
  • Channel rules have the following format

Q What it cant model?
Child re-ordering
41
Limitations
  • Cant model syntactic movements that cross
    brackets
  • SVO to VSO
  • Modal movement between English and French
  • Not ? ne .. pas (from English to French)

VP
VP
VP
VP
VP
.
VB
Aux
go
Not
Does
pas
va
ne
The span of Not cant intersect that of Go
Cant Interleave Green with the other two
42
Limitations Possible solutions
  • Some follow up study showed relative improvement
    by
  • Gildea 2003 added cloning operations
  • AER went from .42 ? 0.3 on Korean-English corus

VP
VP
VP
VP
VP
.
VB
Aux
go
Not
Does
pas
va
ne
The span of Not cant intersect that of Go
Cant Interleave Green with the other two
43
Tree-String Tranceducers
  • Linguistic Tools
  • English Parse Trees
  • from statistical parser
  • Alignment
  • from Giza
  • Conditional Model
  • P(f Te)
  • Models differ on
  • How to factor P(f Te)
  • Domain of locality
  • SCFG (Yamada,Knight 2001)
  • STSG (Galley et. Al 2004)

Caveat
44
Learning Expressive Rules (Galley 2004)
Yamada Knight
Channel Operation Tables
f1,f2,..,fn
Parsing Rules For F-side
Galley, et. al 2004
Rule Extraction
TSG rules
CFG rules
  • Condition on larger fragments of the trees

45
Rule format Decoding
Rule 1
Current State
Derivation Step
VP
VP
fi1
ne VB pas
fi
NP VP
NP ne VB pas
X2
Aux
go
VB
PRP
PRP
Not
Does
go
Not
he
he
Does
Tree Fragment
CFG
  • Tree is build bottom up
  • Foreign string at each derivation may have
    non-terminals
  • Rules are extracted from training corpus
  • English side trees
  • Foreign side strings
  • Alignment from Giza

46
Rule Extraction
S
VP
NP
Aux
VB
RB
Upward projection
PRP
go
Not
Does
he
pas
il
va
ne
S
go
VP
he il
VP
NP
Frontier nodes Nodes whose span is exclusive
il
Aux
VBva
va
Rx
Frontier Graph
Not
Does
PRP il
VBva
NP il
S,NP,PRP,he, VP, VB,go
pas
ne
PRP il
he il
Gova
Extract Rule as before
47
Illustrating Rule Extraction
48
Minimality of Extracted rules
  • Other rules can be composed form these minimal
    rules

VP
VP
Aux
Aux
VB
VBva
RB
Rx
VBva


go
Not
Does
Not
Does
Gova
pas
pas
va
ne
ne
49
Probability Estimation
  • Just EM
  • Modified inside outside for E-Step
  • Decoding as parsing
  • Training can be done using of the shelf
    tree-transducers (Knight et al. 2004)

50
Evaluation
  • Coverage
  • how well the learnt rules explain the corpus
  • 100 coverage on F-E and C-E corpus

Translation Results
Decoder was still work in progress
51
The Big Picture Translation Models
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Word-based
Phrase-based
SCFG (Chiang 2005), ITG (Wu 97)
Inter-lingua
Inter-lingua
Inter-lingua
Syntax
Syntax
Syntax
Syntax
Syntax
Syntax
String
String
String
String
String
String
Tree-Tree Transducers
Tree-String Transducers
String-Tree Transducers
52
Tree-Tree Transducers
  • Linguistic Tools
  • E/F Parse Trees
  • from statistical parser
  • Alignment
  • from Giza
  • Conditional Model
  • P(Tf Te)
  • Models differ on
  • How to factor P(Tf Te)
  • Really many many many
  • CFG vs. dependency trees
  • How to train
  • EM most of them
  • Discriminative (Collins 2006)
  • Directly model P(Te Tf)

Same Caveat as before
53
Discriminative tree-tree models
  • Directly model P(Te Tf)
  • Translate from German to English
  • Extended Projections
  • Just elementary tree from TAG with
  • One verb
  • Lexical functional words
  • NP, PP place holders
  • Learns to map between tree fragments
  • German clause ? EP
  • Modeled as structured learning

54
How to Decode?
  • Why no generative story?
  • Because this is a direct model!
  • Given a German String
  • Parse it
  • Break it into clauses
  • Predict an EP from each clause
  • Translate German NP, PP using Pharaoh
  • Map translated German NP and NN to holes in EP
  • Structure learning comes here
  • Stitch clauses to get English translation

55
How to train
  • Training data Aligned clauses
  • Extraction procedures
  • Parse English and German
  • Align (NP,PP) in them using GIZE
  • Break parse trees into clauses
  • Order clauses based on verb position
  • Discard sentences with different number of
    clauses
  • Training (e1,g1).(en,gn)

56
How to train (2)
57
How to train (3)
  • (X,Y) is a training pair
  • - Just our good old perceptron friend!

58
Results
  • German-English Europol corpus
  • 750k training sentences ? 441,000 training
    clauses, test on 2000 sentences
  • BLEU Score
  • base line 25.26
  • This system 23.96
  • Human judgment
  • 62 equal
  • 16 better under this system
  • 22 better for baseline
  • Largely because lots of restriction were imposed

59
Outline
  • The Translation Problem
  • The Noisy Channel Model
  • Syntax-light SMT
  • Why Syntax?
  • Syntax-based SMT Models
  • Summary

60
Summary
  • Syntax Does help but
  • What is the right representation
  • Is it language-pair specific?
  • How to deal with parser errors?
  • Modeling the uncertainty of the parsing process
  • Large scale syntax-based models
  • Are they possible?
  • What are the trade-offs
  • Better parameter estimation!
  • Should we trust the GIZA alignment results?
  • Block-translation vs. word-word?

61
  • Thanks

62
Related Work
  • Fast parsers for synchronous grammars
  • Grammar binarization
  • Fast K-best list parses
  • Re-ranking
  • Syntax driven evaluation measures
  • Impact of parsing quality on overall system
    performance

63
SAMT
Write a Comment
User Comments (0)
About PowerShow.com