Title: Latest Developments in SMT
1Latest Developments in (S)MT
MT Wars II The Empire (Linguistics) strikes back
- Harold Somers
- University of Manchester
2Overview
- The story so far
- EBMT
- SMT
- Latest developments in RBMT
- Is there convergence?
- Some attempts to classify MT
- (Carl and Wus MT model spaces)
- Has the empire struck back?
3The story so far EBMT
- Early history well known
- Nagao (1981/3)
- Early development as part of RBMT
- Relationship with Translation Memories
- Focus (cf. Somers 1998) on
- Matching algorithms
- Selection and storage of examples
- Mainly sentence-based
- TL generation (Recombination) not much addressed
Somers, H. (1998) New paradigms in MT, 10th
European Summer School in Logic, Language and
Information, Workshop on MT, Saarbrücken revised
version in Machine Translation 14 (1999) and 2nd
revised version in M. Carl A. Way (2003) Recent
Advances in EBMT (Kluwer).
4EBMT in a nutshell
- (In case youve been on Tatooine for the last 15
years) - Database of paired examples
- Translation involves
- Finding the best example(s) (matching)
- Identifying which bits do(nt) match (alignment)
- Replacing the non-matching bits (if multiple
examples, gluing them together) (recombination) - All of the above at run-time
5EBMT in a nutshell (cont.)
- Main difficulty is boundary friction in two
senses
The old man is dead Le vieil homme est mort The
old woman is dead
Le vieil femme est mort
The operation was interrupted because the file
was hidden. a. The operation was interrupted
because the Ctrl-c key was pressed.
Lopération a été interrompue car la touché
Ctrl-c a été enfoncée. b. The specified method
failed because the file is hidden. La méthode
spécifiée a échoué car le fichier est masqué
6EBMT later developments
- Example generalisation (templates)
- Incorporation of linguistic resources and/or
statistical measures - Structured representation of examples
- Use of statistical techniques
7Example generalisation
- (Furuse Iida, Kaji et al., Matsumoto et
al., Carl, Cicekli Güvenir, Brown, McTait, Way
et al.) - Similar examples can be combined to give a more
general example - Can be seen as a way of generating transfer rules
(and lexicons) - Process may be entirely automatic, based on
string matching - or seeded using linguistic information (POS
tags) or resources (bilingual dictionary)
8Example generalisation (cont.)
The monkey ate a peach ? saru wa momo o tabeta
The man ate a peach ? hito wa momo o
tabeta
monkey ? saru man ? hito
The ate a peach ? wa momo o tabeta
The dog ate a rabbit ? inu wa usagi o tabeta
dog ? inu rabbit ? usagi
The x ate a ...y ? x wa y tabeta
9Example generalisation (cont.)
- Thats too simple (e.g. because of boundary
friction) - Need to introduce constraints on the slots, e.g.
using POS tags and morphological information
(which implies some other processing) - Can use clustering algorithms to infer
substitution sets
10Incorporation of linguistic resources
- Actually, early EBMT used all sorts of linguistic
resources - Briefly there was a move towards more pure
approaches - Now we see much use of POS tags (sometimes only
partial, e.g. marker words Way et al.),
morphological analysis (as just mentioned),
bilingual lexicons - Target-language grammars for recombination/generat
ion phase
11Incorporation of statistical measures
- Example database preprocessed to assign weights
(probabilities) to fragments and their
translations (Aramaki et al.) - Good way of handling ambiguities due to
alternative translations - Clustering words into equivalence classes for
example generalization (Brown) - Using statistical tools to extract translation
knowledge from parallel corpora (Yamamoto
Matsumoto) - Statistically induced grammars for translation or
generation, as in ...
12Use of structured representations
- Again, a feature of early EBMT, now reappearing
- Translation grammars induced from the example set
- Examples stored as tree structures
(overwhelmingly dependency structures)
13Translation grammars
- Carl generates translation grammars from aligned
linguistically annotated texts - WayData-Oriented Translation based on Poutsmas
DOP, using both PS and LFG models)
14Structured examples
- Use of tree comparison algorithms to extract
translation patterns from parsed corpora/tree
banks (Watanabe et al.) - Translation pairings extracted from aligned
parsed examples (Menezes Richardson) - Tree-to-string approach used by Langlais Gotti
and Liu et al. ( statistical generation model)
15Typical use of structured examples
- Rule-based analysis and generation
example-based transfer - Input is parsed into representation using a
traditional or statistics-based analyser - TL representation constructed by combining
translation mappings learned from the parallel
corpus - TL sentence generated using a hand-written or
machine-learned generation grammar - Is this still EBMT?
- Note that the only example-based part is use of
mappings which are learned, not computed at
run-time
16Pure EBMT (Lepage Denoual)
- In contrast (but now something of an oddity)
pure analogy-based EBMT - Use of proportional analogies ABCD
- Terms in the analogies are translation pairs
A?A B?B C?C D?D
17(No Transcript)
18Pure EBMT
- No explicit transfer
- No extraction of symbolic knowledge
- No use of templates
- Analogies do not always represent any sort of
linguistic reality - No training or preprocessing
- Solving the proportional analogies is done at
run-time
19The story so far (SMT)
- Early history well known
- IBM group inspired by improved results in speech
recognition when non-linguistic approach taken - Availability of Canadian Hansards inspired purely
statistical approach to MT (1988) - Immediate partial success (60) to the dismay of
MT people - Early observers (Wilks) predicted hybrid methods
(stone soup) would evolve - Later developments
- Phrase-based SMT
- Syntax-based SMT
20SMT in a nutshell
- (In case youve been on Kamino for the last 15
years) - From parallel corpus two sets of statistical data
are extracted - Translation model probabilities that a given
word e in the SL gives rise to a word f in the TL - (Target) language model most probable word-order
for the words predicted by the translation model - These two models are computed off-line
- Given an input sentence, a decoder applies the
two models, and juggles the probabilities to get
the best score various methods have been
proposed
21SMT in a nutshell (cont.)
- The translation model has to take into account
the fact that - for a given e in there may be various different
fs depending on context (grammatical variants as
well as alternatives due to polysemy or homonymy) - a given e may not necessarily correspond to a
single f, or any f at all fertility - (e.g. may have ? aurait implemented ? mis en
application)
22SMT in a nutshell (cont.)
- The language model has to take into account the
fact that - The TL words predicted by the translation model
will not occur in the same order as the SL words
distortion - TL word choices can depend on neighbouring words
(which may be easy to model) or, especially
because of distortion, more distant words
long-distance dependencies, much harder to
model
23SMT in a nutshell (cont.)
- Main difficulty combination of fertility and
distortion - Zeitmangel erschwert das Problem.
- Lack of time makes the problem more difficult.
- Eine Diskussion erübrigt sich demnach.
- Therefore there is no point in discussion.
- Das ist der Sache nicht angemessen.
- That is not appropriate for this matter.
- Den Vorschlag lehnt die Kommission ab.
- The Commission rejects the proposal.
24SMT later developments
- Phrase-based SMT
- Extend models beyond individual words to word
sequences (phrases) - Direct phrase alignment
- Word alignment induced phrase model
- Alignment templates
- Results better than word-based models, and show
improvement proportional (log-linear) to corpus
size - Phrases do not correspond to constituents, and
limiting them to do so hurts results
25Direct phrase alignment
- (Wang Waible 1998, Och et al., 1999, Marcu
Wong 2002) - Enhance word translation model by adding joint
probabilities, i.e. probabilities for phrases - Phrase probabilities compensate for missing
lexical probabilities - Easy to integrate probabilities from different
sources/methods, allows for mutual compensation
26Word alignment induced model
- Koehn et al. 2003 example stolen from Knight
Koehn http//www.iccs.inf.ed.ac.uk/pkoehn/publica
tions/tutorial2003.pdf
Maria did not slap the green witch
Maria no daba una botefada a la bruja verda
Start with all phrase pairs justified by the word
alignment
27Word alignment induced model
- Koehn et al. 2003 example stolen from Knight
Koehn http//www.iccs.inf.ed.ac.uk/pkoehn/publica
tions/tutorial2003.pdf
(Maria, Maria), (no, did not) (daba una botefada,
slap), (a la, the), (verde, green), (bruja,
witch)
28Word alignment induced model
- Koehn et al. 2003 example stolen from Knight
Koehn http//www.iccs.inf.ed.ac.uk/pkoehn/publica
tions/tutorial2003.pdf
(Maria, Maria), (no, did not) (daba una botefada,
slap), (a la, the), (verde, green) (bruja,
witch), (Maria no, Maria did not), (no daba una
botefada, did not slap), (daba una botefada a la,
slap the), (bruja verde, green witch)
etc.
29Word alignment induced model
- Koehn et al. 2003 example stolen from Knight
Koehn http//www.iccs.inf.ed.ac.uk/pkoehn/publica
tions/tutorial2003.pdf
(Maria, Maria), (no, did not), (slap, daba una
bofetada), (a la, the), (bruja, witch), (verde,
green), (Maria no, Maria did not), (no daba una
bofetada, did not slap), (daba una bofetada a la,
slap the), (bruja verde, green witch), (Maria no
daba una bofetada, Maria did not slap), (no daba
una bofetada a la, did not slap the), (a la
bruja verde, the green witch), (Maria no daba una
bofetada a la, Maria did not slap the), (daba una
bofetada a la bruja verde, slap the green
witch), (no daba una bofetada a la bruja verde,
did not slap the green witch), (Maria no daba una
bofetada a la bruja verde, Maria did not slap the
green witch)
30Word alignment induced model
- Given the phrase pairs collected, estimate the
phrase translation probability distribution by
relative frequency (without smoothing)
31Alignment templates
- Och et al. 1999 further developed by Marcu and
Wong 2002, Koehn and Knight 2003, Koehn et al.
2003) - Problem of sparse data worse for phrases
- So use word classes instead of words
- alignment templates instead of phrases
- more reliable statistics for translation table
- smaller translation table
- more complex decoding
- Word classes are induced (by distributional
statistics), so may not correspond to intuitive
(linguistic) classes - Takes context into account
32Problems with phrase-based models
- Still do not handle very well ...
- dependencies (especially long-distance)
- distortion
- discontinuities (e.g. bought habe ... gekauft)
- More promising seems to be ...
33Syntax-based SMT
- Better able to handle
- Constituents
- Function words
- Grammatical context (e.g. case marking)
- Inversion Transduction Grammars
- Hierarchical transduction model
- Tree-to-string translation
- Tree-to-tree translation
34Inversion transduction grammars
- Wu and colleagues (1997 onwards)
- Grammar generates two trees in parallel and
mappings between them - Rules can specify order changes
- Restriction to binary rules limits complexity
35Inversion transduction grammars
36Inversion transduction grammars
- Grammar is trained on word-aligned bilingual
corpus Note that all the rules are learned
automatically - Translation uses a decoder which effectively
works like traditional RBMT - Parser uses source side of transduction rules to
build a parse tree - Transduction rules are applied to transform the
tree - The target text is generated by linearizing the
tree
37(No Transcript)
38(No Transcript)
39Almost all possible mappings can be
handled Missing ones (crossing constraints) are
not found in Wus corpus But examples can be
found, apparently
40Hierarchical transduction model
- (Alshawi et al. 1998)
- Based on finite-state transducers, also uses
binary notation - Uses automatically induced dependency structure
- Initial head-word pair is chosen
- Sentence is then expanded by translating the
dependent structures
41Tree-to-string translation
- (Yamada Knight 2001, Charniak 2003)
- Uses (statistical) parser on input side only
- Tree is then subject to reordering and insertion
according to models learned from data - Lexical translation is then done, again according
to probability models
42(No Transcript)
43Tree-to-tree translation
- (Gildea 2003)
- Use parser on both sides to capture structurual
differences - Subtree cloning
- (Habash 2002, Cmejrek et al. 2003)
- Full morphology/syntactic/semantic parsing
- All based on stachastic grammars
44Latest developments in RBMT
- RBMT making a come-back (e.g. METIS)
- Perhaps it was always there, just wasnt
represented in CL journals/conferences - There is some activity, but around the periphery
- Open-source systems
- development for low-density languages
- Much use made of corpus-derived modules, eg
tagging, chunking - SMT is now RBMT, only the rules are learned
rather than written by linguists
45Overview
- The story so far
- EBMT
- SMT
- Latest developments in RBMT
- Is there convergence?
- Some attempts to classify MT
- (Carl and Wus MT model spaces)
- Has the empire struck back?
46Classifications of MT
- Empirical vs. Rationalist
- data- vs theory-driven
- use (or not) of symbolic representation
- From MLIM chapter 4
- high vs. low coverage
- low vs. high quality/fluency
- shallow vs. deep representation
- Distinguish in the above
- design vs. consequence
- How true are they anyway?
47EBMTSMT Is there convergence?
- Lively debate on mtlist
- Articles by
- Somers, Turcato Popowich in Carl Way (2003)
- Hutchins, Carl, Wu (2006) in special issue of
Machine Translation - Slides marked need your input!
48Essential features of EBMT
- Use of bilingual corpus data as the main (only?)
source of knowledge (Somers) - Most early EBMT systems were hybrids
- We do not know a priori which parts of example
are relevant (Turcato Popowich) - Raw data is consulted at run-time (little or) no
preprocessing - Therefore template-based EBMT is already a hybrid
(with RBMT) - Act of matching the input against the examples,
regardless of how they are stored (Hutchins)
49Pros (and cons) of analogy model
- Like CBR
- Library of cases used during task performance
- Analogous examples broken down, adapted,
recombined - In contrast with other machine learning methods
- Offline learning to compile abstract performance
model - No loss of coverage due to incorrect
generalization during training - Guaranteed correct when input is exactly like an
example in the training set (not true of SMT) - But Lack of generalization leads to potential
runtime inefficiency
(Wu, 2006)
50EBMTSMT Common features
- Easily agreed
- Use of bilingual corpus data as the main (only?)
source of knowledge - Translation relations are derived automatically
from the data - Underlying methods are independent of
language-pair, and hence of language similarity - More contentious
- Bilingual corpus data should be real (a practical
issue for SMT, but some EBMT systems use
hand-crafted examples) - System can be easily extended just by adding more
data
51EBMTRBMT common features
- Hybrid is easy to conceive
- Rule-based analysis/generation with example-based
transfer - Example-based processing only for awkward cases
52SMTRBMT common features
- Some versions of SMT exactly mirror classic RBMT
- parse-transfer-generate
- Same things are hard
- Long-distance dependency
- Discontinuous constituents
53Wus 3D classification of all MT
- Example-based vs. schema-based
- abstraction or generalization performed at
run-time - Compositional vs. lexical
- Relates primarily to transfer (or equiv.)
- Statistical vs. logical
- Pictures also show historical development
54Classic (direct and transfer) MT models
- Early systems (Georgetown) lexical and
compositional - Treatment of idioms, collocations, phrasal
translations in classical 2G transfer systems - Modern RBMT systems starting to adopt statistical
methods (according to Wu) - Where do commercial systems sit?
55(No Transcript)
56EBMT systems
57SMT systems
58Example-based SMT systems
59Summary
60Model space corpus-based MT (Carl 2000)
- Based on Dummetts theory of meaning
- Rich vs austere
- Complexity of representations
- Molecular vs holistic
- Descriptions based on finite set of predefined
features vs global distinctions - Fine-grained vs coarse-grained
- Based on smaller or larger units
61Rich vs austere
- Translation memories are most austere, depending
only on graphemic similarity - TMs with annotated examples (eg Planas Furuse)
are richer - Early EBMT systems, and recent systems where
examples are generalized are rich - EBMT using light annotation (eg TAGS, markers)
are moderately rich - Pure EBMT (Lepage Denoual) is austere
- Early SMT systems were austere, but move towards
syntax makes them richer - Phrase-based SMT still austere
62METIS
EBMT where examples are lightly annotated
Phrase-based SMT
Syntax-based SMT
Pure EBMT (Lepage)
Marker-based EBMT (Way)
Template-based EBMT (McTait, Brown, Cicekli)
Early SMT (Brown et al.)
Annotated translation memories
Translation memories
Classic EBMT (Sato, Nagao)
63Molecular vs holistic
- Early SMT purely holistic, as is pure EBMT
- TMs molecular distance measure based on fixed
set of symbols - Translation templates are holistic, but molecular
if they depend on some sort of analysis - Phrase-based and syntax-based SMT highly
molecular
64EBMT where examples are lightly annotated
Early SMT (Brown et al.)
METIS generation
Template-based EBMT (McTait, Brown)
Pure EBMT (Lepage)
Annotated translation memories
Phrase-based SMT
Syntax-based SMT
Marker-based EBMT (Way)
Classic EBMT (Sato, Nagao)
METIS analysis
Translation memories
Template-based EBMT (Cicekli)
65Coarse- vs. fine-grained
- Coarse-grained translates with bigger units
- TM system wirks only on sentences coarse-grained
- Word-based systems are fine-grained Early SMT
- Phrase-based SMT slightly more coarse-grained
- Template-based EBMT fine-grained
66Template-based EBMT (McTait, Brown)
Early SMT (Brown et al.)
fine
Phrase-based SMT
Marker-based EBMT (Way)
Translation memories
coarse
67Overview
- The story so far
- EBMT
- SMT
- Latest developments in RBMT
- Is there convergence?
- Some attempts to classify MT
- (Carl and Wus MT model spaces)
- Has the empire struck back?
68Has the empire struck back?
- Is linguistics back in MT?
- Was MT ever of interest to linguists?
- Is SMT like RBMT?
69Vauquois triangle
To what extent can a given system be described
in terms of the classic view of MT (G2) ?
70Has the empire struck back?
- Is linguistics back in MT?
- Was MT ever of interest to linguists?
- Is SMT like RBMT?
- As predicted by Wilks (Stone soup talk, 1992)
way forward is hybrid - Negative experience (for me) of seeing SMT
presenters rediscovering problems first described
by Yngve, Vauquois ... - ... without referencing the original papers!
71LINGUISTICS
72EBMT
SMT
Fill in the gaps
Early SMT (Brown et al.)
EBMT where examples are lightly annotated
Pure EBMT (Lepage)
Annotated translation memories
Phrase-based SMT
Template-based EBMT (McTait, Brown)
Syntax-based SMT
Classic EBMT (Sato, Nagao)
Marker-based EBMT (Way)
Template-based EBMT (Cicekli)
Translation memories
RBMT