Title: Statistical Machine Translation Quo Vadis
1Statistical Machine TranslationQuo Vadis
- Stephan Vogel
- Interactive Systems Lab
- Language Technologies Institute
- Carnegie Mellon University
2N-Best List Generation
- Benefit
- Required for optimizing model scaling factors
- Rescoring
- For translation with pivot language L1 -gt L2 -gt
L3 - We have n-best translations at sentence end
- But Hypotheses are recombined -gt many good
translations dont reach the sentence end - Recover those translations
3Storing Multiple Backpointers
- When recombining hypotheses, store them with the
best (i.e. surviving) hypothesis, but dont
expand them
hb
hr
hb
hr
hr
hr
4Tuning the SMT System
- We use different models in SMT system
- Models have simplifications
- Trained on different amounts of data
- gt Different levels of reliability
- gt Give different weight to different ModelsQ
c1 Q1 c2 Q2 cn Qn - Find optimal scaling factors c1 cn
- Optimal means Highest score for chosen
evaluation metric
5Brute Force Approach Manual Tuning
- Decode with different scaling factors
- Get feeling for range of good values
- Get feeling for importance of models
- LM is typically most important
- Sentence length to balance shortening effect of
LM - Word reordering is more or less effective
depending on language - Narrow down range in which scaling factors are
tested - Essentially multi-linear optimization
- Works good for small number of models
- Time consuming (CPU wise) if decoding takes long
time
6Automatic Tuning
- Many algorithms to find (near) optimal solutions
available - Simplex
- Maximum entropy
- Minimum error training
- Minimum Bayes risk training
- Genetic algorithm
- Note models are not improved, only their
combination - Large number of full translations required gt
still problematic when decoding is slow
7Automatic Tuning on N-best List
- Generate n-best lists, e.g. for each of 500
source sentences 1000 translations - Loop
- Changing scaling factors results in re-ranking
the n-best lists - Evaluate new 1-best translations
- Apply any of the standard optimization techniques
- Advantage much faster
- Can pre-calculate the counts (e.g. n-gram
matches) for each translation to speed up
evaluation - For Bleu or NIST metric with global length
penalty do local hill climbing for each
individual n-best list
8Minimum Error Training
- For each scaling factor we have Q ck Qk
QRest - For different values different hyps have lowest
score - Different hyps lead to different MT eval scores
h21 WER 2
Score
h22 WER 0
h23 WER 5
h11 WER 8
h12 WER 5
h13 WER 4
h22
h21
ck
best hyp
h12
h13
h11
10
7
10
9
9Exploring Lattice Input for Statistical Machine
Translation
- Stephan Vogel, Kay Rottmann, Bing
Zhao,Sanjika Hewavitharana - InterACT Carnegie Mellon
- InterACT University of Karlsruhe
10Motivation
- Lattice translation is used for tighter coupling
between speech recognition and translation - Use alternative word sequences
- Lattice is more efficient than n-best list
- Decouple processing steps yet use relevant
information for end-to-end optimization - Lattices can be used to encode alternatives
arising from other knowledge sources - Disfluency annotation add edges to lattice to
skip over disfluencies - Add synonyms and paraphrases
- Add morphological variants, esp. for unknown
words - Allow different word segmentation (e.g. For
Chinese) - Partial word reordering
11Advantage of using lattices
- Decouple processing steps
- Different people can work on different parts of
the system with less interference - Use stronger models (additional tools) which
would be difficult to integrate into the decoder - Closer to one solution fits allIf decoder can
translate lattices that it can do x, y, z - No hard decisions
- Keep alternatives till additional information is
available - Assign probabilities to those alternatives
- Use them as additional features in minimum error
rate training - Essentially many arguments for n-best list
rescoring are good arguments for lattice
translation
12Outline
- Word reordering (some nice results)
- Multiple word segmentations for Chinese (initial
results) - Paraphrasing (Matthias Bracht)
13Distortion Models for Word Reordering
- Distortion models are part of the IBM and HMM
alignment models - Absolute position (IBM2) or relative position
(HMM, IBM4) - Conditioned on sentence length, word class, word
(-gt lexicalized DM) - In phrase-based systems
- Standard DM models simple jump models, also
lexicalized - Block reordering models
- Some attempts have been made to do reordering as
pre/post processing - Reorder source sentence to fit word order of
target sentences - Reorder target sentence, insert reordering
markers - Reordering based on hand-written rules
- Reordering based on word alignment, need to learn
reordering model to apply to test sentences - Problem with reordering as preprocessing
difficult to recover from errors, therefore keep
alternatives
14Word Reordering based on POS Patterns
- Learning reordering patterns
- Use word alignment information reorder source
words to make alignment (locally) monotone - Collect the reordering patternsoriginal word
sequence -gt reordered word sequence - But use POS tags rather then words to get better
generalization - Extension use context information
- Context 1 POS tag left, 1 POS tag right
- Apply reordering pattern only when also context
matches - I.e. compare POS sequence of length n2, reorder
sequence of length n. - Pruning
- Keep only patters which have been seen gt 10
times - Keep only patterns with relative frequency gt
threshold (typically 0.1)
15Example for Learning RO-Patterns
- Spanish en esto estamos todos de acuerdo .
- English we all agree on that .
- POS PRP DT VB IN DT .
- Alignment NULL ( ) en ( 4 ) esto ( 5 )
estamos ( 1 ) todos ( 2 ) de
( ) acuerdo ( 3 ) . ( 6 ) - Extracted Rules
- PRP DT VB IN DT 4 - 5 - 1 - 2 - 3
- PRP DT VB 2 3 1
- PRP DT VB IN 3 4 1 2
- When using embedded patterns, restrict length to
7 - When dropping embedded patterns, use up to 20
words
16Constructing Reordering Lattice
- Tag test sentence
- For each matching POS sequence create a parallel
reordered path - Confusion lattice would over-generate too much
- Eg. NN ADJ and reordered ADJ NN would also
generateundesirable NN NN and ADJ ADJ
Example Sentence A final agreement has not yet
been reached RO-Lattice
17The Translation Process
- Given phrase table
- Represented as prefix tree in memory
- With emitting nodes pointing to multiple
translations - And a lattice
- Run over lattice and prefix tree in parallel
- Expand phrases one word at a time
- If final state in tree, create for each
translation new lattice edge - Translation lattices have one or multiple scores
- Propagate RO pattern score to translation edges
- First or n-best path search through lattice
- Apply LM, word count, phrase count,
- Allow for locale reordering within given window
(typically 4 words)
18Results English -gt Spanish
- Training and test corpus from TC-Star evaluation
2007 - Systems
- Baseline uses (unlexicalized) jump model
- Use RO patterns and monotone decoding
- Adding context information (1 POS left, 1 POS
right) - Allow for additional reordering in the decoder
- Note scores reported here are higher than in
graphics on 2 previous slidesdue to less pruning
in final MER optimization.
19Effect of Pruning on Translation Quality
- Pruning, i.e. removing low probability reordering
patterns help - Effect is rather pronounced, therefore worrying
20Adding Context Information
- Pattern context one POS left, one POS right
- Result
- 5-fold increase in number of patterns
- More stable
- Small (but consistent) improvements
21Arabic English (Small System)
- Training only on news corpus, dev-set mt03
- Tagging with Stanford parser
- Results
- We see improvements for Arabic, so far not as
impressive as for Spanish - So far RO model trained only on 150k sentence
22Translation with Multiple Word Segmentations
- Chinese to English translation results depend on
word segmentation - Large word lists for segmentation lead to larger
number of unknown words in test sets - Different segmentation will result in unknown
words at different positions in the test
sentences - Intuitively the vocabulary of the segmented
Chinese corpus should be close in size to the
vocabulary of the English side (modulo
morphology)
23Initial Experiments
- Training
- IBM 4 alignment models in both directions on
200 million words - Phrase pairs from combined Viterbi paths
- Note Training with only one segmentation
- Test set
- 494 Chinese sentences from ASR transcriptions of
broadcast news - Two references
- Multiple segmentation for test sentences
- Segmentation based on word lists of different
sizes - Also segmentation into individual characters
- Add source word features
- Unigram probability
- Length of word in characters
24Results
- Large vocabulary for segmentation hurts
performance - Using alternative segmentations alone did not
help - Adding probability information hurt performance
(training corpus has on segmentation only) - Adding length of source words as feature did
result in nice improvement - Note Lattice reduces number of UNKs from 96 to
67
25Summary
- Lattice translation is a nice way to decouple
preprocessing and decoding without making hard
decisions - Lattices are efficient structure to handle many
alternative - Allow to test richer preprocessing in a very
simple way - One solution for may tasks
- Successful applications
- Word reordering
- Word segmentation
26Future Work
- Reordering
- Arabic full system (underway)
- Currently some problems with tagging
- Chinese system
- Restricted to word segmentation for tagging
- Investigate learning the pattern on word level
- Using additional features for learning and
scoring reordering patterns - Multi-word segmentation
- Train different segmentations
- Paraphrasing
- Get the system up and running
- Analyze where the (expected-) benefits come from
- Additional applications
- Disfluency removal
- Morphology
27Quo Vadis Whats Next?
- What are major problems, how to overcome them?
- Quality
- Word order
- Lexical choice
- Agreement
- Language specific
- Partial alignment to compounds in German-English
translations - Word segmentation (Chinese, etc) can
significantly influence translation results - Usability
- Sufficient quality for standard text translation
applications - Speech translation e.g. parliamentary speeches
- Limited domain speech translation on handheld
devices
28Translation Quality
- Oracle experiments show that better translations
should be possible given the existing data - Oracle score for each source sentence select
translation from n-best list which best matches
with human reference translations - Of course, score depends on size of n-best list
- For 1000-best list Oracle score is typically 10
Bleu points higher then 1-best result - Why are they not selected as 1-best?
- Variability in translations
- Only weak correlation (on per sentence level)
between model scores and human evaluation scores - Our models do not assign the best score to the
best translation - Ultimately our models are still too weak
29Problems with Current Models
- Not the right models, i.e. we miss important
aspects of languages and translation - Models have simplifying assumptions, e.g.
independence assumptions - Models are trained separately but interact at
decoding time - Mismatch between training criterion (maximum
likelihood) and decoding criterion (minimum
error)
30Solution Learning Richer Models
- Adding new models e.g. word fertility model,
class-based LM, etc. this is the usual way to
improve the systems - Problem number of parameters become to large,
i.e. better model but parameter estimation
unreliable - Discriminative training, to retrain given
models e.g. perceptron learning - Results so far did not really give significant
improvements ?
31Solution Soft Extension of Models
- Add discriminating features to generate/select
correct translation for current sentence - Dangerous could degrade translations of other
sentences - Solution only add discriminating features to
current model, which corrects current error
without destroying other translations - Need to get generalization, i.e. apply more
abstract features not p( f e f' ) but p( f
e 'current topic is X' ) or p( f e 'there
is an auxiliary verb close by ) - There is a very rich and constantly evolving set
of different machine learning techniques we
need to investigate them
32Improving Quality through Data
- We have been improving quality constantly over
the last couple of years just by using more data - So continue to use more and more data
- Only for a small number of languages, but
resources are growing at an immense rate - Monolingual data, definitely
- Bilingual data more problematic currently for
major language pairs - Creates engineering challenges
- Memory and speed requirements grow
- Distributed processing necessary
- Opens scientific possibilities
- More data means more model parameters can be
estimated - Fancier models can be used
33Going Large
- Currently large research systems for Chinese and
Arabic - 250 million word corpus, resp. 120 million words
- vocabularies gt 1 million full word forms
- Memory problems
- Phrase tables (15 words) to large to fit into
memory - Sampling techniques are used
- No one-for-all systems
- Translating a test set (typically lt 1000
sentences) will take hours - Sampling the phrase table
- Even retraining on sub-sampled training data
- But could even improve performance - adaptation
- Time problems
- Training with GIZA Chinese-English (1 direction)
takes 5 days on 1 CPU - Many groups started to parallelize training
- Parallelizing translation is trivial (if your
models fit into memory of each machine)
34Big LMs, really Big LMs
- We (at CMU) typically work with a 200 million
word 3-gram LM - Google reported improvement of 5 Bleu points
using a 200 billion word 5-gram LM - 1.6 TB ngram table
- 1000 CPUs to sample
- 40 hours for this sampling
- IBM reports 3 Bleu points for Arabic system using
Gigaword corpus, i.e. 3 billion words 5-gram LM - Typically linear improvement in MT quality (as
measures with standard MT metrics) requires
exponential growth in corpus size - Holds for bilingual and monolingual data
- Slope different for different languages
- Slope different for very in-domain data and not
so in-domain - Out-of domain hurts small systems, unclear for
very large systems
35Large Corpus for NBest List Rescoring
- For a hypothesis from the n-best list, calculate
the collocation statistics of any n-gram pairs.
(No restriction to length of n-grams!) - The prime minister calls on the people to work
together for permanent peace. - I(the, prime), I(the, prime minister),, I(the
prime minister, the people), ..., I(calls on,
to),, I(work, for), - Collocation Statistics for n-gram pair ,
- Different co-occurrence statistics explored
- Point-wise Mutual Information works best
- Accumulated collocations for a sentence
36Co-Occurrences of n-gram Pairs
- Index the corpus of N words using suffix array
(Manber Myers 1990) - For a sentence with m words, locate all its
embedded n-grams in the corpus within O(m log N)
time (Zhang Vogel 2005). - For each n-gram, locate all the sentence IDs for
each of its occurrences - Calculate co-occurrences for all the n-gram
pairs
37Distributed Computing for Large Data
- Corpus and suffix array for 100 million words
need 900MB RAM - We use 2.9 billion words from Gigaword corpus
- 100 corpus chunks distribute over 2030 machines
Monolingual Corpus Information Servers
NYT1999
NYT2000
XINHUA2004
Hyp1 Hyp2 Hyp N
hypothesis
Client
Add up co-occurrence information from servers
38Experiment Results
- TIDES 03 Chinese-English test set
- Selection for each sentence select corpus
segment with highest ngram match
39Corpus Selection
Translation quality (BLEU, NIST) when using
different amounts of data
- The useful information is often in a small
subsection of the data - Some portions hurt
- Add selection mechanism - adaptation
40Going small
- Two issues
- Generating bilingual data from scratch
- Running translation system on devices with
limited resources - If needed, translations can be made.
- What should be translated?
- Select from larger monolingual corpus, typically
available for one of the two languages - Select sentences to cover vocabulary and bigrams
seems good strategy - Get those translated
41Selection N-grams / Sentence Length
- 20 of the corpus, well selected, give nearly the
same result as using the full corpus - Using trigrams in the selection process does not
make a difference
42Going small Prune Phrase Table
- Sure you remove translation alternative with too
low probabilities - And you dont store very long phrases which might
never be used - But Can you then eliminate another 50 or 80 of
the entries without hurting performance? - Current studies Remove entries
- which can be generated from shorter phrases
- and which are close in probabilities
- Method guarantees that you loose coverage
- Initial results on BTEC data successful no
degradation up to removing 80 of the entries
43Going Small SMT on Handheld Devices
- We do not always have the big memory can we
still build data-driven systems? - Yes, successfully build 2-way speech translation
system for PDA - Specification
- 1 GB on PDA, i.e. not so small
- Models used directly from flash card
- 1m source phrases plus 1m target phrases plus 5m
pairs - Typical domain specific systems (travel, medical)
use 10-20 MB - Language model is actually using more memory than
the phrase-table
44Going Deep
- More structure
- More linguistic knowlegde
- Parsing
- String-to-tree and tree-to-tree mapping
- More features, i.e. richer models
- More data, more parameters, richer models
possible - Context dependent lexicon models p( f e, f ),
f an the left, on the right, or somewhere in the
sentence - Distortion models p( jump f1, f2, e1, e2)
- Dependencies on word classes
45Syntax based SMT System
- Many folks working on this
- Its tough to beat the phrase-based systems
- Bet between Franz Josef Och and Daniel Marcu,
- So far Franz is the winner
- But Daniel remain optimistic syntax based
systems catch up - Also new effort in our group
- Parse English side
- Align corpus and induce phrase pairs
- Use this information to generate (hierarchical)
translation rules - Use chart-based decoder to translate
- Current situation catching up - not yet
exceeding - standard system - But relies on phrase alignment, i.e. improvements
in phrase alignment will improve this system too - Still problems with integration of LM
- Requires smarter pruning to bring down run-time
46Parallel ( bilingual) Processing
- Currently
- word alignment models bridge the language gap
- But a load of preprocessing steps are done
monolingualy - Examples
- Number tagging
- Word segmentation for Chinese, Japanese, etc
- Using morphology toolkit to fragment Arabic words
- Often, this leads to even greater disparity
between the two languages - Need integrated models
- Word segmentation integrated with word alignment
- Splitting/deleting of morphemes integrated with
word alignment
47Chinese Word Segmentation
- Different Chinese words segmenters Stanford,
IBM, CMU - Which segmenter to use i.e. which segmenter is
best - Best closest to gold standard, e.g. treebank
- Best best translation (according to some
automatic metric) - Translation Experiment
- Bilingual 19 m words (English)
- LM from 200 m words
- Phrase pairs from lt 20 sentences
- Phrase probs only from lexicalfeatures
- Testset mt03
- Standford segmenter is best when evaluated
against treebank - But problematic for machine translation
48Quo Vadis
- Continuing to improve best practices
- More data
- Incremental extensions to word and phrase
alignment models - Improvements in search strategies
- Going large
- Very large corpora monolingual, comparable
- Distributed processing
- But should not turn science into pure engineering
tasks - Going small
- Limited Domain Applications on hand-held devices
- Selecting the right data
- Removing redundant information from models
- Going deep
- Structurally rich models
- Exploring generously the arsenal of machine
learning techniques - Soft extensions of current models minimal number
of additional parameter to minimize errors