Statistical Machine Translation Quo Vadis - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Statistical Machine Translation Quo Vadis

Description:

But: Hypotheses are recombined - many good translations don't reach the sentence end ... Discriminative training, to retrain given models: e.g. perceptron learning; ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 49

Provided by: Vog63

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Machine Translation Quo Vadis

1
Statistical Machine TranslationQuo Vadis

Stephan Vogel
Interactive Systems Lab
Language Technologies Institute
Carnegie Mellon University

2
N-Best List Generation

Benefit
Required for optimizing model scaling factors
Rescoring
For translation with pivot language L1 -gt L2 -gt
L3
We have n-best translations at sentence end
But Hypotheses are recombined -gt many good
translations dont reach the sentence end
Recover those translations

3
Storing Multiple Backpointers

When recombining hypotheses, store them with the
best (i.e. surviving) hypothesis, but dont
expand them

hb
hr
hb
hr
hr
hr
4
Tuning the SMT System

We use different models in SMT system
Models have simplifications
Trained on different amounts of data
gt Different levels of reliability
gt Give different weight to different ModelsQ
c1 Q1 c2 Q2 cn Qn
Find optimal scaling factors c1 cn
Optimal means Highest score for chosen
evaluation metric

5
Brute Force Approach Manual Tuning

Decode with different scaling factors
Get feeling for range of good values
Get feeling for importance of models
LM is typically most important
Sentence length to balance shortening effect of
LM
Word reordering is more or less effective
depending on language
Narrow down range in which scaling factors are
tested
Essentially multi-linear optimization
Works good for small number of models
Time consuming (CPU wise) if decoding takes long
time

6
Automatic Tuning

Many algorithms to find (near) optimal solutions
available
Simplex
Maximum entropy
Minimum error training
Minimum Bayes risk training
Genetic algorithm
Note models are not improved, only their
combination
Large number of full translations required gt
still problematic when decoding is slow

7
Automatic Tuning on N-best List

Generate n-best lists, e.g. for each of 500
source sentences 1000 translations
Loop
Changing scaling factors results in re-ranking
the n-best lists
Evaluate new 1-best translations
Apply any of the standard optimization techniques
Advantage much faster
Can pre-calculate the counts (e.g. n-gram
matches) for each translation to speed up
evaluation
For Bleu or NIST metric with global length
penalty do local hill climbing for each
individual n-best list

8
Minimum Error Training

For each scaling factor we have Q ck Qk
QRest
For different values different hyps have lowest
score
Different hyps lead to different MT eval scores

h21 WER 2
Score
h22 WER 0
h23 WER 5
h11 WER 8
h12 WER 5
h13 WER 4
h22
h21
ck
best hyp
h12
h13
h11
10
7
10
9
9
Exploring Lattice Input for Statistical Machine
Translation

Stephan Vogel, Kay Rottmann, Bing
Zhao,Sanjika Hewavitharana
InterACT Carnegie Mellon
InterACT University of Karlsruhe

10
Motivation

Lattice translation is used for tighter coupling
between speech recognition and translation
Use alternative word sequences
Lattice is more efficient than n-best list
Decouple processing steps yet use relevant
information for end-to-end optimization
Lattices can be used to encode alternatives
arising from other knowledge sources
Disfluency annotation add edges to lattice to
skip over disfluencies
Add synonyms and paraphrases
Add morphological variants, esp. for unknown
words
Allow different word segmentation (e.g. For
Chinese)
Partial word reordering

11
Advantage of using lattices

Decouple processing steps
Different people can work on different parts of
the system with less interference
Use stronger models (additional tools) which
would be difficult to integrate into the decoder
Closer to one solution fits allIf decoder can
translate lattices that it can do x, y, z
No hard decisions
Keep alternatives till additional information is
available
Assign probabilities to those alternatives
Use them as additional features in minimum error
rate training
Essentially many arguments for n-best list
rescoring are good arguments for lattice
translation

12
Outline

Word reordering (some nice results)
Multiple word segmentations for Chinese (initial
results)
Paraphrasing (Matthias Bracht)

13
Distortion Models for Word Reordering

Distortion models are part of the IBM and HMM
alignment models
Absolute position (IBM2) or relative position
(HMM, IBM4)
Conditioned on sentence length, word class, word
(-gt lexicalized DM)
In phrase-based systems
Standard DM models simple jump models, also
lexicalized
Block reordering models
Some attempts have been made to do reordering as
pre/post processing
Reorder source sentence to fit word order of
target sentences
Reorder target sentence, insert reordering
markers
Reordering based on hand-written rules
Reordering based on word alignment, need to learn
reordering model to apply to test sentences
Problem with reordering as preprocessing
difficult to recover from errors, therefore keep
alternatives

14
Word Reordering based on POS Patterns

Learning reordering patterns
Use word alignment information reorder source
words to make alignment (locally) monotone
Collect the reordering patternsoriginal word
sequence -gt reordered word sequence
But use POS tags rather then words to get better
generalization
Extension use context information
Context 1 POS tag left, 1 POS tag right
Apply reordering pattern only when also context
matches
I.e. compare POS sequence of length n2, reorder
sequence of length n.
Pruning
Keep only patters which have been seen gt 10
times
Keep only patterns with relative frequency gt
threshold (typically 0.1)

15
Example for Learning RO-Patterns

Spanish en esto estamos todos de acuerdo .
English we all agree on that .
POS PRP DT VB IN DT .
Alignment NULL ( ) en ( 4 ) esto ( 5 )
estamos ( 1 ) todos ( 2 ) de
( ) acuerdo ( 3 ) . ( 6 )
Extracted Rules
PRP DT VB IN DT 4 - 5 - 1 - 2 - 3
PRP DT VB 2 3 1
PRP DT VB IN 3 4 1 2
When using embedded patterns, restrict length to
7
When dropping embedded patterns, use up to 20
words

16
Constructing Reordering Lattice

Tag test sentence
For each matching POS sequence create a parallel
reordered path
Confusion lattice would over-generate too much
Eg. NN ADJ and reordered ADJ NN would also
generateundesirable NN NN and ADJ ADJ

Example Sentence A final agreement has not yet
been reached RO-Lattice
17
The Translation Process

Given phrase table
Represented as prefix tree in memory
With emitting nodes pointing to multiple
translations
And a lattice
Run over lattice and prefix tree in parallel
Expand phrases one word at a time
If final state in tree, create for each
translation new lattice edge
Translation lattices have one or multiple scores
Propagate RO pattern score to translation edges
First or n-best path search through lattice
Apply LM, word count, phrase count,
Allow for locale reordering within given window
(typically 4 words)

18
Results English -gt Spanish

Training and test corpus from TC-Star evaluation
2007
Systems
Baseline uses (unlexicalized) jump model
Use RO patterns and monotone decoding
Adding context information (1 POS left, 1 POS
right)
Allow for additional reordering in the decoder
Note scores reported here are higher than in
graphics on 2 previous slidesdue to less pruning
in final MER optimization.

19
Effect of Pruning on Translation Quality

Pruning, i.e. removing low probability reordering
patterns help
Effect is rather pronounced, therefore worrying

20
Adding Context Information

Pattern context one POS left, one POS right
Result
5-fold increase in number of patterns
More stable
Small (but consistent) improvements

21
Arabic English (Small System)

Training only on news corpus, dev-set mt03
Tagging with Stanford parser
Results
We see improvements for Arabic, so far not as
impressive as for Spanish
So far RO model trained only on 150k sentence

22
Translation with Multiple Word Segmentations

Chinese to English translation results depend on
word segmentation
Large word lists for segmentation lead to larger
number of unknown words in test sets
Different segmentation will result in unknown
words at different positions in the test
sentences
Intuitively the vocabulary of the segmented
Chinese corpus should be close in size to the
vocabulary of the English side (modulo
morphology)

23
Initial Experiments

Training
IBM 4 alignment models in both directions on
200 million words
Phrase pairs from combined Viterbi paths
Note Training with only one segmentation
Test set
494 Chinese sentences from ASR transcriptions of
broadcast news
Two references
Multiple segmentation for test sentences
Segmentation based on word lists of different
sizes
Also segmentation into individual characters
Add source word features
Unigram probability
Length of word in characters

24
Results

Large vocabulary for segmentation hurts
performance
Using alternative segmentations alone did not
help
Adding probability information hurt performance
(training corpus has on segmentation only)
Adding length of source words as feature did
result in nice improvement
Note Lattice reduces number of UNKs from 96 to
67

25
Summary

Lattice translation is a nice way to decouple
preprocessing and decoding without making hard
decisions
Lattices are efficient structure to handle many
alternative
Allow to test richer preprocessing in a very
simple way
One solution for may tasks
Successful applications
Word reordering
Word segmentation

26
Future Work

Reordering
Arabic full system (underway)
Currently some problems with tagging
Chinese system
Restricted to word segmentation for tagging
Investigate learning the pattern on word level
Using additional features for learning and
scoring reordering patterns
Multi-word segmentation
Train different segmentations
Paraphrasing
Get the system up and running
Analyze where the (expected-) benefits come from
Additional applications
Disfluency removal
Morphology

27
Quo Vadis Whats Next?

What are major problems, how to overcome them?
Quality
Word order
Lexical choice
Agreement
Language specific
Partial alignment to compounds in German-English
translations
Word segmentation (Chinese, etc) can
significantly influence translation results
Usability
Sufficient quality for standard text translation
applications
Speech translation e.g. parliamentary speeches
Limited domain speech translation on handheld
devices

28
Translation Quality

Oracle experiments show that better translations
should be possible given the existing data
Oracle score for each source sentence select
translation from n-best list which best matches
with human reference translations
Of course, score depends on size of n-best list
For 1000-best list Oracle score is typically 10
Bleu points higher then 1-best result
Why are they not selected as 1-best?
Variability in translations
Only weak correlation (on per sentence level)
between model scores and human evaluation scores
Our models do not assign the best score to the
best translation
Ultimately our models are still too weak

29
Problems with Current Models

Not the right models, i.e. we miss important
aspects of languages and translation
Models have simplifying assumptions, e.g.
independence assumptions
Models are trained separately but interact at
decoding time
Mismatch between training criterion (maximum
likelihood) and decoding criterion (minimum
error)

30
Solution Learning Richer Models

Adding new models e.g. word fertility model,
class-based LM, etc. this is the usual way to
improve the systems
Problem number of parameters become to large,
i.e. better model but parameter estimation
unreliable
Discriminative training, to retrain given
models e.g. perceptron learning
Results so far did not really give significant
improvements ?

31
Solution Soft Extension of Models

Add discriminating features to generate/select
correct translation for current sentence
Dangerous could degrade translations of other
sentences
Solution only add discriminating features to
current model, which corrects current error
without destroying other translations
Need to get generalization, i.e. apply more
abstract features not p( f e f' ) but p( f
e 'current topic is X' ) or p( f e 'there
is an auxiliary verb close by )
There is a very rich and constantly evolving set
of different machine learning techniques we
need to investigate them

32
Improving Quality through Data

We have been improving quality constantly over
the last couple of years just by using more data
So continue to use more and more data
Only for a small number of languages, but
resources are growing at an immense rate
Monolingual data, definitely
Bilingual data more problematic currently for
major language pairs
Creates engineering challenges
Memory and speed requirements grow
Distributed processing necessary
Opens scientific possibilities
More data means more model parameters can be
estimated
Fancier models can be used

33
Going Large

Currently large research systems for Chinese and
Arabic
250 million word corpus, resp. 120 million words
vocabularies gt 1 million full word forms
Memory problems
Phrase tables (15 words) to large to fit into
memory
Sampling techniques are used
No one-for-all systems
Translating a test set (typically lt 1000
sentences) will take hours
Sampling the phrase table
Even retraining on sub-sampled training data
But could even improve performance - adaptation
Time problems
Training with GIZA Chinese-English (1 direction)
takes 5 days on 1 CPU
Many groups started to parallelize training
Parallelizing translation is trivial (if your
models fit into memory of each machine)

34
Big LMs, really Big LMs

We (at CMU) typically work with a 200 million
word 3-gram LM
Google reported improvement of 5 Bleu points
using a 200 billion word 5-gram LM
1.6 TB ngram table
1000 CPUs to sample
40 hours for this sampling
IBM reports 3 Bleu points for Arabic system using
Gigaword corpus, i.e. 3 billion words 5-gram LM
Typically linear improvement in MT quality (as
measures with standard MT metrics) requires
exponential growth in corpus size
Holds for bilingual and monolingual data
Slope different for different languages
Slope different for very in-domain data and not
so in-domain
Out-of domain hurts small systems, unclear for
very large systems

35
Large Corpus for NBest List Rescoring

For a hypothesis from the n-best list, calculate
the collocation statistics of any n-gram pairs.
(No restriction to length of n-grams!)
The prime minister calls on the people to work
together for permanent peace.
I(the, prime), I(the, prime minister),, I(the
prime minister, the people), ..., I(calls on,
to),, I(work, for),
Collocation Statistics for n-gram pair ,
Different co-occurrence statistics explored
Point-wise Mutual Information works best
Accumulated collocations for a sentence

36
Co-Occurrences of n-gram Pairs

Index the corpus of N words using suffix array
(Manber Myers 1990)
For a sentence with m words, locate all its
embedded n-grams in the corpus within O(m log N)
time (Zhang Vogel 2005).
For each n-gram, locate all the sentence IDs for
each of its occurrences
Calculate co-occurrences for all the n-gram
pairs

37
Distributed Computing for Large Data

Corpus and suffix array for 100 million words
need 900MB RAM
We use 2.9 billion words from Gigaword corpus
100 corpus chunks distribute over 2030 machines

Monolingual Corpus Information Servers

NYT1999
NYT2000
XINHUA2004
Hyp1 Hyp2 Hyp N
hypothesis
Client
Add up co-occurrence information from servers
38
Experiment Results

TIDES 03 Chinese-English test set
Selection for each sentence select corpus
segment with highest ngram match

39
Corpus Selection
Translation quality (BLEU, NIST) when using
different amounts of data

The useful information is often in a small
subsection of the data
Some portions hurt
Add selection mechanism - adaptation

40
Going small

Two issues
Generating bilingual data from scratch
Running translation system on devices with
limited resources
If needed, translations can be made.
What should be translated?
Select from larger monolingual corpus, typically
available for one of the two languages
Select sentences to cover vocabulary and bigrams
seems good strategy
Get those translated

41
Selection N-grams / Sentence Length

20 of the corpus, well selected, give nearly the
same result as using the full corpus
Using trigrams in the selection process does not
make a difference

42
Going small Prune Phrase Table

Sure you remove translation alternative with too
low probabilities
And you dont store very long phrases which might
never be used
But Can you then eliminate another 50 or 80 of
the entries without hurting performance?
Current studies Remove entries
which can be generated from shorter phrases
and which are close in probabilities
Method guarantees that you loose coverage
Initial results on BTEC data successful no
degradation up to removing 80 of the entries

43
Going Small SMT on Handheld Devices

We do not always have the big memory can we
still build data-driven systems?
Yes, successfully build 2-way speech translation
system for PDA
Specification
1 GB on PDA, i.e. not so small
Models used directly from flash card
1m source phrases plus 1m target phrases plus 5m
pairs
Typical domain specific systems (travel, medical)
use 10-20 MB
Language model is actually using more memory than
the phrase-table

44
Going Deep

More structure
More linguistic knowlegde
Parsing
String-to-tree and tree-to-tree mapping
More features, i.e. richer models
More data, more parameters, richer models
possible
Context dependent lexicon models p( f e, f ),
f an the left, on the right, or somewhere in the
sentence
Distortion models p( jump f1, f2, e1, e2)
Dependencies on word classes

45
Syntax based SMT System

Many folks working on this
Its tough to beat the phrase-based systems
Bet between Franz Josef Och and Daniel Marcu,
So far Franz is the winner
But Daniel remain optimistic syntax based
systems catch up
Also new effort in our group
Parse English side
Align corpus and induce phrase pairs
Use this information to generate (hierarchical)
translation rules
Use chart-based decoder to translate
Current situation catching up - not yet
exceeding - standard system
But relies on phrase alignment, i.e. improvements
in phrase alignment will improve this system too
Still problems with integration of LM
Requires smarter pruning to bring down run-time

46
Parallel ( bilingual) Processing

Currently
word alignment models bridge the language gap
But a load of preprocessing steps are done
monolingualy
Examples
Number tagging
Word segmentation for Chinese, Japanese, etc
Using morphology toolkit to fragment Arabic words
Often, this leads to even greater disparity
between the two languages
Need integrated models
Word segmentation integrated with word alignment
Splitting/deleting of morphemes integrated with
word alignment

47
Chinese Word Segmentation

Different Chinese words segmenters Stanford,
IBM, CMU
Which segmenter to use i.e. which segmenter is
best
Best closest to gold standard, e.g. treebank
Best best translation (according to some
automatic metric)
Translation Experiment
Bilingual 19 m words (English)
LM from 200 m words
Phrase pairs from lt 20 sentences
Phrase probs only from lexicalfeatures
Testset mt03
Standford segmenter is best when evaluated
against treebank
But problematic for machine translation

48
Quo Vadis

Continuing to improve best practices
More data
Incremental extensions to word and phrase
alignment models
Improvements in search strategies
Going large
Very large corpora monolingual, comparable
Distributed processing
But should not turn science into pure engineering
tasks
Going small
Limited Domain Applications on hand-held devices
Selecting the right data
Removing redundant information from models
Going deep
Structurally rich models
Exploring generously the arsenal of machine
learning techniques
Soft extensions of current models minimal number
of additional parameter to minimize errors