Statistical Machine Translation Quo Vadis - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Statistical Machine Translation Quo Vadis

Description:

But: Hypotheses are recombined - many good translations don't reach the sentence end ... Discriminative training, to retrain given models: e.g. perceptron learning; ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 49
Provided by: Vog63
Category:

less

Transcript and Presenter's Notes

Title: Statistical Machine Translation Quo Vadis


1
Statistical Machine TranslationQuo Vadis
  • Stephan Vogel
  • Interactive Systems Lab
  • Language Technologies Institute
  • Carnegie Mellon University

2
N-Best List Generation
  • Benefit
  • Required for optimizing model scaling factors
  • Rescoring
  • For translation with pivot language L1 -gt L2 -gt
    L3
  • We have n-best translations at sentence end
  • But Hypotheses are recombined -gt many good
    translations dont reach the sentence end
  • Recover those translations

3
Storing Multiple Backpointers
  • When recombining hypotheses, store them with the
    best (i.e. surviving) hypothesis, but dont
    expand them

hb
hr
hb
hr
hr
hr
4
Tuning the SMT System
  • We use different models in SMT system
  • Models have simplifications
  • Trained on different amounts of data
  • gt Different levels of reliability
  • gt Give different weight to different ModelsQ
    c1 Q1 c2 Q2 cn Qn
  • Find optimal scaling factors c1 cn
  • Optimal means Highest score for chosen
    evaluation metric

5
Brute Force Approach Manual Tuning
  • Decode with different scaling factors
  • Get feeling for range of good values
  • Get feeling for importance of models
  • LM is typically most important
  • Sentence length to balance shortening effect of
    LM
  • Word reordering is more or less effective
    depending on language
  • Narrow down range in which scaling factors are
    tested
  • Essentially multi-linear optimization
  • Works good for small number of models
  • Time consuming (CPU wise) if decoding takes long
    time

6
Automatic Tuning
  • Many algorithms to find (near) optimal solutions
    available
  • Simplex
  • Maximum entropy
  • Minimum error training
  • Minimum Bayes risk training
  • Genetic algorithm
  • Note models are not improved, only their
    combination
  • Large number of full translations required gt
    still problematic when decoding is slow

7
Automatic Tuning on N-best List
  • Generate n-best lists, e.g. for each of 500
    source sentences 1000 translations
  • Loop
  • Changing scaling factors results in re-ranking
    the n-best lists
  • Evaluate new 1-best translations
  • Apply any of the standard optimization techniques
  • Advantage much faster
  • Can pre-calculate the counts (e.g. n-gram
    matches) for each translation to speed up
    evaluation
  • For Bleu or NIST metric with global length
    penalty do local hill climbing for each
    individual n-best list

8
Minimum Error Training
  • For each scaling factor we have Q ck Qk
    QRest
  • For different values different hyps have lowest
    score
  • Different hyps lead to different MT eval scores

h21 WER 2
Score
h22 WER 0
h23 WER 5
h11 WER 8
h12 WER 5
h13 WER 4
h22
h21
ck
best hyp
h12
h13
h11
10
7
10
9
9
Exploring Lattice Input for Statistical Machine
Translation
  • Stephan Vogel, Kay Rottmann, Bing
    Zhao,Sanjika Hewavitharana
  • InterACT Carnegie Mellon
  • InterACT University of Karlsruhe

10
Motivation
  • Lattice translation is used for tighter coupling
    between speech recognition and translation
  • Use alternative word sequences
  • Lattice is more efficient than n-best list
  • Decouple processing steps yet use relevant
    information for end-to-end optimization
  • Lattices can be used to encode alternatives
    arising from other knowledge sources
  • Disfluency annotation add edges to lattice to
    skip over disfluencies
  • Add synonyms and paraphrases
  • Add morphological variants, esp. for unknown
    words
  • Allow different word segmentation (e.g. For
    Chinese)
  • Partial word reordering

11
Advantage of using lattices
  • Decouple processing steps
  • Different people can work on different parts of
    the system with less interference
  • Use stronger models (additional tools) which
    would be difficult to integrate into the decoder
  • Closer to one solution fits allIf decoder can
    translate lattices that it can do x, y, z
  • No hard decisions
  • Keep alternatives till additional information is
    available
  • Assign probabilities to those alternatives
  • Use them as additional features in minimum error
    rate training
  • Essentially many arguments for n-best list
    rescoring are good arguments for lattice
    translation

12
Outline
  • Word reordering (some nice results)
  • Multiple word segmentations for Chinese (initial
    results)
  • Paraphrasing (Matthias Bracht)

13
Distortion Models for Word Reordering
  • Distortion models are part of the IBM and HMM
    alignment models
  • Absolute position (IBM2) or relative position
    (HMM, IBM4)
  • Conditioned on sentence length, word class, word
    (-gt lexicalized DM)
  • In phrase-based systems
  • Standard DM models simple jump models, also
    lexicalized
  • Block reordering models
  • Some attempts have been made to do reordering as
    pre/post processing
  • Reorder source sentence to fit word order of
    target sentences
  • Reorder target sentence, insert reordering
    markers
  • Reordering based on hand-written rules
  • Reordering based on word alignment, need to learn
    reordering model to apply to test sentences
  • Problem with reordering as preprocessing
    difficult to recover from errors, therefore keep
    alternatives

14
Word Reordering based on POS Patterns
  • Learning reordering patterns
  • Use word alignment information reorder source
    words to make alignment (locally) monotone
  • Collect the reordering patternsoriginal word
    sequence -gt reordered word sequence
  • But use POS tags rather then words to get better
    generalization
  • Extension use context information
  • Context 1 POS tag left, 1 POS tag right
  • Apply reordering pattern only when also context
    matches
  • I.e. compare POS sequence of length n2, reorder
    sequence of length n.
  • Pruning
  • Keep only patters which have been seen gt 10
    times
  • Keep only patterns with relative frequency gt
    threshold (typically 0.1)

15
Example for Learning RO-Patterns
  • Spanish en esto estamos todos de acuerdo .
  • English we all agree on that .
  • POS PRP DT VB IN DT .
  • Alignment NULL ( ) en ( 4 ) esto ( 5 )
    estamos ( 1 ) todos ( 2 ) de
    ( ) acuerdo ( 3 ) . ( 6 )
  • Extracted Rules
  • PRP DT VB IN DT 4 - 5 - 1 - 2 - 3
  • PRP DT VB 2 3 1
  • PRP DT VB IN 3 4 1 2
  • When using embedded patterns, restrict length to
    7
  • When dropping embedded patterns, use up to 20
    words

16
Constructing Reordering Lattice
  • Tag test sentence
  • For each matching POS sequence create a parallel
    reordered path
  • Confusion lattice would over-generate too much
  • Eg. NN ADJ and reordered ADJ NN would also
    generateundesirable NN NN and ADJ ADJ

Example Sentence A final agreement has not yet
been reached RO-Lattice
17
The Translation Process
  • Given phrase table
  • Represented as prefix tree in memory
  • With emitting nodes pointing to multiple
    translations
  • And a lattice
  • Run over lattice and prefix tree in parallel
  • Expand phrases one word at a time
  • If final state in tree, create for each
    translation new lattice edge
  • Translation lattices have one or multiple scores
  • Propagate RO pattern score to translation edges
  • First or n-best path search through lattice
  • Apply LM, word count, phrase count,
  • Allow for locale reordering within given window
    (typically 4 words)

18
Results English -gt Spanish
  • Training and test corpus from TC-Star evaluation
    2007
  • Systems
  • Baseline uses (unlexicalized) jump model
  • Use RO patterns and monotone decoding
  • Adding context information (1 POS left, 1 POS
    right)
  • Allow for additional reordering in the decoder
  • Note scores reported here are higher than in
    graphics on 2 previous slidesdue to less pruning
    in final MER optimization.

19
Effect of Pruning on Translation Quality
  • Pruning, i.e. removing low probability reordering
    patterns help
  • Effect is rather pronounced, therefore worrying

20
Adding Context Information
  • Pattern context one POS left, one POS right
  • Result
  • 5-fold increase in number of patterns
  • More stable
  • Small (but consistent) improvements

21
Arabic English (Small System)
  • Training only on news corpus, dev-set mt03
  • Tagging with Stanford parser
  • Results
  • We see improvements for Arabic, so far not as
    impressive as for Spanish
  • So far RO model trained only on 150k sentence

22
Translation with Multiple Word Segmentations
  • Chinese to English translation results depend on
    word segmentation
  • Large word lists for segmentation lead to larger
    number of unknown words in test sets
  • Different segmentation will result in unknown
    words at different positions in the test
    sentences
  • Intuitively the vocabulary of the segmented
    Chinese corpus should be close in size to the
    vocabulary of the English side (modulo
    morphology)

23
Initial Experiments
  • Training
  • IBM 4 alignment models in both directions on
    200 million words
  • Phrase pairs from combined Viterbi paths
  • Note Training with only one segmentation
  • Test set
  • 494 Chinese sentences from ASR transcriptions of
    broadcast news
  • Two references
  • Multiple segmentation for test sentences
  • Segmentation based on word lists of different
    sizes
  • Also segmentation into individual characters
  • Add source word features
  • Unigram probability
  • Length of word in characters

24
Results
  • Large vocabulary for segmentation hurts
    performance
  • Using alternative segmentations alone did not
    help
  • Adding probability information hurt performance
    (training corpus has on segmentation only)
  • Adding length of source words as feature did
    result in nice improvement
  • Note Lattice reduces number of UNKs from 96 to
    67

25
Summary
  • Lattice translation is a nice way to decouple
    preprocessing and decoding without making hard
    decisions
  • Lattices are efficient structure to handle many
    alternative
  • Allow to test richer preprocessing in a very
    simple way
  • One solution for may tasks
  • Successful applications
  • Word reordering
  • Word segmentation

26
Future Work
  • Reordering
  • Arabic full system (underway)
  • Currently some problems with tagging
  • Chinese system
  • Restricted to word segmentation for tagging
  • Investigate learning the pattern on word level
  • Using additional features for learning and
    scoring reordering patterns
  • Multi-word segmentation
  • Train different segmentations
  • Paraphrasing
  • Get the system up and running
  • Analyze where the (expected-) benefits come from
  • Additional applications
  • Disfluency removal
  • Morphology

27
Quo Vadis Whats Next?
  • What are major problems, how to overcome them?
  • Quality
  • Word order
  • Lexical choice
  • Agreement
  • Language specific
  • Partial alignment to compounds in German-English
    translations
  • Word segmentation (Chinese, etc) can
    significantly influence translation results
  • Usability
  • Sufficient quality for standard text translation
    applications
  • Speech translation e.g. parliamentary speeches
  • Limited domain speech translation on handheld
    devices

28
Translation Quality
  • Oracle experiments show that better translations
    should be possible given the existing data
  • Oracle score for each source sentence select
    translation from n-best list which best matches
    with human reference translations
  • Of course, score depends on size of n-best list
  • For 1000-best list Oracle score is typically 10
    Bleu points higher then 1-best result
  • Why are they not selected as 1-best?
  • Variability in translations
  • Only weak correlation (on per sentence level)
    between model scores and human evaluation scores
  • Our models do not assign the best score to the
    best translation
  • Ultimately our models are still too weak

29
Problems with Current Models
  • Not the right models, i.e. we miss important
    aspects of languages and translation
  • Models have simplifying assumptions, e.g.
    independence assumptions
  • Models are trained separately but interact at
    decoding time
  • Mismatch between training criterion (maximum
    likelihood) and decoding criterion (minimum
    error)

30
Solution Learning Richer Models
  • Adding new models e.g. word fertility model,
    class-based LM, etc. this is the usual way to
    improve the systems
  • Problem number of parameters become to large,
    i.e. better model but parameter estimation
    unreliable
  • Discriminative training, to retrain given
    models  e.g. perceptron learning    
  • Results so far did not really give significant
    improvements ?

31
Solution Soft Extension of Models
  • Add discriminating features to generate/select
    correct translation for current sentence
  • Dangerous could degrade translations of other
    sentences
  • Solution only add discriminating features to
    current model, which corrects current error
    without destroying other translations
  • Need to get generalization, i.e. apply more
    abstract features not p( f e f' ) but p( f
    e 'current topic is X' ) or p( f e 'there
    is an auxiliary verb close by )
  • There is a very rich and constantly evolving set
    of different machine learning techniques we
    need to investigate them

32
Improving Quality through Data
  • We have been improving quality constantly over
    the last couple of years just by using more data
  • So continue to use more and more data
  • Only for a small number of languages, but
    resources are growing at an immense rate
  • Monolingual data, definitely
  • Bilingual data more problematic currently for
    major language pairs
  • Creates engineering challenges
  • Memory and speed requirements grow
  • Distributed processing necessary
  • Opens scientific possibilities
  • More data means more model parameters can be
    estimated
  • Fancier models can be used

33
Going Large
  • Currently large research systems for Chinese and
    Arabic
  • 250 million word corpus, resp. 120 million words
  • vocabularies gt 1 million full word forms
  • Memory problems
  • Phrase tables (15 words) to large to fit into
    memory
  • Sampling techniques are used
  • No one-for-all systems
  • Translating a test set (typically lt 1000
    sentences) will take hours
  • Sampling the phrase table
  • Even retraining on sub-sampled training data
  • But could even improve performance - adaptation
  • Time problems
  • Training with GIZA Chinese-English (1 direction)
    takes 5 days on 1 CPU
  • Many groups started to parallelize training
  • Parallelizing translation is trivial (if your
    models fit into memory of each machine)

34
Big LMs, really Big LMs
  • We (at CMU) typically work with a 200 million
    word 3-gram LM
  • Google reported improvement of 5 Bleu points
    using a 200 billion word 5-gram LM
  • 1.6 TB ngram table
  • 1000 CPUs to sample
  • 40 hours for this sampling
  • IBM reports 3 Bleu points for Arabic system using
    Gigaword corpus, i.e. 3 billion words 5-gram LM
  • Typically linear improvement in MT quality (as
    measures with standard MT metrics) requires
    exponential growth in corpus size
  • Holds for bilingual and monolingual data
  • Slope different for different languages
  • Slope different for very in-domain data and not
    so in-domain
  • Out-of domain hurts small systems, unclear for
    very large systems

35
Large Corpus for NBest List Rescoring
  • For a hypothesis from the n-best list, calculate
    the collocation statistics of any n-gram pairs.
    (No restriction to length of n-grams!)
  • The prime minister calls on the people to work
    together for permanent peace.
  • I(the, prime), I(the, prime minister),, I(the
    prime minister, the people), ..., I(calls on,
    to),, I(work, for),
  • Collocation Statistics for n-gram pair ,
  • Different co-occurrence statistics explored
  • Point-wise Mutual Information works best
  • Accumulated collocations for a sentence

36
Co-Occurrences of n-gram Pairs
  • Index the corpus of N words using suffix array
    (Manber Myers 1990)
  • For a sentence with m words, locate all its
    embedded n-grams in the corpus within O(m log N)
    time (Zhang Vogel 2005).
  • For each n-gram, locate all the sentence IDs for
    each of its occurrences
  • Calculate co-occurrences for all the n-gram
    pairs

37
Distributed Computing for Large Data
  • Corpus and suffix array for 100 million words
    need 900MB RAM
  • We use 2.9 billion words from Gigaword corpus
  • 100 corpus chunks distribute over 2030 machines

Monolingual Corpus Information Servers

NYT1999
NYT2000
XINHUA2004
Hyp1 Hyp2 Hyp N
hypothesis
Client
Add up co-occurrence information from servers
38
Experiment Results
  • TIDES 03 Chinese-English test set
  • Selection for each sentence select corpus
    segment with highest ngram match

39
Corpus Selection
Translation quality (BLEU, NIST) when using
different amounts of data
  • The useful information is often in a small
    subsection of the data
  • Some portions hurt
  • Add selection mechanism - adaptation

40
Going small
  • Two issues
  • Generating bilingual data from scratch
  • Running translation system on devices with
    limited resources
  • If needed, translations can be made.
  • What should be translated?
  • Select from larger monolingual corpus, typically
    available for one of the two languages
  • Select sentences to cover vocabulary and bigrams
    seems good strategy
  • Get those translated

41
Selection N-grams / Sentence Length
  • 20 of the corpus, well selected, give nearly the
    same result as using the full corpus
  • Using trigrams in the selection process does not
    make a difference

42
Going small Prune Phrase Table
  • Sure you remove translation alternative with too
    low probabilities
  • And you dont store very long phrases which might
    never be used
  • But Can you then eliminate another 50 or 80 of
    the entries without hurting performance?
  • Current studies Remove entries
  • which can be generated from shorter phrases
  • and which are close in probabilities
  • Method guarantees that you loose coverage
  • Initial results on BTEC data successful no
    degradation up to removing 80 of the entries

43
Going Small SMT on Handheld Devices
  • We do not always have the big memory can we
    still build data-driven systems?
  • Yes, successfully build 2-way speech translation
    system for PDA
  • Specification
  • 1 GB on PDA, i.e. not so small
  • Models used directly from flash card
  • 1m source phrases plus 1m target phrases plus 5m
    pairs
  • Typical domain specific systems (travel, medical)
    use 10-20 MB
  • Language model is actually using more memory than
    the phrase-table

44
Going Deep
  • More structure
  • More linguistic knowlegde
  • Parsing
  • String-to-tree and tree-to-tree mapping
  • More features, i.e. richer models
  • More data, more parameters, richer models
    possible
  • Context dependent lexicon models p( f e, f ),
    f an the left, on the right, or somewhere in the
    sentence
  • Distortion models p( jump f1, f2, e1, e2)
  • Dependencies on word classes

45
Syntax based SMT System
  • Many folks working on this
  • Its tough to beat the phrase-based systems
  • Bet between Franz Josef Och and Daniel Marcu,
  • So far Franz is the winner
  • But Daniel remain optimistic syntax based
    systems catch up
  • Also new effort in our group
  • Parse English side
  • Align corpus and induce phrase pairs
  • Use this information to generate (hierarchical)
    translation rules
  • Use chart-based decoder to translate
  • Current situation catching up - not yet
    exceeding - standard system
  • But relies on phrase alignment, i.e. improvements
    in phrase alignment will improve this system too
  • Still problems with integration of LM
  • Requires smarter pruning to bring down run-time

46
Parallel ( bilingual) Processing
  • Currently
  • word alignment models bridge the language gap
  • But a load of preprocessing steps are done
    monolingualy
  • Examples
  • Number tagging
  • Word segmentation for Chinese, Japanese, etc
  • Using morphology toolkit to fragment Arabic words
  • Often, this leads to even greater disparity
    between the two languages
  • Need integrated models
  • Word segmentation integrated with word alignment
  • Splitting/deleting of morphemes integrated with
    word alignment

47
Chinese Word Segmentation
  • Different Chinese words segmenters Stanford,
    IBM, CMU
  • Which segmenter to use i.e. which segmenter is
    best
  • Best closest to gold standard, e.g. treebank
  • Best best translation (according to some
    automatic metric)
  • Translation Experiment
  • Bilingual 19 m words (English)
  • LM from 200 m words
  • Phrase pairs from lt 20 sentences
  • Phrase probs only from lexicalfeatures
  • Testset mt03
  • Standford segmenter is best when evaluated
    against treebank
  • But problematic for machine translation

48
Quo Vadis
  • Continuing to improve best practices
  • More data
  • Incremental extensions to word and phrase
    alignment models
  • Improvements in search strategies
  • Going large
  • Very large corpora monolingual, comparable
  • Distributed processing
  • But should not turn science into pure engineering
    tasks
  • Going small
  • Limited Domain Applications on hand-held devices
  • Selecting the right data
  • Removing redundant information from models
  • Going deep
  • Structurally rich models
  • Exploring generously the arsenal of machine
    learning techniques
  • Soft extensions of current models minimal number
    of additional parameter to minimize errors
Write a Comment
User Comments (0)
About PowerShow.com