Title: Introduction to Statistical Machine Translation
1Introduction to Statistical Machine Translation
ShihHsiang
2Reference
- Brown, Cocke et al, 1990 A statistical
approach to machine translation, Computational
Linguistics, 1679-85, 1990. - Papineni, Roukos et al, 2001 BLEU a Method
for Automatic Evaluation of Machine Translation,
Technical Report, IBM Research Division - Chou and Juang Pattern Recognition in Speech
and Language Processing, Chapter 11, CRC Press. - Some slides are directly borrowed
- Dr. Kevin Knight, University of Southern
California, - Dr. Philipp Koehn from University of Edinburgh
- Dr. Franz Josef Och from Google
3The Rosetta Stone (196 BC)
Egyptian hieroglyphs (used from 3300 BC 400 AD)
Egyptian Demotic (a late cursive script)
Greek (the language of Ptolemy V, ruler of Egypt)
1799 a stone with Egyptian text and its
translation into Greek was found ? Humans could
learn how to translated Egyptian
4Warren Weaver (1947)
When I look at an article in Russian, I say to
myself This is really written in English, but it
has been coded in some strange symbols. I will
now proceed to decode.
5Interest in MT
- Commercial interest
- U.S. has invested in MT for intelligence purposes
- MT is popular on the webit is the most used of
Googles special features - EU spends more than 1 billion on translation
costs each year. - (Semi-)automated translation could lead to huge
savings - Academic interest
- One of the most challenging problems in NLP
research - Requires knowledge from many NLP sub-areas, e.g.,
lexical semantics, parsing, morphological
analysis, statistical modeling, - Being able to establish links between two
languages allows for transferring resources from
one language to another
6Why Its Challenging
7Competitions
- Progress driven by MT Competitions
- NIST/DARPA Yearly campaigns for Arabic-English,
Chinese-English, newstexts, since 2001 - IWSLT Yearly competitions for Asian languages
and Arabic into English, speech travel domain,
since 2003 - WPT/WMT Yearly competitions for European
languages, European - Parliament proceedings, since 2005
- Increasing number of statistical MT groups
participate - Competitions won by statistical systems
8Major Speech Translations Systems
9ATT How May I Help You
- Spanish-to-English
- MT transnizer
- A transnizer is a stochastic finite-state
transducer that integrates the language model of
a speech recognizer and the translation model
into one single finite-state transducer - Directly maps source language phones into target
language word sequences - One step instead of two
10MIT Lincoln Lab
11NEC
Stand-alone version ISOTANI03
C/S version as in Yamabana ACL03
12Levels of Transfer
13Methodologies
- Word-for-word translation
- Syntactic transfer
- Interlingual approaches
- Example-based
- Statistical
14Word-for-word translation
- Use a machine-readable bilingual dictionary to
translate each word in a text - Advantages
- Easy to implement, results give a rough idea
about what the text is about - Disadvantages
- Problems with word order means that this results
in low-quality translation
15Syntactic transfer
- It includes three steps
- Parse the sentence ? Rearrange constituents ?
translate the words - Advantages
- Deals with the word-order problem
- Disadvantages
- Must construct transfer rules for each language
pair that you deal with - Sometimes there is syntactic mis-match
?
English word order is subject - verb -
object Japanese order is subject -
object - verb
16Interlingua
- Assign a logical form to sentences
- John must not go
- OBLIGATORY(NOT(GO(JOHN)))
- John may not go
- NOT(PERMITTED(GO(JOHN)))
- Use logical form to generate a sentence in
another language - Advantages
- Single logical form means that we can translate
between all languages and only write a
parser/generator for each language once - Disadvantages
- Difficult to define a single logical form.
English words in all capital letter probably
won't cut it.
17Example-based MT
- Fundamental idea
- People do not translate by doing deep linguistics
analysis of a sentence - They translate by decomposing sentence into
fragments, translating each of those, and then
composing those properly - Translate
- He buys a book on international politics
- With these examples
- (He buys) a notebook.
- (Kare ha) nouto (wo kau).
- I read (a book on international politics).
- Watashi ha (kokusaiseiji nitsuite kakareta hon)
wo yomu - ?(Kare ha) (kokusaiseiji nitsuite kakareta hon)
(wo kau).
18Example-based MT
- Challenges
- Locating similar sentences
- Aligning sub-sentential fragments
- Combining multiple fragments of example
translations into a single sentence - Determining when it is appropriate to substitute
one fragment for another - Selecting the best translation out of many
candidates - Advantages
- Uses fragments of human translations which can
result in higher quality - Disadvantages
- May have limited coverage depending on the size
of the example database, and flexibility of
matching heuristics
19Statistical MT
- Find most probable target sentence given a source
foreign language sentence - Automatically align words and phrases within
sentence pairs in a parallel corpus - Probabilities are determined automatically by
training a statistical model using the parallel
corpus
parallel corpus
20Statistical MT
- Advantages
- Has a way of dealing with lexical ambiguity
- Can deal with idioms that occur in the training
data - Requires minimal human effort
- Can be created for any language pair that has
enough training data - No need for staff of linguists of language
experts - Disadvantages
- Does not explicitly deal with syntax
21Example-based MT vs. Statistical MT
- Both are empirical approaches
- As opposed to rule-based machine translation
- EBMT emphasizes learning from examples
- Often heuristic scoring/learning methods
- SMT emphasizes making optimal decisions
- SMT and EBMT astonishingly separate research
communities - SMT researchers often use methods and terminology
from speech recognition research - Different language used in both communities
22Parallel Corpora
- Collections of texts and their translation into
different languages - Alignment across languages at various levels
- Document
- Section
- Paragraph
- Sentence (not necessarily one-to-one)
- Phrase
- Word
- Examples of Parallel Corpora
- European Parliament Proceedings Parallel Corpus
- The Bible
23Statistical MT Systems
Spanish/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Broken English
Spanish
English
What hunger have I, Hungry I am so, I am so
hungry, Have I that hunger
Que hambre tengo yo
I am so hungry
24Statistical MT Systems
Spanish/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Broken English
Spanish
English
Translation Model P(fe)
Language Model P(e)
Que hambre tengo yo
I am so hungry
Decoding algorithm argmax P(e) P(fe) e
25Statistical MT Systems
26Three Problems for Statistical MT
- Language model
- Assigns a higher probability to fluent /
grammatical sentences - Estimated using monolingual corpora
- good English string -gt high P(e)
- random word sequence -gt low P(e)
- Translation model
- Assigns higher probability to sentences that have
corresponding meaning - Estimated using bilingual corpora
- ltf,egt look like translations -gt high P(f e)
- ltf,egt dont look like translations -gt low P(f
e) - Decoding algorithm
- Given a language model, a translation model, and
a new sentence f find translation e maximizing
P(e) P(f e)
27Translation Model Alignment
- Source language string
- Target language string
- Alignment Mapping
28Translation Model Alignment
- Decomposition without Loss of generality
Length Model
Alignment Model
Lexicon Model
29IBM Model 1
- Generative model break up translation process
into smaller steps - Length Model
- Alignment Model
- Lexicon Model
la
casa
blu
la
casa
blu
the
blue
house
the
blue
house
30How to estimate Lexicon Model?
- Observation
- Co-occurring words potential translations
- Frequently co-occurring words likely
translations - Rarely co-occurring words unlikely translations
- Idea
- estimate translation probabilities using
co-occurring counts - Problem
- co-occurrences are very noisy
31Lexicon model estimation with known alignments
- Haus - house 2 occurrences
- P(Haushouse) 1.0
- blau - blue 1
- blaue - blue 1
- P(blaublue) 1/2 0.5
- P(blaueblue) 1/2 0.5
- P(fe) N(f,e)/N(e)
Given alignment information simple relative
frequency
32Lexicon model estimation with uncertain alignments
- Haus - house 1.8 times
- blaue - house 0.2 times
- P(Haushouse) 1.8/(1.80.2)
- P(blauehouse) 0.2/(1.80.2)
- blaue - blue 0.8
- das - blue 0.2
- blau - blue 1.0
- P(blaueblue) 0.8/2.00.4
- P(dasblue)0.2/2.00.1
- P(blaublue)1.0/2.00.5
33Lexicon model estimation with uncertain alignments
- N(f,ea,f,e) count of alignment between (f,e) in
sentence pair f,e with alignment a - c(fe) fractional counts -- counts weighted with
alignment probability
Chicken-Egg Problem
34Lexicon model estimation with uncertain alignments
- Solution EM-algorithm
- Iteratively re-estimate parameters given previous
setting - Starting uniformly
35More sophisticated models
- IBM Model 2
- Adds dependence on absolute word positions
- can learn for example that words at the beginning
of a sentence are often also translated at the
beginning - HMM
- Adds dependence on relative word positions
- can learn for example that alignments are often
monotone
36More sophisticated models
- IBM Model 3 ( 4,5)
- Adds new probability distribution p(ne) for the
fertility of words - Fertility of e number of Foreign words that e
aligns to - Adds soft coverage constraint for English words
- Context-dependent lexicon model
- Takes into account word context
37Phrase-Based Statistical MT
Morgen
fliege
ich
nach Kanada
zur Konferenz
Tomorrow
I
will fly
to the conference
In Canada
- Foreign input segmented in to phrases
- phrase is any sequence of words
- Each phrase is probabilistically translated into
English - P(to the conference zur Konferenz)
- P(into the meeting zur Konferenz)
- Phrases are probabilistically re-ordered
38Advantages of Phrase-Based
- Phrases capture local reordering
- Single-word based needs to be stored in
alignment model - Local context useful for disambiguation
- Single-word based only target language model
does disambiguation - Phrases are reordered as a whole
- Works well for non-compositional phrases
- With a lot of data sometimes whole sentences can
be covered
39Evaluation of MT
- Ideal criterion user satisfaction
- Problems
- Expensive, Slow, Inconsistent, Subjective
- Problematic to use in system development
- Goal automatic objective evaluation of machine
translation quality - Idea Compute similarity of MT output with good
human translations (reference translations) - Hope
- If MT output is good similar to good human
translations - If MT output is bad very different from human
translations - Question Which similarity metric?
40Evaluation of MT
- Use a set of bilingual test sentences so that,
for each source sentence, an associated target
sentence is given - WER (word error rate)
- SER (sentence error rate)
- PER (position-independent word error rate)
- without taking the word order into account
- BLEU (Bilingual Evaluation Understudy)
- an MT metric based on n-gram precision
- ROUGE
41BLEU (Bilingual Evaluation Understudy)
- Modified n-gram precision
- N-gram precision fraction of N-grams occurring
in references - Modified N-gram precision same part of reference
cannot be used twice - Brevity penalty
- Penalize too short translations
- BP exp( min(1 - r/c , 0) )
- c length of MT output, r length of reference
translation - BLEUn4 score
-
42Typical BLEU scores (2005 NIST evaluation data)
- Arabic-English news translation, 4 references
- Best statistical (research) system 51 BLEU
score - (some) commercial systems 10 - 34 BLEU score
- Estimated human BLEU score 63 BLEU score
- Chinese-English news translation, 4 references
- Best statistical (research) system 35 BLEU
score - (some) commercial system 15 BLEU score
- Estimated human BLEU score 55 BLEU score
- Approach used to estimate human BLEU score (given
4 references) - Round robin score one reference against other 3
references
43SMT for Spoken Language
- Spoken-Language-Translation not merely
translation of written text containing ASR errors
44SMT for Spoken LanguageTraditional Approach
- 1-best ASR-hypothesis passed to SMT
- Other ASR hypotheses not considered
- ASR / SMT systems developed independently
- Trained using different data
- Performance optimized for different criterion
(WER/BLEU)
Hope end-to-end system performance is good
45Tighter Coupling for SLT
46Outlook Progress from
- Better Models Training
- Generalized phrase models (e.g. hierarchical)
- Long-distance dependencies
- Topic adaptation
- Discriminative training with many more features
- Much More Data
- Monolingual data gt 1 trillion words
- Bilingual data gt 1 billion words
- Better automatic machine translation evaluation
(BLEU) - Better engineering / infrastructure / tools