Title: The State of the Art in PhraseBased Statistical Machine Translation SMT
1 The State of the Art in Phrase-Based Statistical
Machine Translation (SMT) Roland Kuhn, George
Foster, Nicola Ueffing February 2007
2Tutorial Plan
- A. Overview
- B. Details research topics
- NOTE best overall reference for SMT hasnt been
published yet Philipp Koehns Statistical
Machine Translation (to be published by
Cambridge University Press). Some of the material
presented here is from a draft of that book.
3Tutorial Plan
- Overview
- The MT Task Approaches to it
- Examples of SMT output
- SMT Research Culture, Evaluations, Metrics
- SMT History IBM Models
- Phrase-based SMT
- Phrase-Based Search
- Loglinear Model Combination
- Target Language Model P(T)
- Flaws of Phrase-based, Loglinear Systems
- PORTAGE a Typical SMT System
4The MT Task Approaches to it
- Core MT task translate a sentence from a source
language S to target language T - Conventional expert system approach hire experts
to write rules for translating S to T - Statistical approach using a bilingual text
corpus (lots of S sentences their translations
into T), train a statistical translation model
that will map each new S sentence into a T
sentence
5The MT Task Approaches to it
Statistical System
Expert System
Experts
Bilingual parallel corpus
S
T
Machine Learning
Statistical system output
6The MT Task Approaches to it
- Expert vs. Statistical systems
- Expert systems incorporate deep linguistic
knowledge - They still yield top performance for well-studied
language pairs in non-specialized domains - Computationally cheap (compared to statistical
MT) - BUT -
- Brittle
- Expensive to maintain (messy software
engineering) - Expensive to port to new semantic domains or new
language pairs - Typically yield only one T sentence for each S
sentence
7The MT Task Approaches to it
- Expert vs. Statistical systems
- More E-text, better algorithms, stronger machines
? quality of SMT output approaching that of
expert systems - Statistical approach has beaten expert systems in
related areas - e.g., automatic speech
recognition - SMT is robust (does well on frequent phenomena)
- Easy to maintain
- Easily ported to new semantic domain or new
language pairs IF training corpora available - For each S sentence, yields many T sentences
(each with a probabilistic score) useful for
semi-supervised translation
8The MT Task Approaches to it
Structure of Typical SMT System
Extra Target Corpora
offline training
Phrase Translation Model
Target Language Model
(optional extra LM training corpora)
mais où sont les neiges d antan ?
Other Knowledge Sources
Final N-best hypotheses
Initial N-best hypotheses
T1 But where are the snows of yesteryear?
P0.41 T2 However, where are yesterdays snows?
P 0.33
T1 however where are the snows d
antan P 0.22 T2 but where are the snows d
antan P 0.21 T3 but where did the d antan
snow go P 0.13
Reordering
9The MT Task Approaches to it
- Commercial Systems
- Systran, biggest MT company, uses expert systems
so do most MT companies. However, Systran has
recently begun exploring possibility of adding a
statistical component to their system. - Important exception LanguageWeaver, new company
based on SMT (closely linked to researchers at
ISI, U. Southern California) - Google has superb SMT research team but online,
they still mainly use Systran (probably because
of computational cost of online SMT). Seem to be
gradually swapping in SMT systems for language
pairs with lower traffic. -
10Examples of SMT output
Chinese ? English output REF Hong Kong
citizens jumped for joy when they knew Beijing's
bid for 2008 Olympic games was successful.
PORTAGE Dec. 2004 The public see that Beijing's
hosting of the Olympic Games in 2008
excited. PORTAGE Nov. 2006 Hong Kong people see
Beijing's successful bid for the 2008 Olympic
Games, very happy. REF The U.S. delegation
includes a China expert from Stanford University,
two Senate foreign policy aides and a former
State Department official who has negotiated with
North Korea. PORTAGE Dec. 2004 The United
States delegation comprising members from the
Stanford University, one of the Chinese experts,
two of the Senate foreign policy as well as
assistant who was responsible for dealing with
Pyongyang authorities of the former State
Department officials. PORTAGE Nov. 2006 The US
delegation included members from Stanford
University and an expert on China, two Senate
foreign policy, and one who is responsible for
dealing with Pyongyang authorities, a former
State Department officials. REF Kuwait foreign
minister Mohammad Al Sabah and visiting Jordan
foreign minister Muasher jointly presided the
first meeting of the joint higher committee of
the two countries on that day. PORTAGE Dec.
2004 Kuwaiti Foreign Secretary Sabah on that day
and visiting Jordan Foreign Secretary maasher
co-chaired the section about the two countries
mixed Committee at the inaugural meeting. PORTAGE
Nov. 2006 Kuwaiti Foreign Minister Sabah day and
visiting Jordanian Foreign Minister of Malaysia,
co-chaired by the two countries, the joint
commission met for the first time. REF The
Beagle 2 was scheduled to land on Mars on
Christmas Day, but its signal is still difficult
to pin down. PORTAGE Dec. 2004 small dog meat,
originally scheduled for Christmas landing Mars,
but it is a signal remains elusive. PORTAGE Nov.
2006 2 small dog meat for Christmas landing on
Mars, but it signals is still unpredictable.
11Examples of SMT output
And a silly English ? German example from Google
(Jan. 25, 2007) the hotel has a squash court
? das Hotel hat ein Kürbisgericht (think
zucchini tribunal) but this kind of error
perfect syntax, never-seen word combination
isnt typical of a statistical system, so this
was probably a rule-based system
12SMT Research Culture,Evaluations, Metrics
- Culture
- SMT research is very engineering-oriented driven
by performance in NIST other evaluations (see
later slides) - ? if a heuristic yields a big improvement in
BLEU scores a wonderful new theoretical
approach doesnt, expect the former to get much
more attention than the latter - Advantages of SMT culture open-minded to new
ideas that can be tested quickly researchers who
count have working systems with reasonably
well-written software (so they can participate in
evaluations) - Disadvantages of SMT culture closed-minded to
ideas not tested in a working system ? if you
have a brilliant theory that doesnt show a BLEU
score improvement in a reasonable baseline
system, dont expect SMT researchers to read your
paper!
13 SMT Research Culture,Evaluations, Metrics
The NIST MT Evaluations
- Since 2001, US National Institute of Standards
Technology (NIST) has been evaluating MT systems - Participants include MIT , IBM , CMU
, RWTH , - Hong Kong UST , ATR , IRST ,
others -
- and NRC NRCs system is called
PORTAGE (in NIST evaluation 2005 2006). - Main NIST language pairs Chinese?English,
Arabic?English - Semantic domains news stories multigenre
- Training corpora released each fall, test corpus
each spring participants have 1 working week to
submit target sentences - NIST evaluates systems comparatively
- In 2005 http//www.nist.gov/speech/tests/mt/mt05ev
al_official_results_release_20050801_v3.html - 2006 http//www.nist.gov/speech/tests/mt/mt06
eval_official_results.html - statistical systems beat expert systems
according to BLEU metric
14SMT Research Culture,Evaluations, Metrics
- Other MT Evaluations
- WPT/WMT usually organized each spring by Philipp
Koehn Christoph Monz smaller training corpora
than NIST, European language pairs. In 2006,
evaluated on French lt-gt English, German lt-gt
English, Spanish lt-gtEnglish. http//www.statmt.org
/wmt06/proceedings/ - TC-STAR Evaluation for spoken language
translation. In 2006, evaluated on
Chinese-gtEnglish (one direction only) and
Spanish lt-gtEnglish http//www.elda.org/tcstar-wor
kshop/2006eval.htm - IWSLT Evaluation for spoken language translation.
In 2006, evaluated on Arabic-gtEnglish,
Chinese-gtEnglish, Italian-gtEnglish,
Japanese-gtEnglish http//www.slt.atr.jp/IWSLT2006
_whatsnew/index.html
15SMT Research Culture,Evaluations, Metrics
- GALE Project
- Huge DARPA-sponsored project 50 million per
year for 5 years. Three consortia BBN-led
Agile , IBM-led Rosetta , SRI-led
Nightingale . - NRC team is in MT working group of Nightingale.
Automatic speech recognition (ASR)
Machine translation (MT)
Distillation
16SMT Research Culture,Evaluations, Metrics
- What is BLEU?
- Human evaluation of automatic translation quality
hard expensive. BLEU metric (invented at IBM)
compares MT output with human-generated reference
translations via N-gram matches. - N-gram precision (N-grams in MT output seen
in ref.) - (N-grams in
MT output) - Example (from P. Koehn)
- REF Israeli officials are responsible for
airport security - Sys A Israeli officials responsibility of
airport safety - Sys B airport security Israeli officials
are responsible
1-gram match
2-gram matches
4-gram match
17SMT Research Culture,Evaluations, Metrics
- What is BLEU?
- REF Israeli officials are responsible for
airport security - Sys A Israeli officials responsibility of
airport safety - Sys B airport security Israeli officials
are responsible - Sys A 1-gram precision 3/6 (Israeli,
officials, airport) - 2-gram precision 2/5 (Israeli
officials) - 3-gram precision 0/4 4-gram
precision 0/3. - Sys B 1-gram precision 6/6 2-gram
precision 4/5 - 3-gram precision 2/4 4-gram
precision 1/3. - BLEU-N multiplies together the N N-gram
precisions the higher the value, the better the
translation. But, could cheat by having very few
words in MT output so, brevity penalty.
18SMT Research Culture,Evaluations, Metrics
- What is BLEU?
- BLEU-N (brevity-penalty)?i1N(precisioni)?i,
where - brevity-penalty min(1,output-length/ref-length)
. - Usually, we set N4 and all ?i 1, so we have
- BLEU-4 (min(1,output-length/ref-length))?i14pr
ecisioni. - If any MT output has no N-grams matching ref.,
for some N1, , 4, BLEU-4 is zero. So, normally
compute BLEU over whole test set of at least a
hundred or so sentences. - Multiple references if an N-gram has K
occurrences in output, look for single ref. that
has K or more copies of that N-gram. If find such
a single ref., that N-gram has matched K times.
If not, look for a ref. that has the highest of
copies (L) of that N-gram use L in precision
calculation. Ref-length closest length.
19SMT Research Culture,Evaluations, Metrics
- Does BLEU correlate with human judgment?
Quality score 0 terrible, 3 excellent
Translator Identity
BLEU kind of correlates with human judgment
works best with multiple references.
20SMT Research Culture,Evaluations, Metrics
- Why BLEU Is Controversial
- If system produces a brilliant translation that
uses many N-grams not found in the references, it
will receive a low score. - Proponents of the expert system approach argue
that BLEU is biased against this approach,
favours SMT - Partial confirmation 1. in NIST 2006
Arabic-to-English evaluation, AppTek hybrid
system (rule-based SMT system) did best
according to human evaluators, but not according
to BLEU. 2. in 2006 WMT evaluation Systran was
scored comparably to other systems for some
European language pairs (e.g., French-English) by
human evaluators, but had much lower in-domain
BLEU scores (see graphs in http//www.statmt.org/w
mt06/proceedings/pdf/WMT14.pdf).
21SMT Research Culture,Evaluations, Metrics
- Other Automatic Metrics
- SMT systems need an automatic metric for tuning
(must try out thousands of variants). Automatic
metrics compare MT output with human-generated
reference translations. - Rivals of BLEU translation edit rate (TER)
how many edit ops to match references?
http//www.cs.umd.edu/snover/pub/amta06/ter_amta.
pdf - METEOR compares MT output with references
in way thats less dependent on word choice
(via stemming, WordNet, etc.) Gaining
credibility correlates better than - BLEU with human scores. However,
- METEOR only defined for translation into
English. - http//www.cs.cmu.edu/alavie/METEOR/.
22SMT Research Culture,Evaluations, Metrics
- Manual Metrics
- Human evaluation of SMT preferable to automatic
evaluation, but much slower more expensive.
Cant use for system tuning. - Ask humans to rank systems by adequacy and
fluency. Adequacy does MT output convey same
meaning as source?Fluency does MT output look
like normal target-language text? (Good syntax
idiom). - Metrics based on human postediting of MT output.
E.g., HTER. - Metrics based on human understanding of MT
output. Related to adequacy, but less subjective.
E.g., Lincoln Labs metric give English output of
Arabic MT system to unilingual English analyst,
then test him with standard Defense Language
Proficiency Test (see Jones05).
23SMT Research Culture,Evaluations, Metrics
- Who Uses Which Metric When?
- Many groups use BLEU for automatic system tuning
- NIST, WPT/WMT, TC-STAR, other evaluations often
have BLEU as official metric, with some human
reality checks. Koehn Monz WPT/WMT
participants do human fluency/adequacy
evaluations - nice analyses! - Many expert/rule-based MT researchers hate
BLEU (can become excuse not to evaluate system
competitively) - In theory, manual metrics should be related to MT
task e.g., adequacy for browsing/gisting,
Lincoln Labs metric for intelligence community,
HTER if MT output will be post-edited. So why is
HTER GALEs official metric? HTER Human
Translation Edit Rate MT output hand-edited by
humans measure of operations performed.
24SMT History IBM Models
- In the late 1980s, members of IBMs speech
recognition group applied statistical learning
techniques to bilingual corpora. These American
researchers worked mainly with the Canadian
Hansard bilingual transcription of
parliamentary proceedings. - These researchers quit IBM around 1991 for a
hedge fund, Renaissance Technologies they are
now very rich! - Renewed interest in their work sparked the
revival of research into statistical learning for
MT that occurred from late 1990s onward. Newer
phrase-based approach still partially relies
on IBM models. - The IBM approach used Bayess Theorem to define
the Fundamental Equation of MT (Brown et al.
1993)
25SMT History IBM Models
Fundamental Equation of MT
- The best-fit translation of a source-language
(French) sentence S into a target-language
(English) sentence T is
Job of language model ensure well-formed
target-language T Job of translation model
ensure T could have generated S Search task
find T maximizing product P(T)P(ST)
26SMT History IBM Models
- The IBM researchers defined five statistical
translation models (numbered in order of
complexity) - Each defines a mechanism for generation of text
in one language - (e.g., French or foreign F) from another
(e.g., English E) - Most general many-to-many case is not covered by
IBM models in this forbidden case, a group of E
words generates a group of F words, e.g.
The poor dont have any money
Les pauvres sont démunis
27SMT History IBM Models
- The IBM models only allow one-to-many generation,
e.g.
Le programme a été mis en
application
Ø
- IBM models 1 2 all lengths for F sentence
equally likely - Model 1 is bag of words - word order in F
E doesnt matter - In model 2, chance that an E word generates
given F word(s) depends on position - IBM models 3, 4, 5 are fertility-based
28SMT History IBM Models
(draw with uniform probability)
P(L?M)
IBM model 2 position-dependent bag of words
P(1 ?1)
e1 e2 . eL
P(1?M)
P(2 ?1)
.
(draw with position-dep. prob)
P(2 ?M)
.
P(L?1)
P(L?M)
29SMT History IBM Models
- Parameters f(ei) fertility of ei prob.
will produce - 0, 1, 2 words in F t(fei)
probability that ei can generate f - ?(j i, k) distortion prob. prob. that kth
word generated by ei ends up in pos. j of F
IBM model 4
IBM model 3
NOTE phrases can be broken up,but with lower
prob. than in model 3
f(e1)
f(e1)
3
2
t
fM
t
f(e2)
f(e2)
t
0
(phrase)
0
Ø
Ø
f(eL)
f(eL)
.
1
1
t
t
fM
IBM model 5 cleaned-up version of model 4 (e.g.,
two F words cant be
given same position)
30Phrase-based SMT
- Four key ideas
- phrase-based models (Och04, Koehn03, Marcu02)
- dynamic programming search algorithms (Koehn04)
- loglinear model combination (Och02)
- error-driven learning (Och03)
31Phrase-based SMT
Phrase-based approach introduced around 1998 by
Franz Josef Och others (Ney, Wong, Marcu)
many-words-to-many-words (improvement on IBM
one-to-many)
Example cul de sac word-based translation
ass of bag (N. Am), arse of bag
(British)phrase-based translation dead end
(N. Am.), blind alley (British) This
knowledge is stored in a phrase table
collection of conditional probabilities of form
P(ST) backward phrase table or P(TS)
forward phrase table. Recall Bayes T argmaxT
P(T)P(ST) ? backward table essential,
forward table used for heuristics. Tables for
French-gtEnglish
forward P(TS) p(bagsac) 0.5 p(hand bagsac)
0.2 p(culass) 0.5 p(dead endcul de sac)
0.85
backward P(ST) p(sacbag) 0.9 p(sacochebag)
0.1 p(cul de sacdead end)
0.7 p(impassedead end) 0.3
32Phrase-based SMT
- Overall Phrase Pair Extraction Algorithm
- 1. Run a sentence aligner on a parallel bilingual
corpus (wont go over this) - 2. Run word aligner (e.g., one based on IBM
models) on each aligned sentence pair see next
slide. - 3. From each aligned sentence pair, extract all
phrase pairs with no external links - see two
slides ahead.
33Phrase-based SMT
- Symmetrized Word Alignment using IBM Models
- Alignments produced by IBM models are
asymmetrical source words have at most one
connection, but target words may have many
connections. - To improve quality, use symmetrization heuristic
(Och00) - 1. Perform two separate alignments, one in each
different translation direction. - 2. Take intersection of links as starting point.
- 3. Add neighbouring links from union until all
words are covered.
S I want to go home T Je veux aller chez moi
I want to go home Je veux aller chez moi
S Je veux aller chez moi T I want to go home
34Phrase-based SMT
Diag-And phrase extraction
- Je l ai vu à la télévision
- I saw him on television
Input aligned sentence pair Output set of
consistent phrases
Extract all phrase pairs with no external links,
for example Good pairs (Je, I) (Je l ai
vu, I saw him) (ai vu, saw) (l ai vu à la, saw
him on) Bad pairs (Je l ai vu, I saw) (l
ai vu à, saw him on) (la télévision, television)
35Phrase-Based Search
- Generative process
- 1. Split source sentence into phrases
(N-grams). - 2. Translate each source phrase (one-to-one).
- 3. Permute target phrases to get final
translation. - much simpler and more intuitive than the
IBM process, - but the price of this is no provision for
gaps, e.g., ne VERB pas
1
2
3
I
Je l ai vu à la télévision
I saw him on television
him
saw
on television
NOTE XRCEs Matrax does handle gaps
36Phrase-Based Search
Order Target hypotheses grow left-gtright, from
source segments consumed in any order
Backward Table
?
?
?
?
?
Source s1 s2 s3 s4 s5 s6 s7 s8 s9
P(ST) p(s2 s3 t8) p(s2 s3 t5 t3) p(s3 s4
t4 t9)
(pick s2 s3 first)
(pick s3 s4 first)
(phrase transl)
Tgt hyp t5 t3
Tgt hyp t8
phrase table 1. suggests possible
segments 2. supplies phrase translation
scores
(phrase transl)
(pick s5 s6 s7)
Tgt hyp t4 t9
(phrase transl)
Language Model P(T)
Tgt hyp t8 t6 t2
language model scores growing target
hypotheses left -gt right
37Loglinear Model Combination
Previous slides show basic system that ranks
hypotheses by P(ST)P(T). Now lets introduce
an alignment/reordering variable A (aligns T S
phrases). We want T argmaxT P(TS) argmaxT
,AP(T, AS) argmaxT, A f1(T,A,S)?1
f2(T,A,S)?2 fM(T,A,S)?M argmax exp (?i ?i
log fi(T,A,S)). The fi now typically include not
only functions related to P(ST) and language
model P(T), but also to A distortion , P(TS),
length(T), etc. The ?i serve as reliability
weights. This change in score computation
doesnt fundamentally change the search
algorithm.
38Loglinear Model Combination
- Advantages
- Very flexible! Anyone can devise dozens of
features. - E.g., if lots of mismatched brackets in output,
include feature function that outputs 1 if no
mismatched brackets, -1 if have mismatched
brackets. - So lots of new features being tried in somewhat
haphazard way. - But systems steadily improving outputs from
NIST 2006 look much better than those from NIST
2002. SMT not good enough to replace human
translators, but good enough for, e.g., most Web
browsing. Using 1000 machines and massive
quantities of data, Google got 45.4 BLEU for
Arabic to English, 35.0 for Chinese to English
very high scores!
39Loglinear Model Combination
- Typical Loglinear Components for SMT Decoding
- Joint counts C(S,T) from phrase extraction yield
estimates P(ST) stored in backward phrase
table and estimates P(TS) stored in forward
phrase table. These are typically relative
frequency estimates (but weve looked at smoothed
variants). - Distortion model D(T,A,S) assigns score to amount
of phrase reordering incurred in going from S to
hypothesis T. Can be based purely on
displacement, or be lexicalized (identity of
words in S T is important). - Length model L(T,S) scores probability that
hypothesis of length T generated from source of
length S. - Language model P(T) gives probability of word
sequence T in target language see next few
slides. - NOTE these are just for decoding you can use
lots more components for N-best/lattice
reordering!
40Target Language Model P(T)
- The Stupidest Thing Noam Chomsky Ever Said
- It must be recognized that the notion of a
probability of a sentence is an entirely
useless one, under any interpretation of this
term . - Chomsky, 1969.
41Target Language Model P(T)
- Language model helps generate fluent output by
- 1. assigning higher probability to correct
word order e.g., PLM(the house is small)
gtgt PLM(small the is house) - 2. assigning higher probability to correct
word choices e.g., - PLM(i am going home) gtgt PLM(I am going
house) - Almost everyone in both SMT and ASR (automatic
speech recognition) communities uses N-gram
language models. Start with - P(W) P(w1)P(w2w1)P(w3w1,w2)P(wiw1,
,wi-1)P(wmw1,,wm-1), - then limit window to N words. E.g., for N3,
trigram LM - P(W) P(w1)P(w2w1)P(w3w1,w2)P(wiwi-2,
wi-1)P(wmwm-2,wm-1).
42Target Language Model P(T)
- Estimation is done by relative frequency on large
corpus P(wiwi-2,wi-1) f(wiwi-2,wi-1)
C(wi-2,wi-1,wi)/Sw C(wi-2,wi-1,w). - E.g., in Europarl corpus, see 225 trigrams
starting the red - C(the red cross)123, C(the red tape)31,
C(the red army)9, C(the red card)7, C(the red
,)5 (and 50 other trigrams). So estimate
P(cross the red) 123/225 0.547 . - But need to reserve probability mass for unseen
events - maybe never saw the red planet in
Europarl, but dont want to have estimate
P(planet the red) 0. Also, want estimates
whose variance isnt too high. Smoothing
techniques are used to solve both problems. E.g.,
could linearly smooth trigrams with bigrams
unigrams P(wiwi-2,wi-1) ?f (wiwi-2,wi-1)
µf(wiwi-1) (1-?-µ)f(wi)
0 lt ?, µ lt 1.
43Target Language Model P(T)
Measuring Language Model Quality
- Perplexity metric that measures predictive power
of an LM on new data as an average branching
factor. E.g., model that says any digit 0, , 9
has equal probability of occurrence will yield
perplexity of 10.0 on digit sequence generated
randomly from these 10 digits. - Perplexity of LM measured on corpus W (w1 wN)
is - PerpLM(T) (?wi P(wiLM))-1/N 1/(average
per word prob.) - The better the LM is as a model for W, the
less surprised it is by words of W ? higher
estimated prob. ? lower entropy. - Typical perplexities for well-trained
English trigram LMs with lexica of about 25K
words for various dictation domains - Perp(radiology)20, Perp(emergency
medicine)60, Perp(journalism)105, Perp(general
English)247 .
44Target Language Model P(T)
- A Bit of Progress in Language Modeling
(Goodman01) is good summary of state of the art
in N-gram language modeling. - Consistently superior method Kneser-Ney.
- Intuition if Francisco eggplant each
seen 103 times in our corpus of 106 words, and
neither eggplant Francisco nor eggplant stew
seen, which should be higher, P(Franciscoeggplan
t) or P(steweggplant)? - Interpolation answer P(wiwi-1)
?f(wiwi-1) (1-?)f(wi ). - So P(Franciscoeggplant) ?0 (1- ?)10-3
P(steweggplant). - Kneser-Ney answer no, Francisco only
occurs after San, but 1,000 occurrences of
stew preceded by 100 different words. So when
(wi-1 wi) has never been seen before, wi stew
more probable than wi Francisco ?
P(steweggplant) gtgt P(Franciscoeggplant).
45Target Language Model P(T)
- Kneser-Ney formula (for bigrams easily extended
to N-grams) - PKN(wi wi-1) max C(wi-1 wi)-D,
0/C(wi-1) - ?(wi-1)v C(v
wi) gt 0/?w v C(v w) gt 0 , - where D is a discount factor lt 1, ?(wi-1) is
a normalization constant, v C(v wi) gt 0 is
the number of different words that precede wi in
the training corpus, and ?w v C(v w) gt 0 is
the number of different bigrams in the training
corpus.
46Flaws of Phrase-based, Loglinear Systems
- Loglinear feature function combination is too
flexible! Makes it - easy not to think about theoretical
properties of models. - The IBM models were true models given arbitrary
source sentence S and target sentence T, could
estimate non-zero P(TS). Phrase-based models
are not models in general, for T which is a good
translation of S, they give P(TS) 0. They
dont guarantee existence of an alignment between
T and S. Thus, the only translations T to which
a phrase-based system is guaranteed to assign
P(TS) gt 0 are T output by same system. - This has practical consequences in general, a
phrase-based MT system cant be used for
analyzing pre-existing translations. This rules
out many useful forms of assistance to human
translators - e.g., spotting potential errors in
translations based on regions of low P(TS).
47PORTAGE A Typical SMT System
- Sentence-align a big bilingual corpus
- On each sentence pair, use IBM models to align
words - Build phrase tables from word alignments via
diag-and or similar heuristic (Koehn03).
Backwards phrase table gives P(ST) ( is
implicit segmentation model). - Build language model (LM) for target language
estimates P(T) , based on n-grams in T - 5. P(ST) and P(T) are sufficient for
decoding, but one often adds other loglinear
feature functions such as a distortion penalty - 6. Use (Och03) method to find good weights ?i
for loglinear features - 7. Optionally, include reordering step i.e.,
decoder outputs many hypotheses (via N-best list
or lattice) which are rescored by larger set of
feature functions
48PORTAGE A Typical SMT System
Core Engine
Small set of information sources for Canoe
decoder
(number-of-words model)
(at least 1 phrase translation model)
(any of additional info. sources - for
rescorer only)
(at least one language model)
(at least one distortion model)
A1
A2
A3
feature functions
Source sentence
Large set of information sources for
Rescorer
Weighted large info
Weighted small info
mais où sont les neiges d antan ?
kLMLM kTMTM kA3A3
Weights for large set
wLMLM wTMTM wNMNM
Weights for small set
Rescorer
Canoe decoder
Rescored N-best
N-best hypotheses
49Training Core Components of PORTAGE
Preprocessing
Raw parallel corpus
src-lang text tgt-lang text
Additional monolingual corpora
Tgt-lang text
Tgt-lang text
phrase translation model
language model
PT
LM
other small set models
small set info only
model3
modelK
large set info
extra models for large set
modelM
modelK1
large set wts
small set wts
w1, , wM
w1, , wK
50Canoe Optimization of Weights (COW) Purpose find
weights w1, , ws on small set of
information sources (N around 100)
Small set of information sources
I2
IS
I1
(first call to Canoe)
(2nd subsequent calls to Canoe)
New Weights (from rescore-train )
w1r , w2r,, wsr
List of D N-best hyp.
(union 2nd subsequent calls to rescore-train)
(first call to rescore-train)
K random wt. vectors
W1 WK
W1w11 , w21,, ws1 WKw1K,
w2K,, wsK
W
Rescore_train
51Rescoring Finding Weights on Large Info.
Set for Rescorer (N around 1000)
Large set
I1
I2
IS1
IS
IL
Small set
Initial Weights
Weighted small info
w1i , w2i,, wLi
Weights for small fixed by previous COW
step
w1 I1 wSIS
feature functions
Final large wts
w1f , w2f,, wLf
K random wt. vectors
W1 WK
W1w11 , w21,, wL1 WKw1K,
w2K,, wLK
W
Rescore_train
52Tutorial Plan
- B. Details research topics
- Named entities
- Large-scale discriminative training (George
Foster) - Decoding for SMT (prepared by Nicola Ueffing)
- Hierarchical models (George Foster)
- System combination
53Named entity recognition transliteration
Chinese Example Secretary-General Wong
appeared with Larry Ellison, Chief Executive
Officer of Oracle Corporation, at a press
conference to announce Oracles investment of
100 million dollars in a new research centre in
Szechuan Province . Personal names Wong,
Larry Ellison. Titles Secretary-General,
Chief Executive Officer. Organization name
Oracle Corporation. Place name Szechuan
Province. Recognition problem detect these
entities in a continuous stream of
ideograms. Transliteration problem when
ideograms are used phonetically (esp. for
non-Chinese names like Larry Ellison) become
aware of that, map them onto Latin characters.
54Named entity recognition transliteration
- Made-up Chinese Transliteration Example
- How to translate ??????????
- ? táng (surname) - Tang Dynasty ?(F?) nà
receive, accept, enjoy, pay, sew ? dé virtue - ? la pull, drag, haul ? mu nurse ? si
(thus now used mostly for sound) ? fei ??
humble ?(F?) er (archaic) you ? dé virtue - After receiving virtue from the Tang Dynasty,
you thus pulled the humble nurse away from
virtue (????). No - tang na de la mu si fei de DONALD
RUMSFELD. - Actual Chinese?English example generated by
PORTAGE - Outgoing president Iliescu has also
congratulated Basescu. ? - Outgoing president of Iraq, has also been made
to the road to the public.
55Named entity recognition transliteration
Other Examples Arabic?English Muammar Ghadafy
Moammar Khaddafi Muamar Qadafy Azeddine
Elzedine Alsuddin Ahzudin (depending on
region, pronounced differently thus
transliterated into Latin alphabet differently)
English?French (Google Translate Jan. 24,
2007) The Englishman John Snow thought cholera
was transmitted by small, living organisms. ?
Le choléra de pensée de neige de John d'Anglais
a été transmis par la petite, organique matière.
56System Combination
- Introduction
- Different systems make different errors why not
combine information? This worked well for ASR - But, because of reordering, synonyms, etc.,
system combination not as easy for MT! - RWTH (Aachen) is SMT powerhouse has recently
been working on parallel system combination
(Evgeny Matusov). - NRC has been working on serial system
combination. - Both teams now getting good results.
57System Combination
- Parallel System Combination (RWTH Aachen)
- Hypotheses from different systems aligned some
word reordering allowed use of synonyms - Generate confusion network ? choices at each
position scored with system weights and word
confidence scores - N-best consensus translations are generated from
confusion network rescored with various
information sources - A year ago, results unimpressive. Since then,
added new information sources (e.g., LMs trained
on N-best lists from contributing systems) that
encourage preservation of original phrases. Nice
preliminary Arabic results improvement of 2-3
BLEU points over best individual system in
combination.
58System Combination
Example of RWTH Parallel Combination Ref
Chinese president directs
unprecedented criticism at
leaders of Hong Kong. Best
System Chinese president slams unprecedented
leaders to Hong
Kong. System Comb. Chinese president sends
unprecedented criticism of the
leaders of Hong Kong.
59System Combination
- Serial System Combination (NRC)
- Use SMT to correct mistakes made by another
method (e.g., a rule-based one)
- Training Procedure
- Use MT1 to produce initial target translation of
source half of a parallel human-translated
corpus, thus giving a corpus of MT1 target output
in parallel with good target versions of same
sentences use parallel corpus of (MT1 target
human target) sentences to train SMT. - Even better, if can get humans to post-edit MT1
output, have MT1 target in parallel with
corrected target as SMT training corpus.
60System Combination
Serial System Combination (NRC)
61System Combination
Serial System Combination (NRC)
62System Combination
- Discussion Future Work
- Parallel combination probably best for similar
systems of good quality, serial combination for
systems that are very different - Future work for serial combination allow SMT
both direct indirect (via MT1) access to source
text. Could do this using, e.g. - Rescoring
- Parallel phrasetables
- Parallel LMs
- Parallel decoding (etc.)
63References (1)
Best overall reference Philipp Koehn,
Statistical Machine Translation , University
of Edinburgh (textbook to appear 2007 or 2008,
Cambridge University Press). Papers (NOTE
short summary of key papers available from
Kuhn/Foster) Brown93 Peter F. Brown, Stephen A.
Della Pietra, Vincent Della J. Pietra, and Robert
L. Mercer. The mathematics of Machine
Translation Parameter estimation. Computational
Linguistics, 19(2)263-312, June 1993. Chomsky69
Noam Chomsky. Quines Empirical Assertions. In
Words and Objections Essays on the Work of W.V.
Quine (ed. D. Davidson and J. Hintikka).
Dordrecht, Netherlands, 1969. Foster06 George
Foster, Roland Kuhn, and Howard Johnson.
Phrasetable Smoothing for Statistical Machine
Translation. EMNLP 2006, Sydney, Australia, July
22-23, 2006. Germann01 Ulrich Germann, Michael
Jahr, Kevin Knight, Daniel Marcu, and Kenji
Yamada. Fast decoding and optimal decoding for
machine translation. In Proceedings of the 39th
Annual Meeting of the Association for
Computational Linguistics (ACL), Toulouse, July
2001.
64References (2)
Goodman01 Joshua Goodman. A Bit of Progress in
Language Modeling (extended version). Microsoft
Research Technical Report, Aug. 2001.
Downloadable from research.microsoft.com/joshuago
/publications.htm Jones05 Douglas Jones, Edward
Gibson, et al. Measuring Human Readability of
Machine Generated Text Studies in Speech
Recognition and Machine Translation. In
Proceedings of the IEEE Int. Conf. on Acoustics,
Speech, and Signal Processing (ICASSP),
Philadelphia, PA, USA, March 2005 (Special
Session on Human Language Technology
Applications and Challenge of Speech Processing).
Knight99 Kevin Knight. Decoding complexity in
word-replacement translation models.
Computational Linguistics, Squibs and Discussion,
25(4), 1999. Koehn04 Philipp Koehn. Pharaoh a
beam search decoder for phrase-based statistical
machine translation models. In Proceedings of the
6th Conference of the Association for Machine
Translation in the Americas, Georgetown
University, Washington D.C., October 2004.
Springer-Verlag. KoehnDec03 Philipp Koehn.
PHARAOH - a Beam Search Decoder for Phrase-Based
Statistical Machine Translation Models (User
Manual and Description). USC Information
Sciences Institute, Dec. 2003.
65References (3)
KoehnMay03 Philipp Koehn, Franz Josef Och, and
Daniel Marcu. Statistical phrase-based
translation. In Eduard Hovy, editor, Proceedings
of the Human Language Technology Conference of
the North American Chapter of the Association for
Computational Linguistics (HLT/NAACL), pp.
127-133, Edmonton, Alberta, Canada, May 2003.
Marcu02 Daniel Marcu and William Wong. A
phrase-based, joint probability model for
statistical machine translation. In Proceedings
of the 2002 Conference on Empirical Methods in
Natural Language Processing (EMNLP),
Philadelphia, PA, 2002. OchJHU04 Franz Josef
Och, Daniel Gildea, et al. Final Report of the
Johns Hopkins 2003 Summer Workshop on Syntax for
Statistical Machine Translation (revised
version). http//www.clsp.jhu.edu/ws03/groups/tran
slate (JHU-syntax-for-SMT.pdf), Feb. 2004. Och04
Franz Och and Hermann Ney. The alignment template
approach to statistical machine translation.
Computational Linguistics, V. 30, pp. 417-449,
2004. Och03 Franz Josef Och. Minimum error rate
training for statistical machine translation. In
Proceedings of the 41th Annual Meeting of the
Association for Computational Linguistics (ACL),
Sapporo, July 2003.
66References (4)
Och02 Franz Josef Och and Hermann Ney.
Discriminative training and maximum entropy
models for statistical machine translation. In
Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL),
Philadelphia, July 2002. Och01 Franz Josef Och,
Nicola Ueffing, and Hermann Ney. An Efficient A
Search Algorithm for Statistical Machine
Translation. In Proc. Data-Driven Machine
Translation Workshop, July 2001. Och00 Franz
Josef Och and Hermann Ney. A Comparison of
Alignment Models for Statistical Machine
Translation. Int. Conf. on Computational
Linguistics (COLING), Saarbrucken, Germany,
August 2000. Papineni01 Kishore Papineni, Salim
Roukos, Todd Ward, and Wei-Jing Zhu. BLEU A
method for automatic evaluation of Machine
Translation. Technical Report RC22176, IBM,
September 2001. Ueffing02 Nicola Ueffing, Franz
Josef Och, and Hermann Ney. Generation of Word
Graphs in Statistical Machine Translation.
Empirical Methods in Natural Language Processing,
July 2002.