The State of the Art in PhraseBased Statistical Machine Translation SMT presentation

About This Presentation

Transcript and Presenter's Notes

Title: The State of the Art in PhraseBased Statistical Machine Translation SMT

1

The State of the Art in Phrase-Based Statistical
Machine Translation (SMT) Roland Kuhn, George
Foster, Nicola Ueffing February 2007
2
Tutorial Plan

A. Overview
B. Details research topics
NOTE best overall reference for SMT hasnt been
published yet Philipp Koehns Statistical
Machine Translation (to be published by
Cambridge University Press). Some of the material
presented here is from a draft of that book.

3
Tutorial Plan

Overview
The MT Task Approaches to it
Examples of SMT output
SMT Research Culture, Evaluations, Metrics
SMT History IBM Models
Phrase-based SMT
Phrase-Based Search
Loglinear Model Combination
Target Language Model P(T)
Flaws of Phrase-based, Loglinear Systems
PORTAGE a Typical SMT System

4
The MT Task Approaches to it

Core MT task translate a sentence from a source
language S to target language T
Conventional expert system approach hire experts
to write rules for translating S to T
Statistical approach using a bilingual text
corpus (lots of S sentences their translations
into T), train a statistical translation model
that will map each new S sentence into a T
sentence

5
The MT Task Approaches to it
Statistical System
Expert System
Experts
Bilingual parallel corpus
S
T

Machine Learning
Statistical system output
6
The MT Task Approaches to it

Expert vs. Statistical systems
Expert systems incorporate deep linguistic
knowledge
They still yield top performance for well-studied
language pairs in non-specialized domains
Computationally cheap (compared to statistical
MT)
BUT -
Brittle
Expensive to maintain (messy software
engineering)
Expensive to port to new semantic domains or new
language pairs
Typically yield only one T sentence for each S
sentence

7
The MT Task Approaches to it

Expert vs. Statistical systems
More E-text, better algorithms, stronger machines
? quality of SMT output approaching that of
expert systems
Statistical approach has beaten expert systems in
related areas - e.g., automatic speech
recognition
SMT is robust (does well on frequent phenomena)
Easy to maintain
Easily ported to new semantic domain or new
language pairs IF training corpora available
For each S sentence, yields many T sentences
(each with a probabilistic score) useful for
semi-supervised translation

8
The MT Task Approaches to it
Structure of Typical SMT System
Extra Target Corpora
offline training
Phrase Translation Model
Target Language Model
(optional extra LM training corpora)
mais où sont les neiges d antan ?
Other Knowledge Sources
Final N-best hypotheses
Initial N-best hypotheses
T1 But where are the snows of yesteryear?
P0.41 T2 However, where are yesterdays snows?
P 0.33
T1 however where are the snows d
antan P 0.22 T2 but where are the snows d
antan P 0.21 T3 but where did the d antan
snow go P 0.13
Reordering
9
The MT Task Approaches to it

Commercial Systems
Systran, biggest MT company, uses expert systems
so do most MT companies. However, Systran has
recently begun exploring possibility of adding a
statistical component to their system.
Important exception LanguageWeaver, new company
based on SMT (closely linked to researchers at
ISI, U. Southern California)
Google has superb SMT research team but online,
they still mainly use Systran (probably because
of computational cost of online SMT). Seem to be
gradually swapping in SMT systems for language
pairs with lower traffic.

10
Examples of SMT output
Chinese ? English output REF Hong Kong
citizens jumped for joy when they knew Beijing's
bid for 2008 Olympic games was successful.
PORTAGE Dec. 2004 The public see that Beijing's
hosting of the Olympic Games in 2008
excited. PORTAGE Nov. 2006 Hong Kong people see
Beijing's successful bid for the 2008 Olympic
Games, very happy. REF The U.S. delegation
includes a China expert from Stanford University,
two Senate foreign policy aides and a former
State Department official who has negotiated with
North Korea. PORTAGE Dec. 2004 The United
States delegation comprising members from the
Stanford University, one of the Chinese experts,
two of the Senate foreign policy as well as
assistant who was responsible for dealing with
Pyongyang authorities of the former State
Department officials. PORTAGE Nov. 2006 The US
delegation included members from Stanford
University and an expert on China, two Senate
foreign policy, and one who is responsible for
dealing with Pyongyang authorities, a former
State Department officials. REF Kuwait foreign
minister Mohammad Al Sabah and visiting Jordan
foreign minister Muasher jointly presided the
first meeting of the joint higher committee of
the two countries on that day. PORTAGE Dec.
2004 Kuwaiti Foreign Secretary Sabah on that day
and visiting Jordan Foreign Secretary maasher
co-chaired the section about the two countries
mixed Committee at the inaugural meeting. PORTAGE
Nov. 2006 Kuwaiti Foreign Minister Sabah day and
visiting Jordanian Foreign Minister of Malaysia,
co-chaired by the two countries, the joint
commission met for the first time. REF The
Beagle 2 was scheduled to land on Mars on
Christmas Day, but its signal is still difficult
to pin down. PORTAGE Dec. 2004 small dog meat,
originally scheduled for Christmas landing Mars,
but it is a signal remains elusive. PORTAGE Nov.
2006 2 small dog meat for Christmas landing on
Mars, but it signals is still unpredictable.
11
Examples of SMT output
And a silly English ? German example from Google
(Jan. 25, 2007) the hotel has a squash court
? das Hotel hat ein Kürbisgericht (think
zucchini tribunal) but this kind of error
perfect syntax, never-seen word combination
isnt typical of a statistical system, so this
was probably a rule-based system
12
SMT Research Culture,Evaluations, Metrics

Culture
SMT research is very engineering-oriented driven
by performance in NIST other evaluations (see
later slides)
? if a heuristic yields a big improvement in
BLEU scores a wonderful new theoretical
approach doesnt, expect the former to get much
more attention than the latter
Advantages of SMT culture open-minded to new
ideas that can be tested quickly researchers who
count have working systems with reasonably
well-written software (so they can participate in
evaluations)
Disadvantages of SMT culture closed-minded to
ideas not tested in a working system ? if you
have a brilliant theory that doesnt show a BLEU
score improvement in a reasonable baseline
system, dont expect SMT researchers to read your
paper!

13

SMT Research Culture,Evaluations, Metrics
The NIST MT Evaluations

Since 2001, US National Institute of Standards
Technology (NIST) has been evaluating MT systems
Participants include MIT , IBM , CMU
, RWTH ,
Hong Kong UST , ATR , IRST ,
others
and NRC NRCs system is called
PORTAGE (in NIST evaluation 2005 2006).
Main NIST language pairs Chinese?English,
Arabic?English
Semantic domains news stories multigenre
Training corpora released each fall, test corpus
each spring participants have 1 working week to
submit target sentences
NIST evaluates systems comparatively
In 2005 http//www.nist.gov/speech/tests/mt/mt05ev
al_official_results_release_20050801_v3.html
2006 http//www.nist.gov/speech/tests/mt/mt06
eval_official_results.html
statistical systems beat expert systems
according to BLEU metric

14
SMT Research Culture,Evaluations, Metrics

Other MT Evaluations
WPT/WMT usually organized each spring by Philipp
Koehn Christoph Monz smaller training corpora
than NIST, European language pairs. In 2006,
evaluated on French lt-gt English, German lt-gt
English, Spanish lt-gtEnglish. http//www.statmt.org
/wmt06/proceedings/
TC-STAR Evaluation for spoken language
translation. In 2006, evaluated on
Chinese-gtEnglish (one direction only) and
Spanish lt-gtEnglish http//www.elda.org/tcstar-wor
kshop/2006eval.htm
IWSLT Evaluation for spoken language translation.
In 2006, evaluated on Arabic-gtEnglish,
Chinese-gtEnglish, Italian-gtEnglish,
Japanese-gtEnglish http//www.slt.atr.jp/IWSLT2006
_whatsnew/index.html

15
SMT Research Culture,Evaluations, Metrics

GALE Project
Huge DARPA-sponsored project 50 million per
year for 5 years. Three consortia BBN-led
Agile , IBM-led Rosetta , SRI-led
Nightingale .
NRC team is in MT working group of Nightingale.

Automatic speech recognition (ASR)
Machine translation (MT)
Distillation
16
SMT Research Culture,Evaluations, Metrics

What is BLEU?
Human evaluation of automatic translation quality
hard expensive. BLEU metric (invented at IBM)
compares MT output with human-generated reference
translations via N-gram matches.
N-gram precision (N-grams in MT output seen
in ref.)
(N-grams in
MT output)
Example (from P. Koehn)
REF Israeli officials are responsible for
airport security
Sys A Israeli officials responsibility of
airport safety
Sys B airport security Israeli officials
are responsible

1-gram match
2-gram matches
4-gram match
17
SMT Research Culture,Evaluations, Metrics

What is BLEU?
REF Israeli officials are responsible for
airport security
Sys A Israeli officials responsibility of
airport safety
Sys B airport security Israeli officials
are responsible
Sys A 1-gram precision 3/6 (Israeli,
officials, airport)
2-gram precision 2/5 (Israeli
officials)
3-gram precision 0/4 4-gram
precision 0/3.
Sys B 1-gram precision 6/6 2-gram
precision 4/5
3-gram precision 2/4 4-gram
precision 1/3.
BLEU-N multiplies together the N N-gram
precisions the higher the value, the better the
translation. But, could cheat by having very few
words in MT output so, brevity penalty.

18
SMT Research Culture,Evaluations, Metrics

What is BLEU?
BLEU-N (brevity-penalty)?i1N(precisioni)?i,
where
brevity-penalty min(1,output-length/ref-length)
.
Usually, we set N4 and all ?i 1, so we have
BLEU-4 (min(1,output-length/ref-length))?i14pr
ecisioni.
If any MT output has no N-grams matching ref.,
for some N1, , 4, BLEU-4 is zero. So, normally
compute BLEU over whole test set of at least a
hundred or so sentences.
Multiple references if an N-gram has K
occurrences in output, look for single ref. that
has K or more copies of that N-gram. If find such
a single ref., that N-gram has matched K times.
If not, look for a ref. that has the highest of
copies (L) of that N-gram use L in precision
calculation. Ref-length closest length.

19
SMT Research Culture,Evaluations, Metrics

Does BLEU correlate with human judgment?

Quality score 0 terrible, 3 excellent
Translator Identity
BLEU kind of correlates with human judgment
works best with multiple references.
20
SMT Research Culture,Evaluations, Metrics

Why BLEU Is Controversial
If system produces a brilliant translation that
uses many N-grams not found in the references, it
will receive a low score.
Proponents of the expert system approach argue
that BLEU is biased against this approach,
favours SMT
Partial confirmation 1. in NIST 2006
Arabic-to-English evaluation, AppTek hybrid
system (rule-based SMT system) did best
according to human evaluators, but not according
to BLEU. 2. in 2006 WMT evaluation Systran was
scored comparably to other systems for some
European language pairs (e.g., French-English) by
human evaluators, but had much lower in-domain
BLEU scores (see graphs in http//www.statmt.org/w
mt06/proceedings/pdf/WMT14.pdf).

21
SMT Research Culture,Evaluations, Metrics

Other Automatic Metrics
SMT systems need an automatic metric for tuning
(must try out thousands of variants). Automatic
metrics compare MT output with human-generated
reference translations.
Rivals of BLEU translation edit rate (TER)
how many edit ops to match references?
http//www.cs.umd.edu/snover/pub/amta06/ter_amta.
pdf
METEOR compares MT output with references
in way thats less dependent on word choice
(via stemming, WordNet, etc.) Gaining
credibility correlates better than
BLEU with human scores. However,
METEOR only defined for translation into
English.
http//www.cs.cmu.edu/alavie/METEOR/.

22
SMT Research Culture,Evaluations, Metrics

Manual Metrics
Human evaluation of SMT preferable to automatic
evaluation, but much slower more expensive.
Cant use for system tuning.
Ask humans to rank systems by adequacy and
fluency. Adequacy does MT output convey same
meaning as source?Fluency does MT output look
like normal target-language text? (Good syntax
idiom).
Metrics based on human postediting of MT output.
E.g., HTER.
Metrics based on human understanding of MT
output. Related to adequacy, but less subjective.
E.g., Lincoln Labs metric give English output of
Arabic MT system to unilingual English analyst,
then test him with standard Defense Language
Proficiency Test (see Jones05).

23
SMT Research Culture,Evaluations, Metrics

Who Uses Which Metric When?
Many groups use BLEU for automatic system tuning
NIST, WPT/WMT, TC-STAR, other evaluations often
have BLEU as official metric, with some human
reality checks. Koehn Monz WPT/WMT
participants do human fluency/adequacy
evaluations - nice analyses!
Many expert/rule-based MT researchers hate
BLEU (can become excuse not to evaluate system
competitively)
In theory, manual metrics should be related to MT
task e.g., adequacy for browsing/gisting,
Lincoln Labs metric for intelligence community,
HTER if MT output will be post-edited. So why is
HTER GALEs official metric? HTER Human
Translation Edit Rate MT output hand-edited by
humans measure of operations performed.

24
SMT History IBM Models

In the late 1980s, members of IBMs speech
recognition group applied statistical learning
techniques to bilingual corpora. These American
researchers worked mainly with the Canadian
Hansard bilingual transcription of
parliamentary proceedings.
These researchers quit IBM around 1991 for a
hedge fund, Renaissance Technologies they are
now very rich!
Renewed interest in their work sparked the
revival of research into statistical learning for
MT that occurred from late 1990s onward. Newer
phrase-based approach still partially relies
on IBM models.
The IBM approach used Bayess Theorem to define
the Fundamental Equation of MT (Brown et al.
1993)

25
SMT History IBM Models
Fundamental Equation of MT

The best-fit translation of a source-language
(French) sentence S into a target-language
(English) sentence T is

Job of language model ensure well-formed
target-language T Job of translation model
ensure T could have generated S Search task
find T maximizing product P(T)P(ST)
26
SMT History IBM Models

The IBM researchers defined five statistical
translation models (numbered in order of
complexity)
Each defines a mechanism for generation of text
in one language
(e.g., French or foreign F) from another
(e.g., English E)
Most general many-to-many case is not covered by
IBM models in this forbidden case, a group of E
words generates a group of F words, e.g.

The poor dont have any money
Les pauvres sont démunis
27
SMT History IBM Models

The IBM models only allow one-to-many generation,
e.g.

Le programme a été mis en
application
Ø

IBM models 1 2 all lengths for F sentence
equally likely
Model 1 is bag of words - word order in F
E doesnt matter
In model 2, chance that an E word generates
given F word(s) depends on position
IBM models 3, 4, 5 are fertility-based

28
SMT History IBM Models

IBM model 1 bag of words

(draw with uniform probability)
P(L?M)
IBM model 2 position-dependent bag of words
P(1 ?1)
e1 e2 . eL
P(1?M)
P(2 ?1)
.
(draw with position-dep. prob)
P(2 ?M)
.
P(L?1)
P(L?M)
29
SMT History IBM Models

Parameters f(ei) fertility of ei prob.
will produce
0, 1, 2 words in F t(fei)
probability that ei can generate f
?(j i, k) distortion prob. prob. that kth
word generated by ei ends up in pos. j of F

IBM model 4
IBM model 3
NOTE phrases can be broken up,but with lower
prob. than in model 3
f(e1)
f(e1)
3
2
t
fM
t
f(e2)
f(e2)
t
0
(phrase)
0
Ø
Ø
f(eL)
f(eL)
.
1
1
t
t
fM
IBM model 5 cleaned-up version of model 4 (e.g.,
two F words cant be
given same position)
30
Phrase-based SMT

Four key ideas
phrase-based models (Och04, Koehn03, Marcu02)
dynamic programming search algorithms (Koehn04)
loglinear model combination (Och02)
error-driven learning (Och03)

31
Phrase-based SMT
Phrase-based approach introduced around 1998 by
Franz Josef Och others (Ney, Wong, Marcu)
many-words-to-many-words (improvement on IBM
one-to-many)
Example cul de sac word-based translation
ass of bag (N. Am), arse of bag
(British)phrase-based translation dead end
(N. Am.), blind alley (British) This
knowledge is stored in a phrase table
collection of conditional probabilities of form
P(ST) backward phrase table or P(TS)
forward phrase table. Recall Bayes T argmaxT
P(T)P(ST) ? backward table essential,
forward table used for heuristics. Tables for
French-gtEnglish

forward P(TS) p(bagsac) 0.5 p(hand bagsac)
0.2 p(culass) 0.5 p(dead endcul de sac)
0.85
backward P(ST) p(sacbag) 0.9 p(sacochebag)
0.1 p(cul de sacdead end)
0.7 p(impassedead end) 0.3
32
Phrase-based SMT

Overall Phrase Pair Extraction Algorithm
1. Run a sentence aligner on a parallel bilingual
corpus (wont go over this)
2. Run word aligner (e.g., one based on IBM
models) on each aligned sentence pair see next
slide.
3. From each aligned sentence pair, extract all
phrase pairs with no external links - see two
slides ahead.

33
Phrase-based SMT

Symmetrized Word Alignment using IBM Models
Alignments produced by IBM models are
asymmetrical source words have at most one
connection, but target words may have many
connections.
To improve quality, use symmetrization heuristic
(Och00)
1. Perform two separate alignments, one in each
different translation direction.
2. Take intersection of links as starting point.
3. Add neighbouring links from union until all
words are covered.

S I want to go home T Je veux aller chez moi
I want to go home Je veux aller chez moi
S Je veux aller chez moi T I want to go home
34
Phrase-based SMT
Diag-And phrase extraction

Je l ai vu à la télévision
I saw him on television

Input aligned sentence pair Output set of
consistent phrases
Extract all phrase pairs with no external links,
for example Good pairs (Je, I) (Je l ai
vu, I saw him) (ai vu, saw) (l ai vu à la, saw
him on) Bad pairs (Je l ai vu, I saw) (l
ai vu à, saw him on) (la télévision, television)
35
Phrase-Based Search

Generative process
1. Split source sentence into phrases
(N-grams).
2. Translate each source phrase (one-to-one).
3. Permute target phrases to get final
translation.
much simpler and more intuitive than the
IBM process,
but the price of this is no provision for
gaps, e.g., ne VERB pas

1
2
3
I
Je l ai vu à la télévision
I saw him on television
him
saw
on television
NOTE XRCEs Matrax does handle gaps
36
Phrase-Based Search
Order Target hypotheses grow left-gtright, from
source segments consumed in any order
Backward Table
?
?
?
?
?
Source s1 s2 s3 s4 s5 s6 s7 s8 s9
P(ST) p(s2 s3 t8) p(s2 s3 t5 t3) p(s3 s4
t4 t9)
(pick s2 s3 first)
(pick s3 s4 first)
(phrase transl)
Tgt hyp t5 t3
Tgt hyp t8
phrase table 1. suggests possible
segments 2. supplies phrase translation
scores
(phrase transl)
(pick s5 s6 s7)

Tgt hyp t4 t9

(phrase transl)
Language Model P(T)
Tgt hyp t8 t6 t2
language model scores growing target
hypotheses left -gt right

37
Loglinear Model Combination
Previous slides show basic system that ranks
hypotheses by P(ST)P(T). Now lets introduce
an alignment/reordering variable A (aligns T S
phrases). We want T argmaxT P(TS) argmaxT
,AP(T, AS) argmaxT, A f1(T,A,S)?1
f2(T,A,S)?2 fM(T,A,S)?M argmax exp (?i ?i
log fi(T,A,S)). The fi now typically include not
only functions related to P(ST) and language
model P(T), but also to A distortion , P(TS),
length(T), etc. The ?i serve as reliability
weights. This change in score computation
doesnt fundamentally change the search
algorithm.

38
Loglinear Model Combination

Advantages
Very flexible! Anyone can devise dozens of
features.
E.g., if lots of mismatched brackets in output,
include feature function that outputs 1 if no
mismatched brackets, -1 if have mismatched
brackets.
So lots of new features being tried in somewhat
haphazard way.
But systems steadily improving outputs from
NIST 2006 look much better than those from NIST
2002. SMT not good enough to replace human
translators, but good enough for, e.g., most Web
browsing. Using 1000 machines and massive
quantities of data, Google got 45.4 BLEU for
Arabic to English, 35.0 for Chinese to English
very high scores!

39
Loglinear Model Combination

Typical Loglinear Components for SMT Decoding
Joint counts C(S,T) from phrase extraction yield
estimates P(ST) stored in backward phrase
table and estimates P(TS) stored in forward
phrase table. These are typically relative
frequency estimates (but weve looked at smoothed
variants).
Distortion model D(T,A,S) assigns score to amount
of phrase reordering incurred in going from S to
hypothesis T. Can be based purely on
displacement, or be lexicalized (identity of
words in S T is important).
Length model L(T,S) scores probability that
hypothesis of length T generated from source of
length S.
Language model P(T) gives probability of word
sequence T in target language see next few
slides.
NOTE these are just for decoding you can use
lots more components for N-best/lattice
reordering!

40
Target Language Model P(T)

The Stupidest Thing Noam Chomsky Ever Said
It must be recognized that the notion of a
probability of a sentence is an entirely
useless one, under any interpretation of this
term .
Chomsky, 1969.

41
Target Language Model P(T)

Language model helps generate fluent output by
1. assigning higher probability to correct
word order e.g., PLM(the house is small)
gtgt PLM(small the is house)
2. assigning higher probability to correct
word choices e.g.,
PLM(i am going home) gtgt PLM(I am going
house)
Almost everyone in both SMT and ASR (automatic
speech recognition) communities uses N-gram
language models. Start with
P(W) P(w1)P(w2w1)P(w3w1,w2)P(wiw1,
,wi-1)P(wmw1,,wm-1),
then limit window to N words. E.g., for N3,
trigram LM
P(W) P(w1)P(w2w1)P(w3w1,w2)P(wiwi-2,
wi-1)P(wmwm-2,wm-1).

42
Target Language Model P(T)

Estimation is done by relative frequency on large
corpus P(wiwi-2,wi-1) f(wiwi-2,wi-1)
C(wi-2,wi-1,wi)/Sw C(wi-2,wi-1,w).
E.g., in Europarl corpus, see 225 trigrams
starting the red
C(the red cross)123, C(the red tape)31,
C(the red army)9, C(the red card)7, C(the red
,)5 (and 50 other trigrams). So estimate
P(cross the red) 123/225 0.547 .
But need to reserve probability mass for unseen
events - maybe never saw the red planet in
Europarl, but dont want to have estimate
P(planet the red) 0. Also, want estimates
whose variance isnt too high. Smoothing
techniques are used to solve both problems. E.g.,
could linearly smooth trigrams with bigrams
unigrams P(wiwi-2,wi-1) ?f (wiwi-2,wi-1)
µf(wiwi-1) (1-?-µ)f(wi)
0 lt ?, µ lt 1.

43
Target Language Model P(T)
Measuring Language Model Quality

Perplexity metric that measures predictive power
of an LM on new data as an average branching
factor. E.g., model that says any digit 0, , 9
has equal probability of occurrence will yield
perplexity of 10.0 on digit sequence generated
randomly from these 10 digits.
Perplexity of LM measured on corpus W (w1 wN)
is
PerpLM(T) (?wi P(wiLM))-1/N 1/(average
per word prob.)
The better the LM is as a model for W, the
less surprised it is by words of W ? higher
estimated prob. ? lower entropy.
Typical perplexities for well-trained
English trigram LMs with lexica of about 25K
words for various dictation domains
Perp(radiology)20, Perp(emergency
medicine)60, Perp(journalism)105, Perp(general
English)247 .

44
Target Language Model P(T)

A Bit of Progress in Language Modeling
(Goodman01) is good summary of state of the art
in N-gram language modeling.
Consistently superior method Kneser-Ney.
Intuition if Francisco eggplant each
seen 103 times in our corpus of 106 words, and
neither eggplant Francisco nor eggplant stew
seen, which should be higher, P(Franciscoeggplan
t) or P(steweggplant)?
Interpolation answer P(wiwi-1)
?f(wiwi-1) (1-?)f(wi ).
So P(Franciscoeggplant) ?0 (1- ?)10-3
P(steweggplant).
Kneser-Ney answer no, Francisco only
occurs after San, but 1,000 occurrences of
stew preceded by 100 different words. So when
(wi-1 wi) has never been seen before, wi stew
more probable than wi Francisco ?
P(steweggplant) gtgt P(Franciscoeggplant).

45
Target Language Model P(T)

Kneser-Ney formula (for bigrams easily extended
to N-grams)
PKN(wi wi-1) max C(wi-1 wi)-D,
0/C(wi-1)
?(wi-1)v C(v
wi) gt 0/?w v C(v w) gt 0 ,
where D is a discount factor lt 1, ?(wi-1) is
a normalization constant, v C(v wi) gt 0 is
the number of different words that precede wi in
the training corpus, and ?w v C(v w) gt 0 is
the number of different bigrams in the training
corpus.

46
Flaws of Phrase-based, Loglinear Systems

Loglinear feature function combination is too
flexible! Makes it
easy not to think about theoretical
properties of models.
The IBM models were true models given arbitrary
source sentence S and target sentence T, could
estimate non-zero P(TS). Phrase-based models
are not models in general, for T which is a good
translation of S, they give P(TS) 0. They
dont guarantee existence of an alignment between
T and S. Thus, the only translations T to which
a phrase-based system is guaranteed to assign
P(TS) gt 0 are T output by same system.
This has practical consequences in general, a
phrase-based MT system cant be used for
analyzing pre-existing translations. This rules
out many useful forms of assistance to human
translators - e.g., spotting potential errors in
translations based on regions of low P(TS).

47
PORTAGE A Typical SMT System

Sentence-align a big bilingual corpus
On each sentence pair, use IBM models to align
words
Build phrase tables from word alignments via
diag-and or similar heuristic (Koehn03).
Backwards phrase table gives P(ST) ( is
implicit segmentation model).
Build language model (LM) for target language
estimates P(T) , based on n-grams in T
5. P(ST) and P(T) are sufficient for
decoding, but one often adds other loglinear
feature functions such as a distortion penalty
6. Use (Och03) method to find good weights ?i
for loglinear features
7. Optionally, include reordering step i.e.,
decoder outputs many hypotheses (via N-best list
or lattice) which are rescored by larger set of
feature functions

48
PORTAGE A Typical SMT System
Core Engine
Small set of information sources for Canoe
decoder
(number-of-words model)
(at least 1 phrase translation model)
(any of additional info. sources - for
rescorer only)
(at least one language model)
(at least one distortion model)
A1
A2
A3
feature functions
Source sentence
Large set of information sources for
Rescorer
Weighted large info
Weighted small info
mais où sont les neiges d antan ?
kLMLM kTMTM kA3A3
Weights for large set
wLMLM wTMTM wNMNM
Weights for small set
Rescorer
Canoe decoder
Rescored N-best
N-best hypotheses
49
Training Core Components of PORTAGE
Preprocessing
Raw parallel corpus
src-lang text tgt-lang text
Additional monolingual corpora
Tgt-lang text
Tgt-lang text

phrase translation model
language model
PT
LM
other small set models
small set info only

model3
modelK
large set info
extra models for large set

modelM
modelK1
large set wts
small set wts
w1, , wM
w1, , wK
50
Canoe Optimization of Weights (COW) Purpose find
weights w1, , ws on small set of
information sources (N around 100)
Small set of information sources
I2
IS

I1
(first call to Canoe)
(2nd subsequent calls to Canoe)
New Weights (from rescore-train )
w1r , w2r,, wsr
List of D N-best hyp.
(union 2nd subsequent calls to rescore-train)
(first call to rescore-train)
K random wt. vectors
W1 WK
W1w11 , w21,, ws1 WKw1K,
w2K,, wsK
W

Rescore_train
51
Rescoring Finding Weights on Large Info.
Set for Rescorer (N around 1000)
Large set
I1
I2
IS1
IS

IL

Small set
Initial Weights
Weighted small info
w1i , w2i,, wLi
Weights for small fixed by previous COW
step
w1 I1 wSIS
feature functions
Final large wts
w1f , w2f,, wLf
K random wt. vectors
W1 WK
W1w11 , w21,, wL1 WKw1K,
w2K,, wLK
W

Rescore_train
52
Tutorial Plan

B. Details research topics
Named entities
Large-scale discriminative training (George
Foster)
Decoding for SMT (prepared by Nicola Ueffing)
Hierarchical models (George Foster)
System combination

53
Named entity recognition transliteration
Chinese Example Secretary-General Wong
appeared with Larry Ellison, Chief Executive
Officer of Oracle Corporation, at a press
conference to announce Oracles investment of
100 million dollars in a new research centre in
Szechuan Province . Personal names Wong,
Larry Ellison. Titles Secretary-General,
Chief Executive Officer. Organization name
Oracle Corporation. Place name Szechuan
Province. Recognition problem detect these
entities in a continuous stream of
ideograms. Transliteration problem when
ideograms are used phonetically (esp. for
non-Chinese names like Larry Ellison) become
aware of that, map them onto Latin characters.
54
Named entity recognition transliteration

Made-up Chinese Transliteration Example
How to translate ??????????
? táng (surname) - Tang Dynasty ?(F?) nà
receive, accept, enjoy, pay, sew ? dé virtue
? la pull, drag, haul ? mu nurse ? si
(thus now used mostly for sound) ? fei ??
humble ?(F?) er (archaic) you ? dé virtue
After receiving virtue from the Tang Dynasty,
you thus pulled the humble nurse away from
virtue (????). No
tang na de la mu si fei de DONALD
RUMSFELD.
Actual Chinese?English example generated by
PORTAGE
Outgoing president Iliescu has also
congratulated Basescu. ?
Outgoing president of Iraq, has also been made
to the road to the public.

55
Named entity recognition transliteration
Other Examples Arabic?English Muammar Ghadafy
Moammar Khaddafi Muamar Qadafy Azeddine
Elzedine Alsuddin Ahzudin (depending on
region, pronounced differently thus
transliterated into Latin alphabet differently)
English?French (Google Translate Jan. 24,
2007) The Englishman John Snow thought cholera
was transmitted by small, living organisms. ?
Le choléra de pensée de neige de John d'Anglais
a été transmis par la petite, organique matière.
56
System Combination

Introduction
Different systems make different errors why not
combine information? This worked well for ASR
But, because of reordering, synonyms, etc.,
system combination not as easy for MT!
RWTH (Aachen) is SMT powerhouse has recently
been working on parallel system combination
(Evgeny Matusov).
NRC has been working on serial system
combination.
Both teams now getting good results.

57
System Combination

Parallel System Combination (RWTH Aachen)
Hypotheses from different systems aligned some
word reordering allowed use of synonyms
Generate confusion network ? choices at each
position scored with system weights and word
confidence scores
N-best consensus translations are generated from
confusion network rescored with various
information sources
A year ago, results unimpressive. Since then,
added new information sources (e.g., LMs trained
on N-best lists from contributing systems) that
encourage preservation of original phrases. Nice
preliminary Arabic results improvement of 2-3
BLEU points over best individual system in
combination.

58
System Combination
Example of RWTH Parallel Combination Ref
Chinese president directs
unprecedented criticism at
leaders of Hong Kong. Best
System Chinese president slams unprecedented
leaders to Hong
Kong. System Comb. Chinese president sends
unprecedented criticism of the
leaders of Hong Kong.
59
System Combination

Serial System Combination (NRC)
Use SMT to correct mistakes made by another
method (e.g., a rule-based one)

Training Procedure
Use MT1 to produce initial target translation of
source half of a parallel human-translated
corpus, thus giving a corpus of MT1 target output
in parallel with good target versions of same
sentences use parallel corpus of (MT1 target
human target) sentences to train SMT.
Even better, if can get humans to post-edit MT1
output, have MT1 target in parallel with
corrected target as SMT training corpus.

60
System Combination
Serial System Combination (NRC)
61
System Combination
Serial System Combination (NRC)
62
System Combination

Discussion Future Work
Parallel combination probably best for similar
systems of good quality, serial combination for
systems that are very different
Future work for serial combination allow SMT
both direct indirect (via MT1) access to source
text. Could do this using, e.g.
Rescoring
Parallel phrasetables
Parallel LMs
Parallel decoding (etc.)

63
References (1)
Best overall reference Philipp Koehn,
Statistical Machine Translation , University
of Edinburgh (textbook to appear 2007 or 2008,
Cambridge University Press). Papers (NOTE
short summary of key papers available from
Kuhn/Foster) Brown93 Peter F. Brown, Stephen A.
Della Pietra, Vincent Della J. Pietra, and Robert
L. Mercer. The mathematics of Machine
Translation Parameter estimation. Computational
Linguistics, 19(2)263-312, June 1993. Chomsky69
Noam Chomsky. Quines Empirical Assertions. In
Words and Objections Essays on the Work of W.V.
Quine (ed. D. Davidson and J. Hintikka).
Dordrecht, Netherlands, 1969. Foster06 George
Foster, Roland Kuhn, and Howard Johnson.
Phrasetable Smoothing for Statistical Machine
Translation. EMNLP 2006, Sydney, Australia, July
22-23, 2006. Germann01 Ulrich Germann, Michael
Jahr, Kevin Knight, Daniel Marcu, and Kenji
Yamada. Fast decoding and optimal decoding for
machine translation. In Proceedings of the 39th
Annual Meeting of the Association for
Computational Linguistics (ACL), Toulouse, July
2001.
64
References (2)
Goodman01 Joshua Goodman. A Bit of Progress in
Language Modeling (extended version). Microsoft
Research Technical Report, Aug. 2001.
Downloadable from research.microsoft.com/joshuago
/publications.htm Jones05 Douglas Jones, Edward
Gibson, et al. Measuring Human Readability of
Machine Generated Text Studies in Speech
Recognition and Machine Translation. In
Proceedings of the IEEE Int. Conf. on Acoustics,
Speech, and Signal Processing (ICASSP),
Philadelphia, PA, USA, March 2005 (Special
Session on Human Language Technology
Applications and Challenge of Speech Processing).
Knight99 Kevin Knight. Decoding complexity in
word-replacement translation models.
Computational Linguistics, Squibs and Discussion,
25(4), 1999. Koehn04 Philipp Koehn. Pharaoh a
beam search decoder for phrase-based statistical
machine translation models. In Proceedings of the
6th Conference of the Association for Machine
Translation in the Americas, Georgetown
University, Washington D.C., October 2004.
Springer-Verlag. KoehnDec03 Philipp Koehn.
PHARAOH - a Beam Search Decoder for Phrase-Based
Statistical Machine Translation Models (User
Manual and Description). USC Information
Sciences Institute, Dec. 2003.
65
References (3)
KoehnMay03 Philipp Koehn, Franz Josef Och, and
Daniel Marcu. Statistical phrase-based
translation. In Eduard Hovy, editor, Proceedings
of the Human Language Technology Conference of
the North American Chapter of the Association for
Computational Linguistics (HLT/NAACL), pp.
127-133, Edmonton, Alberta, Canada, May 2003.
Marcu02 Daniel Marcu and William Wong. A
phrase-based, joint probability model for
statistical machine translation. In Proceedings
of the 2002 Conference on Empirical Methods in
Natural Language Processing (EMNLP),
Philadelphia, PA, 2002. OchJHU04 Franz Josef
Och, Daniel Gildea, et al. Final Report of the
Johns Hopkins 2003 Summer Workshop on Syntax for
Statistical Machine Translation (revised
version). http//www.clsp.jhu.edu/ws03/groups/tran
slate (JHU-syntax-for-SMT.pdf), Feb. 2004. Och04
Franz Och and Hermann Ney. The alignment template
approach to statistical machine translation.
Computational Linguistics, V. 30, pp. 417-449,
2004. Och03 Franz Josef Och. Minimum error rate
training for statistical machine translation. In
Proceedings of the 41th Annual Meeting of the
Association for Computational Linguistics (ACL),
Sapporo, July 2003.
66
References (4)
Och02 Franz Josef Och and Hermann Ney.
Discriminative training and maximum entropy
models for statistical machine translation. In
Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL),
Philadelphia, July 2002. Och01 Franz Josef Och,
Nicola Ueffing, and Hermann Ney. An Efficient A
Search Algorithm for Statistical Machine
Translation. In Proc. Data-Driven Machine
Translation Workshop, July 2001. Och00 Franz
Josef Och and Hermann Ney. A Comparison of
Alignment Models for Statistical Machine
Translation. Int. Conf. on Computational
Linguistics (COLING), Saarbrucken, Germany,
August 2000. Papineni01 Kishore Papineni, Salim
Roukos, Todd Ward, and Wei-Jing Zhu. BLEU A
method for automatic evaluation of Machine
Translation. Technical Report RC22176, IBM,
September 2001. Ueffing02 Nicola Ueffing, Franz
Josef Och, and Hermann Ney. Generation of Word
Graphs in Statistical Machine Translation.
Empirical Methods in Natural Language Processing,
July 2002.

Write a Comment

User Comments (0)

The State of the Art in PhraseBased Statistical Machine Translation SMT PowerPoint PPT Presentation