The State of the Art in PhraseBased Statistical Machine Translation SMT PowerPoint PPT Presentation

presentation player overlay
1 / 66
About This Presentation
Transcript and Presenter's Notes

Title: The State of the Art in PhraseBased Statistical Machine Translation SMT


1

The State of the Art in Phrase-Based Statistical
Machine Translation (SMT) Roland Kuhn, George
Foster, Nicola Ueffing February 2007
2
Tutorial Plan
  • A. Overview
  • B. Details research topics
  • NOTE best overall reference for SMT hasnt been
    published yet Philipp Koehns  Statistical
    Machine Translation  (to be published by
    Cambridge University Press). Some of the material
    presented here is from a draft of that book.

3
Tutorial Plan
  • Overview
  • The MT Task Approaches to it
  • Examples of SMT output
  • SMT Research Culture, Evaluations, Metrics
  • SMT History IBM Models
  • Phrase-based SMT
  • Phrase-Based Search
  • Loglinear Model Combination
  • Target Language Model P(T)
  • Flaws of Phrase-based, Loglinear Systems
  • PORTAGE a Typical SMT System

4
The MT Task Approaches to it
  • Core MT task translate a sentence from a source
    language S to target language T
  • Conventional expert system approach hire experts
    to write rules for translating S to T
  • Statistical approach using a bilingual text
    corpus (lots of S sentences their translations
    into T), train a statistical translation model
    that will map each new S sentence into a T
    sentence

5
The MT Task Approaches to it
Statistical System
Expert System
Experts
Bilingual parallel corpus
S
T


Machine Learning
Statistical system output
6
The MT Task Approaches to it
  • Expert vs. Statistical systems
  • Expert systems incorporate deep linguistic
    knowledge
  • They still yield top performance for well-studied
    language pairs in non-specialized domains
  • Computationally cheap (compared to statistical
    MT)
  • BUT -
  • Brittle
  • Expensive to maintain (messy software
    engineering)
  • Expensive to port to new semantic domains or new
    language pairs
  • Typically yield only one T sentence for each S
    sentence

7
The MT Task Approaches to it
  • Expert vs. Statistical systems
  • More E-text, better algorithms, stronger machines
    ? quality of SMT output approaching that of
    expert systems
  • Statistical approach has beaten expert systems in
    related areas - e.g., automatic speech
    recognition
  • SMT is robust (does well on frequent phenomena)
  • Easy to maintain
  • Easily ported to new semantic domain or new
    language pairs IF training corpora available
  • For each S sentence, yields many T sentences
    (each with a probabilistic score) useful for
    semi-supervised translation

8
The MT Task Approaches to it
Structure of Typical SMT System
Extra Target Corpora
offline training
Phrase Translation Model
Target Language Model
(optional extra LM training corpora)
mais où sont les neiges d antan ?
Other Knowledge Sources
Final N-best hypotheses
Initial N-best hypotheses
T1 But where are the snows of yesteryear?
P0.41 T2 However, where are yesterdays snows?
P 0.33
T1 however where are the snows d
antan P 0.22 T2 but where are the snows d
antan P 0.21 T3 but where did the d antan
snow go P 0.13
Reordering
9
The MT Task Approaches to it
  • Commercial Systems
  • Systran, biggest MT company, uses expert systems
    so do most MT companies. However, Systran has
    recently begun exploring possibility of adding a
    statistical component to their system.
  • Important exception LanguageWeaver, new company
    based on SMT (closely linked to researchers at
    ISI, U. Southern California)
  • Google has superb SMT research team but online,
    they still mainly use Systran (probably because
    of computational cost of online SMT). Seem to be
    gradually swapping in SMT systems for language
    pairs with lower traffic.

10
Examples of SMT output
Chinese ? English output REF Hong Kong
citizens jumped for joy when they knew Beijing's
bid for 2008 Olympic games was successful.
PORTAGE Dec. 2004 The public see that Beijing's
hosting of the Olympic Games in 2008
excited. PORTAGE Nov. 2006 Hong Kong people see
Beijing's successful bid for the 2008 Olympic
Games, very happy. REF The U.S. delegation
includes a China expert from Stanford University,
two Senate foreign policy aides and a former
State Department official who has negotiated with
North Korea. PORTAGE Dec. 2004 The United
States delegation comprising members from the
Stanford University, one of the Chinese experts,
two of the Senate foreign policy as well as
assistant who was responsible for dealing with
Pyongyang authorities of the former State
Department officials. PORTAGE Nov. 2006 The US
delegation included members from Stanford
University and an expert on China, two Senate
foreign policy, and one who is responsible for
dealing with Pyongyang authorities, a former
State Department officials. REF Kuwait foreign
minister Mohammad Al Sabah and visiting Jordan
foreign minister Muasher jointly presided the
first meeting of the joint higher committee of
the two countries on that day. PORTAGE Dec.
2004 Kuwaiti Foreign Secretary Sabah on that day
and visiting Jordan Foreign Secretary maasher
co-chaired the section about the two countries
mixed Committee at the inaugural meeting. PORTAGE
Nov. 2006 Kuwaiti Foreign Minister Sabah day and
visiting Jordanian Foreign Minister of Malaysia,
co-chaired by the two countries, the joint
commission met for the first time. REF The
Beagle 2 was scheduled to land on Mars on
Christmas Day, but its signal is still difficult
to pin down. PORTAGE Dec. 2004 small dog meat,
originally scheduled for Christmas landing Mars,
but it is a signal remains elusive. PORTAGE Nov.
2006 2 small dog meat for Christmas landing on
Mars, but it signals is still unpredictable.
11
Examples of SMT output
And a silly English ? German example from Google
(Jan. 25, 2007) the hotel has a squash court
? das Hotel hat ein Kürbisgericht (think
zucchini tribunal) but this kind of error
perfect syntax, never-seen word combination
isnt typical of a statistical system, so this
was probably a rule-based system
12
SMT Research Culture,Evaluations, Metrics
  • Culture
  • SMT research is very engineering-oriented driven
    by performance in NIST other evaluations (see
    later slides)
  • ? if a heuristic yields a big improvement in
    BLEU scores a wonderful new theoretical
    approach doesnt, expect the former to get much
    more attention than the latter
  • Advantages of SMT culture open-minded to new
    ideas that can be tested quickly researchers who
    count have working systems with reasonably
    well-written software (so they can participate in
    evaluations)
  • Disadvantages of SMT culture closed-minded to
    ideas not tested in a working system ? if you
    have a brilliant theory that doesnt show a BLEU
    score improvement in a reasonable baseline
    system, dont expect SMT researchers to read your
    paper!

13
 
SMT Research Culture,Evaluations, Metrics
The NIST MT Evaluations
  • Since 2001, US National Institute of Standards
    Technology (NIST) has been evaluating MT systems
  • Participants include MIT , IBM , CMU
    , RWTH ,
  • Hong Kong UST , ATR , IRST ,
    others
  • and NRC NRCs system is called
    PORTAGE (in NIST evaluation 2005 2006).
  • Main NIST language pairs Chinese?English,
    Arabic?English
  • Semantic domains news stories multigenre
  • Training corpora released each fall, test corpus
    each spring participants have 1 working week to
    submit target sentences
  • NIST evaluates systems comparatively
  • In 2005 http//www.nist.gov/speech/tests/mt/mt05ev
    al_official_results_release_20050801_v3.html
  • 2006 http//www.nist.gov/speech/tests/mt/mt06
    eval_official_results.html
  • statistical systems beat expert systems
    according to BLEU metric

14
SMT Research Culture,Evaluations, Metrics
  • Other MT Evaluations
  • WPT/WMT usually organized each spring by Philipp
    Koehn Christoph Monz smaller training corpora
    than NIST, European language pairs. In 2006,
    evaluated on French lt-gt English, German lt-gt
    English, Spanish lt-gtEnglish. http//www.statmt.org
    /wmt06/proceedings/
  • TC-STAR Evaluation for spoken language
    translation. In 2006, evaluated on
    Chinese-gtEnglish (one direction only) and
    Spanish lt-gtEnglish http//www.elda.org/tcstar-wor
    kshop/2006eval.htm
  • IWSLT Evaluation for spoken language translation.
    In 2006, evaluated on Arabic-gtEnglish,
    Chinese-gtEnglish, Italian-gtEnglish,
    Japanese-gtEnglish http//www.slt.atr.jp/IWSLT2006
    _whatsnew/index.html

15
SMT Research Culture,Evaluations, Metrics
  • GALE Project
  • Huge DARPA-sponsored project 50 million per
    year for 5 years. Three consortia BBN-led
     Agile , IBM-led  Rosetta , SRI-led
     Nightingale .
  • NRC team is in MT working group of Nightingale.

Automatic speech recognition (ASR)
Machine translation (MT)
Distillation
16
SMT Research Culture,Evaluations, Metrics
  • What is BLEU?
  • Human evaluation of automatic translation quality
    hard expensive. BLEU metric (invented at IBM)
    compares MT output with human-generated reference
    translations via N-gram matches.
  • N-gram precision (N-grams in MT output seen
    in ref.)
  • (N-grams in
    MT output)
  • Example (from P. Koehn)
  • REF Israeli officials are responsible for
    airport security
  • Sys A Israeli officials responsibility of
    airport safety
  • Sys B airport security Israeli officials
    are responsible

1-gram match
2-gram matches
4-gram match
17
SMT Research Culture,Evaluations, Metrics
  • What is BLEU?
  • REF Israeli officials are responsible for
    airport security
  • Sys A Israeli officials responsibility of
    airport safety
  • Sys B airport security Israeli officials
    are responsible
  • Sys A 1-gram precision 3/6 (Israeli,
    officials, airport)
  • 2-gram precision 2/5 (Israeli
    officials)
  • 3-gram precision 0/4 4-gram
    precision 0/3.
  • Sys B 1-gram precision 6/6 2-gram
    precision 4/5
  • 3-gram precision 2/4 4-gram
    precision 1/3.
  • BLEU-N multiplies together the N N-gram
    precisions the higher the value, the better the
    translation. But, could cheat by having very few
    words in MT output so, brevity penalty.

18
SMT Research Culture,Evaluations, Metrics
  • What is BLEU?
  • BLEU-N (brevity-penalty)?i1N(precisioni)?i,
    where
  • brevity-penalty min(1,output-length/ref-length)
    .
  • Usually, we set N4 and all ?i 1, so we have
  • BLEU-4 (min(1,output-length/ref-length))?i14pr
    ecisioni.
  • If any MT output has no N-grams matching ref.,
    for some N1, , 4, BLEU-4 is zero. So, normally
    compute BLEU over whole test set of at least a
    hundred or so sentences.
  • Multiple references if an N-gram has K
    occurrences in output, look for single ref. that
    has K or more copies of that N-gram. If find such
    a single ref., that N-gram has matched K times.
    If not, look for a ref. that has the highest of
    copies (L) of that N-gram use L in precision
    calculation. Ref-length closest length.

19
SMT Research Culture,Evaluations, Metrics
  • Does BLEU correlate with human judgment?

Quality score 0 terrible, 3 excellent
Translator Identity
BLEU kind of correlates with human judgment
works best with multiple references.
20
SMT Research Culture,Evaluations, Metrics
  • Why BLEU Is Controversial
  • If system produces a brilliant translation that
    uses many N-grams not found in the references, it
    will receive a low score.
  • Proponents of the expert system approach argue
    that BLEU is biased against this approach,
    favours SMT
  • Partial confirmation 1. in NIST 2006
    Arabic-to-English evaluation, AppTek hybrid
    system (rule-based SMT system) did best
    according to human evaluators, but not according
    to BLEU. 2. in 2006 WMT evaluation Systran was
    scored comparably to other systems for some
    European language pairs (e.g., French-English) by
    human evaluators, but had much lower in-domain
    BLEU scores (see graphs in http//www.statmt.org/w
    mt06/proceedings/pdf/WMT14.pdf).

21
SMT Research Culture,Evaluations, Metrics
  • Other Automatic Metrics
  • SMT systems need an automatic metric for tuning
    (must try out thousands of variants). Automatic
    metrics compare MT output with human-generated
    reference translations.
  • Rivals of BLEU translation edit rate (TER)
    how many edit ops to match references?
    http//www.cs.umd.edu/snover/pub/amta06/ter_amta.
    pdf
  • METEOR compares MT output with references
    in way thats less dependent on word choice
    (via stemming, WordNet, etc.) Gaining
    credibility correlates better than
  • BLEU with human scores. However,
  • METEOR only defined for translation into
    English.
  • http//www.cs.cmu.edu/alavie/METEOR/.

22
SMT Research Culture,Evaluations, Metrics
  • Manual Metrics
  • Human evaluation of SMT preferable to automatic
    evaluation, but much slower more expensive.
    Cant use for system tuning.
  • Ask humans to rank systems by adequacy and
    fluency. Adequacy does MT output convey same
    meaning as source?Fluency does MT output look
    like normal target-language text? (Good syntax
    idiom).
  • Metrics based on human postediting of MT output.
    E.g., HTER.
  • Metrics based on human understanding of MT
    output. Related to adequacy, but less subjective.
    E.g., Lincoln Labs metric give English output of
    Arabic MT system to unilingual English analyst,
    then test him with standard  Defense Language
    Proficiency Test  (see Jones05).

23
SMT Research Culture,Evaluations, Metrics
  • Who Uses Which Metric When?
  • Many groups use BLEU for automatic system tuning
  • NIST, WPT/WMT, TC-STAR, other evaluations often
    have BLEU as official metric, with some human
    reality checks. Koehn Monz WPT/WMT
    participants do human fluency/adequacy
    evaluations - nice analyses!
  • Many  expert/rule-based MT  researchers hate
    BLEU (can become excuse not to evaluate system
    competitively)
  • In theory, manual metrics should be related to MT
    task e.g., adequacy for browsing/gisting,
    Lincoln Labs metric for intelligence community,
    HTER if MT output will be post-edited. So why is
    HTER GALEs official metric? HTER Human
    Translation Edit Rate MT output hand-edited by
    humans measure of operations performed.

24
SMT History IBM Models
  • In the late 1980s, members of IBMs speech
    recognition group applied statistical learning
    techniques to bilingual corpora. These American
    researchers worked mainly with the Canadian
    Hansard bilingual transcription of
    parliamentary proceedings.
  • These researchers quit IBM around 1991 for a
    hedge fund, Renaissance Technologies they are
    now very rich!
  • Renewed interest in their work sparked the
    revival of research into statistical learning for
    MT that occurred from late 1990s onward. Newer
     phrase-based  approach still partially relies
    on IBM models.
  • The IBM approach used Bayess Theorem to define
    the  Fundamental Equation  of MT (Brown et al.
    1993)

25
SMT History IBM Models
Fundamental Equation of MT
  • The best-fit translation of a source-language
    (French) sentence S into a target-language
    (English) sentence T is

Job of language model ensure well-formed
target-language T Job of translation model
ensure T could have generated S Search task
find T maximizing product P(T)P(ST)
26
SMT History IBM Models
  • The IBM researchers defined five statistical
    translation models (numbered in order of
    complexity)
  • Each defines a mechanism for generation of text
    in one language
  • (e.g., French or foreign F) from another
    (e.g., English E)
  • Most general many-to-many case is not covered by
    IBM models in this forbidden case, a group of E
    words generates a group of F words, e.g.

The poor dont have any money
Les pauvres sont démunis
27
SMT History IBM Models
  • The IBM models only allow one-to-many generation,
    e.g.

Le programme a été mis en
application
Ø
  • IBM models 1 2 all lengths for F sentence
    equally likely
  • Model 1 is  bag of words  - word order in F
    E doesnt matter
  • In model 2, chance that an E word generates
    given F word(s) depends on position
  • IBM models 3, 4, 5 are fertility-based

28
SMT History IBM Models
  • IBM model 1  bag of words 

(draw with uniform probability)
P(L?M)
IBM model 2  position-dependent bag of words 
P(1 ?1)
e1 e2 . eL
P(1?M)
P(2 ?1)
.
(draw with position-dep. prob)
P(2 ?M)
.
P(L?1)
P(L?M)
29
SMT History IBM Models
  • Parameters f(ei) fertility of ei prob.
    will produce
  • 0, 1, 2 words in F t(fei)
    probability that ei can generate f
  • ?(j i, k) distortion prob. prob. that kth
    word generated by ei ends up in pos. j of F

IBM model 4
IBM model 3
NOTE phrases can be broken up,but with lower
prob. than in model 3
f(e1)
f(e1)
3
2
t
fM
t
f(e2)
f(e2)
t
0
(phrase)
0
Ø
Ø
f(eL)
f(eL)
.
1
1
t
t
fM
IBM model 5 cleaned-up version of model 4 (e.g.,
two F words cant be
given same position)
30
Phrase-based SMT
  • Four key ideas
  • phrase-based models (Och04, Koehn03, Marcu02)
  • dynamic programming search algorithms (Koehn04)
  • loglinear model combination (Och02)
  • error-driven learning (Och03)

31
Phrase-based SMT
Phrase-based approach introduced around 1998 by
Franz Josef Och others (Ney, Wong, Marcu)
many-words-to-many-words (improvement on IBM
one-to-many)
Example  cul de sac  word-based translation
 ass of bag  (N. Am),  arse of bag 
(British)phrase-based translation  dead end 
(N. Am.),  blind alley  (British) This
knowledge is stored in a phrase table
collection of conditional probabilities of form
P(ST) backward phrase table or P(TS)
forward phrase table. Recall Bayes T argmaxT
P(T)P(ST) ? backward table essential,
forward table used for heuristics. Tables for
French-gtEnglish

forward P(TS) p(bagsac) 0.5 p(hand bagsac)
0.2 p(culass) 0.5 p(dead endcul de sac)
0.85
backward P(ST) p(sacbag) 0.9 p(sacochebag)
0.1 p(cul de sacdead end)
0.7 p(impassedead end) 0.3
32
Phrase-based SMT
  • Overall Phrase Pair Extraction Algorithm
  • 1. Run a sentence aligner on a parallel bilingual
    corpus (wont go over this)
  • 2. Run word aligner (e.g., one based on IBM
    models) on each aligned sentence pair see next
    slide.
  • 3. From each aligned sentence pair, extract all
    phrase pairs with no external links - see two
    slides ahead.

33
Phrase-based SMT
  • Symmetrized Word Alignment using IBM Models
  • Alignments produced by IBM models are
    asymmetrical source words have at most one
    connection, but target words may have many
    connections.
  • To improve quality, use symmetrization heuristic
    (Och00)
  • 1. Perform two separate alignments, one in each
    different translation direction.
  • 2. Take intersection of links as starting point.
  • 3. Add neighbouring links from union until all
    words are covered.

S I want to go home T Je veux aller chez moi
I want to go home Je veux aller chez moi
S Je veux aller chez moi T I want to go home
34
Phrase-based SMT
 Diag-And  phrase extraction
  • Je l ai vu à la télévision
  • I saw him on television

Input aligned sentence pair Output set of
consistent phrases
Extract all phrase pairs with no external links,
for example Good pairs (Je, I) (Je l ai
vu, I saw him) (ai vu, saw) (l ai vu à la, saw
him on) Bad pairs (Je l ai vu, I saw) (l
ai vu à, saw him on) (la télévision, television)
35
Phrase-Based Search
  • Generative process
  • 1. Split source sentence into phrases
    (N-grams).
  • 2. Translate each source phrase (one-to-one).
  • 3. Permute target phrases to get final
    translation.
  • much simpler and more intuitive than the
    IBM process,
  • but the price of this is no provision for
    gaps, e.g., ne VERB pas

1
2
3
I
Je l ai vu à la télévision
I saw him on television
him
saw
on television
NOTE XRCEs Matrax does handle gaps
36
Phrase-Based Search
Order Target hypotheses grow left-gtright, from
source segments consumed in any order
Backward Table
?
?
?
?
?
Source s1 s2 s3 s4 s5 s6 s7 s8 s9
P(ST) p(s2 s3 t8) p(s2 s3 t5 t3) p(s3 s4
t4 t9)
(pick s2 s3 first)
(pick s3 s4 first)
(phrase transl)
Tgt hyp t5 t3
Tgt hyp t8
phrase table 1. suggests possible
segments 2. supplies phrase translation
scores
(phrase transl)
(pick s5 s6 s7)

Tgt hyp t4 t9

(phrase transl)
Language Model P(T)
Tgt hyp t8 t6 t2
language model scores growing target
hypotheses left -gt right

37
Loglinear Model Combination
Previous slides show basic system that ranks
hypotheses by P(ST)P(T). Now lets introduce
an alignment/reordering variable A (aligns T S
phrases). We want T argmaxT P(TS) argmaxT
,AP(T, AS) argmaxT, A f1(T,A,S)?1
f2(T,A,S)?2 fM(T,A,S)?M argmax exp (?i ?i
log fi(T,A,S)). The fi now typically include not
only functions related to P(ST) and language
model P(T), but also to A  distortion , P(TS),
length(T), etc. The ?i serve as reliability
weights. This change in score computation
doesnt fundamentally change the search
algorithm.

38
Loglinear Model Combination
  • Advantages
  • Very flexible! Anyone can devise dozens of
    features.
  • E.g., if lots of mismatched brackets in output,
    include feature function that outputs 1 if no
    mismatched brackets, -1 if have mismatched
    brackets.
  • So lots of new features being tried in somewhat
    haphazard way.
  • But systems steadily improving outputs from
    NIST 2006 look much better than those from NIST
    2002. SMT not good enough to replace human
    translators, but good enough for, e.g., most Web
    browsing. Using 1000 machines and massive
    quantities of data, Google got 45.4 BLEU for
    Arabic to English, 35.0 for Chinese to English
    very high scores!

39
Loglinear Model Combination
  • Typical Loglinear Components for SMT Decoding
  • Joint counts C(S,T) from phrase extraction yield
    estimates P(ST) stored in backward phrase
    table and estimates P(TS) stored in forward
    phrase table. These are typically relative
    frequency estimates (but weve looked at smoothed
    variants).
  • Distortion model D(T,A,S) assigns score to amount
    of phrase reordering incurred in going from S to
    hypothesis T. Can be based purely on
    displacement, or be lexicalized (identity of
    words in S T is important).
  • Length model L(T,S) scores probability that
    hypothesis of length T generated from source of
    length S.
  • Language model P(T) gives probability of word
    sequence T in target language see next few
    slides.
  • NOTE these are just for decoding you can use
    lots more components for N-best/lattice
    reordering!

40
Target Language Model P(T)
  • The Stupidest Thing Noam Chomsky Ever Said
  •  It must be recognized that the notion of a
    probability of a sentence is an entirely
    useless one, under any interpretation of this
    term .
  • Chomsky, 1969.

41
Target Language Model P(T)
  • Language model helps generate fluent output by
  • 1. assigning higher probability to correct
    word order e.g., PLM(the house is small)
    gtgt PLM(small the is house)
  • 2. assigning higher probability to correct
    word choices e.g.,
  • PLM(i am going home) gtgt PLM(I am going
    house)
  • Almost everyone in both SMT and ASR (automatic
    speech recognition) communities uses N-gram
    language models. Start with
  • P(W) P(w1)P(w2w1)P(w3w1,w2)P(wiw1,
    ,wi-1)P(wmw1,,wm-1),
  • then limit window to N words. E.g., for N3,
    trigram LM
  • P(W) P(w1)P(w2w1)P(w3w1,w2)P(wiwi-2,
    wi-1)P(wmwm-2,wm-1).

42
Target Language Model P(T)
  • Estimation is done by relative frequency on large
    corpus P(wiwi-2,wi-1) f(wiwi-2,wi-1)
    C(wi-2,wi-1,wi)/Sw C(wi-2,wi-1,w).
  • E.g., in Europarl corpus, see 225 trigrams
    starting  the red  
  • C(the red cross)123, C(the red tape)31,
    C(the red army)9, C(the red card)7, C(the red
    ,)5 (and 50 other trigrams). So estimate
    P(cross the red) 123/225 0.547 .
  • But need to reserve probability mass for unseen
    events - maybe never saw  the red planet  in
    Europarl, but dont want to have estimate
    P(planet the red) 0. Also, want estimates
    whose variance isnt too high. Smoothing
    techniques are used to solve both problems. E.g.,
    could linearly smooth trigrams with bigrams
    unigrams P(wiwi-2,wi-1) ?f (wiwi-2,wi-1)
    µf(wiwi-1) (1-?-µ)f(wi)
    0 lt ?, µ lt 1.

43
Target Language Model P(T)
Measuring Language Model Quality
  • Perplexity metric that measures predictive power
    of an LM on new data as an average branching
    factor. E.g., model that says any digit 0, , 9
    has equal probability of occurrence  will yield
    perplexity of 10.0 on digit sequence generated
    randomly from these 10 digits.
  • Perplexity of LM measured on corpus W (w1 wN)
    is
  • PerpLM(T) (?wi P(wiLM))-1/N 1/(average
    per word prob.)
  • The better the LM is as a model for W, the
    less  surprised  it is by words of W ? higher
    estimated prob. ? lower entropy.
  • Typical perplexities for well-trained
    English trigram LMs with lexica of about 25K
    words for various dictation domains
  • Perp(radiology)20, Perp(emergency
    medicine)60, Perp(journalism)105, Perp(general
    English)247 .

44
Target Language Model P(T)
  •  A Bit of Progress in Language Modeling 
    (Goodman01) is good summary of state of the art
    in N-gram language modeling.
  • Consistently superior method Kneser-Ney.
  • Intuition if Francisco eggplant each
    seen 103 times in our corpus of 106 words, and
    neither eggplant Francisco nor eggplant stew
    seen, which should be higher, P(Franciscoeggplan
    t) or P(steweggplant)?
  • Interpolation answer P(wiwi-1)
    ?f(wiwi-1) (1-?)f(wi ).
  • So P(Franciscoeggplant) ?0 (1- ?)10-3
    P(steweggplant).
  • Kneser-Ney answer no, Francisco only
    occurs after San, but 1,000 occurrences of
     stew  preceded by 100 different words. So when
    (wi-1 wi) has never been seen before, wi stew
    more probable than wi Francisco ?
    P(steweggplant) gtgt P(Franciscoeggplant).

45
Target Language Model P(T)
  • Kneser-Ney formula (for bigrams easily extended
    to N-grams)
  • PKN(wi wi-1) max C(wi-1 wi)-D,
    0/C(wi-1)
  • ?(wi-1)v C(v
    wi) gt 0/?w v C(v w) gt 0 ,
  • where D is a discount factor lt 1, ?(wi-1) is
    a normalization constant, v C(v wi) gt 0 is
    the number of different words that precede wi in
    the training corpus, and ?w v C(v w) gt 0 is
    the number of different bigrams in the training
    corpus.

46
Flaws of Phrase-based, Loglinear Systems
  • Loglinear feature function combination is too
    flexible! Makes it
  • easy not to think about theoretical
    properties of models.
  • The IBM models were true models given arbitrary
    source sentence S and target sentence T, could
    estimate non-zero P(TS). Phrase-based models
    are not models in general, for T which is a good
    translation of S, they give P(TS) 0. They
    dont guarantee existence of an alignment between
    T and S. Thus, the only translations T to which
    a phrase-based system is guaranteed to assign
    P(TS) gt 0 are T output by same system.
  • This has practical consequences in general, a
    phrase-based MT system cant be used for
    analyzing pre-existing translations. This rules
    out many useful forms of assistance to human
    translators - e.g., spotting potential errors in
    translations based on regions of low P(TS).

47
PORTAGE A Typical SMT System
  • Sentence-align a big bilingual corpus
  • On each sentence pair, use IBM models to align
    words
  • Build phrase tables from word alignments via
    diag-and or similar heuristic (Koehn03).
    Backwards phrase table gives P(ST) ( is
    implicit segmentation model).
  • Build language model (LM) for target language
    estimates P(T) , based on n-grams in T
  • 5. P(ST) and P(T) are sufficient for
    decoding, but one often adds other loglinear
    feature functions such as a distortion penalty
  • 6. Use (Och03) method to find good weights ?i
    for loglinear features
  • 7. Optionally, include reordering step i.e.,
    decoder outputs many hypotheses (via N-best list
    or lattice) which are rescored by larger set of
    feature functions

48
PORTAGE A Typical SMT System
Core Engine
 Small  set of information sources for Canoe
decoder
(number-of-words model)
(at least 1 phrase translation model)
(any of additional info. sources - for
rescorer only)
(at least one language model)
(at least one distortion model)
A1
A2
A3
feature functions
Source sentence
 Large set of information sources for
Rescorer
Weighted  large info
Weighted  small  info
mais où sont les neiges d antan ?
kLMLM kTMTM kA3A3
Weights for  large  set
wLMLM wTMTM wNMNM
Weights for  small  set
Rescorer
Canoe decoder
Rescored N-best
N-best hypotheses
49
Training Core Components of PORTAGE
Preprocessing
Raw parallel corpus
src-lang text tgt-lang text
Additional monolingual corpora
Tgt-lang text
Tgt-lang text

phrase translation model
language model
PT
LM
other small set models
small set info only

model3
modelK
large set info
extra models for large set

modelM
modelK1
large set wts
small set wts
w1, , wM
w1, , wK
50
Canoe Optimization of Weights (COW) Purpose find
weights w1, , ws on  small  set of
information sources (N around 100)
 Small  set of information sources
I2
IS

I1
(first call to Canoe)
(2nd subsequent calls to Canoe)
New Weights (from  rescore-train )
w1r , w2r,, wsr
List of D N-best hyp.
(union 2nd subsequent calls to rescore-train)
(first call to rescore-train)
K random wt. vectors
W1 WK
W1w11 , w21,, ws1 WKw1K,
w2K,, wsK
W

Rescore_train
51
Rescoring Finding Weights on  Large  Info.
Set for Rescorer (N around 1000)
 Large set
I1
I2
IS1
IS

IL

 Small set
Initial Weights
Weighted  small  info
w1i , w2i,, wLi
Weights for  small  fixed by previous COW
step
w1 I1 wSIS
feature functions
Final  large  wts
w1f , w2f,, wLf
K random wt. vectors
W1 WK
W1w11 , w21,, wL1 WKw1K,
w2K,, wLK
W

Rescore_train
52
Tutorial Plan
  • B. Details research topics
  • Named entities
  • Large-scale discriminative training (George
    Foster)
  • Decoding for SMT (prepared by Nicola Ueffing)
  • Hierarchical models (George Foster)
  • System combination

53
Named entity recognition transliteration
Chinese Example  Secretary-General Wong
appeared with Larry Ellison, Chief Executive
Officer of Oracle Corporation, at a press
conference to announce Oracles investment of
100 million dollars in a new research centre in
Szechuan Province . Personal names Wong,
Larry Ellison. Titles Secretary-General,
Chief Executive Officer. Organization name
Oracle Corporation. Place name Szechuan
Province. Recognition problem detect these
entities in a continuous stream of
ideograms. Transliteration problem when
ideograms are used phonetically (esp. for
non-Chinese names like Larry Ellison) become
aware of that, map them onto Latin characters.
54
Named entity recognition transliteration
  • Made-up Chinese Transliteration Example
  • How to translate ??????????
  • ? táng (surname) - Tang Dynasty ?(F?) nà
    receive, accept, enjoy, pay, sew ? dé virtue
  • ? la pull, drag, haul ? mu nurse ? si
    (thus now used mostly for sound) ? fei ??
    humble ?(F?) er (archaic) you ? dé virtue
  • After receiving virtue from the Tang Dynasty,
    you thus pulled the humble nurse away from
    virtue (????). No
  •  tang na de la mu si fei de  DONALD
    RUMSFELD.
  • Actual Chinese?English example generated by
    PORTAGE
  • Outgoing president Iliescu has also
    congratulated Basescu. ?
  • Outgoing president of Iraq, has also been made
    to the road to the public.

55
Named entity recognition transliteration
Other Examples Arabic?English Muammar Ghadafy
Moammar Khaddafi Muamar Qadafy Azeddine
Elzedine Alsuddin Ahzudin (depending on
region, pronounced differently thus
transliterated into Latin alphabet differently)
English?French (Google Translate Jan. 24,
2007) The Englishman John Snow thought cholera
was transmitted by small, living organisms. ?
Le choléra de pensée de neige de John d'Anglais
a été transmis par la petite, organique matière.
56
System Combination
  • Introduction
  • Different systems make different errors why not
    combine information? This worked well for ASR
  • But, because of reordering, synonyms, etc.,
    system combination not as easy for MT!
  • RWTH (Aachen) is SMT powerhouse has recently
    been working on parallel system combination
    (Evgeny Matusov).
  • NRC has been working on serial system
    combination.
  • Both teams now getting good results.

57
System Combination
  • Parallel System Combination (RWTH Aachen)
  • Hypotheses from different systems aligned some
    word reordering allowed use of synonyms
  • Generate confusion network ? choices at each
    position scored with system weights and word
    confidence scores
  • N-best consensus translations are generated from
    confusion network rescored with various
    information sources
  • A year ago, results unimpressive. Since then,
    added new information sources (e.g., LMs trained
    on N-best lists from contributing systems) that
    encourage preservation of original phrases. Nice
    preliminary Arabic results improvement of 2-3
    BLEU points over best individual system in
    combination.

58
System Combination
Example of RWTH Parallel Combination Ref
Chinese president directs
unprecedented criticism at
leaders of Hong Kong. Best
System Chinese president slams unprecedented
leaders to Hong
Kong. System Comb. Chinese president sends
unprecedented criticism of the
leaders of Hong Kong.
59
System Combination
  • Serial System Combination (NRC)
  • Use SMT to correct mistakes made by another
    method (e.g., a rule-based one)
  • Training Procedure
  • Use MT1 to produce initial target translation of
    source half of a parallel human-translated
    corpus, thus giving a corpus of MT1 target output
    in parallel with good target versions of same
    sentences use parallel corpus of (MT1 target
    human target) sentences to train SMT.
  • Even better, if can get humans to post-edit MT1
    output, have MT1 target in parallel with
    corrected target as SMT training corpus.

60
System Combination
Serial System Combination (NRC)
61
System Combination
Serial System Combination (NRC)
62
System Combination
  • Discussion Future Work
  • Parallel combination probably best for similar
    systems of good quality, serial combination for
    systems that are very different
  • Future work for serial combination allow SMT
    both direct indirect (via MT1) access to source
    text. Could do this using, e.g.
  • Rescoring
  • Parallel phrasetables
  • Parallel LMs
  • Parallel decoding (etc.)

63
References (1)
Best overall reference Philipp Koehn,
 Statistical Machine Translation , University
of Edinburgh (textbook to appear 2007 or 2008,
Cambridge University Press). Papers (NOTE
short summary of key papers available from
Kuhn/Foster) Brown93 Peter F. Brown, Stephen A.
Della Pietra, Vincent Della J. Pietra, and Robert
L. Mercer. The mathematics of Machine
Translation Parameter estimation. Computational
Linguistics, 19(2)263-312, June 1993. Chomsky69
Noam Chomsky. Quines Empirical Assertions. In
Words and Objections Essays on the Work of W.V.
Quine (ed. D. Davidson and J. Hintikka).
Dordrecht, Netherlands, 1969. Foster06 George
Foster, Roland Kuhn, and Howard Johnson.
Phrasetable Smoothing for Statistical Machine
Translation. EMNLP 2006, Sydney, Australia, July
22-23, 2006. Germann01 Ulrich Germann, Michael
Jahr, Kevin Knight, Daniel Marcu, and Kenji
Yamada. Fast decoding and optimal decoding for
machine translation. In Proceedings of the 39th
Annual Meeting of the Association for
Computational Linguistics (ACL), Toulouse, July
2001.
64
References (2)
Goodman01 Joshua Goodman. A Bit of Progress in
Language Modeling (extended version). Microsoft
Research Technical Report, Aug. 2001.
Downloadable from research.microsoft.com/joshuago
/publications.htm Jones05 Douglas Jones, Edward
Gibson, et al. Measuring Human Readability of
Machine Generated Text Studies in Speech
Recognition and Machine Translation. In
Proceedings of the IEEE Int. Conf. on Acoustics,
Speech, and Signal Processing (ICASSP),
Philadelphia, PA, USA, March 2005 (Special
Session on Human Language Technology
Applications and Challenge of Speech Processing).
Knight99 Kevin Knight. Decoding complexity in
word-replacement translation models.
Computational Linguistics, Squibs and Discussion,
25(4), 1999. Koehn04 Philipp Koehn. Pharaoh a
beam search decoder for phrase-based statistical
machine translation models. In Proceedings of the
6th Conference of the Association for Machine
Translation in the Americas, Georgetown
University, Washington D.C., October 2004.
Springer-Verlag. KoehnDec03 Philipp Koehn.
PHARAOH - a Beam Search Decoder for Phrase-Based
Statistical Machine Translation Models (User
Manual and Description). USC Information
Sciences Institute, Dec. 2003.
65
References (3)
KoehnMay03 Philipp Koehn, Franz Josef Och, and
Daniel Marcu. Statistical phrase-based
translation. In Eduard Hovy, editor, Proceedings
of the Human Language Technology Conference of
the North American Chapter of the Association for
Computational Linguistics (HLT/NAACL), pp.
127-133, Edmonton, Alberta, Canada, May 2003.
Marcu02 Daniel Marcu and William Wong. A
phrase-based, joint probability model for
statistical machine translation. In Proceedings
of the 2002 Conference on Empirical Methods in
Natural Language Processing (EMNLP),
Philadelphia, PA, 2002. OchJHU04 Franz Josef
Och, Daniel Gildea, et al. Final Report of the
Johns Hopkins 2003 Summer Workshop on Syntax for
Statistical Machine Translation (revised
version). http//www.clsp.jhu.edu/ws03/groups/tran
slate (JHU-syntax-for-SMT.pdf), Feb. 2004. Och04
Franz Och and Hermann Ney. The alignment template
approach to statistical machine translation.
Computational Linguistics, V. 30, pp. 417-449,
2004. Och03 Franz Josef Och. Minimum error rate
training for statistical machine translation. In
Proceedings of the 41th Annual Meeting of the
Association for Computational Linguistics (ACL),
Sapporo, July 2003.
66
References (4)
Och02 Franz Josef Och and Hermann Ney.
Discriminative training and maximum entropy
models for statistical machine translation. In
Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL),
Philadelphia, July 2002. Och01 Franz Josef Och,
Nicola Ueffing, and Hermann Ney. An Efficient A
Search Algorithm for Statistical Machine
Translation. In Proc. Data-Driven Machine
Translation Workshop, July 2001. Och00 Franz
Josef Och and Hermann Ney. A Comparison of
Alignment Models for Statistical Machine
Translation. Int. Conf. on Computational
Linguistics (COLING), Saarbrucken, Germany,
August 2000. Papineni01 Kishore Papineni, Salim
Roukos, Todd Ward, and Wei-Jing Zhu. BLEU A
method for automatic evaluation of Machine
Translation. Technical Report RC22176, IBM,
September 2001. Ueffing02 Nicola Ueffing, Franz
Josef Och, and Hermann Ney. Generation of Word
Graphs in Statistical Machine Translation.
Empirical Methods in Natural Language Processing,
July 2002.
Write a Comment
User Comments (0)