What - PowerPoint PPT Presentation

1 / 140
About This Presentation
Title:

What

Description:

What s New in Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department Recent Progress in Statistical MT ... – PowerPoint PPT presentation

Number of Views:187
Avg rating:3.0/5.0
Slides: 141
Provided by: l2rCsUiu4
Category:

less

Transcript and Presenter's Notes

Title: What


1
Whats New in Statistical Machine Translation
  • Kevin Knight

USC/Information Sciences Institute USC/Computer
Science Department
2
Machine Translation
?????????????????????????????????????,????????????
???????,??????????
The U.S. island of Guam is maintaining a high
state of alert after the Guam airport and its
offices both received an e-mail from someone
calling himself the Saudi Arabian Osama bin Laden
and threatening a biological/chemical attack
against public places such as the airport.
3
Thousands of Languages Are Spoken
MANDARIN 885,000,000 SPANISH
332,000,000 ENGLISH 322,000,000 BENGALI
189,000,000 HINDI 182,000,000 PORTUGUESE
170,000,000 RUSSIAN 170,000,000 JAPANESE
125,000,000 GERMAN 98,000,000
TURKISH 59,000,000 URDU 58,000,000 MIN NAN
(China) 49,000,000 JINYU (China)
45,000,000 GUJARATI 44,000,000 POLISH
44,000,000 ARABIC 42,500,000 UKRAINIAN
41,000,000
WU (China) 77,175,000 JAVANESE
75,500,800 KOREAN 75,000,000 FRENCH
72,000,000 VIETNAMESE 67,662,000 TELUGU
66,350,000 YUE (China) 66,000,000 MARATHI
64,783,000 TAMIL 63,075,000
ITALIAN 37,000,000 XIANG (China)
36,015,000 MALAYALAM 34,022,000 HAKKA (China)
34,000,000 KANNADA 33,663,000 ORIYA
31,000,000 PANJABI 30,000,000 SUNDA
27,000,000
Source Ethnologue
4
Recent Progress in Statistical MT
2002
2003
  • insistent Wednesday may recurred her trips to
    Libya tomorrow for flying
  • Cairo 6-4 ( AFP ) - An official announced
    today in the Egyptian lines company for flying
    Tuesday is a company "insistent for flying" may
    resumed a consideration of a day Wednesday
    tomorrow her trips to Libya of Security Council
    decision trace international the imposed ban
    comment.
  • Egyptair Has Tomorrow to Resume Its Flights to
    Libya
  • Cairo 4-6 (AFP) - Said an official at the
    Egyptian Aviation Company today that the company
    egyptair may resume as of tomorrow, Wednesday its
    flights to Libya after the International Security
    Council resolution to the suspension of the
    embargo imposed on Libya.

5
2005
news broadcast
foreign language speech recognition
English translation
searchable archive
6
Warren Weaver (1947)
ingcmpnqsnwf cv fpn owoktvcv hu ihgzsnwfv
rqcffnw cw owgcnwf kowazoanv ...
7
Warren Weaver (1947)
e e e e ingcmpnqsnwf cv fpn
owoktvcv e e e hu
ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv
...
8
Warren Weaver (1947)
e e e the ingcmpnqsnwf cv fpn
owoktvcv e e e hu
ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv
...
9
Warren Weaver (1947)
e he e the ingcmpnqsnwf cv fpn
owoktvcv e e e t hu
ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv
...
10
Warren Weaver (1947)
e he e of the ingcmpnqsnwf cv fpn
owoktvcv e e e t hu
ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv
...
11
Warren Weaver (1947)
e he e of the fof ingcmpnqsnwf cv fpn
owoktvcv e f o e o oe t hu
ihgzsnwfv rqcffnw cw owgcnwf ef kowazoanv
...
12
Warren Weaver (1947)
e he e of the ingcmpnqsnwf cv fpn owoktvcv
e e e t hu ihgzsnwfv
rqcffnw cw owgcnwf e kowazoanv ...
13
Warren Weaver (1947)
e he e is the sis ingcmpnqsnwf cv fpn
owoktvcv e s i e i ie t hu
ihgzsnwfv rqcffnw cw owgcnwf es kowazoanv
...
14
Warren Weaver (1947)
decipherment is the analysis ingcmpnqsnwf cv fpn
owoktvcv of documents written in ancient hu
ihgzsnwfv rqcffnw cw owgcnwf languages
... kowazoanv ...
15
Warren Weaver (1947)
The non-Turkish guy next to me is even
deciphering Turkish! All he needs is a
statistical table of letter-pair frequencies in
Turkish
Can this be computerized?
Collected mechanically from a Turkish body of
text, or corpus
16
When I look at an article in Russian, I say
this is really written in English, but it has
been coded in some strange symbols. I will now
proceed to decode. - Warren Weaver, March 1947
17
When I look at an article in Russian, I say
this is really written in English, but it has
been coded in some strange symbols. I will now
proceed to decode. - Warren Weaver, March
1947 ... as to the problem of mechanical
translation, I frankly am afraid that the
semantic boundaries of words in different
languages are too vague ... to make any
quasi-mechanical translation scheme very
hopeful. - Norbert Wiener, April 1947
18
Spanish/English corpus
 
19
Spanish/English corpus
Translate Clients do not sell pharmaceuticals
in Europe.
 
20
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
21
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
22
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
23
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
???
24
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
25
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
26
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
27
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
???
28
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
29
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
process of elimination
30
Centauri/Arcturan Knight 97
Your assignment, translate this to Arcturan
farok crrrok hihok yorok clok kantok ok-yurp
cognate?
31
Centauri/Arcturan Knight 97
Your assignment, put these words in order
jjat, arrat, mat, bat, oloat, at-yurp
zero fertility
32
When I look at an article in Russian, I say
this is really written in English, but it has
been coded in some strange symbols. I will now
proceed to decode. - Warren Weaver, March 1947
The required statistical tables have millions of
entries? Too much for the computers of Weavers
day. ? Not enough RAM!
33
IBM Candide Project (1988-1994)
  • How to get quantities of human translation in
    computer readable form?
  • parallel corpus

Canadian bureaucrat
IBMs John Cocke, inventor of CKY parsing
RISC processors
34
IBM Candide Project (1988-1994)
  • How to get quantities of human translation in
    computer readable form?
  • parallel corpus

Canadian bureaucrat
IBMs John Cocke, inventor of CKY parsing
RISC processors
35
IBM Candide Project (1988-1994)
  • How to get quantities of human translation in
    computer readable form?
  • parallel corpus

IBMs John Cocke, inventor of CKY parsing
RISC processors
36
IBM Candide ProjectBrown et al 93
French/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Broken English
French
English
What hunger have I, Hungry I am so, I am so
hungry, Have me that hunger
J ai si faim
I am so hungry
37
Mathematical Formulation
Given source sentence f argmaxe P(e f)
argmaxe P(f e) P(e) / P(f) by Bayes
Rule argmaxe P(f e) P(e) P(f) same for
all e
Broken English
French
English
Translation Model P(f e)
Language Model P(e)
Decoding algorithm argmaxe P(e) P(f e)
J ai si faim
I am so hungry
38
Language Modeling
  • Goal of a language model for MT
  • He is on the soccer field
  • He is in the soccer field
  • Is table the on cup the
  • The cup is on the table
  • American shrine
  • American company

Need to make these decisions, because translation
model may not have a lot of context information!
39
The Classic Language ModelWord Bigrams
  • Process model of English
  • Generate each word based only on the previous
    word.
  • P(I saw water on the table)
  • P(I START)
  • P(saw I)
  • P(water saw)
  • P(on water)
  • P(the on)
  • P(table the)
  • P(END table)

Probabilities can be tabulated from an online
English corpus just like Weavers Turkish case.
40
Trigram Language Model
to the said royal purchase plan trustco part
operations of its its is international expand
banking
Soricut Marcu, 05
41
Trigram Language Model
the banking trustco is said to expand its
purchase part of its royal international plan
operations
to the said royal purchase plan trustco part
operations of its its is international expand
banking
Soricut Marcu, 05
42
Trigram Language Model
the banking trustco is said to expand its
purchase part of its royal international plan
operations
to the said royal purchase plan trustco part
operations of its its is international expand
banking
royal trustco said the purchase is part of its
plan to expand its international banking
operations
N-grams have a lot of semantics in them!
Soricut Marcu, 05
43
Trigram Language Model
the banking trustco is said to expand its
purchase part of its royal international plan
operations
to the said royal purchase plan trustco part
operations of its its is international expand
banking
royal trustco said the purchase is part of its
plan to expand its international banking
operations
with the stressed relationship part own
longstanding its its for chinese boeing , ,
for its part, stressed the longstanding
relationship with its own, chinese
boeing boeing, for its part, stressed its own
longstanding relationship with the chinese
Soricut Marcu, 05
44
Translation Model?
Process model of translation
Mary did not slap the green witch
Source-language morphological analysis Source
parse tree Semantic representation Generate
target structure
Maria no dió una bofetada a la bruja verde
45
Translation Model?
Process model of translation
Mary did not slap the green witch
Source-language morphological analysis Source
parse tree Semantic representation Generate
target structure
What are all the possible moves and what
probability tables control those moves?
Maria no dió una bofetada a la bruja verde
46
The Classic Translation ModelWord
Substitution/Permutation Brown et al., 1993
Process model of translation
Mary did not slap the green witch
n(3slap) 50k entries
Mary not slap slap slap the green witch
P-Null 1 entry
Mary not slap slap slap NULL the green witch
t(lathe) 25m entries
Maria no dió una bofetada a la verde bruja
d(ji) 2500 entries
Maria no dió una bofetada a la bruja verde
Trainable
47
The Classic Translation ModelWord
Substitution/Permutation Brown et al., 1993
Process model of translation
Mary did not slap the green witch
n(3slap) 50k entries
P-Null 1 entry
?
t(lathe) 25m entries
d(ji) 2500 entries
Maria no dió una bofetada a la bruja verde
Still trainable!
48
Classic Formula for P(f e)
NULL stuff
P(f e) m F0

l S ( ) P-Null m 2F0
(1-P-Null) F0 ? Fi! (1 / F0!) a
F0
i0
l m
m ? n(Fi ei) ? t(fj eaj)
? d(j aj, l, m) i1
j1 jaj ltgt 0
sum over alignment possibilities
fertility
re-ordering
word translation
Set parameter values so formula assigns the
highest possible probability to observed human
translations. This is a 25m-dimensional search
space.
49
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower
All P(french-word english-word) equally likely
50
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower
la and the observed to co-occur
frequently, so P(la the) is increased.
51
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower
maison co-occurs with both the and house,
but P(maison house) can be raised without
limit, to 1.0, while P(maison the) is limited
because of la (pigeonhole principle)
52
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower
settling down after another iteration
53
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower
  • Inherent hidden structure revealed by EM
    training!
  • A Statistical MT Tutorial Workbook (Knight,
    1999). Promises free beer.
  • The Mathematics of Statistical Machine
    Translation (Brown et al, 1993)
  • Software GIZA

54
Sample Translation Probabilities
Translation Model
e f P(f e)
national nationale 0.47
national national 0.42
national nationaux 0.05
national nationales 0.03
the le 0.50
the la 0.21
the les 0.16
the l 0.09
the ce 0.02
the cette 0.01
farmers agriculteurs 0.44
farmers les 0.42
farmers cultivateurs 0.05
farmers producteurs 0.02
Brown et al 93
55
Translation Model
e f P(f e)
national nationale 0.47
national national 0.42
national nationaux 0.05
national nationales 0.03
the le 0.50
the la 0.21
the les 0.16
the l 0.09
the ce 0.02
the cette 0.01
farmers agriculteurs 0.44
farmers les 0.42
farmers cultivateurs 0.05
farmers producteurs 0.02
new French sentence f
potential translation e
P(f e)
56
Language Model
w1 w2 P(w2 w1)
of the 0.13
of a 0.09
of another 0.01
of some 0.01
hong kong 0.98
hong said 0.01
hong stated 0.01
Translation Model
e f P(f e)
national nationale 0.47
national national 0.42
national nationaux 0.05
national nationales 0.03
the le 0.50
the la 0.21
the les 0.16
the l 0.09
the ce 0.02
the cette 0.01
farmers agriculteurs 0.44
farmers les 0.42
farmers cultivateurs 0.05
farmers producteurs 0.02
new French sentence f
potential translation e
P(f e)
P(e)
57
Language Model
w1 w2 P(w2 w1)
of the 0.13
of a 0.09
of another 0.01
of some 0.01
hong kong 0.98
hong said 0.01
hong stated 0.01
Translation Model
e f P(f e)
national nationale 0.47
national national 0.42
national nationaux 0.05
national nationales 0.03
the le 0.50
the la 0.21
the les 0.16
the l 0.09
the ce 0.02
the cette 0.01
farmers agriculteurs 0.44
farmers les 0.42
farmers cultivateurs 0.05
farmers producteurs 0.02
new French sentence f
potential translation e
P(f e)
P(e)
P(f e) P(e) ? score for e
58
Search for Best Translation
voulez vous vous taire !
59
Search for Best Translation
voulez vous vous taire !
you you you quiet !
60
Search for Best Translation
voulez vous vous taire !
you you quiet !
61
Search for Best Translation
voulez vous vous taire !
quiet you you you !
62
Search for Best Translation
voulez vous vous taire !
shut you you you !
63
Search for Best Translation
voulez vous vous taire !
you shut !
64
Search for Best Translation
voulez vous vous taire !
you shut up !
65
Classic Decoding Algorithm
  • Given f, find the English string e that maximizes
    P(e) P(f e)
  • NP-Complete Knight 99.
  • Brown et al 93
  • In this paper, we focus on the translation
    modeling problem. We hope to deal with the
    decoding problem in a later paper.

66
Beam Search DecodingBrown et al US Patent
5,477,451
1st English word
2nd English word
3rd English word
4th English word
start
end
all source words covered
Each partial translation hypothesis contains
- Last English word chosen source words covered
by it - Next-to-last English word chosen -
Entire coverage vector (so far) of source
sentence - Language model and translation model
scores (so far)
Jelinek 69 Och, Ueffing, and Ney, 01
67
Beam Search DecodingBrown et al US Patent
5,477,451
1st English word
2nd English word
3rd English word
4th English word
best predecessor link
start
end
all source words covered
Each partial translation hypothesis contains
- Last English word chosen source words covered
by it - Next-to-last English word chosen -
Entire coverage vector (so far) of source
sentence - Language model and translation model
scores (so far)
Jelinek 69 Och, Ueffing, and Ney, 01
68
Classic Results
  • nous avons signé le protocole . (Foreign
    Original)
  • we did sign the memorandum of agreement . (Human
    Translation)
  • we have signed the protocol . (MT)
  • où était le plan solide ? (Foreign Original)
  • but where was the solid plan ? (Human
    Translation)
  • where was the economic base ? (MT)

the Ministry of Foreign Trade and Economic
Cooperation, including foreign direct investment
40 billion US dollars today provide data
include that year to November china actually
using foreign 46.959 billion US dollars and
very slow one page per day
69
Okay!
  • I know, so far, this talk should be called
  • Whats Old in Statistical
  • Machine Translation!!

70
Further Developments
  • Follow-on projects
  • Hong Kong
  • Aachen
  • Behavior Design Corporation
  • JHU Summer Workshop 1999
  • Build distribute statistical MT tools
  • Create standard training testing data
  • Disseminate tutorial material
  • MT in a Day
  • Ask new questions

71
How Much Data Do We Need?
Quality of automatically trained machine
translation system
Amount of bilingual training data
72
Advances in Statistical MT2000-2004
73
Ready-to-Use Online Bilingual Data
Millions of words (English side)
(Data stripped of formatting, in sentence-pair
format, available from the Linguistic Data
Consortium at UPenn).
74
Ready-to-Use Online Bilingual Data
Millions of words (English side)
(Data stripped of formatting, in sentence-pair
format, available from the Linguistic Data
Consortium at UPenn).
European parliament data Koehn 05
75
BLEU Evaluation Metric (Papineni et al 02)
Reference (human) translation The U.S. island
of Guam is maintaining a high state of alert
after the Guam airport and its offices both
received an e-mail from someone calling himself
the Saudi Arabian Osama bin Laden and threatening
a biological/chemical attack against public
places such as the airport .
  • N-gram precision (score is between 0 1)
  • What percentage of machine n-grams can be found
    in the reference translation?
  • Gross measure over 1000 test sentences.
  • Not allowed to use same portion of reference
    translation twice (cant cheat by typing out the
    the the the the)
  • Brevity penalty cant just type out single word
    the (and get precision 1.0)

Machine translation The American ?
international airport and its the office all
receives one calls self the sand Arab rich
business ? and so on electronic mail , which
sends out The threat will be able after public
place and so on the airport to start the
biochemistry attack , ? highly alerts after the
maintenance.
76
BLEU in Action
???????? (Foreign Original) the gunman was
shot to death by the police . (Reference
Translation) the gunman was police kill .
1wounded police jaya of 2the gunman
was shot dead by the police . 3the gunman
arrested by police kill . 4the gunmen were
killed . 5the gunman was shot to death by
the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10
77
BLEU in Action
???????? (Foreign Original) the gunman was
shot to death by the police . (Reference
Translation) the gunman was police kill .
1wounded police jaya of 2the gunman
was shot dead by the police . 3the gunman
arrested by police kill . 4the gunmen were
killed . 5the gunman was shot to death by
the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10
green 4-gram match (good!) red word not
matched (bad!)
78
BLEU Tends to Predict Human Judgments
slide from G. Doddington (NIST)
79
Experiment-Driven Progress
BLEU
35
Evaluate new MT research ideas every day! (and be
alerted about bugs)
30
25
20
ISI Syntax-Based MT Chinese/English NIST 2002
Test Set
15
Mar 1
Apr 1
May 1
2005
80
Draw Learning Curves
Swedish/English French/English German/English Finn
ish/English
BLEU score
of sentence pairs used in training
Experiments by Philipp Koehn
81
Flaws of Word-Based MT
  • Cant translate multiple English words to one
    French word
  • Cant translate phrases
  • real estate, note that, interest in
  • Isnt sensitive to syntax
  • Adjectives/nouns should swap order
  • Verb comes at the beginning in Arabic
  • Doesnt understand the meaning (?)

82
The MT Triangle
interlingua
logical form
logical form
syntax
syntax
words
words
SOURCE
TARGET
83
The MT Swimming Pool
interlingua
logical form
logical form
syntax
syntax
words
words
84
Commercial Rule-Based Systems
interlingua
logical form
logical form
syntax
syntax
words
words
SOURCE
TARGET
85
Knight et al 95 - meaning-based translation
- composition rules
interlingua
logical form
logical form
syntax
syntax
Language Model
words
words
SOURCE
TARGET
86
Wu 97, Alshawi 98 - inducing syntactic
structure as a by-product of aligning
words in bilingual text
interlingua
logical form
logical form
syntax
syntax
Language Model
words
words
SOURCE
TARGET
87
Yamada/Knight (01,02) - tree/string model
- used existing target language parser
interlingua
logical form
logical form
syntax
syntax
Language Model
words
words
SOURCE
TARGET
88
  • Well, these all seem like good ideas.
  • Which one had the most dramatic effect on MT
    quality?
  • None of them!

89
Phrases
How do you translate real estate into French?
real estate real number dance number
dance card memory card memory stick

interlingua
logical form
logical form
syntax
syntax
phrases
phrases
words
words
SOURCE
TARGET
90
Phrase-Based Statistical MT
Morgen
fliege
ich
nach Kanada
zur Konferenz
Tomorrow
I
will fly
to the conference
In Canada
  • Foreign input segmented into phrases
  • phrase just means word sequence
  • Each phrase is probabilistically translated into
    English
  • P(to the conference zur Konferenz)
  • P(into the meeting zur Konferenz)
  • Phrases are probabilistically re-ordered
  • See Koehn et al, 2003 for an overview.

91
How to Learn the Phrase Translation Table?
  • One method alignment templates Och et al 99
  • Start with word alignment
  • Collect all phrase pairs that are consistent with
    the word alignment

92
Word Alignment Induced Phrases
Maria no dió una bofetada a
la bruja verde







Mary did not slap the green witch
93
Word Alignment Induced Phrases
Maria no dió una bofetada a
la bruja verde







Mary did not slap the green witch
(Maria, Mary) (no, did not) (slap, dió una
bofetada) (la, the) (bruja, witch) (verde, green)
94
Word Alignment Induced Phrases
Maria no dió una bofetada a
la bruja verde







Mary did not slap the green witch
(Maria, Mary) (no, did not) (slap, dió una
bofetada) (la, the) (bruja, witch) (verde,
green) (a la, the) (dió una bofetada a, slap)
(bruja verde, green witch)
95
Word Alignment Induced Phrases
Maria no dió una bofetada a
la bruja verde







Mary did not slap the green witch
(Maria, Mary) (no, did not) (slap, dió una
bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap) (bruja
verde, green witch) (Maria no, Mary did not) (no
dió una bofetada, did not slap), (dió una
bofetada a la, slap the)
96
Word Alignment Induced Phrases
Maria no dió una bofetada a
la bruja verde







Mary did not slap the green witch
(Maria, Mary) (no, did not) (slap, dió una
bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap) (Maria
no, Mary did not) (no dió una bofetada, did not
slap), (dió una bofetada a la, slap the) (bruja
verde, green witch) (Maria no dió una bofetada,
Mary did not slap) (a la bruja verde, the green
witch)
97
Word Alignment Induced Phrases
Maria no dió una bofetada a
la bruja verde







Mary did not slap the green witch
(Maria, Mary) (no, did not) (slap, dió una
bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap) (Maria
no, Mary did not) (no dió una bofetada, did not
slap), (dió una bofetada a la, slap the) (bruja
verde, green witch) (Maria no dió una bofetada,
Mary did not slap) (a la bruja verde, the green
witch) (Maria no dió una bofetada a la bruja
verde, Mary did not slap the green witch)
98
Phrase Pair Probabilities
  • A certain phrase pair (f-f-f, e-e-e) may appear
    many times across the bilingual corpus.
  • No EM training
  • Just relative frequency
  • count(f-f-f,
    e-e-e)
  • P(f-f-f e-e-e) --------------------
    ---

  • count(e-e-e)

99
Phrase-Based MT
  • This is currently the best way to do Statistical
    MT!
  • What took so long to move from words to phrases?
  • Missing RAM
  • 25m parameters ? billions of parameters
  • Trick idea build test-corpus-specific phrase
    table (takes 5 hours!)
  • Now solved in commercial deployments
  • Missing computing power
  • Many competing ideas to shake out
  • Koehn 03 summarizes several variations
  • Empirical effectiveness even better than
    intuition would predict
  • This is not building a ladder to the moon!
  • If you cant translate real estate into French,
    you are sunk

100
Advanced Training Methods
  • argmax P(e f)
  • e
  • argmax P(e) x P(f e) / P(f)
  • e
  • argmax P(e) x P(f e)
  • e

101
Advanced Training Methods
  • argmax P(e f)
  • e
  • argmax P(e) x P(f e) / P(f)
  • e
  • argmax P(e)2.4 x P(f e) works better!
  • e

102
Advanced Training Methods
  • argmax P(e f)
  • e
  • argmax P(e) x P(f e) / P(f)
  • e
  • argmax P(e)2.4 x P(f e) x length(e)1.1
  • e

Rewards longer hypotheses, since these are
unfairly punished by P(e)
103
Advanced Training Methods
  • argmax P(e)2.4 x P(f e) x length(e)1.1 x FEAT
    3.7
  • e

Lots of features vote on every potential
translation. Exponential model. Problem How
to set the exponent weights? IDEA 1 maximize
probability of the data IDEA 2 maximize BLEU
score of MT system
104
20.64 BLEU
17.96 BLEU
WTM fixed at 1.0
plot by Emil Ettelaie
105
Maximum BLEU Training
  • Novel algorithm developed by Och 03
  • Opened gates to feature hacking
  • Word-based feature to smooth phrase pair counts
    (Model1 Inverse)
  • Phrase-specific propensities to re-order
  • Currently limited to 25 features

106
Advances in Statistical MT2005
107
Googles Language Model
  • Previously, largest language model was trained on
    1b words of English
  • 20b words of news ?
  • significant impact on news translation
  • 200b words of web ?
  • helpful

108
Marylands Hiero system Chiang 05
  • Previously
  • ne mange pas ? does not eat
  • New phrase pairs with variables and reordering
  • ne X pas ? does not X
  • le X1 du X2 ? X2 's X1
  • Nesting
  • does not X itself becomes an X
  • CKY decoder

John Cocke
109
ISIs Syntax-Based MT System
  • First strong showing for an SMT system that knows
    what nouns and verbs are!
  • Why syntax?
  • Frequent high-tech exports are bright spots for
    foreign trade growth of Guangdong has made
    important contributions.
  • Need much more grammatical output
  • Need accurate control over re-ordering
  • Need accurate insertion of function words

110
String Output
  • The gunman killed by police .

?
.
??
??
??
111
Tree Output
  • The gunman killed by police .
  • DT NN VBD IN NN
  • NPB PP
  • NP-C VP
  • S

?
.
??
??
??
112
Tree Output
  • Gunman by police shot .
  • NN IN NN VBD
  • NPB PP
  • NP-C VP
  • S

?
.
??
??
??
113
Tree Output
  • The gunman was killed by police .
  • DT NN AUX VBN IN NN
  • NPB PP
  • NP-C VP
  • S

?
.
??
??
??
114
Sample Rules Learned from Data
  • ? "?" "," x0 0.57
  • ? "?" x0 0.09
  • ? "?" "?" "," x0 0.02
  • ? "??" "," x0 0.02
  • ? x0 0.02

VP
VB
SBAR
IN
x0S
said
that
NP
? x1 x0 0.27 ? "??" x1 x0 0.15 ? x1 "?" x0
0.06 ? "?" x1 x0 0.06 ? "??" x1 "?" x0 0.06
x0NP
PP
IN
x1NP
from
115
Sample Rules Learned from Data
S
(Chinese/ English)
  • ? x0 x1 x2 0.82
  • ? x0 x1 "," x2 0.02
  • x0 x1 x2 0.54
  • x1 x0 x2 0.44

x0NP
VP
x1VB
x2NP
S
(Arabic/ English)
x0NP
VP
x1VB
x2NP
subject-verb inversion
116
Format is Expressive
Non-constituent Phrases
Phrasal Translation
Non-contiguous Phrases
S
VP
VP
poner, x0
hay, x0
está, cantando
PRO
VP
VB
x0NP
PRT
VBZ
VBG
VB
x0NP
there
on
is
singing
put
are
Multilevel Re-Ordering
Lexicalized Re-Ordering
Context-Sensitive Word Insertion
NP
S
NPB
x0
x0NP
PP
x1, , x0
x1, x0, x2
x0NP
VP
DT
x0NNS
P
x1NP
x1VB
x2NP2
the
of
Knight Graehl, 2005
117
Story Gets More Interesting
MT
Applications
Automata Theory
Tree Transducers (Rounds 70)
118
Story Gets More Interesting
Transformational Grammar (Chomsky 57)
MT
Applications
Linguistic Theory
Automata Theory
Tree Transducers (Rounds 70)
119
Story Gets More Interesting
Transformational Grammar (Chomsky 57)
MT (05)
Compression (01)
QA (03)
Applications
Linguistic Theory
Generation (00)
Automata Theory
Tree Transducers (Rounds 70)
120
Story Gets More Interesting
Transformational Grammar (Chomsky 57)
MT (05)
Compression (01)
QA (03)
Applications
Linguistic Theory
Generation (00)
Algorithms
Automata Theory
Efficient Transducer Algorithms
Tree Transducers (Rounds 70)
Generic Tree Toolkits
121
Summary
  • Making good progress
  • Algorithms Data Evaluation Computers
  • Interdisciplinary work
  • Natural language processing
  • Machine learning
  • Linguistics
  • Automata theory
  • Hope that more people will join!

122
  • Thank you

123
Syntax-Based vs Phrase-Based
BLEU
phrase-based system
35
30
25
20
Chinese/English NIST 2002 Test Set
15
Mar 1
Apr 1
May 1
2005
124
Future PhD Theses?
  • Syntax-based Language Models for Improving
    Statistical MT
  • Discriminative Training of Millions of Features
    for MT
  • Semantic Representations Induced from
    Multilingual EU and UN Data
  • What Makes One Language Pair More Difficult to
    Translate Than Another
  • A State-of-the-Art MT System Based on Syntactic
    Transformations
  • New Training Methods for High-Quality Word
    Alignment
  • many unpredictable ones

125
Summary
  • Phrase-based models are state-of-the-art
  • Word alignments
  • Phrase pair extraction probabilities
  • N-gram language models
  • Beam search decoding
  • Feature functions learning weights
  • But the output is not English
  • Fluency must be improved
  • Better translation of person names,
    organizations, locations
  • More automatic acquisition of parallel data,
    exploitation of monolingual data across a variety
    of domains/languages
  • Need good accuracy across a variety of
    domains/languages

126
Available Resources
  • Bilingual corpora
  • 100m words of Chinese/English and
    Arabic/English, LDC (www.ldc.upenn.edu)
  • Lots of French/English, Spanish/French/English,
    LDC
  • European Parliament (sentence-aligned), 11
    languages, Philipp Koehn, ISI
  • (www.isi.edu/koehn/publications/europarl)
  • 20m words (sentence-aligned) of English/French,
    Ulrich Germann, ISI
  • (www.isi.edu/natural-language/download/hansard/)
  • Sentence alignment
  • Dan Melamed, NYU (www.cs.nyu.edu/melamed/GMA/docs
    /README.htm)
  • Xiaoyi Ma, LDC (Champollion)
  • Word alignment
  • GIZA, JHU Workshop 99 (www.clsp.jhu.edu/ws99/proj
    ects/mt/)
  • GIZA, RWTH Aachen (www-i6.Informatik.RWTH-Aachen
    .de/web/Software/GIZA.html)
  • Manually word-aligned test corpus (500
    French/English sentence pairs), RWTH Aachen
  • Shared task, NAACL-HLT03 workshop
  • Decoding
  • ISI ReWrite Model 4 decoder (www.isi.edu/licensed-
    sw/rewrite-decoder/)
  • ISI Pharoah phrase-based decoder
  • Statistical MT Tutorial Workbook, ISI
    (www.isi.edu/knight/)

127
Some Papers Referenced on Slides
  • ACL
  • Och, Tillmann, Ney, 1999
  • Och Ney, 2000
  • Germann et al, 2001
  • Yamada Knight, 2001, 2002
  • Papineni et al, 2002
  • Alshawi et al, 1998
  • Collins, 1997
  • Koehn Knight, 2003
  • Al-Onaizan Knight, 2002
  • Och Ney, 2002
  • Och, 2003
  • Koehn et al, 2003
  • EMNLP
  • Marcu Wong, 2002
  • Fox, 2002
  • Munteanu Marcu, 2002
  • AI Magazine
  • Knight, 1997
  • AMTA
  • Soricut et al, 2002
  • Al-Onaizan Knight, 1998
  • EACL
  • Cmejrek et al, 2003
  • Computational Linguistics
  • Brown et al, 1993
  • Knight, 1999
  • Wu, 1997
  • AAAI
  • Koehn Knight, 2000
  • IWNLG
  • Habash, 2002
  • MT Summit
  • Charniak, Knight, Yamada, 2003
  • NAACL
  • Koehn, Marcu, Och, 2003
  • Germann, 2003
  • Graehl Knight, 2004

128
Ready-to-Use Online Bilingual Data
Millions of words (English side)
(Data stripped of formatting, in sentence-pair
format, available from the Linguistic Data
Consortium at UPenn).
129
Ready-to-Use Online Bilingual Data
Millions of words (English side)
1m-20m words for many language pairs
(Data stripped of formatting, in sentence-pair
format, available from the Linguistic Data
Consortium at UPenn).
130
Ready-to-Use Online Bilingual Data
???
Millions of words (English side)
? One Billion?
131
From No Data to Sentence Pairs
  • Easy way Linguistic Data Consortium (LDC)
  • Really hard way pay
  • Suppose one billion words of parallel data were
    sufficient
  • At 20 cents/word, thats 200 million
  • Pretty hard way Find it, and then earn it!
  • De-formatting
  • Remove strange characters
  • Character code conversion
  • Document alignment
  • Sentence alignment
  • Tokenization (also called Segmentation)

132
Sentence Alignment
  • The old man is happy. He has fished many times.
    His wife talks to him. The fish are jumping.
    The sharks await.

El viejo está feliz porque ha pescado muchos
veces. Su mujer habla con él. Los tiburones
esperan.
133
Sentence Alignment
  1. The old man is happy.
  2. He has fished many times.
  3. His wife talks to him.
  4. The fish are jumping.
  5. The sharks await.
  1. El viejo está feliz porque ha pescado muchos
    veces.
  2. Su mujer habla con él.
  3. Los tiburones esperan.

134
Sentence Alignment
  1. The old man is happy.
  2. He has fished many times.
  3. His wife talks to him.
  4. The fish are jumping.
  5. The sharks await.
  1. El viejo está feliz porque ha pescado muchos
    veces.
  2. Su mujer habla con él.
  3. Los tiburones esperan.

135
Sentence Alignment
  1. The old man is happy. He has fished many times.
  2. His wife talks to him.
  3. The sharks await.
  1. El viejo está feliz porque ha pescado muchos
    veces.
  2. Su mujer habla con él.
  3. Los tiburones esperan.

Note that unaligned sentences are thrown out,
and sentences are merged in n-to-m alignments (n,
m gt 0).
136
Tokenization (or Segmentation)
  • English
  • Input (some byte stream)
  • "There," said Bob.
  • Output (7 tokens or words)
  • " There , " said Bob .
  • Chinese
  • Input (byte stream)
  • Output

??????????????????????????????????????
?? ??? ?? ? ?? ?? ???? ?? ?? ?? ?? ??
??? ?? ? ? ?????
137
Lower-Casing
  • English
  • Input (7 words)
  • " There , " said Bob .
  • Output (7 words)
  • " there , " said bob .

Idea of tokenizing and lower-casing
The the The the
Smaller vocabulary size. More robust counting and
learning.
the
138
Recent Progress in Statistical MT
  • Why is that?
  • Better algorithms that learn patterns from data
  • More data
  • Faster, cheaper computers with more RAM
  • Community-wide test sets
  • Novel automated evaluation methods
  • Shared software tools

139
Three Problems for Statistical MT
  • Translation model
  • Given a pair of strings ltf,egt, assigns P(f e)
    by formula
  • ltf,egt look like translations ? high P(f e)
  • ltf,egt dont look like translations ? low P(f
    e)
  • Language model
  • Given an English string e, assigns P(e) by
    formula
  • good English string ? high P(e)
  • random word sequence ? low P(e)
  • Decoding algorithm
  • Given a language model, a translation model, and
    a new sentence f find translation e maximizing
    P(e) P(f e)

140
Web Language Models
She has a lot
of nerve. 20 French input
It has a lot of nerve. 3
Soricut, Knight, Marcu, 02
Used by Google in 2005 to increase performance of
their research MT system!
Write a Comment
User Comments (0)
About PowerShow.com