Corpora and Translation - PowerPoint PPT Presentation

About This Presentation

Title:

Corpora and Translation

Description:

Corpora of texts and their translations ... Some words are translated by several words, e.g. cheap peu cher. Not always obvious how to align ... – PowerPoint PPT presentation

Number of Views:316

Avg rating:3.0/5.0

Slides: 39

Provided by: somers6

Category:

more less

Transcript and Presenter's Notes

Title: Corpora and Translation

1
Corpora and Translation

Parallel corpora
Statistical MT
(not to mention Corpus of translated text, for
translation studies)

2
Parallel corpora

Corpora of texts and their translations
Basic idea that such parallel corpora implicitly
contain lots of information about translation
equivalence
Nowadays many such bitexts are available
bilingual countries have laws, parliamentary
proceedings, and other documents
large multinational organizations (UN, EU
Europarl corpus, etc.)
multinational commercial organizations produce
multilingual texts

3
Bilingual concordance
Source TransSearch, Laboratoire de Recherche
Appliquée en Linguistique Informatique,
Université de Montréal http//www-rali.iro.umont
real.ca
4
Parallel corpora

Usually not corpora in the strict sense (planned,
annotated, etc.)
Usefulness may depend on
the quality of translation
the closeness of translation
whether we have a text and its translation, or a
multilingually authored text
the language pair
Parallel corpus needs to be aligned

5
Alignment

Means annotating the bilingual corpus to show
explicitly the correspondences
at sentence level
at word and phrase level
Main difficulty for sentence alignment is that
translations do not always keep sentence
boundaries, or even sentence order
In addition, translation may be localized and
therefore not especially faithful

6
Sentence-level alignment

If parallel corpus is quite a literal
translation, this can be done using quite
low-level information
sentence length
looking for anchors
proper names, dates, figures
eg in a parliamentary debate, speakers names

7
Alignment tools
8
Corpus-based MT

Translation memory (tool for translators)
database of previous translations
find close matching examples to current
translation unit
translator decides what to do with it

9
Note that translator has to know/decide what bits
of the target sentence to change
10
Corpus-based MT

Translation memory (tool for translators)
database of previous translations
find close matching examples to current
translation unit
translator decides what to do with it
Example-based translation
similar idea, but computer program tries to
manipulate example(s)
may involve learning general rules from
multiple examples

11
Statistical MT

Pioneered by IBM in early 1990s
Spurred on by better success in speech
recognition of statistical over linguistic
rule-based approaches
Idea that translation can be modelled as a
statistical process
Seems to work best in limited domain where given
data is a good model of future translations

12
Translation as a probabilistic problem

For a given SL sentence Si, there are ? number of
translations T of varying probability
Task is to find for Si the sentence Tj for which
the probability P(Tj Si) is the highest

13
Two models

P(Tj Si) is a function of two models
The probabilities of the individual words that
make up Tj given the individual words in Si -
the translation model
The probability that the individual words that
make up Tj are in the appropriate order the
language model

14
Expressed in mathematical terms

Since S is a given, and constant, this can be
simplified as

Translation model
Language model
15
So how do we translate?

For a given input sentence Si we have to have a
practical way to find the Tj that maximizes the
formula
We have to start somewhere, so we start with the
translation model which words look most likely
to help us?
In a systematic way we can keep trying different
combinations together with the language model
until we stop getting improvements

16
Seek improvement by trying other combinations
17
Where do the models come from?

All the statistical parameters are pre-computed
(learned), based on a parallel corpus
Language model is probabilities of word sequences
(n-grams)
Translation model is derived from aligned
parallel corpus
This approach is attractive to some as an example
of machine learning
The computer learns to translate (just) from
seeing previous examples of translation

18
The translation model

Take sentence-aligned parallel corpus
Extract entire vocabulary for both languages
For every word-pair, calculate probability that
they correspond e.g. by comparing distributions

19
Problem fertility

fertility not all word correspondences are 11
Some words have multiple possible translations,
e.g. the ? le, la, l, les
Some words have no translation, e.g. in il se
rase he shaves, se ??
Some words are translated by several words, e.g.
cheap ? peu cher
Not always obvious how to align

20
Problem distortion

Notice that corresponding words do not appear in
the same order.
The translation model includes probabilities for
distortion
e.g. P(25) the P that ws in position 2 will
produce a wt in position 5
can be more complex P(52,4,6) the P that ws in
position 2 will produce a wt in position 5 when S
has 4 words and T has 6.

21
The language model

Impractical to calculate probability of every
word sequence
Many will be very improbable
Because they are ungrammatical
Or because they happen not to occur in the data
Probabilities of sequences of n words (n-grams)
more practical
Bigram model
where P(wiwi1) ?f(wi1, wi)/f(wi)

22
Sparse data

Relying on n-grams with a large n risks
0-probabilities
Bigrams are less risky but sometimes not
discriminatory enough
e.g. I hire men who is good pilots
3- or 4-grams allow a nice compromise, and if a
3-gram is previously unseen, we can give it a
score based on the component bigrams
(smoothing)

23
Put it all together and ?

To build a statistical MT system we need
Aligned bilingual corpus
Training programs which will extract from the
corpora all the statistical data for the models
A decoder which takes a given input, and seeks
the output that evaluates the magic argmax
formula based on a heuristic search algorithm
Software for this purpose is freely available
http//www.statmt.org/moses/, http//www.isi.edu/l
icensed-sw/pharaoh/
Claim is that an MT system for a new language
pair can be built in a matter of hours

24
SMT latest developments

Nevertheless, quality is limited
SMT researchers quickly learned that this crude
approach can get them so far (quite far
actually), but that to go the extra distance you
need linguistic knowledge (eg morphology,
phrases, consitutents)
Latest developments aim to incorporate this
Big difference is that it too can be LEARNED
(automatically) from corpora
So SMT still contrasts with traditional RBMT
where rules are hand coded by linguists

25
Direct phrase alignment

(Wang Waible 1998, Och et al., 1999, Marcu
Wong 2002)
Enhance word translation model by adding joint
probabilities, i.e. probabilities for phrases
Phrase probabilities compensate for missing
lexical probabilities
Easy to integrate probabilities from different
sources/methods, allows for mutual compensation

26
Word alignment induced model

Koehn et al. 2003 example stolen from Knight
Koehn http//www.iccs.inf.ed.ac.uk/pkoehn/publica
tions/tutorial2003.pdf

Maria did not slap the green witch
Maria no daba una botefada a la bruja verda
Start with all phrase pairs justified by the word
alignment
27
Word alignment induced model

Koehn et al. 2003 example stolen from Knight
Koehn http//www.iccs.inf.ed.ac.uk/pkoehn/publica
tions/tutorial2003.pdf

(Maria, Maria), (no, did not) (daba una botefada,
slap), (a la, the), (verde, green), (bruja,
witch)
28
Word alignment induced model

Koehn et al. 2003 example stolen from Knight
Koehn http//www.iccs.inf.ed.ac.uk/pkoehn/publica
tions/tutorial2003.pdf

(Maria, Maria), (no, did not) (daba una botefada,
slap), (a la, the), (verde, green) (bruja,
witch), (Maria no, Maria did not), (no daba una
botefada, did not slap), (daba una botefada a la,
slap the), (bruja verde, green witch)
etc.
29
Word alignment induced model

Koehn et al. 2003 example stolen from Knight
Koehn http//www.iccs.inf.ed.ac.uk/pkoehn/publica
tions/tutorial2003.pdf

(Maria, Maria), (no, did not), (slap, daba una
bofetada), (a la, the), (bruja, witch), (verde,
green), (Maria no, Maria did not), (no daba una
bofetada, did not slap), (daba una bofetada a la,
slap the), (bruja verde, green witch), (Maria no
daba una bofetada, Maria did not slap), (no daba
una bofetada a la, did not slap the), (a la
bruja verde, the green witch), (Maria no daba una
bofetada a la, Maria did not slap the), (daba una
bofetada a la bruja verde, slap the green
witch), (no daba una bofetada a la bruja verde,
did not slap the green witch), (Maria no daba una
bofetada a la bruja verde, Maria did not slap the
green witch)
30
Alignment templates

Och et al. 1999 further developed by Marcu and
Wong 2002, Koehn and Knight 2003, Koehn et al.
2003)
Problem of sparse data worse for phrases
So use word classes instead of words
alignment templates instead of phrases
more reliable statistics for translation table
smaller translation table
more complex decoding
Word classes are induced (by distributional
statistics), so may not correspond to intuitive
(linguistic) classes
Takes context into account

31
Problems with phrase-based models

Still do not handle very well ...
dependencies (especially long-distance)
distortion
discontinuities (e.g. bought habe ... gekauft)
More promising seems to be ...

32
Syntax-based SMT

Better able to handle
Constituents
Function words
Grammatical context (e.g. case marking)
Inversion Transduction Grammars
Hierarchical transduction model
Tree-to-string translation
Tree-to-tree translation

33
Inversion transduction grammars

Wu and colleagues (1997 onwards)
Grammar generates two trees in parallel and
mappings between them
Rules can specify order changes
Restriction to binary rules limits complexity

34
Inversion transduction grammars
35
Inversion transduction grammars

Grammar is trained on word-aligned bilingual
corpus Note that all the rules are learned
automatically
Translation uses a decoder which effectively
works like traditional RBMT
Parser uses source side of transduction rules to
build a parse tree
Transduction rules are applied to transform the
tree
The target text is generated by linearizing the
tree

36
(No Transcript)
37
(No Transcript)
38
Other approaches

Other approaches use more and more linguistic
information
In each case automatically learned, especially
from treebanks
Traditional (rule-based) MT used (hand-written)
grammars and lexicons
State-of-the-art MT is moving back in this
direction, except that linguistic rules are
machine learned

Write a Comment

User Comments (0)