Stream Decoding for Simultaneous Translation - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Stream Decoding for Simultaneous Translation

Description:

unlimited amount new named entities each day ... United Nations Vereinte Nationen. transcribe (between letter-based scripts)? russ. ... – PowerPoint PPT presentation

Number of Views:196

Avg rating:3.0/5.0

Slides: 31

Provided by: muntsi

Category:

more less

Transcript and Presenter's Notes

Title: Stream Decoding for Simultaneous Translation

1
Preprocessing in Statistical Machine Translation
May 19, 2009
2
Outline

Bilingual Corpora in SMT
properties and types of corpora
sentence alignment
Preprocessing training corpus and translation
input
basic preprocessing steps
tokenization
casing
abbreviations, numbers, dates, ...
named entities
transliteration
preprocessing for advanced SMT
POS tagging
compound splitting
morphological analysis

3
SMT Architecture
Bilingual Corpus
Preprocessing
Monoling. Corpus
Preprocessing
4
SMT Architecture
of the source text
of text corpora
Bilingual Corpus
Preprocessing
Monoling. Corpus
Preprocessing
5
Text Corpora in SMT

Idea of SMT
learn how to translate by analyzing huge
amounts of sample translations
core of SMT system training corpus of translated
texts
Bilingual corpora
collection of bilingual data documents, texts,
transcriptions of speech
different types
different domains / topics (politics, economics,
literature, )?
modality
written (grammatical)?
spoken (less grammatical, incomplete sentences,
filler words, stuttering)?
styles
formal (books, papers, law texts, business
letters, lectures)?
informal (e-mail, chat, text messages)?
? type of training corpus defines type of data,
the SMT system is able to translate (best)?

6
Domains in MT systems

Most commercial MT systems
developed for or adapted to a particular domain
or task
using customer data (translated manuals, business
letters, )?
MT Research
focus on open-domain systems or covering large
domains, e.g. news
special topics domain detection, automatic
domain adaptation

7
Parallel Data vs. Comparable Data

Parallel Data
texts are human translations
typically sentence-by-sentence
amount of available data restricted
human translations expensive
proprietary, copy right restricted ? translations
of books owned by publishing companies
confidential ? business letters,
Comparable Data
bilingual texts tell the same story
e.g. newspaper reports about the same event in
different newspapers, different languages
text elements not necessarily in the same order,
missing parts

8
Parallel Data for SMT

Available Data for SMT
Official data
EU documents are translated into all European
languages
European Parliamentary Speeches ? Europarl corpus
European Laws ? Acquis Communautaire
UN data available in multiple languages
Research Associations collect and provide data
Linguistic Data Consortium, European Language
Resources Association
sentence aligned, word aligned,
syntactically/semantically annotated
size of available parallel corpora
several hundred million words e.g. for Chinese,
Arabic, French, English
Europarl corpus 30-40 Mio words, depending on
language

9
Acquiring Data for SMT

Web crawling
BBC publishes news in 32 languages
bilingual web sites in bilingual countries
Wikipedia
typically comparable corpora that need extensive
cleaning
need to identify corresponding texts

10
Aligning Bilingual Texts

depending on the origin of the bilingual corpus
certain preprocessing steps are necessary
Extract text from html or pdf documents
Document alignment
Sentence segmentation
Sentence alignment
Document alignment
identify matching documents
e.g. corresponding html pages in different
languages
identify matching paragraphs
e.g. corresponding news stories on bilingual
websites

11
Sentence Segmentation

where to set sentence boundaries
trigger sentence segmentation at punctuation
marks
full stop (.), exclamation (!) and question mark
(?)
possibly at semicolon () and colon () ?
prevent segmentation after abbreviations

12
Sentence Alignment

text rarely translated word by word
sometimes not even sentence by sentence
long sentences might be splitted up, short ones
merged
not straight forward to identify corresponding
sentences in a parallel corpus

13
Sentence Alignment Task

given sentences f1fnf in the foreign language
and sentences e1ene in English
sentence alignment S list of sentence pairs s1,
sn
each sentence pair si is a pair of sets
si (fstart_f(i) ,, fend_f(i) ,estart_e(i)
,, eend_e(i) )?
restrictions
sentences translated in sequence
start_f(i) end_f(i-1)1
start_e(i) end_e(i-1)1
start_f(1) 1
start_e(1) 1
end_f(n) nf
end_e(n) ne
start_f(i) lt end_f(i)
start_e(i) lt end_e(i)?

14
Alignment Strategy

different alignment types
number of sentences in each set within a sentence
pair
1-1 (substitution),
1-0 (deletion), 0-1 (insertion),
2-1 (contraction), 1-2 (expansion),
2-2 (merger)?
Requirements for a full sentence alignment of a
corpus
all sentences need to be covered
each sentence can only be part of one sentence
pair

15
Alignment Strategy

Search for sentence alignment S s1, , sn
fulfilling the requirements of coverage and
uniqueness
optimize the applied measure of matching quality
of all its sentence pairs
?
Search the possible space of sentence alignments
for highest scoring one
dynamic programming
pruning

16
Popular Sentence Alignment Algorithm

Gale and Church (1993)?
2 components for the match function in
probability distribution for alignment types
distance measure considering the number of
letters in each of the sentences
d (l2 l1/c)/sqrt(l1 s2)
where c is the number of expected characters in
L2 per character in L1 and s2 the variance
? match function estimated by Prob( match d )
Prob( d match ) Prob( match )
Other algorithms consider cognates or other
lexical clues, e.g. using a bilingual dictionaries

17
Popular Sentence Alignment Algorithm

Gale and Church (1993)?
2 components for the match function in
probability distribution for alignment types
distance measure considering the number of
letters in each of the sentences
d (l2 l1/c)/sqrt(l1 s2)
where c is the number of expected characters in
L2 per character in L1 and s2 the variance
? match function estimated by Prob( match d )
Prob( d match ) Prob( match )
2 (1- Prob( d )) and Prob( d ) is
computed by integrating a standard normal
distribution
Other algorithms consider cognates or other
lexical clues, e.g. using a bilingual dictionaries

18
Sentence Alignment Output

sentence aligned parallel corpus
1 line 1 sentence

The skyward zoom in food prices is the dominant
force behind the speed up in eurozone inflation
\n Official forecasts predicted just 3 percent,
Bloomberg said. \n His performance is
delightfully tongue-in-cheek. Essentially, his
war reporter Simon Hunt is who Gere could have
ended up as, had fate and the film industry not
been so kind to him A man whose heyday is long
past, but who has preserved considerable shreds
of his former charm even as a ruin-like monument.
\n
Hauptgrund für den in der Eurozone gemessenen
Anstieg der Inflation seien die rasant steigenden
Lebensmittelpreise. \n Offizielle Prognosen sind
von nur 3 Prozent ausgegangen, meldete Bloomberg.
\n Er liefert eine wunderbar augenzwinkernde
Darstellung Im Grunde genommen ist sein
Kriegsreporter Simon Hunt das, was aus Gere hätte
werden können, wenn das Schicksal und die
Filmbranche nicht so gnädig wären Ein Mann, der
seine allerbesten Zeiten lange hinter sich hat,
der aber selbst als ruinengleiches Denkmal seines
Niedergang noch beträchtliche Reste des einstigen
Charmes bewahrt hat. \n
line 1
line 1
line 2
line 2
line x
line x
19
Preprocessing

Text format
plain text
xml annotated text
encoding
Filter sentences
remove sentence pairs containing empty or long
sentences
remove sentence pairs with unreasonable ratio of
words
Cleaning the corpus
normalize quotes, clean spaces
change special characters

20
Preprocessing

Tokenization
segment text (sequence of characters) into words
especially challenging for Asian languages ? no
spaces to indicate word boundaries
separate punctuation from words
The Social Democratic Partys fraction at the
Bürgerschaft, Hamburgsparliament, accused the
senate of having wasted precious time.
? The Social Democratic Party s fraction at the
Bürgerschaft , Hamburg sparliament , accused
the senate of having wasted precious time .

21
Preprocessing

Dealing with lower and upper case in SMT
normal text obeys the rules of capitalization
train translation model on normalized text
benefit from generalization of the vocabulary
but also introduces ambiguities May (month) ?
may
convert training corpus into lowercase /
smartcase
train recasing model on original and lowercased /
smartcased text
? recreate true case for translated text in
postprocessing step

22
Preprocessing

Preprocess patterns
expand contractions and abbreviations
hell ? he will, its ? it is, z.B. ? zum
Beispiel,
normalize dates and numbers
4.12.01, 04.12.2001, 4. Dez. 2001, 4. Dez. 01 ?
4. Dezember 2001
? normalization leads to generalization and
therefore better coverage

23
Named Entities (NE)?

specific instance of an object class which is
referred to by its name
Types
personal names, organizations, locations,
temporal phrases, monetary expressions
William Bell, Spice Girls, United Nations,
München lt-gt Munich, the year 2001,
Problems
most NEs are out-of-vocabulary words
should not be translated even if (partly)
possible
unlimited amount ? new named entities each day
? Named Entities need to be recognized as such
and treated separately

24
Named Entity Recognition (NER)?

identify named entities in a running text
Indications
capitalization
patterns am 3. Januar, 12.5.2009
Techniques
grammar based
define language-dependent rules, patterns
statistical
train a model from NE-annotated corpora

25
Named Entities in Translation

How to translate Named Entities?
leave
George Bush ! George Strauch
translate partially / change format
e.g. rule-based pattern substitution of time and
money expressions
am 3. Januar lt-gt on January 3rd
300 lt-gt 300 Dollar
12.5.2009 lt-gt 5/12/2009
4pm lt-gt 16.00 Uhr
fixed equivalence
München Munich
United Nations Vereinte Nationen
transcribe (between letter-based scripts)?
russ. ????? ????? dt. Anton Tschechow eng.
Anton Chekhov wiss. transcr. Anton Cechov
transliterate (between letter-based and
character-based scripts)?
Microsoft

26
Transliteration

Hindi
translation hello
transliteration namaste
method
sounding out
transform a word in the source language script
into a phonetically identical word in the target
language script
mapping of source script characters to target
script characters that are pronounced similarly ?
no 11 mapping due to different phoneme spaces of
different languages

source of images and hindi text Kellner, 2007
27
Transliteration

Forward Transliteration
word does not exist in target language
alternative phonetically identical
transliterations possible and usually ok
e.g. spelling of Indian name Mina, Minaa,
Meena, Meenaa,
Backward Transliteration
word exists in target language
only one form correct
e.g. city names London, Lundan, Lundon

28
Machine Transliteration Techniques

3 basic strategies
Manually compiled rules / tables
language dependent
must be exhaustive
Phonetic-based Model
Source word ? source segments ? source phonemes ?
target phonemes ? target segments ? target word
methods based on hand-crafted rules or machine
learning
requires language-specific linguistic knowledge
or data about segmentation and pronunciation
Direct Orthographic Model
direct mapping source segments ? target segments
New approach Direct Orthographic Modeling with
automated segments
consider all segmentation possibilities ? choose
best one based on probabilities
determine segments and probabilities during
iterative training based on word-transliteration
pairs

29
Preprocessing for Advanced SMT Methods

Morphological preprocessing
compound splitting
splitting compound words into their constituent
words facilitates translation
universitätsgebäude ? universität gebäude ?
university building
stemming / lemmatization (of morphologically rich
languages)?
car, cars, car's, cars' ? car
strip functional morphemes at word endings ? stem
geh-st ? geh
use dictionary and morphological inflection rules
to derive base form ? lemma
geh-st ? gehen
use lemmatized word form for training
? less variability, more generalization,
increased lexical coverage
recreate morphological information after
translation

30
Preprocessing for Advanced SMT Methods

POS tagging
assign POS tags to each word
for instance , the government introduced
statewide comparative tests .IN NN , DT NN VVD
JJ JJ NNS SENT
make use of probability of certain POS sequences,
e.g.
apply a POS language model
use POS as factors in factorized translation

Write a Comment

User Comments (0)