Title: Stream Decoding for Simultaneous Translation
1Preprocessing in Statistical Machine Translation
May 19, 2009
2Outline
- Bilingual Corpora in SMT
- properties and types of corpora
- sentence alignment
- Preprocessing training corpus and translation
input - basic preprocessing steps
- tokenization
- casing
- abbreviations, numbers, dates, ...
- named entities
- transliteration
- preprocessing for advanced SMT
- POS tagging
- compound splitting
- morphological analysis
3SMT Architecture
Bilingual Corpus
Preprocessing
Monoling. Corpus
Preprocessing
4SMT Architecture
of the source text
of text corpora
Bilingual Corpus
Preprocessing
Monoling. Corpus
Preprocessing
5Text Corpora in SMT
- Idea of SMT
- learn how to translate by analyzing huge
amounts of sample translations - core of SMT system training corpus of translated
texts - Bilingual corpora
- collection of bilingual data documents, texts,
transcriptions of speech - different types
- different domains / topics (politics, economics,
literature, )? - modality
- written (grammatical)?
- spoken (less grammatical, incomplete sentences,
filler words, stuttering)? - styles
- formal (books, papers, law texts, business
letters, lectures)? - informal (e-mail, chat, text messages)?
- ? type of training corpus defines type of data,
the SMT system is able to translate (best)?
6Domains in MT systems
- Most commercial MT systems
- developed for or adapted to a particular domain
or task - using customer data (translated manuals, business
letters, )? - MT Research
- focus on open-domain systems or covering large
domains, e.g. news - special topics domain detection, automatic
domain adaptation
7Parallel Data vs. Comparable Data
- Parallel Data
- texts are human translations
- typically sentence-by-sentence
- amount of available data restricted
- human translations expensive
- proprietary, copy right restricted ? translations
of books owned by publishing companies - confidential ? business letters,
- Comparable Data
- bilingual texts tell the same story
- e.g. newspaper reports about the same event in
different newspapers, different languages - text elements not necessarily in the same order,
missing parts
8Parallel Data for SMT
- Available Data for SMT
- Official data
- EU documents are translated into all European
languages - European Parliamentary Speeches ? Europarl corpus
- European Laws ? Acquis Communautaire
- UN data available in multiple languages
- Research Associations collect and provide data
- Linguistic Data Consortium, European Language
Resources Association - sentence aligned, word aligned,
syntactically/semantically annotated - size of available parallel corpora
- several hundred million words e.g. for Chinese,
Arabic, French, English - Europarl corpus 30-40 Mio words, depending on
language
9Acquiring Data for SMT
- Web crawling
- BBC publishes news in 32 languages
- bilingual web sites in bilingual countries
- Wikipedia
- typically comparable corpora that need extensive
cleaning - need to identify corresponding texts
10Aligning Bilingual Texts
- depending on the origin of the bilingual corpus
certain preprocessing steps are necessary - Extract text from html or pdf documents
- Document alignment
- Sentence segmentation
- Sentence alignment
- Document alignment
- identify matching documents
- e.g. corresponding html pages in different
languages - identify matching paragraphs
- e.g. corresponding news stories on bilingual
websites
11Sentence Segmentation
- where to set sentence boundaries
- trigger sentence segmentation at punctuation
marks - full stop (.), exclamation (!) and question mark
(?) - possibly at semicolon () and colon () ?
- prevent segmentation after abbreviations
12Sentence Alignment
- text rarely translated word by word
- sometimes not even sentence by sentence
- long sentences might be splitted up, short ones
merged - not straight forward to identify corresponding
sentences in a parallel corpus
13Sentence Alignment Task
- given sentences f1fnf in the foreign language
and sentences e1ene in English - sentence alignment S list of sentence pairs s1,
sn - each sentence pair si is a pair of sets
- si (fstart_f(i) ,, fend_f(i) ,estart_e(i)
,, eend_e(i) )? - restrictions
- sentences translated in sequence
- start_f(i) end_f(i-1)1
- start_e(i) end_e(i-1)1
- start_f(1) 1
- start_e(1) 1
- end_f(n) nf
- end_e(n) ne
- start_f(i) lt end_f(i)
- start_e(i) lt end_e(i)?
14Alignment Strategy
- different alignment types
- number of sentences in each set within a sentence
pair - 1-1 (substitution),
- 1-0 (deletion), 0-1 (insertion),
- 2-1 (contraction), 1-2 (expansion),
- 2-2 (merger)?
- Requirements for a full sentence alignment of a
corpus - all sentences need to be covered
- each sentence can only be part of one sentence
pair
15Alignment Strategy
- Search for sentence alignment S s1, , sn
- fulfilling the requirements of coverage and
uniqueness - optimize the applied measure of matching quality
of all its sentence pairs - ?
- Search the possible space of sentence alignments
for highest scoring one - dynamic programming
- pruning
16Popular Sentence Alignment Algorithm
- Gale and Church (1993)?
- 2 components for the match function in
- probability distribution for alignment types
- distance measure considering the number of
letters in each of the sentences - d (l2 l1/c)/sqrt(l1 s2)
- where c is the number of expected characters in
L2 per character in L1 and s2 the variance - ? match function estimated by Prob( match d )
Prob( d match ) Prob( match ) - Other algorithms consider cognates or other
lexical clues, e.g. using a bilingual dictionaries
17Popular Sentence Alignment Algorithm
- Gale and Church (1993)?
- 2 components for the match function in
- probability distribution for alignment types
- distance measure considering the number of
letters in each of the sentences - d (l2 l1/c)/sqrt(l1 s2)
- where c is the number of expected characters in
L2 per character in L1 and s2 the variance - ? match function estimated by Prob( match d )
Prob( d match ) Prob( match ) - 2 (1- Prob( d )) and Prob( d ) is
computed by integrating a standard normal
distribution - Other algorithms consider cognates or other
lexical clues, e.g. using a bilingual dictionaries
18Sentence Alignment Output
- sentence aligned parallel corpus
- 1 line 1 sentence
The skyward zoom in food prices is the dominant
force behind the speed up in eurozone inflation
\n Official forecasts predicted just 3 percent,
Bloomberg said. \n His performance is
delightfully tongue-in-cheek. Essentially, his
war reporter Simon Hunt is who Gere could have
ended up as, had fate and the film industry not
been so kind to him A man whose heyday is long
past, but who has preserved considerable shreds
of his former charm even as a ruin-like monument.
\n
Hauptgrund für den in der Eurozone gemessenen
Anstieg der Inflation seien die rasant steigenden
Lebensmittelpreise. \n Offizielle Prognosen sind
von nur 3 Prozent ausgegangen, meldete Bloomberg.
\n Er liefert eine wunderbar augenzwinkernde
Darstellung Im Grunde genommen ist sein
Kriegsreporter Simon Hunt das, was aus Gere hätte
werden können, wenn das Schicksal und die
Filmbranche nicht so gnädig wären Ein Mann, der
seine allerbesten Zeiten lange hinter sich hat,
der aber selbst als ruinengleiches Denkmal seines
Niedergang noch beträchtliche Reste des einstigen
Charmes bewahrt hat. \n
line 1
line 1
line 2
line 2
line x
line x
19Preprocessing
- Text format
- plain text
- xml annotated text
- encoding
- Filter sentences
- remove sentence pairs containing empty or long
sentences - remove sentence pairs with unreasonable ratio of
words - Cleaning the corpus
- normalize quotes, clean spaces
- change special characters
20Preprocessing
- Tokenization
- segment text (sequence of characters) into words
- especially challenging for Asian languages ? no
spaces to indicate word boundaries - separate punctuation from words
- The Social Democratic Partys fraction at the
Bürgerschaft, Hamburgsparliament, accused the
senate of having wasted precious time. - ? The Social Democratic Party s fraction at the
Bürgerschaft , Hamburg sparliament , accused
the senate of having wasted precious time .
21Preprocessing
- Dealing with lower and upper case in SMT
- normal text obeys the rules of capitalization
- train translation model on normalized text
- benefit from generalization of the vocabulary
- but also introduces ambiguities May (month) ?
may - convert training corpus into lowercase /
smartcase - train recasing model on original and lowercased /
smartcased text - ? recreate true case for translated text in
postprocessing step
22Preprocessing
- Preprocess patterns
- expand contractions and abbreviations
- hell ? he will, its ? it is, z.B. ? zum
Beispiel, - normalize dates and numbers
- 4.12.01, 04.12.2001, 4. Dez. 2001, 4. Dez. 01 ?
4. Dezember 2001 - ? normalization leads to generalization and
therefore better coverage
23Named Entities (NE)?
- specific instance of an object class which is
referred to by its name - Types
- personal names, organizations, locations,
temporal phrases, monetary expressions - William Bell, Spice Girls, United Nations,
München lt-gt Munich, the year 2001, - Problems
- most NEs are out-of-vocabulary words
- should not be translated even if (partly)
possible - unlimited amount ? new named entities each day
- ? Named Entities need to be recognized as such
and treated separately
24Named Entity Recognition (NER)?
- identify named entities in a running text
- Indications
- capitalization
- patterns am 3. Januar, 12.5.2009
- Techniques
- grammar based
- define language-dependent rules, patterns
- statistical
- train a model from NE-annotated corpora
25Named Entities in Translation
- How to translate Named Entities?
- leave
- George Bush ! George Strauch
- translate partially / change format
- e.g. rule-based pattern substitution of time and
money expressions - am 3. Januar lt-gt on January 3rd
- 300 lt-gt 300 Dollar
- 12.5.2009 lt-gt 5/12/2009
- 4pm lt-gt 16.00 Uhr
- fixed equivalence
- München Munich
- United Nations Vereinte Nationen
- transcribe (between letter-based scripts)?
- russ. ????? ????? dt. Anton Tschechow eng.
Anton Chekhov wiss. transcr. Anton Cechov - transliterate (between letter-based and
character-based scripts)? - Microsoft
26Transliteration
- Hindi
- translation hello
- transliteration namaste
- method
- sounding out
- transform a word in the source language script
into a phonetically identical word in the target
language script - mapping of source script characters to target
script characters that are pronounced similarly ?
no 11 mapping due to different phoneme spaces of
different languages
source of images and hindi text Kellner, 2007
27Transliteration
- Forward Transliteration
- word does not exist in target language
- alternative phonetically identical
transliterations possible and usually ok - e.g. spelling of Indian name Mina, Minaa,
Meena, Meenaa, - Backward Transliteration
- word exists in target language
- only one form correct
- e.g. city names London, Lundan, Lundon
28Machine Transliteration Techniques
- 3 basic strategies
- Manually compiled rules / tables
- language dependent
- must be exhaustive
- Phonetic-based Model
- Source word ? source segments ? source phonemes ?
target phonemes ? target segments ? target word - methods based on hand-crafted rules or machine
learning - requires language-specific linguistic knowledge
or data about segmentation and pronunciation - Direct Orthographic Model
- direct mapping source segments ? target segments
- New approach Direct Orthographic Modeling with
automated segments - consider all segmentation possibilities ? choose
best one based on probabilities - determine segments and probabilities during
iterative training based on word-transliteration
pairs
29Preprocessing for Advanced SMT Methods
- Morphological preprocessing
- compound splitting
- splitting compound words into their constituent
words facilitates translation - universitätsgebäude ? universität gebäude ?
university building - stemming / lemmatization (of morphologically rich
languages)? - car, cars, car's, cars' ? car
- strip functional morphemes at word endings ? stem
- geh-st ? geh
- use dictionary and morphological inflection rules
to derive base form ? lemma - geh-st ? gehen
- use lemmatized word form for training
- ? less variability, more generalization,
increased lexical coverage - recreate morphological information after
translation
30Preprocessing for Advanced SMT Methods
- POS tagging
- assign POS tags to each word
- for instance , the government introduced
statewide comparative tests .IN NN , DT NN VVD
JJ JJ NNS SENT - make use of probability of certain POS sequences,
e.g. - apply a POS language model
- use POS as factors in factorized translation