Stream Decoding for Simultaneous Translation - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Stream Decoding for Simultaneous Translation

Description:

unlimited amount new named entities each day ... United Nations Vereinte Nationen. transcribe (between letter-based scripts)? russ. ... – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 31
Provided by: muntsi
Category:

less

Transcript and Presenter's Notes

Title: Stream Decoding for Simultaneous Translation


1
Preprocessing in Statistical Machine Translation
May 19, 2009
2
Outline
  • Bilingual Corpora in SMT
  • properties and types of corpora
  • sentence alignment
  • Preprocessing training corpus and translation
    input
  • basic preprocessing steps
  • tokenization
  • casing
  • abbreviations, numbers, dates, ...
  • named entities
  • transliteration
  • preprocessing for advanced SMT
  • POS tagging
  • compound splitting
  • morphological analysis

3
SMT Architecture
Bilingual Corpus
Preprocessing
Monoling. Corpus
Preprocessing
4
SMT Architecture
of the source text
of text corpora
Bilingual Corpus
Preprocessing
Monoling. Corpus
Preprocessing
5
Text Corpora in SMT
  • Idea of SMT
  • learn how to translate by analyzing huge
    amounts of sample translations
  • core of SMT system training corpus of translated
    texts
  • Bilingual corpora
  • collection of bilingual data documents, texts,
    transcriptions of speech
  • different types
  • different domains / topics (politics, economics,
    literature, )?
  • modality
  • written (grammatical)?
  • spoken (less grammatical, incomplete sentences,
    filler words, stuttering)?
  • styles
  • formal (books, papers, law texts, business
    letters, lectures)?
  • informal (e-mail, chat, text messages)?
  • ? type of training corpus defines type of data,
    the SMT system is able to translate (best)?

6
Domains in MT systems
  • Most commercial MT systems
  • developed for or adapted to a particular domain
    or task
  • using customer data (translated manuals, business
    letters, )?
  • MT Research
  • focus on open-domain systems or covering large
    domains, e.g. news
  • special topics domain detection, automatic
    domain adaptation

7
Parallel Data vs. Comparable Data
  • Parallel Data
  • texts are human translations
  • typically sentence-by-sentence
  • amount of available data restricted
  • human translations expensive
  • proprietary, copy right restricted ? translations
    of books owned by publishing companies
  • confidential ? business letters,
  • Comparable Data
  • bilingual texts tell the same story
  • e.g. newspaper reports about the same event in
    different newspapers, different languages
  • text elements not necessarily in the same order,
    missing parts

8
Parallel Data for SMT
  • Available Data for SMT
  • Official data
  • EU documents are translated into all European
    languages
  • European Parliamentary Speeches ? Europarl corpus
  • European Laws ? Acquis Communautaire
  • UN data available in multiple languages
  • Research Associations collect and provide data
  • Linguistic Data Consortium, European Language
    Resources Association
  • sentence aligned, word aligned,
    syntactically/semantically annotated
  • size of available parallel corpora
  • several hundred million words e.g. for Chinese,
    Arabic, French, English
  • Europarl corpus 30-40 Mio words, depending on
    language

9
Acquiring Data for SMT
  • Web crawling
  • BBC publishes news in 32 languages
  • bilingual web sites in bilingual countries
  • Wikipedia
  • typically comparable corpora that need extensive
    cleaning
  • need to identify corresponding texts

10
Aligning Bilingual Texts
  • depending on the origin of the bilingual corpus
    certain preprocessing steps are necessary
  • Extract text from html or pdf documents
  • Document alignment
  • Sentence segmentation
  • Sentence alignment
  • Document alignment
  • identify matching documents
  • e.g. corresponding html pages in different
    languages
  • identify matching paragraphs
  • e.g. corresponding news stories on bilingual
    websites

11
Sentence Segmentation
  • where to set sentence boundaries
  • trigger sentence segmentation at punctuation
    marks
  • full stop (.), exclamation (!) and question mark
    (?)
  • possibly at semicolon () and colon () ?
  • prevent segmentation after abbreviations

12
Sentence Alignment
  • text rarely translated word by word
  • sometimes not even sentence by sentence
  • long sentences might be splitted up, short ones
    merged
  • not straight forward to identify corresponding
    sentences in a parallel corpus

13
Sentence Alignment Task
  • given sentences f1fnf in the foreign language
    and sentences e1ene in English
  • sentence alignment S list of sentence pairs s1,
    sn
  • each sentence pair si is a pair of sets
  • si (fstart_f(i) ,, fend_f(i) ,estart_e(i)
    ,, eend_e(i) )?
  • restrictions
  • sentences translated in sequence
  • start_f(i) end_f(i-1)1
  • start_e(i) end_e(i-1)1
  • start_f(1) 1
  • start_e(1) 1
  • end_f(n) nf
  • end_e(n) ne
  • start_f(i) lt end_f(i)
  • start_e(i) lt end_e(i)?

14
Alignment Strategy
  • different alignment types
  • number of sentences in each set within a sentence
    pair
  • 1-1 (substitution),
  • 1-0 (deletion), 0-1 (insertion),
  • 2-1 (contraction), 1-2 (expansion),
  • 2-2 (merger)?
  • Requirements for a full sentence alignment of a
    corpus
  • all sentences need to be covered
  • each sentence can only be part of one sentence
    pair

15
Alignment Strategy
  • Search for sentence alignment S s1, , sn
  • fulfilling the requirements of coverage and
    uniqueness
  • optimize the applied measure of matching quality
    of all its sentence pairs
  • ?
  • Search the possible space of sentence alignments
    for highest scoring one
  • dynamic programming
  • pruning

16
Popular Sentence Alignment Algorithm
  • Gale and Church (1993)?
  • 2 components for the match function in
  • probability distribution for alignment types
  • distance measure considering the number of
    letters in each of the sentences
  • d (l2 l1/c)/sqrt(l1 s2)
  • where c is the number of expected characters in
    L2 per character in L1 and s2 the variance
  • ? match function estimated by Prob( match d )
    Prob( d match ) Prob( match )
  • Other algorithms consider cognates or other
    lexical clues, e.g. using a bilingual dictionaries

17
Popular Sentence Alignment Algorithm
  • Gale and Church (1993)?
  • 2 components for the match function in
  • probability distribution for alignment types
  • distance measure considering the number of
    letters in each of the sentences
  • d (l2 l1/c)/sqrt(l1 s2)
  • where c is the number of expected characters in
    L2 per character in L1 and s2 the variance
  • ? match function estimated by Prob( match d )
    Prob( d match ) Prob( match )
  • 2 (1- Prob( d )) and Prob( d ) is
    computed by integrating a standard normal
    distribution
  • Other algorithms consider cognates or other
    lexical clues, e.g. using a bilingual dictionaries

18
Sentence Alignment Output
  • sentence aligned parallel corpus
  • 1 line 1 sentence

The skyward zoom in food prices is the dominant
force behind the speed up in eurozone inflation
\n Official forecasts predicted just 3 percent,
Bloomberg said. \n His performance is
delightfully tongue-in-cheek. Essentially, his
war reporter Simon Hunt is who Gere could have
ended up as, had fate and the film industry not
been so kind to him A man whose heyday is long
past, but who has preserved considerable shreds
of his former charm even as a ruin-like monument.
\n
Hauptgrund für den in der Eurozone gemessenen
Anstieg der Inflation seien die rasant steigenden
Lebensmittelpreise. \n Offizielle Prognosen sind
von nur 3 Prozent ausgegangen, meldete Bloomberg.
\n Er liefert eine wunderbar augenzwinkernde
Darstellung Im Grunde genommen ist sein
Kriegsreporter Simon Hunt das, was aus Gere hätte
werden können, wenn das Schicksal und die
Filmbranche nicht so gnädig wären Ein Mann, der
seine allerbesten Zeiten lange hinter sich hat,
der aber selbst als ruinengleiches Denkmal seines
Niedergang noch beträchtliche Reste des einstigen
Charmes bewahrt hat. \n
line 1
line 1
line 2
line 2
line x
line x
19
Preprocessing
  • Text format
  • plain text
  • xml annotated text
  • encoding
  • Filter sentences
  • remove sentence pairs containing empty or long
    sentences
  • remove sentence pairs with unreasonable ratio of
    words
  • Cleaning the corpus
  • normalize quotes, clean spaces
  • change special characters

20
Preprocessing
  • Tokenization
  • segment text (sequence of characters) into words
  • especially challenging for Asian languages ? no
    spaces to indicate word boundaries
  • separate punctuation from words
  • The Social Democratic Partys fraction at the
    Bürgerschaft, Hamburgsparliament, accused the
    senate of having wasted precious time.
  • ? The Social Democratic Party s fraction at the
    Bürgerschaft , Hamburg sparliament , accused
    the senate of having wasted precious time .

21
Preprocessing
  • Dealing with lower and upper case in SMT
  • normal text obeys the rules of capitalization
  • train translation model on normalized text
  • benefit from generalization of the vocabulary
  • but also introduces ambiguities May (month) ?
    may
  • convert training corpus into lowercase /
    smartcase
  • train recasing model on original and lowercased /
    smartcased text
  • ? recreate true case for translated text in
    postprocessing step

22
Preprocessing
  • Preprocess patterns
  • expand contractions and abbreviations
  • hell ? he will, its ? it is, z.B. ? zum
    Beispiel,
  • normalize dates and numbers
  • 4.12.01, 04.12.2001, 4. Dez. 2001, 4. Dez. 01 ?
    4. Dezember 2001
  • ? normalization leads to generalization and
    therefore better coverage

23
Named Entities (NE)?
  • specific instance of an object class which is
    referred to by its name
  • Types
  • personal names, organizations, locations,
    temporal phrases, monetary expressions
  • William Bell, Spice Girls, United Nations,
    München lt-gt Munich, the year 2001,
  • Problems
  • most NEs are out-of-vocabulary words
  • should not be translated even if (partly)
    possible
  • unlimited amount ? new named entities each day
  • ? Named Entities need to be recognized as such
    and treated separately

24
Named Entity Recognition (NER)?
  • identify named entities in a running text
  • Indications
  • capitalization
  • patterns am 3. Januar, 12.5.2009
  • Techniques
  • grammar based
  • define language-dependent rules, patterns
  • statistical
  • train a model from NE-annotated corpora

25
Named Entities in Translation
  • How to translate Named Entities?
  • leave
  • George Bush ! George Strauch
  • translate partially / change format
  • e.g. rule-based pattern substitution of time and
    money expressions
  • am 3. Januar lt-gt on January 3rd
  • 300 lt-gt 300 Dollar
  • 12.5.2009 lt-gt 5/12/2009
  • 4pm lt-gt 16.00 Uhr
  • fixed equivalence
  • München Munich
  • United Nations Vereinte Nationen
  • transcribe (between letter-based scripts)?
  • russ. ????? ????? dt. Anton Tschechow eng.
    Anton Chekhov wiss. transcr. Anton Cechov
  • transliterate (between letter-based and
    character-based scripts)?
  • Microsoft

26
Transliteration
  • Hindi
  • translation hello
  • transliteration namaste
  • method
  • sounding out
  • transform a word in the source language script
    into a phonetically identical word in the target
    language script
  • mapping of source script characters to target
    script characters that are pronounced similarly ?
    no 11 mapping due to different phoneme spaces of
    different languages

source of images and hindi text Kellner, 2007
27
Transliteration
  • Forward Transliteration
  • word does not exist in target language
  • alternative phonetically identical
    transliterations possible and usually ok
  • e.g. spelling of Indian name Mina, Minaa,
    Meena, Meenaa,
  • Backward Transliteration
  • word exists in target language
  • only one form correct
  • e.g. city names London, Lundan, Lundon

28
Machine Transliteration Techniques
  • 3 basic strategies
  • Manually compiled rules / tables
  • language dependent
  • must be exhaustive
  • Phonetic-based Model
  • Source word ? source segments ? source phonemes ?
    target phonemes ? target segments ? target word
  • methods based on hand-crafted rules or machine
    learning
  • requires language-specific linguistic knowledge
    or data about segmentation and pronunciation
  • Direct Orthographic Model
  • direct mapping source segments ? target segments
  • New approach Direct Orthographic Modeling with
    automated segments
  • consider all segmentation possibilities ? choose
    best one based on probabilities
  • determine segments and probabilities during
    iterative training based on word-transliteration
    pairs

29
Preprocessing for Advanced SMT Methods
  • Morphological preprocessing
  • compound splitting
  • splitting compound words into their constituent
    words facilitates translation
  • universitätsgebäude ? universität gebäude ?
    university building
  • stemming / lemmatization (of morphologically rich
    languages)?
  • car, cars, car's, cars' ? car
  • strip functional morphemes at word endings ? stem
  • geh-st ? geh
  • use dictionary and morphological inflection rules
    to derive base form ? lemma
  • geh-st ? gehen
  • use lemmatized word form for training
  • ? less variability, more generalization,
    increased lexical coverage
  • recreate morphological information after
    translation

30
Preprocessing for Advanced SMT Methods
  • POS tagging
  • assign POS tags to each word
  • for instance , the government introduced
    statewide comparative tests .IN NN , DT NN VVD
    JJ JJ NNS SENT
  • make use of probability of certain POS sequences,
    e.g.
  • apply a POS language model
  • use POS as factors in factorized translation
Write a Comment
User Comments (0)
About PowerShow.com