Segmentation for English-to-Arabic Statistical Machine Translation - PowerPoint PPT Presentation

About This Presentation
Title:

Segmentation for English-to-Arabic Statistical Machine Translation

Description:

Title: Building and Optimizing A Broad-Coverage English Arabic Phrase Based Statistical Machine Translation System Author: New1 Last modified by – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 10
Provided by: new1
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Segmentation for English-to-Arabic Statistical Machine Translation


1
Segmentation for English-to-Arabic Statistical
Machine Translation
  • Ibrahim Badr, Rabih Zbib, James Glass

2
Introduction
  • Experiment on English-to-Arabic SMT.
  • Two domains text news ,spoken travel conv.
  • Explore the effect of Arabic segmentation, on
    the translation quality .
  • Propose various schemes recombining (Not
    Trivial!) the segmented Arabic.
  • Apply (basic) factored translation models

3
Arabic Morphology
  • Arabic is a morphologically rich language.
  • Nouns and Adjectives inflect for gender (m,f) ,
    number (pl,sg,du) and case (Nom,Acc,Gen) all
    comb are possible
  • ???? (a player, M), ????? (a player, F),
    ?????? (two players, M),
  • ??????? (two players, F), ?????? (players,
    M,P,Nom), ?????? (players, M,P,Acc or gen)

  • In addition to gender and number, verbs inflect
    for tense, voice, and
  • person
  • ????? (play, past, plM3P), ?????? (play ,
    present, plM3P), ??????? (played, plM3P)
  • Addittional Pefixes conjunction ?, determiner
    ??, preposition ? (with,in) ? (to) (for) ??..
    ?????????
  • Additional Sufixes
  • - possessive pronouns (attach to nouns)??
    (their), ??? (your, pl,M), ??? (your, pl,F),
  • - object and subject pronouns attach to
    verbs ?? (me), ??? (them), ? (they)
  • ??????????
  • Many surface forms sharing the same lemma!

4
Arabic segmentation
  • Use MADA for morphological decomposition of
    Arabic text.
  • (typical) normalizaion ? ? ?, ??? ? ?
  • 2 proposed segmentation
  • S1 Split all clitics mentioned in prev
    slide except plural
  • and subject pronoun morphemes.
  • S2 Same a S1, the split clitics are
    glued into one prefix and one suffix
  • word prefix stem suffix
  • Example
  • ???????? (and for his kids)
  • s1 ? ????? ? ?
  • s2 ?? ????? ?

5
Arabic Recombination
  • Segmented output needs recombination!
  • Why is it not a trivial
  • a) Letter ambiguity we normalized ? ?
    ?
  • ? ??? ?????
  • ? ?? ? ???
  • b) Word Ambiguity Some words can be
    grammatically recombined in more than
  • one way
  • ???? ? 1 ???? 2 ?????
  • Propose two recombination schemes
  • 1. R recombination rules define manualy.
  • Resolve a pick most frequent stem form
    in non-norm data.
  • Resovle b pick most frequent
    grammatical form.
  • 2. T Build a table derived from the
    training set (surface, decomposed word)
  • more than one surface ? choose
    randomly.
  • can help in combining words segmented
    incorrectly .

6
Factored Model Data
  • Factors
  • -Factors on the English Side surface
    formPOS
  • -Factors on the Arabic Side Surface form
    POSclitics
  • -Build 3-gram LM on surface form, 7-gram
    for the POSclitics.
  • -Generation model Surface POSclitics
    ? Surface.
  • Data Newswire spoken dialogue (travel)
  • - Training Data
  • Newswire LDC 3M ,1.6M,
    600K words. (Avg sent 33 En,25 Ar, 36 SegAr
  • Spoken dialogue IWSLT (2007),
    200k words (Avg sent 9 En, 8 Ar, 10 SegAr)
  • - LM
  • Newswire 3M Ar side 30M from
    Arabic Giga word
  • Spoken dialogue 200k words Ar
    side.
  • - Tuning and test sets (1 En ref)
  • Newwire 2000 tune, 2000 test
    (chosen randomly,same source of trainnig)
  • Spoken dialogue 500 tune, 500
    test

7
Setup Recombination
  • Setup
  • Use GIZA for alignment (both unseg Ar, seg Ar),
    use MAXPHR 15 for segAr!
  • Decode using MOSES.
  • SRI LM
  • - News wire 4 -gram (unseg Ar), 6-gram
    (SegAr).
  • - Spoken 3-gram (unseg Ar), 4-gram
    (SegAr).
  • MERT for tuning, optimize for BLEU.
  • Define 2 tuning schemes for SegAr
  • - T1 Use segAr for ref
  • -T2 Use UnsegAr for ref. Combine
    before scoring the n-best list
  • Recombination Results
  • -Test on Newswire training and test sets .(Sent
    error!)
  • - T was trained on the Training set.
  • - Baseline Glue pref and suff.
  • - TR if word was seen use T, else use R

8
Translation Results News
  • Results for Newswire (BLEU)
  • Segmentation helps, but the gain diminishes as
    the training data size increases (less sparse
    model).
  • Segmentation S2 is slightly better than S1.
  • Tuning scheme T2 performs better than T1
  • Factored models performs the best for the
    Largest system (at higher cost!)

9
Translation Results Spoken Dialogue
  • Results for Spoken dialogue (BLEU)
  • S2 performs slightly better than S1
  • T1 is better than T2
  • Conclusions
  • - Recombination based on both the training data
    and rules performs best.
  • - Segmentation helps, but the gain diminishes
    as the training data size increases .
  • - Recombining the segmented output during
    tuning helps.
  • - Factored models perform best for the Large
    system.
  • - What next Explore the effect of Syntactic
    reordering on En?Ar MT
  • Syntactic Phrase Reordering for
    English-to-Arabic Statistical Machine
    Translation, Badr et al., EACL 2009.
Write a Comment
User Comments (0)
About PowerShow.com