HebrewtoEnglish XFER MT Project Update - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

HebrewtoEnglish XFER MT Project Update

Description:

'Dahan' H-to-E and E-to-H dictionary available to us. Excel spreadsheet format (from prev project) ... Strong Decoder for H-to-E: Kathrin and Alon adapted ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 19
Provided by: chadtl
Category:

less

Transcript and Presenter's Notes

Title: HebrewtoEnglish XFER MT Project Update


1
Hebrew-to-English XFER MT Project - Update
  • Alon Lavie
  • June 2, 2004

2
The Team
  • Alon Lavie
  • Shuly Wintner (Faculty at Haifa Univ.)
  • Yaniv Eytani (MS student at Haifa Univ.)
  • Erik Peterson and Kathrin Probst

3
(No Transcript)
4
Main Tasks in Month-1
  • Hebrew Encoding Issues
  • Hebrew Language Resources
  • H-to-E Translation Lexicon
  • Morphological Analyzer
  • Putting together a front-end to the XFER engine
    morphology, format conversions
  • Elicitation for Hebrew (two versions of EC)
  • Installing system on local server in Haifa

5
Main Tasks in Month-2
  • Improving Hebrew Language Resources
  • H-to-E Translation Lexicon full spelling,
    reverse dict, compounds, enhanced English side
  • Morphological Analyzer all analyses, lattice
    representation
  • Manual Transfer Grammar
  • Collecting development and testing data (and
    their reference translations)
  • Development based on small dev-set
  • Evaluation on test data

6
Hebrew Encoding Issues
  • Input texts are (mostly) in standard Windows
    encoding for Hebrew
  • Morphology analyzer and other resources already
    set to work in an ascii-like representation
  • ? Converter script converts the input into the
    ascii representation
  • All further processing is done in the ascii
    representation
  • Lexicon and grammar rules are also in ascii
    representation
  • Elicitation is done in UTF8 Hebrew, output is
    converted to ascii representation

7
Translation Lexicon
  • Dahan H-to-E and E-to-H dictionary available to
    us
  • Excel spreadsheet format (from prev project)
  • Coverage is not great but not bad
  • H-to-E is about 15K translation pairs
  • E-to-H is about 7K translation pairs
  • POS information on both sides
  • No proper names or named entities
  • Issue with spelling convention KTIB XSR

8
Translation Lexicon
  • Yaniv wrote scripts that
  • Extract the relevant fields from the excel file
  • Extract words in deficient spelling and
    transform into full spelling
  • Extract and special treat compound nouns
  • Merge with added lexicons (i.e. names)
  • Sort and remove duplicate entries
  • Convert to the XFER lexicon format
  • Kathrin adapted script that enhances lexicon
    for English generation (plurals of nouns, tensed
    verb forms)
  • Show portion of full lexicon

9
Morphological Analyzer
  • Morphology is a big deal for Hebrew
  • Not just inflections and derivations, but also
  • Different words due to omission of vowels from
    the script
  • Attached prefixes for conj, det, prepositions,
    and some attached possessive suffixes
  • Analyzer program from MS student at Technion
    already available, works on Windows and with
    minimal adaptation on Linux
  • Coverage is reasonable
  • Produces all analyses or a disambiguated analysis
    for each word
  • Entire sentence passed as input to morpher (not
    word-by-word)

10
Morphological Processing
  • Split attached prefixes and suffixes into
    separate words for translation
  • Produce f-structures as output
  • Convert feature-value codes to our conventions
  • Install morpher as a server running on our linux
    machines
  • Yaniv wrote java scripts to handle input-output
    from the morpher
  • Erik integrated a wrapper for running morpher as
    a server on our linux machines
  • All analyses mode all possible analyses for
    each input word returned, represented in the form
    of a input lattice

11
Morphology Example
  • Input word BWRH
  • 0 1 2 3 4
  • --------BWRH--------
  • -----B-----WR--H--
  • --B---H----WRH---

12
Morphology Example
  • Y0 ((SPANSTART 0) Y1 ((SPANSTART 0)
    Y2 ((SPANSTART 1)
  • (SPANEND 4) (SPANEND
    2) (SPANEND 3)
  • (LEX BWRH) (LEX B)
    (LEX WR)
  • (POS N) (POS
    PREP)) (POS N)
  • (GEN F)
    (GEN M)
  • (NUM S)
    (NUM S)
  • (STATUS ABSOLUTE))
    (STATUS ABSOLUTE))
  • Y3 ((SPANSTART 3) Y4 ((SPANSTART 0)
    Y5 ((SPANSTART 1)
  • (SPANEND 4) (SPANEND
    1) (SPANEND 2)
  • (LEX LH) (LEX
    B) (LEX H)
  • (POS POSS)) (POS
    PREP)) (POS DET))
  • Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)
  • (SPANEND 4) (SPANEND
    4)
  • (LEX WRH) (LEX
    BWRH)
  • (POS N) (POS
    LEX))
  • (GEN F)
  • (NUM S)

13
Manual Transfer Grammar
  • Written by Alon in a couple of days
  • Current grammar has 36 rules
  • 21 NP rules
  • one PP rule
  • 6 verb complexes and VP) rules
  • 8 higher-phrase and sentence-level rules
  • Captures the most common (mostly local)
    structural differences between Hebrew and English
  • show portion of grammar

14
Elicitation for Hebrew
  • Erik made sure Elicitation Tool works for Hebrew
  • Various versions of EC used
  • Two reduced versions of full EC
  • Two versions of Structural EC
  • Shuly and Yaniv translated and aligned
    substantial portion of both
  • Kathrin trained an initial learned grammar

15
Decoding
  • Strong Decoder for H-to-E
  • Kathrin and Alon adapted script for running
    Stephans decoder.
  • No real amounts of parallel text, so no
    translation model scores for the edges
  • Kathrin constructed a new English LM for decoding
    the Hebrew-to-English system
  • 160 Million words
  • Includes English side of our translation lexicon
  • show portion of lattice

16
Sample Output (dev-data)
  • maxwell anurpung comes from ghana for israel four
    years ago and since worked in cleaning in hotels
    in eilat
  • a few weeks ago announced if management club
    hotel that for him to leave israel according to
    the government instructions and immigration
    police
  • in a letter in broken english which spread among
    the foreign workers thanks to them hotel for
    their hard work and announced that will purchase
    for hm flight tickets for their countries from
    their money

17
Evaluation Results
  • Test set of 62 sentences from Haaretz newspaper,
    2 reference translations

18
Further Issues
  • Transfer XFER engine cannot handle the
    construction of full lattices anymore (too many
    entries) ? we need a pruning mechanism
  • Further improvements in the translation lexicon
    and morphological analyzer
  • Decoding
  • Adding a source-language LM
  • Can we train a translation model?
  • Manual Grammar development
  • Improved grammar learning
Write a Comment
User Comments (0)
About PowerShow.com