AMTEXT: Extractionbased MT for Arabic

About This Presentation

Title:

AMTEXT: Extractionbased MT for Arabic

Description:

Full MT of text is problematic: Requires large amounts of resources, long development time ... Run Time Transfer System. Word-aligned elicited data. Partial ... – PowerPoint PPT presentation

Number of Views:239

Avg rating:3.0/5.0

Slides: 20

Provided by: AlonL

Category:

Tags: amtext | arabic | extractionbased | mt | pattern

more less

Transcript and Presenter's Notes

Title: AMTEXT: Extractionbased MT for Arabic

1
AMTEXTExtraction-based MT for Arabic

Faculty
Alon Lavie, Jaime Carbonell
Students and Staff
Laura Kieras, Peter Jansen
Informant
Loubna El Abadi

2
Background and Objectives

Full MT of text is problematic
Requires large amounts of resources, long
development time
Quality of output varies
Analysts often are looking for limited concrete
information within the text ? full MT may not be
necessary
Alternative rather than full MT followed by
extraction, first extract and then translate only
extracted information
Text Extraction technology has made much progress
in past decade TIPSTER, TREC, EELD
Research Question Can Extraction-based MT result
in improved accuracy and utility of information
for analysts?

3
Extraction-based MT

Traditional Approach
Develop information extraction capability for the
source language
Runtime Extractor produces a template of
extracted feature-value information
If desired, English Generator can render the
information in the form of text
Drawback Adapting extraction technology to a new
foreign language is difficult
Requires significant expertise in the foreign
language
Significant amounts of human development time
Not clear that it is an attractive solution

4
AMTEXT Approach

Attempt to leverage from our work on automatic
learning of MT transfer rules
Develop an elicitation corpus specifically
designed for targeted extraction patterns
Learn generalized transfer rules for targeted
extraction patterns from elicitation corpus
Acquire high accuracy Named-Entity translation
lexicon limited translation lexicon for
targeted vocabulary
Runtime use partial parser transfer rules to
translate only the matched portions of SL text

5
AMTEXT Extraction-based MT
Word-aligned elicited data
Source Text
Learning Module
Run Time Transfer System
Transfer Rules
Partial Parser
SS NE-P pagash et NE-P TE -gt NE-P met with
NE-P TE((X1Y1) (X4Y4) (X5Y5))
Extracted Target Text
Transfer Engine
NE Translation Lexicon
Word Translation Lexicon
6
Elicitation Example
7
Elicitation Example
8
Elicitation Example
9
Elicitation Example
10
Learning Transfer Rules

Different notion of rule generalization than in
our full XFER approach
Generalize from examples to NEs that play
specific roles in target extraction pattern
Verbs and function words may not be generalized
Example

Sharon will meet with Bush today sharon yipagesh
im bush hayom
Goal Rule
SS NE-P yipagesh im NE-P TE -gt NE-P will
meet with NE-P TE((X1Y1) (X4Y5) (X5Y6))
11
Acquisition of Named Entity Translation Lexicon

Utilize Fei Huangs work on building Named Entity
Translation Lexicons based on transliteration
models
NE Lexicon will be split into meaningful
sub-categories PNs, Organizations, Locations,
etc.
NE translation lexicon augmented with NEs from
elicited data
Goal High coverage and high accuracy
identification of NEs that play a part in the
transfer rules

12
Named Entity Translation Lexicon

English-Arabic lexicon from Fei
Trained on TIDES Newswire Data
7522 entries sorted by transliteration score
Example

4.51948528108464 XXX Israel
AsrAAyl 4.05498190544419 XXX Kabul
kAbwl 3.66368346525326 XXX Paris
bArys 3.65527347080481 XXX Afghanistan
AfgAnstAn 3.47030997281853 XXX Pakistan
bAkstAn 3.23199522148251 XXX Moscow
mwskw 3.20392400497002 XXX Arafat
ErfAt 3.13060360328543 XXX Beirut
byrwt 3.06872591580516 XXX Russia rwsyA
13
Named Entity Identification

NE Identifinder for English
Available from BBN
Will be used for identifying English NEs within
elicited data ? Arabic NEs from word alignments
NE Identifinder for Arabic
Requested from BBN, so far no response
Will use if available, can manage without it
(naïve identification based on NE translation
lexicon)

14
Acquisition of Limited Word Translation Lexicon

Vocabulary of interest is limited based on
specific actions and objects that are of interest
? scopeable on the English side
Elicitation corpus serves as a high-quality
initial source for extracting this translation
lexicon
Statistical word-to-word translation dictionary
from SMT or EBMT can be used as a source for
expanding coverage on the foreign language side
Experiment if time/resources permit with
incorporating expanded vocabulary into transfer
rules

15
Partial Parsing

Input Full text in the foreign language
Output Translation of extracted/matched text
Goal Extract by effectively matching transfer
rules with the full text
Identify/parse NEs and words in restricted
vocabulary
Identify transfer-rule (source-side) patterns
Handle expected high-levels of ambiguity

Sharon, meluve b-sar ha-xuc shalom, yipagesh im
bush hayom
NE-P
NE-P
NE-P
TE
Sharon will meet with Bush today
16
Scope of Pilot System

Arabic-to-English
Newswire text (available from TIDES)
Limited set of actions (X meet Y) (X attend Y)
(X hold Y) (X kill Y) (X announce Y)
Limited translation patterns
ltsubj-NEgt ltverbgt ltobjgt ltLOCgt ltTEgt
Limited vocabulary

17
Evaluation Plan

Compare AMTEXT approach to full-text
Arabic-to-English SMT, on a limited task of
translation of relations within the scope of
coverage
Establish a test set for evaluation
Define an appropriate metric Precision/Recall/F1
of relations and entities
Compare performance

18
Current Status