Title: AMTEXT: Extractionbased MT for Arabic
1AMTEXTExtraction-based MT for Arabic
- Faculty
- Alon Lavie, Jaime Carbonell
-
- Students and Staff
- Laura Kieras, Peter Jansen
- Informant
- Loubna El Abadi
2Background and Objectives
- Full MT of text is problematic
- Requires large amounts of resources, long
development time - Quality of output varies
- Analysts often are looking for limited concrete
information within the text ? full MT may not be
necessary - Alternative rather than full MT followed by
extraction, first extract and then translate only
extracted information - Text Extraction technology has made much progress
in past decade TIPSTER, TREC, EELD - Research Question Can Extraction-based MT result
in improved accuracy and utility of information
for analysts?
3Extraction-based MT
- Traditional Approach
- Develop information extraction capability for the
source language - Runtime Extractor produces a template of
extracted feature-value information - If desired, English Generator can render the
information in the form of text - Drawback Adapting extraction technology to a new
foreign language is difficult - Requires significant expertise in the foreign
language - Significant amounts of human development time
- Not clear that it is an attractive solution
4AMTEXT Approach
- Attempt to leverage from our work on automatic
learning of MT transfer rules - Develop an elicitation corpus specifically
designed for targeted extraction patterns - Learn generalized transfer rules for targeted
extraction patterns from elicitation corpus - Acquire high accuracy Named-Entity translation
lexicon limited translation lexicon for
targeted vocabulary - Runtime use partial parser transfer rules to
translate only the matched portions of SL text
5AMTEXT Extraction-based MT
Word-aligned elicited data
Source Text
Learning Module
Run Time Transfer System
Transfer Rules
Partial Parser
SS NE-P pagash et NE-P TE -gt NE-P met with
NE-P TE((X1Y1) (X4Y4) (X5Y5))
Extracted Target Text
Transfer Engine
NE Translation Lexicon
Word Translation Lexicon
6Elicitation Example
7Elicitation Example
8Elicitation Example
9Elicitation Example
10Learning Transfer Rules
- Different notion of rule generalization than in
our full XFER approach - Generalize from examples to NEs that play
specific roles in target extraction pattern - Verbs and function words may not be generalized
- Example
Sharon will meet with Bush today sharon yipagesh
im bush hayom
Goal Rule
SS NE-P yipagesh im NE-P TE -gt NE-P will
meet with NE-P TE((X1Y1) (X4Y5) (X5Y6))
11Acquisition of Named Entity Translation Lexicon
- Utilize Fei Huangs work on building Named Entity
Translation Lexicons based on transliteration
models - NE Lexicon will be split into meaningful
sub-categories PNs, Organizations, Locations,
etc. - NE translation lexicon augmented with NEs from
elicited data - Goal High coverage and high accuracy
identification of NEs that play a part in the
transfer rules
12Named Entity Translation Lexicon
- English-Arabic lexicon from Fei
- Trained on TIDES Newswire Data
- 7522 entries sorted by transliteration score
- Example
4.51948528108464 XXX Israel
AsrAAyl 4.05498190544419 XXX Kabul
kAbwl 3.66368346525326 XXX Paris
bArys 3.65527347080481 XXX Afghanistan
AfgAnstAn 3.47030997281853 XXX Pakistan
bAkstAn 3.23199522148251 XXX Moscow
mwskw 3.20392400497002 XXX Arafat
ErfAt 3.13060360328543 XXX Beirut
byrwt 3.06872591580516 XXX Russia rwsyA
13Named Entity Identification
- NE Identifinder for English
- Available from BBN
- Will be used for identifying English NEs within
elicited data ? Arabic NEs from word alignments - NE Identifinder for Arabic
- Requested from BBN, so far no response
- Will use if available, can manage without it
(naïve identification based on NE translation
lexicon)
14Acquisition of Limited Word Translation Lexicon
- Vocabulary of interest is limited based on
specific actions and objects that are of interest
? scopeable on the English side - Elicitation corpus serves as a high-quality
initial source for extracting this translation
lexicon - Statistical word-to-word translation dictionary
from SMT or EBMT can be used as a source for
expanding coverage on the foreign language side - Experiment if time/resources permit with
incorporating expanded vocabulary into transfer
rules
15Partial Parsing
- Input Full text in the foreign language
- Output Translation of extracted/matched text
- Goal Extract by effectively matching transfer
rules with the full text - Identify/parse NEs and words in restricted
vocabulary - Identify transfer-rule (source-side) patterns
- Handle expected high-levels of ambiguity
Sharon, meluve b-sar ha-xuc shalom, yipagesh im
bush hayom
NE-P
NE-P
NE-P
TE
Sharon will meet with Bush today
16Scope of Pilot System
- Arabic-to-English
- Newswire text (available from TIDES)
- Limited set of actions (X meet Y) (X attend Y)
(X hold Y) (X kill Y) (X announce Y) - Limited translation patterns
- ltsubj-NEgt ltverbgt ltobjgt ltLOCgt ltTEgt
- Limited vocabulary
17Evaluation Plan
- Compare AMTEXT approach to full-text
Arabic-to-English SMT, on a limited task of
translation of relations within the scope of
coverage - Establish a test set for evaluation
- Define an appropriate metric Precision/Recall/F1
of relations and entities - Compare performance
18Current Status
- Initial small elicitation corpus translated and
aligned - Extraction of elicitation phrases from Penn-TB in
advanced stages - Identifying scope of coverage relations,
actions, translation patterns - Preliminary NE translation lexicon available
19Work Plan
- Creation of full elicitation corpus
Nov-03 - Translation/align. of elicitation corpus
Nov/Dec-03 - Install and integrate BBN English Identifinder
Dec-03 - Acquire initial NE translation lexicon
Dec-03 - Acquire initial word translation lexicon Dec-03
- Develop and integrate partial parser
Dec-03/Feb-04 - Modify Transfer Engine for AMTEXT configuration
Dec-03/Jan-04 - Integration of preliminary complete system
Feb-04 - Design of evaluation Feb-04
- System testing and modifications Feb/Apr-04
- Test-set evaluation Apr-04