Named Entity Transliteration with Comparable Corpora presentation

About This Presentation

Transcript and Presenter's Notes

Title: Named Entity Transliteration with Comparable Corpora

1
Named Entity Transliteration with Comparable
Corpora

2
Named Entity Transliteration with Comparable
Corpora

Given comparable corpora (C1, , Cn), where Cis
are corpora in different languages
Given a name X in Ci , the task is to find
transliterations of X in C1,Ci-1, Ci, Cn
Possible applications
Cross-lingual entity retrieval
Cross-lingual information integration
Cross-lingual trend analysis

3
Previous Work

Transliteration Knight Graehl 1998 Meng et
al. 2001 Gao et al. 2004 inter alia.
Comparable corpora Fung, 1995 Rapp 1995 Tanaka
and Iwasaki, 1996 Franz et al.,1998 Ballesteros
and Croft, 1998 Masuichi et al., 2000 Sadat et
al., 2003 Tao and Zhai, 2005.
Mining transliterations from multilingual web
pages Zhang Vines, 2004

4
Method

We assume that we have comparable corpora,
consisting of newspaper articles in English and
Chinese from the same day, or almost the same
day.
In our experiments we use data from the English
and Chinese stories from the Xinhua News agency
for about 6 months of 2001.

5
Method

Identify NEs english names using method of Li et
al. 2001 (based on SNoW Carlson et al. 1999)
Identify NEs chinese names using a list of
characters that frequently used for
transliterating foreign names.

6
Method

The general three-step approach
Given an English name, identify candidate Chinese
character n-grams as possible transliterations.
Score each candidate based on how likely the
candidate is to be a transliteration of the
English name.
Two initial scoring methods.
Phonetic scoring,
Frequency profile of the candidate pair over
time.
Propagate scores of all the candidate
transliteration pairs globally based on their
cooccurences in document pairs in the comparable
corpora.

7
Overview
8
Method 1 Phonetic Transliteration

Much work using the source-channel approach
Cast as a problem where you have a clean
source e.g. a Chinese name and a noisy
channel that corrupts the source into the
observed form e.g. an English name
P(EC)P(C)
Seek to estimate P(EC)
E.g. P(fi,E fi1,E fi2,E fin,E sC)
Chinese characters represent syllables (s) we
match these to sequences of English phonemes (f)

9
Phonetic Transliteration Estimation
10
Phonetic Transliteration General Approach

Train a transliteration model from a dictionary
of known transliterations (720 entries)
Identify names in English news text for a given
day using an existing named entity recognizer
Process same day of Chinese text looking for
sequences of characters used in foreign names
Do an all-pairs match using the transliteration
model to find possible transliteration pairs

11
Phonetic Transliteration Some Automatically
Found Pairs

12
Method 2 Frequency Correlation

We pool all documents in a single day to form a
large pseudo-document.
compute each transliterations frequency in each
of those pseudo-documents and obtain a raw
frequency vector
normalize the raw frequency vector so that it
becomes a frequency distribution over all the
time points (days).
Using The Pearson correlation coefficient

13
Frequency Correlation

14
Method 2 Frequency Correlation
15
(No Transcript)
16
Method 3 Combining Phonetic and Time Correlation
Methods

Two methods exploit complementary resources, thus
combining them might further help.
Phonetic filter use the phonetic model to filter
out (clearly impossible) candidates and then use
the frequency correlation method to rank the
candidates.
Score combination mean of normalized scores

17
Method 4 Score Propagation

The methods so far score each transliteration
pair independently
But knowing that two transliteration pairs
co-occur in the same cross-lingual document pair
should increase our confidence on both
transliteration pairs
Similarly, document pairs that contain lots of
plausible transliteration pairs are likely
comparable content-wise
Thus, document/transliteration pairs reinforce
each other

18
Score Propagation
19
Score Propagation
20
Score Propagation
21
Estimate of P(ji)

22
Evaluation

We take one days worth of comparable news
articles (234 Chinese stories and 322 English
stories) from Chinese English Gigaword Corpus
(LDC)
Generate about 600 English names with the entity
recognizer (Li et al., 2004)
Find potential Chinese names by looking for
strings of characters that are commonly used in
transliteration
This generates 627 Chinese candidates
In principle any of the 600 x 627 pairings could
be correct
Use phonetic and time-correlation methods to rank
the candidate pairings.
Evaluate using Mean Reciprocal Rank (MRR)

23
Evaluation Further Details

Small number of English names do not seem to have
any standard transliteration according to the
resources that we consulted.
Removing these, we have a list of 490 out of
original 600 English names
Furthermore, about 20 of answers are not in
Chinese candidate list
Either they are really not there
Or our candidate selection process missed them.
This motivates two scores
AllMRR using the original list of 600 English
names
CoreMRR for just those where the English names
are also in our Chinese candidate list

24
Evaluation

Phonetic correspondence yielded an AllMRR score
of 0.3 and a CoreMRR score of 0.89
Time-correlation scores yielded results as
follows, with different correlation measures

25
Upper Bound Analysis

Results when we manually added the correct
transliterations to the Chinese candidate list

26
Score Propagation (Core)
27
Score Propagation
28
Score Combination
29
Score Propagation (All)
30
Summary

We propose several complementary methods for
transliteration that rely on relatively light
linguistic resources
One of these is a novel score propagation method
We show that different methods can be combined to
further improve performance
It is feasible to perform transliteration over
comparable corpora without much manual effort

Write a Comment

User Comments (0)

About PowerShow.com

Named Entity Transliteration with Comparable Corpora PowerPoint PPT Presentation