Named Entity Transliteration with Comparable Corpora PowerPoint PPT Presentation

presentation player overlay
1 / 30
About This Presentation
Transcript and Presenter's Notes

Title: Named Entity Transliteration with Comparable Corpora


1
Named Entity Transliteration with Comparable
Corpora
  • Richard Sproat, Tao Tao, ChengXiang Zhai
  • University of Illinois at Urbana-Champaign
  • Presented by Ye Shani
  • 2006-12-8

2
Named Entity Transliteration with Comparable
Corpora
  • Given comparable corpora (C1, , Cn), where Cis
    are corpora in different languages
  • Given a name X in Ci , the task is to find
    transliterations of X in C1,Ci-1, Ci, Cn
  • Possible applications
  • Cross-lingual entity retrieval
  • Cross-lingual information integration
  • Cross-lingual trend analysis

3
Previous Work
  • Transliteration Knight Graehl 1998 Meng et
    al. 2001 Gao et al. 2004 inter alia.
  • Comparable corpora Fung, 1995 Rapp 1995 Tanaka
    and Iwasaki, 1996 Franz et al.,1998 Ballesteros
    and Croft, 1998 Masuichi et al., 2000 Sadat et
    al., 2003 Tao and Zhai, 2005.
  • Mining transliterations from multilingual web
    pages Zhang Vines, 2004

4
Method
  • We assume that we have comparable corpora,
    consisting of newspaper articles in English and
    Chinese from the same day, or almost the same
    day.
  • In our experiments we use data from the English
    and Chinese stories from the Xinhua News agency
    for about 6 months of 2001.

5
Method
  • Identify NEs english names using method of Li et
    al. 2001 (based on SNoW Carlson et al. 1999)
  • Identify NEs chinese names using a list of
    characters that frequently used for
    transliterating foreign names.

6
Method
  • The general three-step approach
  • Given an English name, identify candidate Chinese
    character n-grams as possible transliterations.
  • Score each candidate based on how likely the
    candidate is to be a transliteration of the
    English name.
  • Two initial scoring methods.
  • Phonetic scoring,
  • Frequency profile of the candidate pair over
    time.
  • Propagate scores of all the candidate
    transliteration pairs globally based on their
    cooccurences in document pairs in the comparable
    corpora.

7
Overview
8
Method 1 Phonetic Transliteration
  • Much work using the source-channel approach
  • Cast as a problem where you have a clean
    source e.g. a Chinese name and a noisy
    channel that corrupts the source into the
    observed form e.g. an English name
  • P(EC)P(C)
  • Seek to estimate P(EC)
  • E.g. P(fi,E fi1,E fi2,E fin,E sC)
  • Chinese characters represent syllables (s) we
    match these to sequences of English phonemes (f)

9
Phonetic Transliteration Estimation
10
Phonetic Transliteration General Approach
  • Train a transliteration model from a dictionary
    of known transliterations (720 entries)
  • Identify names in English news text for a given
    day using an existing named entity recognizer
  • Process same day of Chinese text looking for
    sequences of characters used in foreign names
  • Do an all-pairs match using the transliteration
    model to find possible transliteration pairs

11
Phonetic Transliteration Some Automatically
Found Pairs
  • Pairs found in same day of newswire text

12
Method 2 Frequency Correlation
  • We pool all documents in a single day to form a
    large pseudo-document.
  • compute each transliterations frequency in each
    of those pseudo-documents and obtain a raw
    frequency vector
  • normalize the raw frequency vector so that it
    becomes a frequency distribution over all the
    time points (days).
  • Using The Pearson correlation coefficient

13
Frequency Correlation
  • Example
  • lte1,c1gt lt1,12gt lt7,8gt..
  • Normalizelt1/sum-e,12/sum-cgt..
  • The Pearson correlation coefficient

14
Method 2 Frequency Correlation
15
(No Transcript)
16
Method 3 Combining Phonetic and Time Correlation
Methods
  • Two methods exploit complementary resources, thus
    combining them might further help.
  • Phonetic filter use the phonetic model to filter
    out (clearly impossible) candidates and then use
    the frequency correlation method to rank the
    candidates.
  • Score combination mean of normalized scores

17
Method 4 Score Propagation
  • The methods so far score each transliteration
    pair independently
  • But knowing that two transliteration pairs
    co-occur in the same cross-lingual document pair
    should increase our confidence on both
    transliteration pairs
  • Similarly, document pairs that contain lots of
    plausible transliteration pairs are likely
    comparable content-wise
  • Thus, document/transliteration pairs reinforce
    each other

18
Score Propagation
19
Score Propagation
20
Score Propagation
21
Estimate of P(ji)
  • Two ways
  • Cooccurrences (CO) in whole collection
  • Mutual Information (MI)

22
Evaluation
  • We take one days worth of comparable news
    articles (234 Chinese stories and 322 English
    stories) from Chinese English Gigaword Corpus
    (LDC)
  • Generate about 600 English names with the entity
    recognizer (Li et al., 2004)
  • Find potential Chinese names by looking for
    strings of characters that are commonly used in
    transliteration
  • This generates 627 Chinese candidates
  • In principle any of the 600 x 627 pairings could
    be correct
  • Use phonetic and time-correlation methods to rank
    the candidate pairings.
  • Evaluate using Mean Reciprocal Rank (MRR)

23
Evaluation Further Details
  • Small number of English names do not seem to have
    any standard transliteration according to the
    resources that we consulted.
  • Removing these, we have a list of 490 out of
    original 600 English names
  • Furthermore, about 20 of answers are not in
    Chinese candidate list
  • Either they are really not there
  • Or our candidate selection process missed them.
  • This motivates two scores
  • AllMRR using the original list of 600 English
    names
  • CoreMRR for just those where the English names
    are also in our Chinese candidate list

24
Evaluation
  • Phonetic correspondence yielded an AllMRR score
    of 0.3 and a CoreMRR score of 0.89
  • Time-correlation scores yielded results as
    follows, with different correlation measures

25
Upper Bound Analysis
  • Results when we manually added the correct
    transliterations to the Chinese candidate list

26
Score Propagation (Core)
27
Score Propagation
28
Score Combination
29
Score Propagation (All)
30
Summary
  • We propose several complementary methods for
    transliteration that rely on relatively light
    linguistic resources
  • One of these is a novel score propagation method
  • We show that different methods can be combined to
    further improve performance
  • It is feasible to perform transliteration over
    comparable corpora without much manual effort
Write a Comment
User Comments (0)
About PowerShow.com