Title: Named Entity Recognition and Transliteration for 50 Languages
1Named Entity Recognition and Transliteration for
50 Languages
- Richard Sproat, Dan Roth, ChengXiang Zhai,
Elabbas Benmamoun, - Andrew Fister, Nadia Karlinsky, Alex Klementiev,
Chongwon Park, - Vasin Punyakanok, Tao Tao, Su-youn Yoon
- University of Illinois at Urbana-Champaign
- http//compling.ai.uiuc.edu/reflex
- The Second Midwest Computational Linguistics
Colloquium - (MCLC-2005)
- May 14-15
- The Ohio State University
2General Goals
- Develop multilingual named entity recognition
technology focus on persons, places,
organizations - Produce seed rules and (small) corpora for
several LCTLs (Less Commonly Taught Languages) - Develop methods for automatic named entity
transliteration - Develop methods for tracking names in comparable
corpora
3Languages
- Languages for seed rules Chinese, English,
Spanish, Arabic, Hindi, Portuguese, Russian,
Japanese, German, Marathi, French, Korean, Urdu,
Italian, Turkish, Thai, Polish, Farsi, Hausa,
Burmese, Sindhi, Yoruba, Serbo-Croatian, Pashto,
Amharic, Indonesian, Tagalog, Hungarian, Greek,
Czech, Swahili, Somali, Zulu, Bulgarian, Quechua,
Berber, Lingala, Catalan, Mongolian, Danish,
Hebrew, Kashmiri, Norwegian, Wolof, Bamanankan,
Twi, Basque. - Languages for (small) corpora Chinese, Arabic,
Hindi, Marathi, Thai, Farsi, Amharic, Indonesian,
Swahili, Quechua.
4Milestones
- Resources for various languages
- NER seed rules for Armenian, Persian, Swahili,
Zulu, Hindi, Russian, Thai - Tagged corpora for Chinese, Arabic, Korean
- Small tagged corpora for Armenian, Persian,
Russian (10-20K words) - Named Entity recognition technology
- Ported NER technology from English to Chinese,
Arabic, Russian and German - Name transliteration Chinese-English,
Arabic-English, Korean-English
5Linguistic/Orthographic Issues
- Capitalization
- Word boundaries
- Phonetic vs.Orthographic issues in
transliteration
6Named Entity Recognition
7Multi-lingual Text Annotator
Annotate any word in a sentence by selecting the
word and an available category. It's also
possible to create new categories.
http//l2r.cs.uiuc.edu/cogcomp/ner_applet.php
8Multi-lingual Text Annotator
View text in other encodings. New language
encodings are easily added in a simple text file
mapping.
http//l2r.cs.uiuc.edu/cogcomp/ner_applet.php
9Motivation for Seed Rules
- The only supervision is in the form of 7 seed
rules (namely, that New York, California and U.S.
are locations that any name containing Mr. is a
person that any name containing Incorporated is
an organization and that I.B.M. and Microsoft
are organizations). - Collins and Singer, 1999
10Seed Rules Thai
- Something including and to the right of ??? is
likely to be a personSomething including and to
the right of ??? is likely to be a
personSomething including and to the right of
?????? is likely to be a personSomething
including and to the right of ?.?. is likely to
be a personSomething including and to the right
of ??? is likely to be a personSomething
including and to the right of ???????? is likely
to be a personSomething including and to the
right of ?.?. is likely to be a person - Something including and to the right of ?.?.?. is
likely to be a personSomething including and to
the right of ??.?.?. is likely to be a
personSomething including and to the right of
??.?.?. is likely to be a personSomething
including and to the right of ??.?.?. is likely
to be a personSomething including and to the
right of ?.?. is likely to be a person - ?????? ??????? is a person?????? is likely a
person??? ??????? is a person?????? ???????? is
a person
11Seed Rules Thai
- Something including and in between ?????? and
????? is likely to be an organizationSomething
including and to the right of ???. is likely to
be an organizationSomething including and in
between ?????? and ????? (?????) is likely to be
an organizationSomething including and in
between ???. and (?????) is likely to be an
organizationSomething including and to the right
of ????????????????? is likely to be an
organizationSomething including and to the right
of ???. is likely to be an organization - ????????????????? is an organization??????? is
an organization??????? is an organization???????
?????? is an organization???????????????? is an
organization??????????? is an organization - Something including and to the right of ???????
is likely to be a locationSomething including
and to the right of ?. is likely to be a
locationSomething including and to the right of
????? is likely to be a locationSomething
including and to the right of ???? is likely to
be a location - ????????????? is a location????????? is a
location???????? is a location??????? is a
location
12Seed Rules Armenian
- CityName CapWord ????? ??????????
- StateName CapWord ??????
- CountryName1 CapWord ?????
- PersonName1 TITLE? FirstName? LastName
- LastName ?-?.???
- FirstName FirstName1 FirstName2
- FirstName1 ?-?\.
- FirstName2 ?-?.
- PersonNameForeign TITLE FirstName?
CapWord? CapWord PersonAny PersonName1
PersonNameForeign -
13Armenian Lexicon
- Lexicon GEODESC
- ?????????
- ?????????
- Lexicon PLACEDESC
- ??????
- ?????
- Lexicon ORGDESC
- ?????????
- ?????
- Lexicon COMPDESC
- ???????????????
- ????????????
- Lexicon TITLE
- ?????
- ???
14Seed Rules Persian
- Lexicon TITLE?????????????????????????
- Lexicon OrgDesc?????????????????????????????
?????
- Lexicon POSITION???? ?????????
???????????????????? - Descriptors for named entitiesLexicon
PerDesc?????????Lexicon CityDesc????????????
?Lexicon CountryDesc????
15Seed Rules Swahili
- People Rules
- Something including and to the right of Bw. is
likely to be a person. - Something including and to the right of Bi. is
likely to be a person. - A capitalized word to the right of bwana,
together with the word bwana, is likely to be a
person. - A capitalized word to the right of bibi, together
with the word bibi, is likely to designate a
person.
- Place Rules
- A capitalized word to the right of a word ending
with -jini, is likely to be a place. - A capitalized word starting with the letter U is
likely to be a place. - A word ending in ni is likely to be a place.
- A sequence of words including and following the
capitalized word Uwanja is likely a place.
16Named Entity Recognition
- Identify entities of specific types in text (e.g.
people, locations, dates, organizations, etc.) - After receiving his M.B.A. from ORG Harvard
Business School, PER Richard F. America
accepted a faculty position at the ORG McDonough
School of Business in LOC Washington.
17Named Entity Recognition
- Not an easy problem since entities
- Are inherently ambiguous (e.g. JFK can be both
location and a person depending on the context) - Can appear in various forms (e.g. abbreviations)
- Can be nested, etc.
- Are too numerous and constantly evolving
(cf. Baayen, H. 2000. Word Frequency
Distributions. Kluwer. Dordrecht.)
18Named Entity Recognition
- Two tasks (sometimes, done simultaneously)
- Identify the named entity phrase boundaries
(segmentation) - May need to respect constraints
- Phrases do not overlap
- Phrase order
- Phrase length
- Classify the phrases (classification)
19Identifying phrase properties with sequential
constraints
- View as inference with classifiers problem. Three
models Punyakanok Roth NIPS01
http//l2r.cs.uiuc.edu/danr/Papers/iwclong.pdf - HMMs
- HMM with classifiers
- Conditional Models
- Projection based Markov model
- Constraint Satisfaction Models
- Constraint satisfaction with classifiers
- Other models proposed
- CRF
- StructurePerceptron
- A model comparison in the context of the SRL
problem Punyakanok et al IJCAI05
Most common
20Adaptation
- Most approaches in NER are targeted toward
specific setting language, subject, set of tags,
etc. - Labeled data may be hard to acquire for each
particular setting - Trained classifiers tend to be brittle when moved
even just to a related subject - We consider the problem of exploiting the
hypothesis we learned in one setting to improve
learning in another. - Kinds of adaptation that can be considered
- Across corpora with a domain
- Across domains
- Across annotation methodologies
- Across languages
21Adaptation Example
Starting with Reuters classifier is better than
starting from scratch
- Train on
- Reuters increasing amounts of NYT
- No Reuters, just increasing amounts of NYT
- Test on NYT
- Performance on NYT increases quickly as
classifier is trained on examples from NYT - Starting with existing classifier trained on
related corpus is better than starting from
scratch
Trained on Reuters 13 NYT tested on NYT
Trained on Reuters tested on NYT
22Current Architecture - Training
Annotated Corpus
- Pre-process annotated corpus
- Extract features
- Train classifier
Honorifics
Features script
Gazetteers
Italics setting specific optional
23Current Architecture - Tagging
Corpus
- Pre-process corpus
- Extract features
- Run NER
Honorifics
Features script
Gazetteers
Network file
Annotated Corpus
24Extending Current Architecture to Multiple
Settings
Chinese newswire
Corpus
German biological
English news
- Choose setting
- Pre-process, extract features and run NER
FEX
NER SNoW-based
25Extending Current Architecture to Multiple
Settings Issues
- For each setting, we need
- Honorifics and gazetteers
- Tuned sentence and word splitters
- Types of features
- Tagged training corpus
- Work is being done to move tags across parallel
corpora (if available)
26Extending Current Architecture to Multiple
Settings Issues
- If parallel corpora are available and one is
annotated, may be able to use Stochastic
Inversion Transduction Grammars to move tags
across corpora Wu, Computational Linguistics
97 - Generate bilingual (annotated and unannotated
parallel corpora) parses - Use ITGs as a filter to deem sentence/phrase
pairs as parallel enough - For those that are, simply move the label from
annotated to the unannotated phrase in same parse
tree node. - Use the now tagged examples as training corpus
27Extending Current Architecture to Multiple
Settings
- Baseline experiments with Arabic, German, and
Russian - E.g. For Russian with no honorifics, gazetteers,
features tuned for English, and imperfect
sentence splitter we still get about 77
precision and 36 recall. - NB Used small hand-constructed corpus of
approx. 15K wds, 1,300 NE (80/20 split)
28Summary
- Seed rules and corpora for subset of 50 languages
- Adapted NER system for English to other languages
- Demonstrated adaptation of NER system to other
settings - Experimenting with ITG as basis for annotation
transplantation
29Methods of Transliteration
30Comparable Corpora
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
31Transliteration in Comparable Corpora
- Take the newspapers for a day in any set of
languages a lot of them will have names in
common. - Given a name in one language, find its
transliteration in a similar text in another
language. - How can we make use of
- Linguistic factors such as similar pronunciations
- Distributional factors
- Right now we used partly supervised methods (e.g.
we assume small training dictionaries) - We are aiming for largely unsupervised methods
(in particular, no training dictionary)
32Some Comparable Corpora
- We have (from the LDC) comparable text corpora
for - English (19M words)
- Chinese (22M characters)
- Arabic (8M words)
- Many more such corpora can, in principle, be
collected from the web
33How Chinese Transliteration Works
- About 500 characters tend to be used for foreign
words - Attempt to mimic the pronunciation
- But lots of alternative ways of doing it
34Transliteration Problem
- Many applications of transliteration have been in
machine translation KnightGraehl, 1998
Al-OnaizanKnight, 2002 Gao, 2004 - Whats the best translation of this Chinese name?
- Our problem is slightly different
- Are these two names the same?
- Want to be able to reject correspondences
- Assign 0 probability to some unseen cases in
training data
35Approaches to Transliteration
- Much work using the source-channel approach
- Cast as a problem where you have a clean
source e.g. a Chinese name and a noisy
channel that corrupts the source into the
observed form e.g. an English name - P(EC)P(C)
- E.g. P(fi,E fi1,E fi2,E fin,E sC)
- Chinese characters represent syllables (s) we
match these to sequences of English phonemes (f)
36Resources
- Small dictionary of 721 (mostly English) names
and their Chinese transliterations - Large dictionary of about 1.6 million names from
LDC
37General Approach
- Train a tight transliteration model from a
dictionary of known transliterations - Identify names in English news text for a given
day using an existing named entity recognizer - Process same day of Chinese text looking for
sequences of characters used in foreign names - Do an all-pairs match using the transliteration
model to find possible transliteration pairs
38Model Estimation
- Seek to estimate P(ec) where e is a sequence of
words in Roman script and c is a sequence of
Chinese characters - We actually estimate P(ec), where e is the
pronunciation of e and c is the pronunciation of
c. - We decompose the estimate of P(ec) as
- Chinese transliteration matches syllables to
similar-sounding spans of foreign phones. So cI
are syllables, and eI are subsequences of the
English phone string
39Model Estimation
- Align phone strings using modified
Sankoff/Kruskal algorithm - For each Chinese s, allow an English phone string
f to correspond just in case the initial of s
corresponds to the initial of f some minimum
number of times in training - Smooth probabilities using Good-Turing
- Distribute unseen probability mass over unseen
cases non-uniformly according to a weighting
scheme
40Model Estimation
- We estimate the probability for a given unseen
case as follows - Where
- P(n0) is the probability of unseen cases
according to the Good-Turing smoothing - P(len(e)mlen(c)n) is the probability of a
Chinese syllable of length n corresponding to an
English phone sequence of length m - count(len(e)m) is the type count of phone
sequences of length m (estimated from 194,000
pronunciations produced by the Festival TTS
system on the XTag dictionary)
41Some Automatically Found Pairs
Pairs found in same day of newswire text
42Further Pairs
43Time Correlations
- When some major event happens (e.g., the tsunami
disaster), it is very likely covered by news
articles in multiple languages - Each event/topic tends to have its own
associated vocabulary (e.g., names such as Sri
Lanka, India may occur in recent news articles) - We thus will likely see that the frequency of a
name such as Sri Lanka will peak as compared with
other time periods and the pattern is likely the
same across languages - cf. Kay and Roscheisen, CL, 1993 Kupiec, ACL,
1993 Rapp, ACL, 1995 Fung, WVLC, 1995
44Construct Term Distributions over Time
45Measure Correlations of English and Chinese Word
Pairs
bad correlation corr 0.0324
good correlation corr 0.885
46Chinese Transliteration
English term Edmonton
Chinese documents
Candidate Chinese names
???? ??? ??? ??? ??? ????
???? 0.96 ??? 0.91 ??? 0.88 ???
0.75
?
Rank Candidates
- Methods
- Phonetic approach
- Frequency correlation
- Combination
47Evaluation
English term Edmonton
MRR Mean Reciprocal Rank AllMRR Evaluation over
all English names CoreMRR Evaluation over just
names w/ found Chinese
correspondence
48Summary and Future Work
- So far
- Phonetic transliteration models
- Time correlation between name distributions
- Work in progress
- Linguistic models
- Develop graphical model approach to
transliteration - Semantic aspects of transliteration in Chinese
female names ending in ia transliterated with ?
ya rather than ? - Resource-poor transliteration for any pair of
languages - Document alignment
- Coordinated mixture models for document/word-level
alignment
49Graphical Models Bilmes Zweig 2002
50Semantic Aspects of Transliteration
- Phonological model doesnt capture
semantic/orthographic features of
transliteration - Saint, San, Sao, use ? sheng holy
- Female names ending in ia transliterated with ?
ya rather than ? ya - Such information boosts evidence that two strings
are transliterations of each other - Consider gender. For each character c
- compute log-likelihood ratio abs(log(P(fc)/P(mc)
)) - build a decision list ranked by decreasing LLR
51Decision List for Gender Classification
41.5833898566 ? male 40.8357601821 ? female 39.0
64753687 ? female 35.8207097407 ? male 34.798058
9928 ? female 34.4008926875 ? female 33.98712877
66 ? female 33.9225902555 ? male 26.945842105 ?
male
52Document Alignment
Basic idea Sum up correlations of all e-c pairs.
Use these to find documents paired by relevance.
Method 1 Expected correlation (ExpCorr)
e1 e2 e3 eE
c1 c2 c3 cC
Matching two rare words is more surprising, so
should count more
English document E
Chinese document C
Method 2 IDF weighted correlation (IDFCorr)
Repeated occurrences of a word Contributes less
than the first few Occurrences.
IDF (for Inverse Doc Freq) penalizes common words
Method 3 BM25 weighting (TF-IDF)
BM25 is a typical retrieval weighting function
53Document Alignment Evaluation
Method 3 TF-IDF
- Randomly pick 6 English documents
- Retrieve 50 Chinese documents (out of approx.
900) for each English document - Rank the 300 E-C pairs by each of 3 methods
- Evaluate the relevance by standard precision
metric
Method 2 IDF
Method 1 ExpCorr
About 80 of the top 100 pairs of documents are
correct.
54Summary and Some Ongoing Work
- Some seed rules and corpora more in progress
- NER techniques being adapted to other languages
- Investigating ITG for annotation transplantation
- What features to use for various languages
- Combined phonetic and temporal information in
transliteration - Semantic/orthographic aspects of transliteration
- Resource poor transliteration
- Document alignment
- Coordinated mixture models
55Acknowledgments
- National Security Agency Contract NBCHC040176,
REFLEX (Research on English and Foreign Language
Exploitation) - Language experts thus far
- Karine Megerdoomian (Persian, Armenian)
- Alla Rozovskaya (Russian, Hebrew)
- Archna Bhatia (Hindi)
- Brent Henderson (Swahili)
- Tholani Hlongwa (Zulu)
- Karen Livescu for much help with GMTK
56Model Estimation
- Training data was small (721 names) smoothing is
essential - Align using small hand-derived set of rules, plus
the alignment algorithm of Sankoff Kruskal.
Some sample rules
57Model Estimation
- For an English phone span to correspond to a
Chinese syllable, the initial phone of the
English span must have been seen in the training
data as corresponding to the initial of the
Chinese syllable some minimum number of times - For consonant-initial syllables we set the number
to 4 - For vowel-initial syllables (since these tend to
be more variable in their correspondences) we set
the minimum at 1 - We then compute P(eI cI) for the seen cases,
and smooth for unseen correspondences using
Sampsons Simple Good Turing.
58High Correlation Word-Character Pairs
English
Chinese
English
Chinese
Correlation
Correlation
.
59Top Ranked Chinese Characters (swimming Afghan)
60Next Steps
- Coordinated mixture models for word/document
alignment
61A Coordinated Mixture Model
Generating a Chinese doc at time t
?1,E ?1,c
?2,E ?2,c
Day 1
?k,E ?k,c
?E
?C
?1,E ?1,c
Day 2
?k,E ?k,c
Applications of the model
ENGLISH
CHINESE
62Graphical Models Bilmes Zweig 2002
63Coordinated Mixture Model
k coordinated themes
English word dist.
Date (or Alignment) distribution
Chinese word dist.
Swimming 0.04 Medal 0.02 Men 0.01
? 0.05 ? 0.03 ? 0.01
Alignments A1 A2 AM
Day1 Day2
2 non-coordinated themes
English word dist. capturing unaligned English
topics
English date dist. capturing uneven topic
coverage over time
Chinese word dist. capturing unaligned Chinese
topics
Chinese date dist. capturing uneven topic
coverage over time
64Details of the Mixture Model
Coordinated mixture model
Lexical translation Document alignment