Mining Wiki Resources for Multilingual Named Entity Recognition - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Mining Wiki Resources for Multilingual Named Entity Recognition

Description:

Mining Wiki Resources for Multilingual Named Entity Recognition ... which includes articles on Fenway Park (a facility), the Republic of ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 21
Provided by: chiayi8
Category:

less

Transcript and Presenter's Notes

Title: Mining Wiki Resources for Multilingual Named Entity Recognition


1
Mining Wiki Resources for Multilingual Named
Entity Recognition
  • Alexander E. Richman Patrick Schone

Department of Defense ACL 2008
Reporter Chia-Ying Lee Advisor Prof.
Hsin-Hsi Chen
2
Introduction
  • Using the multilingual Wikipedia to automatically
    create an annotated corpus of text in any given
    language.
  • Languages French, Ukrainian, Spanish, Polish,
    Russian, and Portuguese.
  • Do not use of any non-English linguistic
    resources outside of the Wikimedia domain and any
    semantic resources such as WordNet or POS tagger.
  • Use an internally modified variant of BBN's
    IdentiFinder (Bikel et al., 1999), specifically
    modified to emphasize fast text processing,
    called PhoenixIDF.

2
3
Related Work
  • Toral and Muñoz (2006) used Wikipedia to create
    lists of named entities.
  • Rely on WordNet, and need a manual supervision
    step
  • Kazama and Torisawa (2007) used Wikipedia to
    building entity dictionaries.
  • Rely on POS tagger
  • Cucerzan (2007) used Wikipedia primarily for
    Named Entity Disambiguation, following the path
    of Bunescu and Pasca (2006)
  • Using Category, but specific to English

4
Wikipedia
  • Multilingual, collaborative encyclopedia on the
    Web which is freely available
  • As of October 2007, there were over 2 million
    articles in English, and 30 languages with at
    least 50,000 articles and another 40 with at
    least 10,000 articles.

4
5
Wikipedia - feature
  • Article links, links from one article to another
    of the same language.
  • Category links, links from an article to special
    Category pages.
  • Interwiki links, links from an article to a
    presumably equivalent, article in another
    language.
  • Redirect pages, short pages which often provide
    equivalent names for an entity
  • Disambiguation pages, a page with little content
    that links to multiple similarly named articles.
  • Example http//en.wikipedia.org/wiki/FBI

5
6
Training Data Generation
  1. Initial Set-up
  2. English Language Categorization
  3. Multilingual Categorization
  4. The Full System

6
7
Initial Set-up
  • ACE Named Entity types
  • PERSON, GPE (Geo-Political Entities),
    ORGANIZATION, VEHICLE, WEAPON, LOCATION,
    FACILITY, DATE, TIME, MONEY, and PERCENT.
  • MUC tags like ltENAMEX TYPEGPEgtPlace
    Namelt/ENAMEXgt
  • Process
  • Identifies words and phrases that might represent
    entities.
  • Uses category links and/or interwiki links to
    associate that phrase with an English language
    phrase or set of Categories.
  • Determines the appropriate type of the English
    language data and assumes that the original
    phrase is of the same type.

8
English Language Categorization(1)
  • Wiki Useful Category gt Key Category Phrase
  • gt Disambiguation Pages? gt Wiktionary
  • Useful Category
  • CategoryLiving People PERSON
  • CategoryCities in NorwayGPE
  • Useless Category
  • Category1912 Establishments which includes
    articles on Fenway Park (a facility), the
    Republic of China (a GPE), and the Better
    Business Bureau (an organization).

9
English Language Categorization(2)
10
Multilingual Categorization
  • Not all articles have English equivalent, but
    many of the most useful categories have English
    equivalents.
  • French CatégorieCommune des Côtes-d'Armor,
    CatégorieVille portuaire de France,
    CatégoriePort de plaisance, and
    CatégorieStation balnéaire.
  • English Category Communes of Côtes-d'Armor,
    UNKNOWN, CategoryMarinas, and
    CategorySeaside resorts

11
The Full System
  • The first pass uses the explicit article links
    within the text.
  • We then search an associated English language
    article, if available, for additional
    information.
  • A second pass checks for multi-word phrases that
    exist as titles of Wikipedia articles.
  • We look for certain types of person and
    organization instances.
  • We perform additional processing for alphabetic
    or space-separated languages, including a third
    pass looking for single word Wikipedia titles.
  • We use regular expressions to locate additional
    entities such as numeric dates.

12
Evaluation All
  • Wiki test set
  • Three human annotated newswire test sets
    Spanish, French and Ukrainian.

F-score Spanish French Ukrainian Polish Portuguese Russian
ALL .846 .844 .807 .859 .804 .802
DATE .925 .910 .848 .891 .861 .822
GPE .877 .868 .887 .916 .826 .867
ORG .701 .718 .657 .785 .706 .712
PERSON .821 .823 .690 .836 .802 .751
12
13
Evaluation Spanish (1)
  • Spanish is a substantial, well-developed
    Wikipedia, consisting of more than 290,000
    articles at October 2007.
  • Newswire 25,000 words from the ACE 2007 test
    set, manually modified extended MUC-style
    standards.
  • Wiki test set 335,000 words.

14
Evaluation Spanish (2)
  • Either Wikipedia is relatively poor in
    Organizations or that PhoenixIDF underperforms
    when identifying Organizations relative to other
    categories or a combination.
  • Traditional Training trained PhoenixIDF on ACE
    2007 data converted to MUC-style tag.

15
Evaluation French
  • French is one of the largest Wikipedias,
    containing more than 570,000 articles at October
    2007.
  • Newswire 25,000 words from Agence France Presse
  • Wiki test set 920,000 words.
  • Similar to Spanish.

15
16
Evaluation Ukrainian (1)
  • Ukrainian is a medium-sized Wikipedia with 74,000
    articles at October 2007.
  • The typical article is shorter and less
    well-linked to other articles than in the French
    or Spanish versions.
  • Newswire approximately 25,000 words from various
    online news sites covering primarily political
    topics.
  • Wiki test set 395,000 words.
  • Traditional Training trained PhoenixIDF Newswire
    data

16
17
Evaluation Ukrainian (2)
  • The Ukrainian newswire contained a much higher
    proportion of organizations than the French or
    Spanish versions.
  • The Ukrainian language Wikipedia contains very
    few articles on organizations relative to other
    types

17
18
Conclusion
  • Wikipedia can create a NER system with
    performance comparable to one developed
    human-annotated Newswire, while not requiring any
    linguistic expertise.
  • This level of performance can likely be obtained
    currently in 20-40 languages.
  • Wikipedia-derived system could be used as a
    supplement to other systems for many more
    languages.
  • An automatically generated entity dictionary
    embedded in our system .

18
19
Future Work
  • Automatically generate the list of key words and
    phrases for useful English language categories.
  • The authors also believe performance could be
    improved by using higher order non-English
    categories and better disambiguation.
  • Lists of organizations might be particularly
    useful, and List of pages are common in many
    languages.

19
20
Thank you!
20
Write a Comment
User Comments (0)
About PowerShow.com