Title: Named Entity Disambiguation on an Ontology Enriched by Wikipedia
1Named Entity Disambiguation on an Ontology
Enriched by Wikipedia
International IEEE Conference - RIVF08
- Hien Thanh Nguyen1, Tru Hoang Cao2
- 1Ton Duc Thang University, Vietnam
- 2Ho Chi Minh City University of Technology,
Vietnam
2Outline
- Introduction
- Background
- Approach
- Evaluation
- Conclusion
3Introduction
- No explicit semantic information about data and
objects are presented in most of the Web pages. - Semantic Web aim at solving this problem by
making semantic metadata available in web page
content - Ex the entity John McCarthy pointing to the
homepage of the inventor of Lisp programming - Entity disambiguation
4Introduction- Entity disambiguation
- Entity disambiguation is the process of
identifying when different references correspond
to the same real world entity (Jorge Cardoso and
Amit Sheth) - Our work aim at detecting named entities in a
text and linking them to a given ontology
5Introduction - What are Named Entities?
- Named Entities (NE) are considered people,
organizations, locations, date, time, money,
measures, percentage, etc. - Example
Ms. Washington's candidacy is being championed
by several powerful lawmakers including her boss,
Chairman John Dingell (D., Mich.) of the House
Energy and Commerce Committee.
6Introduction Basic problem in NE
- Many NEs share the same name
- Ambiguity of NE types John Smith (company vs.
person) - May (person vs. month)
- Washington (person vs. location)
- etc.
- Ambiguity of referent (e.g. Paris may be the
capital of French, or a small town in Texas)
7Introduction - Our contribution are two-fold
- Utilizing ontological concepts, and properties of
instances in a specific KB, to automatically
generate a corpus of labeled training data - Exploiting Wikipedia to enrich the training data
with new and informative features. - Exploring a range of features extracted from
texts, a KB, and Wikipedia
8Background - Ontology
- Ontology schema defines taxonomy of classes and
properties (relations and attributes) - Knowledge base contains semantic descriptions,
including attributes and relations, of named
entities in real world
9Background - Wikipedia
- Each article defines an entity or a concept
- Four sources of information
- Title
- Redirect titles
- Categories
- Hyperlinks
- Outlinks vs. Inlinks
10Background - Wikipedia
11Approach
- Expoiting terms (i.e. base noun phrases) and
named entities coocurring with ambiguous name for
disambiguation - Casting the problem as ranking problem
- Using TFIDF to calculate similarity and choose
the candidate with the highest score
12Approach
- Constructing corpus
- Utilizing classes and properties to generate a
snippet for each instance in an ontology - Feature generation for enriching representation
of those instances - Analyzing a text for disambiguation and
identification of NEs occurring therein
13Approach - Construct corpus
14Approach- Construct corpus
15Approach Disambiguation process
- For each ambiguous name
- Looking up candidates
- Extracting base noun phrases in the same sentence
an in the headline - Extracting named entities in the whole text
- Using TFIDF to rank and choose the candidate with
the highest score
16Approach An example
17Evaluation
- Using KIM Ontology
- 140 texts of news articles in some news agencies
- Focusing on four names John McCarthy, John
Wiliams, Georgia, and Columbia - Measure accuracy as the total number of correctly
assignment NEs (in text)/ontology instances
divided by the total number of assignment
18Evaluation
19Conclusion
- Our approach is quite natural and similar to the
way humans do, relying on co-occurring NEs and
terms to resolve other ambiguous entities in a
given context. - Currently Wikipedia editions are available for
approximately 200 languages, so our method can be
used to build NE disambiguation systems for a
large number of languages - The features from Wikipedia, and NEs in the whole
text are meaningful evidence for disambiguation - In the future detecting NEs out of the ontology,
and investigating other similarity metrics -
20Thanks for your attention !