Title: Using Encyclopedic Knowledge for Named Entity Disambiguation
1Using Encyclopedic Knowledge forNamed Entity
Disambiguation
Razvan Bunescu
Marius Pasca
Machine Learning Group Department of Computer
Sciences University of Texas at Austin
Google Inc. 1600 Amphitheatre Parkway Mountain
View, CA
razvan_at_cs.utexas.edu
mars_at_google.com
2Introduction Disambiguation
- Some names denote multiple entities
- John Williams and the Boston Pops conducted a
summer Star Wars concert at Tanglewood. - John Williams ? John Williams (composer)
- John Williams lost a Taipei death match against
his brother, Axl Rotten. - John Williams ? John Williams (wrestler)
- John Williams won a Victoria Cross for his
actions at the battle of Rorkes Drift. - John Williams ? John Williams (VC)
3Introduction Normalization
- Some entities have multiple names
- John Williams (composer) ? John Williams
- John Williams (composer) ? John Towner Williams
- John Williams (wrestler) ? John Williams
- John Williams (wrestler) ? Ian Rotten
- Venus (planet) ? Venus
- Venus (planet) ? Morning Star
- Venus (planet) ? Evening Star
4Introduction Motivation
- Web searches
- Queries about Named Entities (NEs) constitute a
significant portion of popular web queries. - Ideally, search results are clustered such that
- In each cluster, the queried name denotes the
same entity. - Each cluster is enriched by querying the web with
alternative names of the corresponding entity. - Web-based Information Extraction (IE)
- Aggregating extractions from multiple web pages
can lead to improved accuracy in IE tasks (e.g.
extracting relationships between NEs). - Named entity disambiguation is essential for
performing a meaningful aggregation.
5Introduction Approach
- Build a dictionary D of named entities
- Use information from a large coverage
encyclopedia Wikipedia. - Each name d?D is mapped to d.E, the set of
entities that d can refer to in Wikipedia. - Design a method that takes as input a proper name
in its document context, and can be trained to - Detect when a proper name refers to an entity
from D. Detection - Find the named entity refered in that context.
Disambiguation
6Introduction Example
Dictionary
John Williams
John Towner Williams
Ian Rotten
John Williams (composer)
John Williams (VC)
John Williams (wrestler)
John Williams (other)
Document
?
this past weekend. John Williams and the
Boston Pops conducted a summer Star Wars concert
at Tanglewood
7Outline
- Introduction
- Wikipedia Structures
- Named Entity Dictionary
- Disambiguation Dataset
- Disambiguation Detection
- Experimental Evaluation
- Future Work
- Conclusions
8Wikipedia A Wiki Encyclopedia
- Wikipedia a free online encyclopedia written
collaboratively by volunteers, using wiki
software. - 200 language editions, with varying levels of
coverage. - Very dynamic and quickly growing resource
- May 2005 577,860 articles
- Sep. 2005 751,666 articles
9Wikipedia Articles Titles
- Each article describes a specific entity or
concept. - An article is uniquely identified by its title.
- Usually, the title is the most common name used
to denote the entity described in the article. - If the title name is ambiguous, it may be
qualified with an expression between parentheses. - Example John Williams (composer)
- Notation
- E ? the set of all named entities from Wikipedia.
- e?E ? an arbitrary named entity.
- e.title ? the title name
- e.T ? the text of the article
10Wikipedia Structures
- In general, there is a many-to-many relationship
between names and entities, captured in Wikipedia
through - Redirect articles.
- Disambiguation articles.
- Hyperlinks An article may contain links to other
articles in Wikipedia. - Categories each article belongs to at least one
Wikipedia category.
11Redirect Articles
- A redirect article exists for each alternative
name used to refer to an entity in Wikipedia. - Example The article titled John Towner Williams
consists in a pointer to the article John
Williams (composer). - Notation
- e.R ? the set of all names that redirect to e.
- Example
- e.title ? United States.
- e.R ? USA, US, Estados Unidos, Untied States,
Yankee Land, .
12Disambiguation Articles
- A disambiguation article lists all Wikipedia
entities (articles) that may be denoted by an
ambiguous name. - Example The article titled John Williams
(disambiguation) list 22 entities (articles). - Notation
- e.D ? the set of names whose disambiguation pages
contain a link to e. - Example
- e.title ? Venus (planet).
- e.D ? Venus, Morning Star, Evening Star.
13Named Entity Dictionary
- Named Entities ? entities with a proper name
title. - All Wikipedia titles begin with a capital letter
? 3 heuristics for detecting
proper name titles - If e.title is a multiword title, then e is a
named entity only if all content words are
capitalized (e.g. The Witches of Eastwick) - If e.title is a one word title that contains at
least two capital letters, then e is a named
entity (e.g. NATO) - If at least 75 of the title occurrences inside
the article are capitalized, then e is a named
entity. - Notation
- d?D is a proper name entry in the dictionary D
(?500K entries). - d.E is the set of entities that may be denoted by
d in Wikipedia, - e?d.E ? d ? e.name ? d?e.R ? d?e.D
(e.name ? e.title without the expression between
parantheses)
14Hyperlinks
- Mentions of entities in Wikipedia articles are
often linked to their corresponding article, by
using links or piped links.
piped link
link
Wiki source
The Vatican CityVatican is now an enclave
surrounded by Rome.
Display string
The Vatican is now an enclave surrounded by Rome.
15Disambiguation Dataset
- Hyperlinks in Wikipedia provide disambiguated
named entity queries q.
q1
q2
The Vatican CityVatican is now an enclave
surrounded by Rome.
display name
title
display name ? title
- Notation
- q.E ? the set of entities that are associated in
the dictionary D with the display name from the
link. - q.e?q.E ? the true entity associated with the
query, given by the title included in the link. - q.T ? the text contained in a window of size 55
words Gooi Allan, 2004 centered on the link.
16Disambiguation Dataset
- Every entity ek?q.E contributes a disambiguation
example, labeled 1 if and only if ek ? q.e
q
this past weekend. John Williams and the
Boston Pops conducted a summer Star Wars concert
at Tanglewood
? Query Text (q.T) Entity Title (ek.title)
1 Boston Pops conducted concert Star Wars e1 John Williams (composer)
0 Boston Pops conducted concert Star Wars e2 John Williams (wrestler)
0 Boston Pops conducted concert Star Wars e3 John Williams (VC)
1,783,868 queries
17Categories
- Each article in Wikipedia is required to be
associated with at least one category. - Categories form a directed acyclic graph, which
allows multiple categorization schemes to
co-exist. - 59,759 categories in Wikipedia taxonomy.
- Notation
- e.C ? the set of categories to which e belongs
(ancestors included). - Example
- e.title ? Venus (planet).
- e.C ? Venus, Planets of the Solar Systems,
Planets, Solar System.
18Outline
- Introduction
- Wikipedia Structures
- Named Entity Dictionary
- Disambiguation Dataset
- Disambiguation Detection
- Experimental Evaluation
- Future Work
- Conclusions
19NE Disambiguation Two Approaches
- Classification
- Train a classifier for each proper name in the
dictionary D. - Not feasible 500K proper names ? need 500K
classifiers! - Ranking
- Design a scoring function score(q,ek) that
computes the compatibility between the context of
the proper name occurring in a query q, and any
of the entities ek?q.E that may be referred by
that proper name. - For a given named entity query q, select the
highest ranking entity
20Context-Article Similarity
- NE disambiguation ? ranking problem.
- Use cosine similarity between query context and
article, based on the tf x idf formulation
21Word-Category Correlations
- Problem In many cases, given a query q, the true
entity q.e fails to rank first because cue words
from the query context do not occur in q.es
article. - The article may be too short, or incomplete.
- Relevant concepts from the query context are
captured in the article through synonymous words
or phrases. - Approach Use correlations between words in the
query context w?q.T and categories to which the
named entity belongs c?e.C.
22Word-Category Correlations
People by occupation
People known in connection with sports and hobbies
Musicians
Composers
Wrestlers
Film score composers
Professional wrestlers
John Williams (composer)
John Williams (wrestler)
?
John Williams and the Boston Pops
a summer Star Wars concert at Tanglewood.
conducted
23Ranking Formulation
- Redefine q.E ? the set of named entities from D
that may be denoted by the display name in the
query, plus an out-of-Wikipedia entity eout. - Use a linear ranking function
???cos?w,c?out
24Ranking Formulation Example
q.T ? past, weekend, Boston, Pops, conducted,
summer, Star, Wars, concert, Tanglewood, e1.C?
Film score composers, Composers, Musicians,
People by occupation, eout.C ??
25NE Disambiguation Overview 1
Data Structures
Redirect Pages
Disambiguation Dataset
NE Dictionary
Disambig Pages
Hyperlinks
26NE Disambiguation Overview 2
27Outline
- Introduction
- Wikipedia Structures
- Named Entity Dictionary
- Disambiguation Detection
- Experimental Evaluation
- Future Work
- Conclusions
28Experimental Evaluation
- The normalized ranking kernel is trained and
evaluated against cosine similarity in 4
scenarios - Disambiguation between entities with different
categories in the set of 110 top-level categories
under People by Occupation. - Disambiguation between entities with different
categories in the set of 540 most popular (size gt
200) categories under People by Occupation. - Disambiguation between entities with different
categories in the set of 2847 most popular (size
gt 20) categories under People by Occupation. - Detection Disambiguation between entities with
different categories in the set of 540 most
popular (size gt 200) categories under People by
Occupation. - Use SVMlight with the max-margin ranking approach
from Joachims 2002.
29Experimental Evaluation S2
- The set of Wikipedia categories is restricted to
- C2 ? the 540 categories under People by
Occupation that have at least 200 articles - Train Test only on ambiguous queries ?q,ek?
such that - ek.C ? C2 ? ? (i.e. matching entities have
categories in C2) - ek.C ? C2 ? q.e.C ? C2 (i.e. the true entity does
not have exactly the same categories as other
matching entities) - Statistics Results
Cat Training dataset Training dataset Training dataset Test dataset Test dataset Test Accuracy Test Accuracy
Cat Queries Pairs Constr. Queries Pairs Kernel Cosine
540 17,970 55,452 37,482 70,468 235,290 68.4 55.8
30Experimental Evaluation S4
- The set of Wikipedia categories is restricted to
- C4 ? the 540 categories under People by
Occupation that have at least 200 articles. - Train Test
- Consider out-of-Wikipedia all entities that are
not under People by Occupation. - Randomly select queries such that 10 have true
answer out-of-Wikipedia. - Statistics Results
Cat Training dataset Training dataset Training dataset Test dataset Test dataset Test Accuracy Test Accuracy
Cat Queries Pairs Constr. Queries Pairs Kernel Cosine
540 38,726 102,553 63,827 80,386 191,227 84.8 82.3
31Future Work
- Use weight vector w explicitly reduce its
dimensionality by considering only features
occurring frequently in training data. - Augment article text with context from hyperlinks
that point to it. - Use correlations between categories and
traditional WSD features such as (syntactic)
bigrams and trigrams centered on the ambiguous
proper name.
32Conclusion
- A novel approach to Named Entity Disambiguation
based on knowledge encoded in Wikipedia. - Learned correlations between Wikipedia categories
and context words substantially improve
disambiguation accuracy. - Potential applications
- Clustering results to web searches for popular
named entities. - NE disambiguation is essential for aggregating
corpus-level results from Information Extraction.
33Questions?
34Ranking Kernel
- The corresponding kernel is
35Experimental Evaluation S1
- The set of Wikipedia categories is restricted to
- C1 ? the 110 top-level categories under People
by Occupation. - Train Test only on ambiguous queries ?q,ek?
such that - ek.C ? C1 ? ? (i.e. matching entities have
categories in C1) - ek.C ? C1 ? q.e.C ? C1 (i.e. the true entity does
not have exactly the same categories as other
matching entities) - Statistics Results
Cat Training dataset Training dataset Training dataset Test dataset Test dataset Test Accuracy Test Accuracy
Cat Queries Pairs Constr. Queries Pairs Kernel Cosine
110 12,288 39,880 27,592 48,661 147,165 77.2 61.5
36Experimental Evaluation S3
- The set of Wikipedia categories is restricted to
- C3 ? the 2847 top-level categories under People
by Occupation that have at least 20 articles - Train Test only on ambiguous queries ?q,ek?
such that - ek.C ? C3 ? ? (i.e. matching entities have
categories in C3) - ek.C ? C3 ? q.e.C ? C3 (i.e. the true entity does
not have exactly the same categories as other
matching entities) - Statistics Results
Cat Training dataset Training dataset Training dataset Test dataset Test dataset Test Accuracy Test Accuracy
Cat Queries Pairs Constr. Queries Pairs Kernel Cosine
2847 21,185 64,560 43,375 75,190 261,723 68.0 55.4