Title: Wikitology: A Wikipedia Derived Knowledge Base
1Wikitology A Wikipedia Derived Knowledge Base
- Zareen Syed
- Advisor Dr. Tim Finin
- February 6th, 2009
2Outline
- Introduction and Motivation
- Related Work
- Proposed Work
- Timeline
- Work Progress
- Conclusion
3Introduction
- Wikipedia
- Encyclopedia
- Developed Collaboratively
- Freely available online
- Millions of articles
- English Wikipedia (2,723,767 articles)
- Multiple Languages (More than 260)
- Structured and un-structured content
4Introduction
- Wikipedia Content and Organization
- Article Text
- Categories and Category Hierarchy
- Inter-article Links
- Info-boxes
- Disambiguation Pages
- Redirection Pages
- Talk Pages
- History Pages
- Meta-data
5Motivation
- Challenges
- Human Understandable Content (not machine
readable) - How to make it more structured and organized to
improve machine readability - How to automatically exploit the knowledge in
Wikipedia to solve some real world problems
6Thesis Statement
- We can exploit Wikipedia and other related
knowledge sources to automatically create
knowledge about the world supporting a set of
common use cases such as - Concept Prediction
- Information Retrieval
- Information Extraction
7Proposed Contributions
- Developing a Novel Hybrid Knowledge Base composed
of structured, semi-structured and un-structured
information extracted from Wikipedia and other
related sources - Developing Novel Application Specific Algorithms
for exploiting the hybrid knowledge base - Task Based Evaluation of the system on common
use-cases such as Concept Prediction, Information
Retrieval and Information Extraction
8Outline
- Introduction and Motivation
- Related Work
- Proposed Work
- Timeline
- Work Progress
- Conclusion
9Related Work
- Information Extraction
- Relation extraction 35
- Co-reference resolution 25
- Named Entity Classification 52
- Natural Language Processing
- Automatic word sense disambiguation 27
- Searching synonyms 28
10Related Work
- Information Retrieval
- Text categorization 24
- Computing semantic relatedness 30,31,32
- Predicting document topics 26
- Search Engine 69
- Semantic Web
- DBPedia 46
- Semantic MediaWiki 46
- Linked Open Data Project 23
- Freebase 22
11Outline
- Introduction and Motivation
- Related Work
- Proposed Work
- Timeline
- Work Progress
12Proposed Work
- Refining, Enriching and Exploiting Structured
Content in Wikipedia - Integrating other related knowledge sources
- Developing application specific algorithms
- Developing a dynamic and scalable architecture
13Issues
- Single document in too many categories
- George W. Bush is included in about 30
categories - Links between articles belonging to very
different categories - John F. Kennedy has a link for coincidence
theory which belongs to the Mathematical
Analysis/ Topology/Fixed Points. - Number of articles with in a category
- Some categories are under represented where as
others have many articles - Administrative Categories
- For eg
- Clean up from Sep 2006
- Articles with unsourced statements
- Links to words in an article
- For eg. If the word United States appears in the
document then that word might be linked to the
page on United States
14Issues
- Category Hierarchy
- Multiple Parents (Thesaurus)
- Noisy
- Animals category defined in the sub-tree rooted
at People - loose subsumption
- Geography-gt Geography by place -gt Regions-gt
Regions of Asia-gtMiddle East -gt Dances of Middle
East - Events-gtEvents by year-gtLists of leaders by year
15Refining, Enriching and Exploiting Structured
Content in Wikipedia
- Category Hierarchy
- Filtering out Administrative Categories
- Algorithms for Selecting and Ranking Categories
- Inferring and Labeling Semantic Relations between
Categories - Refining Subsumption (Taxonomy)
- Instance-of Relation
- Using Information in Wikipedia Lists_of_Topics
- Using Specific Administrative Categories
Done
16Refining, Enriching and Exploiting Structured
Content in Wikipedia
- Inter-Article Links
- Problem Dont imply semantic relatedness
- Links to locations, term definitions, dates,
entities - Possible solutions
- Classifying Link Types
- Introducing Link Weights
Done
17Refining, Enriching and Exploiting Structured
Content in Wikipedia
18Proposed Work
- Exploring
- Other structured content
- Talk pages, user pages, history pages and
meta-data - Other structured resources
- Integrating structured information from other
sources like DBpedia and Freebase in Wikitology - How and When to employ reasoning over the RDF
triples
19Proposed Work
- Developing Novel Application Specific Algorithms
on top of the hybrid Wikitology Knowledge Base
for applications such as - Concept Prediction
- Information Retrieval
- Information Extraction
20Proposed Work
- Evaluation
- Main Approaches to Evaluating Ontologies
- Gold Standard Evaluation (Comparison to an
existing Ontology) - Criteria based Evaluation (By humans)
- Task based Evaluation (Application based)
- Comparison with Source of Data (Data driven)
- Using a Reasoning Engine
- Our Approach to Evaluation
- Task based Evaluation (Application based)
21Wikitology Overview
Articles
IR Index
Application Specific Algorithms
Category Links Hierarchical Graph
Wikitology Code
Application Specific Algorithms
Page Links Graph
RDF Reasoner
Application Specific Algorithms
Relational Database
Triple Store
22Outline
- Introduction and Motivation
- Related Work
- Proposed Work
- Time Line
- Work Progress
- Conclusion
23Time Line
No. Mile Stones Expected Completion Date
1 Enriching Wikitology by extracting additional information from Wikipedia May, 2009
2 Studying other related knowledge sources in detail such as Freebase, DBPedia, YAGO etc. May, 2009
3 Incorporating additional knowledge sources to enrich Wikitology May, 2009
4 Working on techniques to improve applications in Information Retrieval and Information Extraction using additional features generated from Wikitology Dec, 2009
5 Evaluating the Wikitology knowledge base May, 2010
6 Thesis write up Aug, 2010
24Outline
- Introduction and Motivation
- Related Work
- Proposed Work
- Time Line
- Work Progress
- Conclusion
25Work Done
- Case Study 1
- Concept Prediction
- Case Study 2
- Document Expansion for Information Retrieval
- Case Study 3
- Named Entity Classification
- Case Study 4
- Co-reference Resolution
- Case Study 5
- Concept Based Features for Information Retrieval
In Progress
26Case Study 1 Concept Prediction 2
- Problem Predict the individual document topics
as well as concepts common to a set of documents - Approach
- Hybrid Knowledge base Wikitology 1.0
- Algorithms for selecting and aggregating terms
27Wikitology 1.0
- Wikipedia as an Ontology
- Each article is a concept in the ontology
- Terms linked via Wikipedias category system and
inter-article links - Its a consensus ontology created, kept current
and maintained by a diverse community - Overall content quality is high
- Terms have unique IDs (URLs) and are self
describing for people
28Wikitology 1.0
- Structured Data
- Specialized Concepts (article titles)
- Generalized Concepts (category titles)
- Inter-category and Inter-article links as
relations between concepts - Article-Category links as relations between
specialized and generalized concepts - Un-Structured Data
- Article Text ( A way to map ontology terms to
free text) - Algorithms
- Algorithms to select, rank and aggregate concepts
using the hybrid knowledge base
29Method 1
Using Wikipedia Article Text and Categories to
Predict Concepts
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
30Method 1
Using Wikipedia Article Text and Categories to
Predict Concepts
Wikipedia Category Graph
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
31Method 1
Using Wikipedia Article Text and Categories to
Predict Concepts
Output
- Rank Categories
- Links
- Cosine similarity
Wikipedia Category Graph
0.9
3
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
32Method 2
Using Spreading Activation on Category Links
Graph to get Aggregated Concepts
Spreading Activation
Output
Ranked Concepts based on Final Activation Score
Wikipedia Category Graph
Input
Querydoc(s)
Similar to
0.8
0.2
0.1
Input Function
Cosine similarity
0.2
0.3
Output Function
33Method 3
Using Spreading Activation on Article Links Graph
Input
Threshold Ignore Spreading Activation to
articles with less than 0.4 Cosine similarity
score
Querydoc(s)
Similar To
Edge Weights Cosine similarity between
linkedarticles
Wikipedia Article Links Graph
Spreading Activation
Node Input Function
Output
Node Output Function
Ranked Concepts based on Final Activation Score
34Wikitology 1.0
- The system was evaluated by predicting the
categories and article links of existing
Wikipedia articles and comparing with the ground
truth - It was observed that Wikitology 1.0 system was
able to predict the document topics and common
concepts with high accuracy when the article
concepts were well represented within Wikipedia
35Case Study 2 Document Expansion with Wikipedia
Derived Ontology Terms 21
Preliminary work with TREC documents
Doc FT921-4598 (3/9/92) ... Alan Turing,
described as a brilliant mathematician and a key
figure in the breaking of the Nazis' Enigma
codes. Prof IJ Good says it is as well that
British security was unaware of Turing's
homosexuality, otherwise he might have been fired
'and we might have lost the war'. In 1950 Turing
wrote the seminal paper 'Computing Machinery And
Intelligence', but in 1954 killed himself
... Turing_machine, Turing_test,
Church_Turing_thesis, Halting_problem,
Computable_number, Bombe, Alan_Turing,
Recusion_theory, Formal_methods,
Computational_models, Theory_of_computation,
Theoretical_computer_science, Artificial_Intellige
nce
MAP P_at_10
base 0.2076 0.4207
Base rf 0.2470 0.4480
Concepts rf 0.2400 0.4553
IR Effectiveness Using Wikipedia Concepts
In Collaboration with Paul McNamee, John
Hopkins University Applied Physics Laboratory
In Collaboration with Paul McNamee, John
Hopkins University Applied Physics Laboratory
36Case Study 3Named Entity Classification
- Semi-automated generation of Training data
- Persons, Locations and Events
- Experimenting with different feature sets
- Inter-article link labeling
Results showing accuracy obtained using
different feature sets
37Case Study 4 Cross Document Entity Co-reference
Resolution 21
- Problem
- To determine whether various named people,
organizations or relations from different
documents refer to the same object in the world.
- For example, does the Condoleezza Rice
mentioned in one document refer to the same
person as the Secretary Rice from another?
In Collaboration with John Hopkins University
Human Language Technology Center of Excellence
38Entity Document (EDOC)
39Wikitology 2.0
- Enhancements
- Structured Data
- Specialized Concepts (article titles)
- Generalized Concepts (category titles)
- Inter-category and Inter-article links as
relations between concepts - Article-Category links as relations between
specialized and generalized concepts - YAGO types (to identify entity type)
- Table with Disambiguation set (to identify highly
confused entities) - Aliases using Redirect pages
- Un-Structured Data
- Article Text
- Redirect titles (added to article text)
40Wikitology 2.0
- Data Structures
- Lucene Index
- Concept Title Redirected Titles (field)
- Article Text Redirected Titles (field)
- RDF field with Entity Type (YAGO type)
- Graphs
- Category links graph
- Article links graph
- Article-Category links
- Tables
- Disambiguation Set derived from disambiguation
pages
41Wikitology 2.0
- Custom Query Front end
- The EDOCs name mention strings
- Wikitologys title field
- slightly higher weight to the longest mention,
i.e., Webb Hubbell - The EDOC type
- RDF Field Yago Type
- Name mention strings Contextual text
- Text (Wikitology Article Contents)
42Wikitology Features
43Features Derived from Wikitology 2.0
Name Range Type Description APL20WAS 0,1 sim 1
if the top article tags for the two entities are
identical, 0 otherwise APL21WCS 0,1 sim 1 if
the top category tags for the two entities are
identical, 0 otherwise APL22WAM 0..1 sim The
cosine similarity of the medium length article
vectors (N5) for the two entities APL23WcM 0..1
sim The cosine similarity of the medium length
category vectors (N4) for the two
entities APL24WAL 0..1 sim The cosine
similarity of the long length article vectors
(N8) for the two entities APL31WAS2 0..1 sim m
atch of entities top Wikitology article tag,
weighted by avg(score1,score2) APL32WCS2 0..1 s
im match of entities top Wikitology category tag,
weighted by avg(score1,score2) APL26WDP 0,1 dis
sim 1 if both entities are of type PER and their
top article tags are different, 0
otherwise APL27WDD 0,1 dissim 1 if the two top
article tags are members of the same
disambiguation set, 0 otherwise APL28WDO 0,1 dis
sim 1 if both entities are of type ORG and their
top article tags are different, 0
otherwise APL29WDP2 0..1 dissim Match both
entities are of type PER and their top article
tags are different, weighted by
1- abs(score1-score2), 0 otherwise APL30WDP2 0
..1 dissim Match if both entities are of type
ORG and their top article matches are different
organizations, weighted by 1-abs(score1-score2
), 0 otherwise Twelve features were computed
for each pair of entities using Wikitology, seven
aimed at measuring their similarity and five for
measuring their dissimilarity.
44Evaluation
Evaluation results for cross-document entity
co-reference task using Wikitology features
match TP rate FP rate Precision Recall F-Measure
yes .722 .001 .966 .722 .826
no .999 .278 .99 .999 .994
45Case Study 5Feature Generation to Improve
Information Retrieval Performance
- Incorporating Generalized Concept Features in
MORAG 69 search engine
Work being done during internship at RiverGlass
Company
46- MORAG Search Engine
- Concept features generated using Wikipedia (ESA)
- Feature Selection using pseudo-relevance feedback
- Merged Ranking of Concept scores and BOW scores
Incorporating Wikitology based features in MORAG
search engine
47Outline
- Introduction and Motivation
- Related Work
- Proposed Work
- Timeline
- Work Progress
- Conclusion
48Thesis Statement
- We can exploit Wikipedia and other related
knowledge sources to automatically create
knowledge about the world supporting a set of
common use cases such as - Concept Prediction
- Information Retrieval
- Information Extraction
49Proposed Contributions
- Developing a Novel Hybrid Knowledge base composed
of structured and un-structured information
extracted from Wikipedia and other related
sources - Wikitology 1.0
- Wikitology 2.0
50Proposed Contributions
- Developing Novel Application Specific Algorithms
for exploiting the hybrid knowledge base - Methods for Concept Prediction
- Ranking methods and Spreading Activation
- Co-reference Resolution
- Novel Entity representation and Hybrid Querying
- Information Retrieval
- Document Expansion, Generalized Concept Features
augmentation
51Proposed Contributions
- Task Based Evaluation of the system on common
use-cases such as Concept Prediction, Information
Retrieval and Information Extraction - Metrics
- Precision and Recall
52The End