Wikitology: A Wikipedia Derived Knowledge Base - PowerPoint PPT Presentation

About This Presentation

Title:

Wikitology: A Wikipedia Derived Knowledge Base

Description:

We can exploit Wikipedia and other related knowledge sources to automatically ... Refining, Enriching and Exploiting Structured Content in Wikipedia ... – PowerPoint PPT presentation

Number of Views:550

Avg rating:3.0/5.0

Slides: 53

Provided by: wenj

Learn more at: https://ebiquity.umbc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Wikitology: A Wikipedia Derived Knowledge Base

1
Wikitology A Wikipedia Derived Knowledge Base

Zareen Syed
Advisor Dr. Tim Finin
February 6th, 2009

2
Outline

Introduction and Motivation
Related Work
Proposed Work
Timeline
Work Progress
Conclusion

3
Introduction

Wikipedia
Encyclopedia
Developed Collaboratively
Freely available online
Millions of articles
English Wikipedia (2,723,767 articles)
Multiple Languages (More than 260)
Structured and un-structured content

4
Introduction

Wikipedia Content and Organization
Article Text
Categories and Category Hierarchy
Inter-article Links
Info-boxes
Disambiguation Pages
Redirection Pages
Talk Pages
History Pages
Meta-data

5
Motivation

Challenges
Human Understandable Content (not machine
readable)
How to make it more structured and organized to
improve machine readability
How to automatically exploit the knowledge in
Wikipedia to solve some real world problems

6
Thesis Statement

We can exploit Wikipedia and other related
knowledge sources to automatically create
knowledge about the world supporting a set of
common use cases such as
Concept Prediction
Information Retrieval
Information Extraction

7
Proposed Contributions

Developing a Novel Hybrid Knowledge Base composed
of structured, semi-structured and un-structured
information extracted from Wikipedia and other
related sources
Developing Novel Application Specific Algorithms
for exploiting the hybrid knowledge base
Task Based Evaluation of the system on common
use-cases such as Concept Prediction, Information
Retrieval and Information Extraction

8
Outline

Introduction and Motivation
Related Work
Proposed Work
Timeline
Work Progress
Conclusion

9
Related Work

Information Extraction
Relation extraction 35
Co-reference resolution 25
Named Entity Classification 52
Natural Language Processing
Automatic word sense disambiguation 27
Searching synonyms 28

10
Related Work

Information Retrieval
Text categorization 24
Computing semantic relatedness 30,31,32
Predicting document topics 26
Search Engine 69
Semantic Web
DBPedia 46
Semantic MediaWiki 46
Linked Open Data Project 23
Freebase 22

11
Outline

Introduction and Motivation
Related Work
Proposed Work
Timeline
Work Progress

12
Proposed Work

Refining, Enriching and Exploiting Structured
Content in Wikipedia
Integrating other related knowledge sources
Developing application specific algorithms
Developing a dynamic and scalable architecture

13
Issues

Single document in too many categories
George W. Bush is included in about 30
categories
Links between articles belonging to very
different categories
John F. Kennedy has a link for coincidence
theory which belongs to the Mathematical
Analysis/ Topology/Fixed Points.
Number of articles with in a category
Some categories are under represented where as
others have many articles
Administrative Categories
For eg
Clean up from Sep 2006
Articles with unsourced statements
Links to words in an article
For eg. If the word United States appears in the
document then that word might be linked to the
page on United States

14
Issues

Category Hierarchy
Multiple Parents (Thesaurus)
Noisy
Animals category defined in the sub-tree rooted
at People
loose subsumption
Geography-gt Geography by place -gt Regions-gt
Regions of Asia-gtMiddle East -gt Dances of Middle
East
Events-gtEvents by year-gtLists of leaders by year

15
Refining, Enriching and Exploiting Structured
Content in Wikipedia

Category Hierarchy
Filtering out Administrative Categories
Algorithms for Selecting and Ranking Categories
Inferring and Labeling Semantic Relations between
Categories
Refining Subsumption (Taxonomy)
Instance-of Relation
Using Information in Wikipedia Lists_of_Topics
Using Specific Administrative Categories

Done
16
Refining, Enriching and Exploiting Structured
Content in Wikipedia

Inter-Article Links
Problem Dont imply semantic relatedness
Links to locations, term definitions, dates,
entities
Possible solutions
Classifying Link Types
Introducing Link Weights

Done
17
Refining, Enriching and Exploiting Structured
Content in Wikipedia

Redirection Pages

Disambiguation pages

18
Proposed Work

Exploring
Other structured content
Talk pages, user pages, history pages and
meta-data
Other structured resources
Integrating structured information from other
sources like DBpedia and Freebase in Wikitology
How and When to employ reasoning over the RDF
triples

19
Proposed Work

Developing Novel Application Specific Algorithms
on top of the hybrid Wikitology Knowledge Base
for applications such as
Concept Prediction
Information Retrieval
Information Extraction

20
Proposed Work

Evaluation
Main Approaches to Evaluating Ontologies
Gold Standard Evaluation (Comparison to an
existing Ontology)
Criteria based Evaluation (By humans)
Task based Evaluation (Application based)
Comparison with Source of Data (Data driven)
Using a Reasoning Engine
Our Approach to Evaluation
Task based Evaluation (Application based)

21
Wikitology Overview
Articles
IR Index
Application Specific Algorithms
Category Links Hierarchical Graph
Wikitology Code
Application Specific Algorithms
Page Links Graph
RDF Reasoner
Application Specific Algorithms
Relational Database
Triple Store
22
Outline

Introduction and Motivation
Related Work
Proposed Work
Time Line
Work Progress
Conclusion

23
Time Line
No. Mile Stones Expected Completion Date
1 Enriching Wikitology by extracting additional information from Wikipedia May, 2009
2 Studying other related knowledge sources in detail such as Freebase, DBPedia, YAGO etc. May, 2009
3 Incorporating additional knowledge sources to enrich Wikitology May, 2009
4 Working on techniques to improve applications in Information Retrieval and Information Extraction using additional features generated from Wikitology Dec, 2009
5 Evaluating the Wikitology knowledge base May, 2010
6 Thesis write up Aug, 2010
24
Outline

Introduction and Motivation
Related Work
Proposed Work
Time Line
Work Progress
Conclusion

25
Work Done

Case Study 1
Concept Prediction
Case Study 2
Document Expansion for Information Retrieval
Case Study 3
Named Entity Classification
Case Study 4
Co-reference Resolution
Case Study 5
Concept Based Features for Information Retrieval

In Progress
26
Case Study 1 Concept Prediction 2

Problem Predict the individual document topics
as well as concepts common to a set of documents
Approach
Hybrid Knowledge base Wikitology 1.0
Algorithms for selecting and aggregating terms

27
Wikitology 1.0

Wikipedia as an Ontology
Each article is a concept in the ontology
Terms linked via Wikipedias category system and
inter-article links
Its a consensus ontology created, kept current
and maintained by a diverse community
Overall content quality is high
Terms have unique IDs (URLs) and are self
describing for people

28
Wikitology 1.0

Structured Data
Specialized Concepts (article titles)
Generalized Concepts (category titles)
Inter-category and Inter-article links as
relations between concepts
Article-Category links as relations between
specialized and generalized concepts
Un-Structured Data
Article Text ( A way to map ontology terms to
free text)
Algorithms
Algorithms to select, rank and aggregate concepts
using the hybrid knowledge base

29
Method 1
Using Wikipedia Article Text and Categories to
Predict Concepts
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
30
Method 1
Using Wikipedia Article Text and Categories to
Predict Concepts
Wikipedia Category Graph
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
31
Method 1
Using Wikipedia Article Text and Categories to
Predict Concepts
Output

Rank Categories
Links
Cosine similarity

Wikipedia Category Graph
0.9
3
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
32
Method 2
Using Spreading Activation on Category Links
Graph to get Aggregated Concepts
Spreading Activation
Output
Ranked Concepts based on Final Activation Score
Wikipedia Category Graph
Input
Querydoc(s)
Similar to
0.8
0.2
0.1
Input Function
Cosine similarity
0.2
0.3
Output Function
33
Method 3
Using Spreading Activation on Article Links Graph
Input
Threshold Ignore Spreading Activation to
articles with less than 0.4 Cosine similarity
score
Querydoc(s)
Similar To
Edge Weights Cosine similarity between
linkedarticles
Wikipedia Article Links Graph
Spreading Activation
Node Input Function
Output
Node Output Function
Ranked Concepts based on Final Activation Score
34
Wikitology 1.0

The system was evaluated by predicting the
categories and article links of existing
Wikipedia articles and comparing with the ground
truth
It was observed that Wikitology 1.0 system was
able to predict the document topics and common
concepts with high accuracy when the article
concepts were well represented within Wikipedia

35
Case Study 2 Document Expansion with Wikipedia
Derived Ontology Terms 21
Preliminary work with TREC documents
Doc FT921-4598 (3/9/92) ... Alan Turing,
described as a brilliant mathematician and a key
figure in the breaking of the Nazis' Enigma
codes. Prof IJ Good says it is as well that
British security was unaware of Turing's
homosexuality, otherwise he might have been fired
'and we might have lost the war'. In 1950 Turing
wrote the seminal paper 'Computing Machinery And
Intelligence', but in 1954 killed himself
... Turing_machine, Turing_test,
Church_Turing_thesis, Halting_problem,
Computable_number, Bombe, Alan_Turing,
Recusion_theory, Formal_methods,
Computational_models, Theory_of_computation,
Theoretical_computer_science, Artificial_Intellige
nce
MAP P_at_10
base 0.2076 0.4207
Base rf 0.2470 0.4480
Concepts rf 0.2400 0.4553
IR Effectiveness Using Wikipedia Concepts
In Collaboration with Paul McNamee, John
Hopkins University Applied Physics Laboratory
In Collaboration with Paul McNamee, John
Hopkins University Applied Physics Laboratory
36
Case Study 3Named Entity Classification

Semi-automated generation of Training data
Persons, Locations and Events
Experimenting with different feature sets
Inter-article link labeling

Results showing accuracy obtained using
different feature sets
37
Case Study 4 Cross Document Entity Co-reference
Resolution 21

Problem
To determine whether various named people,
organizations or relations from different
documents refer to the same object in the world.
For example, does the Condoleezza Rice
mentioned in one document refer to the same
person as the Secretary Rice from another?

In Collaboration with John Hopkins University
Human Language Technology Center of Excellence
38
Entity Document (EDOC)
39
Wikitology 2.0

Enhancements
Structured Data
Specialized Concepts (article titles)
Generalized Concepts (category titles)
Inter-category and Inter-article links as
relations between concepts
Article-Category links as relations between
specialized and generalized concepts
YAGO types (to identify entity type)
Table with Disambiguation set (to identify highly
confused entities)
Aliases using Redirect pages
Un-Structured Data
Article Text
Redirect titles (added to article text)

40
Wikitology 2.0

Data Structures
Lucene Index
Concept Title Redirected Titles (field)
Article Text Redirected Titles (field)
RDF field with Entity Type (YAGO type)
Graphs
Category links graph
Article links graph
Article-Category links
Tables
Disambiguation Set derived from disambiguation
pages

41
Wikitology 2.0

Custom Query Front end
The EDOCs name mention strings
Wikitologys title field
slightly higher weight to the longest mention,
i.e., Webb Hubbell
The EDOC type
RDF Field Yago Type
Name mention strings Contextual text
Text (Wikitology Article Contents)

42
Wikitology Features
43
Features Derived from Wikitology 2.0
Name Range Type Description APL20WAS 0,1 sim 1
if the top article tags for the two entities are
identical, 0 otherwise APL21WCS 0,1 sim 1 if
the top category tags for the two entities are
identical, 0 otherwise APL22WAM 0..1 sim The
cosine similarity of the medium length article
vectors (N5) for the two entities APL23WcM 0..1
sim The cosine similarity of the medium length
category vectors (N4) for the two
entities APL24WAL 0..1 sim The cosine
similarity of the long length article vectors
(N8) for the two entities APL31WAS2 0..1 sim m
atch of entities top Wikitology article tag,
weighted by avg(score1,score2) APL32WCS2 0..1 s
im match of entities top Wikitology category tag,
weighted by avg(score1,score2) APL26WDP 0,1 dis
sim 1 if both entities are of type PER and their
top article tags are different, 0
otherwise APL27WDD 0,1 dissim 1 if the two top
article tags are members of the same
disambiguation set, 0 otherwise APL28WDO 0,1 dis
sim 1 if both entities are of type ORG and their
top article tags are different, 0
otherwise APL29WDP2 0..1 dissim Match both
entities are of type PER and their top article
tags are different, weighted by
1- abs(score1-score2), 0 otherwise APL30WDP2 0
..1 dissim Match if both entities are of type
ORG and their top article matches are different
organizations, weighted by 1-abs(score1-score2
), 0 otherwise Twelve features were computed
for each pair of entities using Wikitology, seven
aimed at measuring their similarity and five for
measuring their dissimilarity.
44
Evaluation
Evaluation results for cross-document entity
co-reference task using Wikitology features
match TP rate FP rate Precision Recall F-Measure
yes .722 .001 .966 .722 .826
no .999 .278 .99 .999 .994
45
Case Study 5Feature Generation to Improve
Information Retrieval Performance

Incorporating Generalized Concept Features in
MORAG 69 search engine

Work being done during internship at RiverGlass
Company
46

MORAG Search Engine
Concept features generated using Wikipedia (ESA)
Feature Selection using pseudo-relevance feedback
Merged Ranking of Concept scores and BOW scores

Incorporating Wikitology based features in MORAG
search engine
47
Outline

Introduction and Motivation
Related Work
Proposed Work
Timeline
Work Progress
Conclusion

48
Thesis Statement

We can exploit Wikipedia and other related
knowledge sources to automatically create
knowledge about the world supporting a set of
common use cases such as
Concept Prediction
Information Retrieval
Information Extraction

49
Proposed Contributions

Developing a Novel Hybrid Knowledge base composed
of structured and un-structured information
extracted from Wikipedia and other related
sources
Wikitology 1.0
Wikitology 2.0

50
Proposed Contributions

Developing Novel Application Specific Algorithms
for exploiting the hybrid knowledge base
Methods for Concept Prediction
Ranking methods and Spreading Activation
Co-reference Resolution
Novel Entity representation and Hybrid Querying
Information Retrieval
Document Expansion, Generalized Concept Features
augmentation

51
Proposed Contributions

Task Based Evaluation of the system on common
use-cases such as Concept Prediction, Information
Retrieval and Information Extraction
Metrics
Precision and Recall

52
The End