Wikitology: A Wikipedia Derived Knowledge Base - PowerPoint PPT Presentation

About This Presentation
Title:

Wikitology: A Wikipedia Derived Knowledge Base

Description:

We can exploit Wikipedia and other related knowledge sources to automatically ... Refining, Enriching and Exploiting Structured Content in Wikipedia ... – PowerPoint PPT presentation

Number of Views:550
Avg rating:3.0/5.0
Slides: 53
Provided by: wenj
Category:

less

Transcript and Presenter's Notes

Title: Wikitology: A Wikipedia Derived Knowledge Base


1
Wikitology A Wikipedia Derived Knowledge Base
  • Zareen Syed
  • Advisor Dr. Tim Finin
  • February 6th, 2009

2
Outline
  • Introduction and Motivation
  • Related Work
  • Proposed Work
  • Timeline
  • Work Progress
  • Conclusion

3
Introduction
  • Wikipedia
  • Encyclopedia
  • Developed Collaboratively
  • Freely available online
  • Millions of articles
  • English Wikipedia (2,723,767 articles)
  • Multiple Languages (More than 260)
  • Structured and un-structured content

4
Introduction
  • Wikipedia Content and Organization
  • Article Text
  • Categories and Category Hierarchy
  • Inter-article Links
  • Info-boxes
  • Disambiguation Pages
  • Redirection Pages
  • Talk Pages
  • History Pages
  • Meta-data

5
Motivation
  • Challenges
  • Human Understandable Content (not machine
    readable)
  • How to make it more structured and organized to
    improve machine readability
  • How to automatically exploit the knowledge in
    Wikipedia to solve some real world problems

6
Thesis Statement
  • We can exploit Wikipedia and other related
    knowledge sources to automatically create
    knowledge about the world supporting a set of
    common use cases such as
  • Concept Prediction
  • Information Retrieval
  • Information Extraction

7
Proposed Contributions
  • Developing a Novel Hybrid Knowledge Base composed
    of structured, semi-structured and un-structured
    information extracted from Wikipedia and other
    related sources
  • Developing Novel Application Specific Algorithms
    for exploiting the hybrid knowledge base
  • Task Based Evaluation of the system on common
    use-cases such as Concept Prediction, Information
    Retrieval and Information Extraction

8
Outline
  • Introduction and Motivation
  • Related Work
  • Proposed Work
  • Timeline
  • Work Progress
  • Conclusion

9
Related Work
  • Information Extraction
  • Relation extraction 35
  • Co-reference resolution 25
  • Named Entity Classification 52
  • Natural Language Processing
  • Automatic word sense disambiguation 27
  • Searching synonyms 28

10
Related Work
  • Information Retrieval
  • Text categorization 24
  • Computing semantic relatedness 30,31,32
  • Predicting document topics 26
  • Search Engine 69
  • Semantic Web
  • DBPedia 46
  • Semantic MediaWiki 46
  • Linked Open Data Project 23
  • Freebase 22

11
Outline
  • Introduction and Motivation
  • Related Work
  • Proposed Work
  • Timeline
  • Work Progress

12
Proposed Work
  • Refining, Enriching and Exploiting Structured
    Content in Wikipedia
  • Integrating other related knowledge sources
  • Developing application specific algorithms
  • Developing a dynamic and scalable architecture

13
Issues
  • Single document in too many categories
  • George W. Bush is included in about 30
    categories
  • Links between articles belonging to very
    different categories
  • John F. Kennedy has a link for coincidence
    theory which belongs to the Mathematical
    Analysis/ Topology/Fixed Points.
  • Number of articles with in a category
  • Some categories are under represented where as
    others have many articles
  • Administrative Categories
  • For eg
  • Clean up from Sep 2006
  • Articles with unsourced statements
  • Links to words in an article
  • For eg. If the word United States appears in the
    document then that word might be linked to the
    page on United States

14
Issues
  • Category Hierarchy
  • Multiple Parents (Thesaurus)
  • Noisy
  • Animals category defined in the sub-tree rooted
    at People
  • loose subsumption
  • Geography-gt Geography by place -gt Regions-gt
    Regions of Asia-gtMiddle East -gt Dances of Middle
    East
  • Events-gtEvents by year-gtLists of leaders by year

15
Refining, Enriching and Exploiting Structured
Content in Wikipedia
  • Category Hierarchy
  • Filtering out Administrative Categories
  • Algorithms for Selecting and Ranking Categories
  • Inferring and Labeling Semantic Relations between
    Categories
  • Refining Subsumption (Taxonomy)
  • Instance-of Relation
  • Using Information in Wikipedia Lists_of_Topics
  • Using Specific Administrative Categories

Done
16
Refining, Enriching and Exploiting Structured
Content in Wikipedia
  • Inter-Article Links
  • Problem Dont imply semantic relatedness
  • Links to locations, term definitions, dates,
    entities
  • Possible solutions
  • Classifying Link Types
  • Introducing Link Weights

Done
17
Refining, Enriching and Exploiting Structured
Content in Wikipedia
  • Redirection Pages
  • Disambiguation pages

18
Proposed Work
  • Exploring
  • Other structured content
  • Talk pages, user pages, history pages and
    meta-data
  • Other structured resources
  • Integrating structured information from other
    sources like DBpedia and Freebase in Wikitology
  • How and When to employ reasoning over the RDF
    triples

19
Proposed Work
  • Developing Novel Application Specific Algorithms
    on top of the hybrid Wikitology Knowledge Base
    for applications such as
  • Concept Prediction
  • Information Retrieval
  • Information Extraction

20
Proposed Work
  • Evaluation
  • Main Approaches to Evaluating Ontologies
  • Gold Standard Evaluation (Comparison to an
    existing Ontology)
  • Criteria based Evaluation (By humans)
  • Task based Evaluation (Application based)
  • Comparison with Source of Data (Data driven)
  • Using a Reasoning Engine
  • Our Approach to Evaluation
  • Task based Evaluation (Application based)

21
Wikitology Overview
Articles
IR Index
Application Specific Algorithms
Category Links Hierarchical Graph
Wikitology Code
Application Specific Algorithms
Page Links Graph
RDF Reasoner
Application Specific Algorithms
Relational Database
Triple Store
22
Outline
  • Introduction and Motivation
  • Related Work
  • Proposed Work
  • Time Line
  • Work Progress
  • Conclusion

23
Time Line
No. Mile Stones Expected Completion Date
1 Enriching Wikitology by extracting additional information from Wikipedia May, 2009
2 Studying other related knowledge sources in detail such as Freebase, DBPedia, YAGO etc. May, 2009
3 Incorporating additional knowledge sources to enrich Wikitology May, 2009
4 Working on techniques to improve applications in Information Retrieval and Information Extraction using additional features generated from Wikitology Dec, 2009
5 Evaluating the Wikitology knowledge base May, 2010
6 Thesis write up Aug, 2010
24
Outline
  • Introduction and Motivation
  • Related Work
  • Proposed Work
  • Time Line
  • Work Progress
  • Conclusion

25
Work Done
  • Case Study 1
  • Concept Prediction
  • Case Study 2
  • Document Expansion for Information Retrieval
  • Case Study 3
  • Named Entity Classification
  • Case Study 4
  • Co-reference Resolution
  • Case Study 5
  • Concept Based Features for Information Retrieval

In Progress
26
Case Study 1 Concept Prediction 2
  • Problem Predict the individual document topics
    as well as concepts common to a set of documents
  • Approach
  • Hybrid Knowledge base Wikitology 1.0
  • Algorithms for selecting and aggregating terms

27
Wikitology 1.0
  • Wikipedia as an Ontology
  • Each article is a concept in the ontology
  • Terms linked via Wikipedias category system and
    inter-article links
  • Its a consensus ontology created, kept current
    and maintained by a diverse community
  • Overall content quality is high
  • Terms have unique IDs (URLs) and are self
    describing for people

28
Wikitology 1.0
  • Structured Data
  • Specialized Concepts (article titles)
  • Generalized Concepts (category titles)
  • Inter-category and Inter-article links as
    relations between concepts
  • Article-Category links as relations between
    specialized and generalized concepts
  • Un-Structured Data
  • Article Text ( A way to map ontology terms to
    free text)
  • Algorithms
  • Algorithms to select, rank and aggregate concepts
    using the hybrid knowledge base

29
Method 1
Using Wikipedia Article Text and Categories to
Predict Concepts
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
30
Method 1
Using Wikipedia Article Text and Categories to
Predict Concepts
Wikipedia Category Graph
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
31
Method 1
Using Wikipedia Article Text and Categories to
Predict Concepts
Output
  • Rank Categories
  • Links
  • Cosine similarity

Wikipedia Category Graph
0.9
3
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
32
Method 2
Using Spreading Activation on Category Links
Graph to get Aggregated Concepts
Spreading Activation
Output
Ranked Concepts based on Final Activation Score
Wikipedia Category Graph
Input
Querydoc(s)
Similar to
0.8
0.2
0.1
Input Function
Cosine similarity
0.2
0.3
Output Function
33
Method 3
Using Spreading Activation on Article Links Graph
Input
Threshold Ignore Spreading Activation to
articles with less than 0.4 Cosine similarity
score
Querydoc(s)
Similar To
Edge Weights Cosine similarity between
linkedarticles
Wikipedia Article Links Graph
Spreading Activation
Node Input Function
Output
Node Output Function
Ranked Concepts based on Final Activation Score
34
Wikitology 1.0
  • The system was evaluated by predicting the
    categories and article links of existing
    Wikipedia articles and comparing with the ground
    truth
  • It was observed that Wikitology 1.0 system was
    able to predict the document topics and common
    concepts with high accuracy when the article
    concepts were well represented within Wikipedia

35
Case Study 2 Document Expansion with Wikipedia
Derived Ontology Terms 21
Preliminary work with TREC documents
Doc FT921-4598 (3/9/92) ... Alan Turing,
described as a brilliant mathematician and a key
figure in the breaking of the Nazis' Enigma
codes. Prof IJ Good says it is as well that
British security was unaware of Turing's
homosexuality, otherwise he might have been fired
'and we might have lost the war'. In 1950 Turing
wrote the seminal paper 'Computing Machinery And
Intelligence', but in 1954 killed himself
... Turing_machine, Turing_test,
Church_Turing_thesis, Halting_problem,
Computable_number, Bombe, Alan_Turing,
Recusion_theory, Formal_methods,
Computational_models, Theory_of_computation,
Theoretical_computer_science, Artificial_Intellige
nce
MAP P_at_10
base 0.2076 0.4207
Base rf 0.2470 0.4480
Concepts rf 0.2400 0.4553
IR Effectiveness Using Wikipedia Concepts
In Collaboration with Paul McNamee, John
Hopkins University Applied Physics Laboratory
In Collaboration with Paul McNamee, John
Hopkins University Applied Physics Laboratory
36
Case Study 3Named Entity Classification
  • Semi-automated generation of Training data
  • Persons, Locations and Events
  • Experimenting with different feature sets
  • Inter-article link labeling

Results showing accuracy obtained using
different feature sets
37
Case Study 4 Cross Document Entity Co-reference
Resolution 21
  • Problem
  • To determine whether various named people,
    organizations or relations from different
    documents refer to the same object in the world.
  • For example, does the Condoleezza Rice
    mentioned in one document refer to the same
    person as the Secretary Rice from another?

In Collaboration with John Hopkins University
Human Language Technology Center of Excellence
38
Entity Document (EDOC)
39
Wikitology 2.0
  • Enhancements
  • Structured Data
  • Specialized Concepts (article titles)
  • Generalized Concepts (category titles)
  • Inter-category and Inter-article links as
    relations between concepts
  • Article-Category links as relations between
    specialized and generalized concepts
  • YAGO types (to identify entity type)
  • Table with Disambiguation set (to identify highly
    confused entities)
  • Aliases using Redirect pages
  • Un-Structured Data
  • Article Text
  • Redirect titles (added to article text)

40
Wikitology 2.0
  • Data Structures
  • Lucene Index
  • Concept Title Redirected Titles (field)
  • Article Text Redirected Titles (field)
  • RDF field with Entity Type (YAGO type)
  • Graphs
  • Category links graph
  • Article links graph
  • Article-Category links
  • Tables
  • Disambiguation Set derived from disambiguation
    pages

41
Wikitology 2.0
  • Custom Query Front end
  • The EDOCs name mention strings
  • Wikitologys title field
  • slightly higher weight to the longest mention,
    i.e., Webb Hubbell
  • The EDOC type
  • RDF Field Yago Type
  • Name mention strings Contextual text
  • Text (Wikitology Article Contents)

42
Wikitology Features
43
Features Derived from Wikitology 2.0
Name Range Type Description APL20WAS 0,1 sim 1
if the top article tags for the two entities are
identical, 0 otherwise APL21WCS 0,1 sim 1 if
the top category tags for the two entities are
identical, 0 otherwise APL22WAM 0..1 sim The
cosine similarity of the medium length article
vectors (N5) for the two entities APL23WcM 0..1
sim The cosine similarity of the medium length
category vectors (N4) for the two
entities APL24WAL 0..1 sim The cosine
similarity of the long length article vectors
(N8) for the two entities APL31WAS2 0..1 sim m
atch of entities top Wikitology article tag,
weighted by avg(score1,score2) APL32WCS2 0..1 s
im match of entities top Wikitology category tag,
weighted by avg(score1,score2) APL26WDP 0,1 dis
sim 1 if both entities are of type PER and their
top article tags are different, 0
otherwise APL27WDD 0,1 dissim 1 if the two top
article tags are members of the same
disambiguation set, 0 otherwise APL28WDO 0,1 dis
sim 1 if both entities are of type ORG and their
top article tags are different, 0
otherwise APL29WDP2 0..1 dissim Match both
entities are of type PER and their top article
tags are different, weighted by
1- abs(score1-score2), 0 otherwise APL30WDP2 0
..1 dissim Match if both entities are of type
ORG and their top article matches are different
organizations, weighted by 1-abs(score1-score2
), 0 otherwise Twelve features were computed
for each pair of entities using Wikitology, seven
aimed at measuring their similarity and five for
measuring their dissimilarity.
44
Evaluation
Evaluation results for cross-document entity
co-reference task using Wikitology features
match TP rate FP rate Precision Recall F-Measure
yes .722 .001 .966 .722 .826
no .999 .278 .99 .999 .994
45
Case Study 5Feature Generation to Improve
Information Retrieval Performance
  • Incorporating Generalized Concept Features in
    MORAG 69 search engine

Work being done during internship at RiverGlass
Company
46
  • MORAG Search Engine
  • Concept features generated using Wikipedia (ESA)
  • Feature Selection using pseudo-relevance feedback
  • Merged Ranking of Concept scores and BOW scores

Incorporating Wikitology based features in MORAG
search engine
47
Outline
  • Introduction and Motivation
  • Related Work
  • Proposed Work
  • Timeline
  • Work Progress
  • Conclusion

48
Thesis Statement
  • We can exploit Wikipedia and other related
    knowledge sources to automatically create
    knowledge about the world supporting a set of
    common use cases such as
  • Concept Prediction
  • Information Retrieval
  • Information Extraction

49
Proposed Contributions
  • Developing a Novel Hybrid Knowledge base composed
    of structured and un-structured information
    extracted from Wikipedia and other related
    sources
  • Wikitology 1.0
  • Wikitology 2.0

50
Proposed Contributions
  • Developing Novel Application Specific Algorithms
    for exploiting the hybrid knowledge base
  • Methods for Concept Prediction
  • Ranking methods and Spreading Activation
  • Co-reference Resolution
  • Novel Entity representation and Hybrid Querying
  • Information Retrieval
  • Document Expansion, Generalized Concept Features
    augmentation

51
Proposed Contributions
  • Task Based Evaluation of the system on common
    use-cases such as Concept Prediction, Information
    Retrieval and Information Extraction
  • Metrics
  • Precision and Recall

52
The End
  • Thank you
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com