Corpus Exploitation from Wikipedia for Ontology Construction presentation

About This Presentation

Title:

Corpus Exploitation from Wikipedia for Ontology Construction

Description:

Add semantic links between concepts in Wiki pages ... Wikipedia Resource. English version in XML. 1,100,000 articles. Cut off date: Nov. 30, 2006 ... –

Number of Views:57

Avg rating:3.0/5.0

Slides: 23

Provided by: hkpu9

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: Corpus Exploitation from Wikipedia for Ontology Construction

1
Corpus Exploitation from Wikipedia for Ontology
Construction

Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen
The Department of Computing
The Hong Kong Polytechnic University

2
Outline

Introduction
Related Works
Algorithm design
Classification Tree Traversal
Ranking nodes in the classification tree
Experiments and Evaluations
Conclusion and Future Works

3
Background

Ontology Construction
Manual construction
Corpus is not necessary
Small scale
Automatic or semiautomatic construction
Domain specific corpus
Good domain knowledge coverage

4
Related Works

Corpus Selection
Corpus by linguists
British National Corpus (BNC) Collin F. Baker,
etc., 1998
Corpus from Publications
Reuters News Corpus Latifur Khan, Feng Luo,
2002
Corpus from Internet
Searching Results from Web as Corpus P Cimiano,
etc., 2004

5
Use of Wikipedia as a Resource

Statistical and analysis work
A Lih., 2004, Jakob Voss, 2005
Link structure and cultural bias analysis of Wiki
M Völkel, M Krötzsch, D Vrandecic, H Haller and
R Studer. , 2006 , F Bellomi and R Bonato,
2005
Add semantic links
Add semantic links between concepts in Wiki pages
M Völkel, 2006, Michael Strube, Simone Paolo
Ponzetto, 2006
Corpus for XML retrieval
L Denoyer, P Gallinari, 2006

6
Problems

Manually Selected Corpus
Domain experts needed
Time and labor intensive
Corpus Collection from Publications
Limitation in time and region
Internet Exploitation
Difficulty in domain specific data identification

7
Wikipedia Overview

Established in 2001
500,000 articles in 2005
1 million articles in Nov. 2006
More than 2 millions of articles till now
Different types of data
Abundance of domain specific data
Availability of category information
Too many reachable nodes

8
Algorithm Design

Basic Idea
Make use of the classification tree to only
certain qualified reachable nodes
Classification Tree Traversal
Given a Root node Pr (category node)
Breadth-First-Search Algorithm
Initialization
Wr 1 for root node Pr
Wi 0 if Pi is not on the current traversal path

9
Tree traversal and weights

Wiki Graph
Classification Tree
In-edge
Out-edge
Nin(P)
Nout(P)

10
Ranking Schemes (1)

S1
Considering the sum of scores of Pcs out-edges
pointing to the classification tree against the
total number of Pcs out-edges
The 1 in denominator is to avoid it being 0

11
Ranking Schemes (2)

S2
Considering the summation of Pcs in-edges in the
classification tree against the total number of
the in-edges of Pi s, which are Pcs upper level
nodes

12
Ranking Schemes (3)

S3
Considering the summation of the out-edge nodes
in the classification tree divided by both Pcs
out-edge scores and its upper level nodes Pis
in-edge scores

13
Data

Wikipedia Resource
English version in XML
1,100,000 articles
Cut off date Nov. 30, 2006
Domain Connected Branches
549,486 nodes for IT
549,433 nodes for biology

14
Evaluation on Scheme Selection

Evaluation by sampling
For Top 20,000 nodes
10 nodes in every 1,000 nodes
For Remaining nodes
10 nodes in every 10,000 nodes
Corpus size
Top 20,000
98M for IT
101M for Biology

15
Sampling Results of Different Schemes
Table 1 Evaluation Result of Different Schemes in
the IT Domain
Table 2 Evaluation Result of Different Schemes in
the Biology Domain
16
Overall Precision on sampled data
17
Root Node Identification

Different root nodes leads to different
classification structure
E.g. Category Electronics
For electronics
For IT
Compare to Library of American Congress
Classification (LACC)
Widely used library classification in most
research and academic areas

18
Comparisons with LACC
Comparisons of Classification Trees with Root
Nodes from Respective Domains
19
Comparisons with LACC (2)
Comparisons of Classification Tree Structures
with LACC with Root Node Electronics
20
Conclusion