Corpus Exploitation from Wikipedia for Ontology Construction

About This Presentation
Title:

Corpus Exploitation from Wikipedia for Ontology Construction

Description:

Add semantic links between concepts in Wiki pages ... Wikipedia Resource. English version in XML. 1,100,000 articles. Cut off date: Nov. 30, 2006 ... –

Number of Views:57
Avg rating:3.0/5.0
Slides: 23
Provided by: hkpu9
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Corpus Exploitation from Wikipedia for Ontology Construction


1
Corpus Exploitation from Wikipedia for Ontology
Construction
  • Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen
  • The Department of Computing
  • The Hong Kong Polytechnic University

2
Outline
  • Introduction
  • Related Works
  • Algorithm design
  • Classification Tree Traversal
  • Ranking nodes in the classification tree
  • Experiments and Evaluations
  • Conclusion and Future Works

3
Background
  • Ontology Construction
  • Manual construction
  • Corpus is not necessary
  • Small scale
  • Automatic or semiautomatic construction
  • Domain specific corpus
  • Good domain knowledge coverage

4
Related Works
  • Corpus Selection
  • Corpus by linguists
  • British National Corpus (BNC) Collin F. Baker,
    etc., 1998
  • Corpus from Publications
  • Reuters News Corpus Latifur Khan, Feng Luo,
    2002
  • Corpus from Internet
  • Searching Results from Web as Corpus P Cimiano,
    etc., 2004

5
Use of Wikipedia as a Resource
  • Statistical and analysis work
  • A Lih., 2004, Jakob Voss, 2005
  • Link structure and cultural bias analysis of Wiki
  • M Völkel, M Krötzsch, D Vrandecic, H Haller and
    R Studer. , 2006 , F Bellomi and R Bonato,
    2005
  • Add semantic links
  • Add semantic links between concepts in Wiki pages
  • M Völkel, 2006, Michael Strube, Simone Paolo
    Ponzetto, 2006
  • Corpus for XML retrieval
  • L Denoyer, P Gallinari, 2006

6
Problems
  • Manually Selected Corpus
  • Domain experts needed
  • Time and labor intensive
  • Corpus Collection from Publications
  • Limitation in time and region
  • Internet Exploitation
  • Difficulty in domain specific data identification

7
Wikipedia Overview
  • Established in 2001
  • 500,000 articles in 2005
  • 1 million articles in Nov. 2006
  • More than 2 millions of articles till now
  • Different types of data
  • Abundance of domain specific data
  • Availability of category information
  • Too many reachable nodes

8
Algorithm Design
  • Basic Idea
  • Make use of the classification tree to only
    certain qualified reachable nodes
  • Classification Tree Traversal
  • Given a Root node Pr (category node)
  • Breadth-First-Search Algorithm
  • Initialization
  • Wr 1 for root node Pr
  • Wi 0 if Pi is not on the current traversal path

9
Tree traversal and weights
  • Wiki Graph
  • Classification Tree
  • In-edge
  • Out-edge
  • Nin(P)
  • Nout(P)

10
Ranking Schemes (1)
  • S1
  • Considering the sum of scores of Pcs out-edges
    pointing to the classification tree against the
    total number of Pcs out-edges
  • The 1 in denominator is to avoid it being 0

11
Ranking Schemes (2)
  • S2
  • Considering the summation of Pcs in-edges in the
    classification tree against the total number of
    the in-edges of Pi s, which are Pcs upper level
    nodes

12
Ranking Schemes (3)
  • S3
  • Considering the summation of the out-edge nodes
    in the classification tree divided by both Pcs
    out-edge scores and its upper level nodes Pis
    in-edge scores

13
Data
  • Wikipedia Resource
  • English version in XML
  • 1,100,000 articles
  • Cut off date Nov. 30, 2006
  • Domain Connected Branches
  • 549,486 nodes for IT
  • 549,433 nodes for biology

14
Evaluation on Scheme Selection
  • Evaluation by sampling
  • For Top 20,000 nodes
  • 10 nodes in every 1,000 nodes
  • For Remaining nodes
  • 10 nodes in every 10,000 nodes
  • Corpus size
  • Top 20,000
  • 98M for IT
  • 101M for Biology

15
Sampling Results of Different Schemes
Table 1 Evaluation Result of Different Schemes in
the IT Domain
Table 2 Evaluation Result of Different Schemes in
the Biology Domain
16
Overall Precision on sampled data
17
Root Node Identification
  • Different root nodes leads to different
    classification structure
  • E.g. Category Electronics
  • For electronics
  • For IT
  • Compare to Library of American Congress
    Classification (LACC)
  • Widely used library classification in most
    research and academic areas

18
Comparisons with LACC
Comparisons of Classification Trees with Root
Nodes from Respective Domains
19
Comparisons with LACC (2)
Comparisons of Classification Tree Structures
with LACC with Root Node Electronics
20
Conclusion
  • Acquire leave nodes through qualified
    classification tree branches in Wiki
  • Best performance should take into consideration
    of both in-edges and out-edges
  • Selection of proper nodes does affect the results
  • Pick the most common term as the root node

21
Future Works
  • Improve Ranking Functions
  • Using page contents
  • Using hyperlinks in contexts of pages
  • Set different parameters of weights to different
    domains

22
Thanks!
Q A
Write a Comment
User Comments (0)
About PowerShow.com