Title: Corpus Exploitation from Wikipedia for Ontology Construction
1Corpus Exploitation from Wikipedia for Ontology
Construction
- Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen
- The Department of Computing
- The Hong Kong Polytechnic University
2Outline
- Introduction
- Related Works
- Algorithm design
- Classification Tree Traversal
- Ranking nodes in the classification tree
- Experiments and Evaluations
- Conclusion and Future Works
3Background
- Ontology Construction
- Manual construction
- Corpus is not necessary
- Small scale
- Automatic or semiautomatic construction
- Domain specific corpus
- Good domain knowledge coverage
4Related Works
- Corpus Selection
- Corpus by linguists
- British National Corpus (BNC) Collin F. Baker,
etc., 1998 - Corpus from Publications
- Reuters News Corpus Latifur Khan, Feng Luo,
2002 - Corpus from Internet
- Searching Results from Web as Corpus P Cimiano,
etc., 2004
5Use of Wikipedia as a Resource
- Statistical and analysis work
- A Lih., 2004, Jakob Voss, 2005
- Link structure and cultural bias analysis of Wiki
- M Völkel, M Krötzsch, D Vrandecic, H Haller and
R Studer. , 2006 , F Bellomi and R Bonato,
2005 - Add semantic links
- Add semantic links between concepts in Wiki pages
- M Völkel, 2006, Michael Strube, Simone Paolo
Ponzetto, 2006 - Corpus for XML retrieval
- L Denoyer, P Gallinari, 2006
6Problems
- Manually Selected Corpus
- Domain experts needed
- Time and labor intensive
- Corpus Collection from Publications
- Limitation in time and region
- Internet Exploitation
- Difficulty in domain specific data identification
7Wikipedia Overview
- Established in 2001
- 500,000 articles in 2005
- 1 million articles in Nov. 2006
- More than 2 millions of articles till now
- Different types of data
- Abundance of domain specific data
- Availability of category information
- Too many reachable nodes
8Algorithm Design
- Basic Idea
- Make use of the classification tree to only
certain qualified reachable nodes - Classification Tree Traversal
- Given a Root node Pr (category node)
- Breadth-First-Search Algorithm
- Initialization
- Wr 1 for root node Pr
- Wi 0 if Pi is not on the current traversal path
9Tree traversal and weights
- Wiki Graph
- Classification Tree
- In-edge
- Out-edge
- Nin(P)
- Nout(P)
10Ranking Schemes (1)
- S1
- Considering the sum of scores of Pcs out-edges
pointing to the classification tree against the
total number of Pcs out-edges - The 1 in denominator is to avoid it being 0
11Ranking Schemes (2)
- S2
- Considering the summation of Pcs in-edges in the
classification tree against the total number of
the in-edges of Pi s, which are Pcs upper level
nodes
12Ranking Schemes (3)
- S3
- Considering the summation of the out-edge nodes
in the classification tree divided by both Pcs
out-edge scores and its upper level nodes Pis
in-edge scores
13Data
- Wikipedia Resource
- English version in XML
- 1,100,000 articles
- Cut off date Nov. 30, 2006
- Domain Connected Branches
- 549,486 nodes for IT
- 549,433 nodes for biology
14Evaluation on Scheme Selection
- Evaluation by sampling
- For Top 20,000 nodes
- 10 nodes in every 1,000 nodes
- For Remaining nodes
- 10 nodes in every 10,000 nodes
- Corpus size
- Top 20,000
- 98M for IT
- 101M for Biology
15Sampling Results of Different Schemes
Table 1 Evaluation Result of Different Schemes in
the IT Domain
Table 2 Evaluation Result of Different Schemes in
the Biology Domain
16Overall Precision on sampled data
17Root Node Identification
- Different root nodes leads to different
classification structure - E.g. Category Electronics
- For electronics
- For IT
- Compare to Library of American Congress
Classification (LACC) - Widely used library classification in most
research and academic areas
18Comparisons with LACC
Comparisons of Classification Trees with Root
Nodes from Respective Domains
19Comparisons with LACC (2)
Comparisons of Classification Tree Structures
with LACC with Root Node Electronics
20Conclusion
- Acquire leave nodes through qualified
classification tree branches in Wiki - Best performance should take into consideration
of both in-edges and out-edges - Selection of proper nodes does affect the results
- Pick the most common term as the root node
21Future Works
- Improve Ranking Functions
- Using page contents
- Using hyperlinks in contexts of pages
- Set different parameters of weights to different
domains
22Thanks!
Q A