Evaluation of Clustering Techniques on DMOZ Data - PowerPoint PPT Presentation

About This Presentation
Title:

Evaluation of Clustering Techniques on DMOZ Data

Description:

Title: ALternative Energy Sources Author: Musa Last modified by: Can Created Date: 5/14/2004 12:43:08 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 37
Provided by: musa
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of Clustering Techniques on DMOZ Data


1
Evaluation of Clustering Techniques on DMOZ Data
  • Alper Rifat Uluçinar
  • Rifat Özcan
  • Mustafa Canim

2
Outline
  • What is DMOZ and why do we use it?
  • What is our aim?
  • Evaluation of partitioning clustering algorithms
  • Evaluation of hierarchical clustering algorithms
  • Conclusion

3
What is DMOZ and why do we use it?
  • www.dmoz.org
  • Another name for ODP, Open Directory Project
  • The largest human edited directory on the
    Internet
  • 5,300,000 sites
  • 72,000 editors
  • 590,000 categories

4
(No Transcript)
5
(No Transcript)
6
What is our aim?
  • Evaluating cluster algorithms is not easy
  • We will use DMOZ as reference point (ideal
    cluster structure)
  • Run our own cluster algorithms on same data
  • Finally compare results.

7
All DMOZ documents (websites)
Applying Clustering Algorithms such as C3M, K
Means etc.
Human Evaluation
?
DMOZ Clusters
??
8
A) Evaluation of Partitioning Clustering
Algorithms
  • 20,000 documents from DMOZ
  • flat partitioned data (214 folders)
  • We applied html parsing, stemming, stop word list
    elimination
  • We will apply two clustering algorithms
  • C3M
  • K-Means

9
Before applying html parsing, stemming, stop word
list elimination
10
After applying html parsing, stemming, stop word
list elimination
11
20,000 DMOZ documents
Applying C3M
Human Evaluation
214 Clusters
642 Clusters
12
C3M Clusters
DMOZ Clusters
214 Clusters
642 Clusters
How to compare DMOZ Clusters and C3M clusters ?
Answer Corrected Rand
13
Validation of Partitioning Clustering
  • Comparison of two clustering structures
  • N documents
  • Clustering structure 1
  • R clusters
  • Clustering structure 2
  • C clusters
  • Metrics 1
  • Rand Index
  • Jaccard Coefficient
  • Corrected Rand Coefficient

14
Validation of Partitioning Clustering
..
..
d1,d2
d1,d2
Type I, Frequency a
15
Validation of Partitioning Clustering
  • Rand Index (ad) / (abcd)
  • Jaccard Coefficient a / (abc)
  • Corrected Rand Coefficient
  • Accounts for randomness
  • Normalize rand index so that 0 when the
    partitions are selected by chance and 1 when a
    perfect match achieved.
  • CR (R E(R)) / (1 E(R))

16
Validation of Partitioning Clustering
  • Example
  • Docs d1 , d2 , d3 , d4 , d5 , d6
  • Clustering Structure 1
  • C1 d1 , d2 , d3
  • C2 d4 , d5 , d6
  • Clustering Structure 2
  • D1 d1 , d2
  • D2 d3 , d4
  • D3 d5 , d6

17
Validation of Partitioning Clustering
  • Contingency Table

a (d1, d2), (d5, d6) b (d1, d3), (d2, d3),
(d4, d5), (d4, d6) c (d3, d4) d remaining 8
pairs (15-7) Rand Index (28)/15
0.66 Jaccard Coeff. 2/(241) 0.29 Corrected
Rand 0.24
D1 D2 D3
C1 2 1 0 3
C2 0 1 2 3
2 2 2 6
18
Results
  • Results
  • Low corrected rand and jaccard values?
  • 0.01
  • Rand index 0.77
  • Possible Reasons
  • Noise in the data
  • Ex 300 Document Not Found pages.
  • Problem is difficult
  • Ex Homepages category.

19
B) Evaluation of Hierarchical Clustering
Algorithms
  • Obtain a partitioning of DMOZ
  • Determine a depth (experiment?)
  • Collect documents of higher (or equal) depth at
    that level
  • Documents of lower depths?
  • Ignore them

20
Hierarchical Clustering Steps
  • Obtain the hierarchical clusters using
  • Single Linkage
  • Average Linkage
  • Complete Linkage
  • Obtain a partitioning on the hierarchical cluster

21
Hierarchical Clustering Steps
  • One way, treat DMOZ clusters as queries
  • For each selected cluster of DMOZ
  • Find the number of target clusters on
    computerized partitioning
  • Take the average
  • See if Nt lt Ntr
  • If not, either choice of partitioning or
    hierarchical clustering did not perform well

22
Hierarchical Clustering Steps
  • Another way
  • Compare the two partitions using an index, i.e.
    C-RAND

23
Choice of Partition Outline
  • Obtain the dendrogram
  • Single linkage
  • Complete linkage
  • Group average linkage
  • Wards methods

24
Choice of Partition Outline
  • How to convert a hierarchical cluster structure
    into a partition?
  • Visually inspect the dendrogram?
  • Use tools from statistics?

25
Choice of Partition Inconsistency Coefficient
  • At each fusion level
  • Calculate the inconsistency coefficient
  • Utilize statistics from the previous fusion
    levels
  • Choose the fusion level for which inconsistency
    coefficient is at maximum.

26
Choice of Partition Inconsistency Coefficient
  • Inconsistency coefficient (I.C.) at fusion level
    i

27
Choice of PartitionI.C. Hands on, Objects
  • Plot of the objects
  • Distance measure Euclidean Distance

28
Choice of PartitionI.C. Hands on, Single Linkage
29
Choice of PartitionI.C. Single Linkage Results
Level 1 ? 0 Level 2 ? 0 Level 3 ?
0 Level 4 ? 0 Level 5 ? 0 Level 6 ?
1.1323 Level 7 ? 0.6434 gt Cut the
dendrogram at a height between level 5 6
30
Choice of PartitionI.C. Single Linkage Results
31
Choice of PartitionI.C. Hands on, Average
Linkage
32
Choice of PartitionI.C. Average Linkage Results
Level 1 ? 0 Level 2 ? 0 Level 3 ?
0.7071 Level 4 ? 0 Level 5 ?
0.7071 Level 6 ? 1.0819 Level 7 ?
0.9467 gt Cut the dendrogram at a height
between level 5 6
33
Choice of PartitionI.C. Hands on, Complete
Linkage
34
Choice of PartitionI.C. Complete Linkage Results
Level 1 ? 0 Level 2 ? 0 Level 3 ?
0.7071 Level 4 ? 0 Level 5 ?
0.7071 Level 6 ? 1.0340 Level 7 ?
1.0116 gt Cut the dendrogram at a height
between level 5 6
35
Conclusion
  • Our aim is to evaluate clustering techniques on
    DMOZ Data.
  • Analysis on partitioning hierarchical
    clustering algorithms.
  • If the experiments are succesfull we will apply
    same experiments on larger DMOZ data after we
    download it.
  • Else
  • We will try other methodologies to improve our
    experiment results.

36
References
  • 1 A. K. Jain and R. C. Dubes. Algorithms for
    Clustering Data. Prentice Hall, 1988.
  • 2 Korenius T., Laurikkala J., Juhola M.,
    Jarvelin K. Hierarchical clustering of a Finnish
    newspaper article collection with graded
    relevance assessments. Information Retrieval,
    9(1). Kluwer Academic Publishers, 2006.
  • www.dmoz.org
Write a Comment
User Comments (0)
About PowerShow.com