Evaluation of Clustering Techniques on DMOZ Data - PowerPoint PPT Presentation

About This Presentation

Title:

Evaluation of Clustering Techniques on DMOZ Data

Description:

Title: ALternative Energy Sources Author: Musa Last modified by: Can Created Date: 5/14/2004 12:43:08 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 37

Provided by: musa

Category:

more less

Transcript and Presenter's Notes

Title: Evaluation of Clustering Techniques on DMOZ Data

1
Evaluation of Clustering Techniques on DMOZ Data

Alper Rifat Uluçinar
Rifat Özcan
Mustafa Canim

2
Outline

What is DMOZ and why do we use it?
What is our aim?
Evaluation of partitioning clustering algorithms
Evaluation of hierarchical clustering algorithms
Conclusion

3
What is DMOZ and why do we use it?

www.dmoz.org
Another name for ODP, Open Directory Project
The largest human edited directory on the
Internet
5,300,000 sites
72,000 editors
590,000 categories

4
(No Transcript)
5
(No Transcript)
6
What is our aim?

Evaluating cluster algorithms is not easy
We will use DMOZ as reference point (ideal
cluster structure)
Run our own cluster algorithms on same data
Finally compare results.

7
All DMOZ documents (websites)
Applying Clustering Algorithms such as C3M, K
Means etc.
Human Evaluation
?
DMOZ Clusters
??
8
A) Evaluation of Partitioning Clustering
Algorithms

20,000 documents from DMOZ
flat partitioned data (214 folders)
We applied html parsing, stemming, stop word list
elimination
We will apply two clustering algorithms
C3M
K-Means

9
Before applying html parsing, stemming, stop word
list elimination
10
After applying html parsing, stemming, stop word
list elimination
11
20,000 DMOZ documents
Applying C3M
Human Evaluation
214 Clusters
642 Clusters
12
C3M Clusters
DMOZ Clusters
214 Clusters
642 Clusters
How to compare DMOZ Clusters and C3M clusters ?
Answer Corrected Rand
13
Validation of Partitioning Clustering

Comparison of two clustering structures
N documents
Clustering structure 1
R clusters
Clustering structure 2
C clusters
Metrics 1
Rand Index
Jaccard Coefficient
Corrected Rand Coefficient

14
Validation of Partitioning Clustering
..
..
d1,d2
d1,d2
Type I, Frequency a
15
Validation of Partitioning Clustering

Rand Index (ad) / (abcd)
Jaccard Coefficient a / (abc)
Corrected Rand Coefficient
Accounts for randomness
Normalize rand index so that 0 when the
partitions are selected by chance and 1 when a
perfect match achieved.
CR (R E(R)) / (1 E(R))

16
Validation of Partitioning Clustering

Example
Docs d1 , d2 , d3 , d4 , d5 , d6
Clustering Structure 1
C1 d1 , d2 , d3
C2 d4 , d5 , d6
Clustering Structure 2
D1 d1 , d2
D2 d3 , d4
D3 d5 , d6

17
Validation of Partitioning Clustering

Contingency Table

a (d1, d2), (d5, d6) b (d1, d3), (d2, d3),
(d4, d5), (d4, d6) c (d3, d4) d remaining 8
pairs (15-7) Rand Index (28)/15
0.66 Jaccard Coeff. 2/(241) 0.29 Corrected
Rand 0.24
D1 D2 D3
C1 2 1 0 3
C2 0 1 2 3
2 2 2 6
18
Results

Results
Low corrected rand and jaccard values?
0.01
Rand index 0.77
Possible Reasons
Noise in the data
Ex 300 Document Not Found pages.
Problem is difficult
Ex Homepages category.

19
B) Evaluation of Hierarchical Clustering
Algorithms

Obtain a partitioning of DMOZ
Determine a depth (experiment?)
Collect documents of higher (or equal) depth at
that level
Documents of lower depths?
Ignore them

20
Hierarchical Clustering Steps

Obtain the hierarchical clusters using
Single Linkage
Average Linkage
Complete Linkage
Obtain a partitioning on the hierarchical cluster

21
Hierarchical Clustering Steps

One way, treat DMOZ clusters as queries
For each selected cluster of DMOZ
Find the number of target clusters on
computerized partitioning
Take the average
See if Nt lt Ntr
If not, either choice of partitioning or
hierarchical clustering did not perform well

22
Hierarchical Clustering Steps

Another way
Compare the two partitions using an index, i.e.
C-RAND

23
Choice of Partition Outline

Obtain the dendrogram
Single linkage
Complete linkage
Group average linkage
Wards methods

24
Choice of Partition Outline

How to convert a hierarchical cluster structure
into a partition?
Visually inspect the dendrogram?
Use tools from statistics?

25
Choice of Partition Inconsistency Coefficient

At each fusion level
Calculate the inconsistency coefficient
Utilize statistics from the previous fusion
levels
Choose the fusion level for which inconsistency
coefficient is at maximum.

26
Choice of Partition Inconsistency Coefficient

Inconsistency coefficient (I.C.) at fusion level
i

27
Choice of PartitionI.C. Hands on, Objects

Plot of the objects
Distance measure Euclidean Distance

28
Choice of PartitionI.C. Hands on, Single Linkage
29
Choice of PartitionI.C. Single Linkage Results
Level 1 ? 0 Level 2 ? 0 Level 3 ?
0 Level 4 ? 0 Level 5 ? 0 Level 6 ?
1.1323 Level 7 ? 0.6434 gt Cut the
dendrogram at a height between level 5 6
30
Choice of PartitionI.C. Single Linkage Results
31
Choice of PartitionI.C. Hands on, Average
Linkage
32
Choice of PartitionI.C. Average Linkage Results
Level 1 ? 0 Level 2 ? 0 Level 3 ?
0.7071 Level 4 ? 0 Level 5 ?
0.7071 Level 6 ? 1.0819 Level 7 ?
0.9467 gt Cut the dendrogram at a height
between level 5 6
33
Choice of PartitionI.C. Hands on, Complete
Linkage
34
Choice of PartitionI.C. Complete Linkage Results
Level 1 ? 0 Level 2 ? 0 Level 3 ?
0.7071 Level 4 ? 0 Level 5 ?
0.7071 Level 6 ? 1.0340 Level 7 ?
1.0116 gt Cut the dendrogram at a height
between level 5 6
35
Conclusion

Our aim is to evaluate clustering techniques on
DMOZ Data.
Analysis on partitioning hierarchical
clustering algorithms.
If the experiments are succesfull we will apply
same experiments on larger DMOZ data after we
download it.
Else
We will try other methodologies to improve our
experiment results.

36
References

1 A. K. Jain and R. C. Dubes. Algorithms for
Clustering Data. Prentice Hall, 1988.
2 Korenius T., Laurikkala J., Juhola M.,
Jarvelin K. Hierarchical clustering of a Finnish
newspaper article collection with graded
relevance assessments. Information Retrieval,
9(1). Kluwer Academic Publishers, 2006.
www.dmoz.org

Write a Comment

User Comments (0)