Title: Evaluation of Clustering Techniques on DMOZ Data
1Evaluation of Clustering Techniques on DMOZ Data
- Alper Rifat Uluçinar
- Rifat Özcan
- Mustafa Canim
2Outline
- What is DMOZ and why do we use it?
- What is our aim?
- Evaluation of partitioning clustering algorithms
- Evaluation of hierarchical clustering algorithms
- Conclusion
3What is DMOZ and why do we use it?
- www.dmoz.org
- Another name for ODP, Open Directory Project
- The largest human edited directory on the
Internet - 5,300,000 sites
- 72,000 editors
- 590,000 categories
4(No Transcript)
5(No Transcript)
6What is our aim?
- Evaluating cluster algorithms is not easy
- We will use DMOZ as reference point (ideal
cluster structure) - Run our own cluster algorithms on same data
- Finally compare results.
7All DMOZ documents (websites)
Applying Clustering Algorithms such as C3M, K
Means etc.
Human Evaluation
?
DMOZ Clusters
??
8A) Evaluation of Partitioning Clustering
Algorithms
- 20,000 documents from DMOZ
- flat partitioned data (214 folders)
- We applied html parsing, stemming, stop word list
elimination - We will apply two clustering algorithms
- C3M
- K-Means
9Before applying html parsing, stemming, stop word
list elimination
10After applying html parsing, stemming, stop word
list elimination
1120,000 DMOZ documents
Applying C3M
Human Evaluation
214 Clusters
642 Clusters
12C3M Clusters
DMOZ Clusters
214 Clusters
642 Clusters
How to compare DMOZ Clusters and C3M clusters ?
Answer Corrected Rand
13Validation of Partitioning Clustering
- Comparison of two clustering structures
- N documents
- Clustering structure 1
- R clusters
- Clustering structure 2
- C clusters
- Metrics 1
- Rand Index
- Jaccard Coefficient
- Corrected Rand Coefficient
14Validation of Partitioning Clustering
..
..
d1,d2
d1,d2
Type I, Frequency a
15Validation of Partitioning Clustering
- Rand Index (ad) / (abcd)
- Jaccard Coefficient a / (abc)
- Corrected Rand Coefficient
- Accounts for randomness
- Normalize rand index so that 0 when the
partitions are selected by chance and 1 when a
perfect match achieved. - CR (R E(R)) / (1 E(R))
16Validation of Partitioning Clustering
- Example
- Docs d1 , d2 , d3 , d4 , d5 , d6
- Clustering Structure 1
- C1 d1 , d2 , d3
- C2 d4 , d5 , d6
- Clustering Structure 2
- D1 d1 , d2
- D2 d3 , d4
- D3 d5 , d6
17Validation of Partitioning Clustering
a (d1, d2), (d5, d6) b (d1, d3), (d2, d3),
(d4, d5), (d4, d6) c (d3, d4) d remaining 8
pairs (15-7) Rand Index (28)/15
0.66 Jaccard Coeff. 2/(241) 0.29 Corrected
Rand 0.24
D1 D2 D3
C1 2 1 0 3
C2 0 1 2 3
2 2 2 6
18Results
- Results
- Low corrected rand and jaccard values?
- 0.01
- Rand index 0.77
- Possible Reasons
- Noise in the data
- Ex 300 Document Not Found pages.
- Problem is difficult
- Ex Homepages category.
19B) Evaluation of Hierarchical Clustering
Algorithms
- Obtain a partitioning of DMOZ
- Determine a depth (experiment?)
- Collect documents of higher (or equal) depth at
that level - Documents of lower depths?
- Ignore them
20Hierarchical Clustering Steps
- Obtain the hierarchical clusters using
- Single Linkage
- Average Linkage
- Complete Linkage
- Obtain a partitioning on the hierarchical cluster
21Hierarchical Clustering Steps
- One way, treat DMOZ clusters as queries
- For each selected cluster of DMOZ
- Find the number of target clusters on
computerized partitioning - Take the average
- See if Nt lt Ntr
- If not, either choice of partitioning or
hierarchical clustering did not perform well
22Hierarchical Clustering Steps
- Another way
- Compare the two partitions using an index, i.e.
C-RAND
23Choice of Partition Outline
- Obtain the dendrogram
- Single linkage
- Complete linkage
- Group average linkage
- Wards methods
24Choice of Partition Outline
- How to convert a hierarchical cluster structure
into a partition? - Visually inspect the dendrogram?
- Use tools from statistics?
25Choice of Partition Inconsistency Coefficient
- At each fusion level
- Calculate the inconsistency coefficient
- Utilize statistics from the previous fusion
levels - Choose the fusion level for which inconsistency
coefficient is at maximum.
26Choice of Partition Inconsistency Coefficient
- Inconsistency coefficient (I.C.) at fusion level
i
27Choice of PartitionI.C. Hands on, Objects
- Plot of the objects
- Distance measure Euclidean Distance
28Choice of PartitionI.C. Hands on, Single Linkage
29Choice of PartitionI.C. Single Linkage Results
Level 1 ? 0 Level 2 ? 0 Level 3 ?
0 Level 4 ? 0 Level 5 ? 0 Level 6 ?
1.1323 Level 7 ? 0.6434 gt Cut the
dendrogram at a height between level 5 6
30Choice of PartitionI.C. Single Linkage Results
31Choice of PartitionI.C. Hands on, Average
Linkage
32Choice of PartitionI.C. Average Linkage Results
Level 1 ? 0 Level 2 ? 0 Level 3 ?
0.7071 Level 4 ? 0 Level 5 ?
0.7071 Level 6 ? 1.0819 Level 7 ?
0.9467 gt Cut the dendrogram at a height
between level 5 6
33Choice of PartitionI.C. Hands on, Complete
Linkage
34Choice of PartitionI.C. Complete Linkage Results
Level 1 ? 0 Level 2 ? 0 Level 3 ?
0.7071 Level 4 ? 0 Level 5 ?
0.7071 Level 6 ? 1.0340 Level 7 ?
1.0116 gt Cut the dendrogram at a height
between level 5 6
35Conclusion
- Our aim is to evaluate clustering techniques on
DMOZ Data. - Analysis on partitioning hierarchical
clustering algorithms. - If the experiments are succesfull we will apply
same experiments on larger DMOZ data after we
download it. - Else
- We will try other methodologies to improve our
experiment results.
36References
- 1 A. K. Jain and R. C. Dubes. Algorithms for
Clustering Data. Prentice Hall, 1988. - 2 Korenius T., Laurikkala J., Juhola M.,
Jarvelin K. Hierarchical clustering of a Finnish
newspaper article collection with graded
relevance assessments. Information Retrieval,
9(1). Kluwer Academic Publishers, 2006. - www.dmoz.org