Title: Clustering for web documents
1Clustering for web documents
2Contents
- Cluto
- Criterion Functions for Document Clustering
Experiments and Analysis (2002) - by Ying Zhao and George Karypis
- Department of Computer Science, University of
Minnesota, Minneapolis, MN 55455 - Feature selection for web documents(2004)
-
3Cluto
- Clustering Toolkit. 2.1.1
- Department of Computer Science, University of
Minnesota, Minneapolis - http//www-users.cs.umn.edu/karypis/
- platform
- Linux 2.4.18
- Sun OS 5.7
- Win32
- programs
- CLUTO's user callable library
- vcluster
- scluster
4Cluto
- What is Cluto.(1/2)
- Clustering algorithms
- partitional clustering
- agglomerative clustering
- graph-partitioning clustering
- clustering criterion function
- provide seven different criterion functions
- both partitional and agglomerative clustering
algorithms - provide some of the more traditional local
criteria (e.g., single-link, complete-link, and
UPGMA) - agglomerative clustering.
5Cluto
- What is Cluto.(2/2)
- Analyze discovered clusters
- relations between the objects assigned to each
cluster - relations between the different clusters
- identify the features that best describe and/or
discriminate each cluster. - relationships between the clusters, objects, and
features. - operate on very large datasets
- the number of objects
- the number of dimensions.
6Cluto
- Programs
- vcluster
- operate in the objects feature space
- scluster
- operate in the objects similarity space.
- Interface
- vcluster optional parameters MatrixFile
Ncluster - nm matrix. rows to objects, cols to features
space - Ncluster number of cluster
7Cluto
- Parameters of Algorithms
- rd, rdr
- k-1 repeated bisections. (rdr optimize the
criterion function) - direct
- computed by simultaneously finding all k clusters
- agglo
- the agglomerative paradigm
- graph
- using a nearest-neighbor graph
- bagglo
8Cluto
- Parameters of the similarity function
- cos the cosine function. default.
- corr the correlation coefficient.
- dist the Euclidean distance
- applicable when -clmethodgraph.
- jacc the extended Jaccard coefficient.
- applicable when -clmethodgraph.
9Cluto
- Parameters of the criterion function
- i1, i2, e1, g1, g1p, h1, h2
10Cluto
- Parameters of the criterion function
- slink single link
- wslink weighted single link
- clink complete link
- wclink weighted complete link
- upgma UPGMA
- cstype
- fulltree
- rowmodel, colmodel
- showfeatures
11(No Transcript)
12- Criterion Functions for Document Clustering
Experiments and Analysis (2002) - by Ying Zhao and George Karypis Department of
Computer Science, University of Minnesota,
Minneapolis, MN 55455
13Data Clustering
A.K. JAIN Michigan State University M.N.
MURTY Indian Institute of Science AND P.J.
FLYNN The Ohio State University ACM Computing
Surveys
14Introduction(1/2)
- Clustering algorithms
- Agglomerative algorithms
- UPGMA, single-link, complete-link, CURE, ROCK,
Chameleon - Partitional algorithms
- K-means, K-medoids, Autoclass, graph-partitional-b
ased, spectral-partitional-based - well suit for large datasets. so fast.
- Seven Criterion functions
- measure intra-cluster similarity, inter-cluster
similarity, two combinations. i1, i2, e1, g1,
g1p, h1, h2
15Introduction(2/2)
- Datasets
- 15 different data sets
16Preliminaries(1/3)
- Document Representation
- use vector space model for each document
- d document, tf term frequency, tfi
frequency of i-th term in the doc - use idf or tfidf
-
N total
documents - Similarity Measures
- The similarity between two docs di, dj
- Cosine functions
- d
normalize the length of doc vector -
1 identical, 0 nothing in common
17Preliminaries(2/3)
- Euclidean functions
- if dis0, docs are identical, if ,
nothing in common. - Definitions
- S set of documents
- S1, S2, Sk set of document of k-th
cluster - k number of clusters
- n1, n2, nk size docs of the corresponding
clusters - A a set of docs
- composite vector DA centroid vector
CA. - sum of all docs vector in A average
the weight of terms of docs in A
18Preliminaries(3/3)
- Vector Properties
- Si, Sj two sets of docs containing ni, nj
documents - Di, Dj the composite vector, Ci, Cj the
centroid vector - The sum of the pair similarity between the docs
in Si and Sj is DjtDj - The sum of the pair similarity between the docs
in Si is Di2
19Criterion Functions(1/5)
- Internal Criterion Functions
- maximize sum of the average pairwise similarities
between the docs to each cluster - use cosine function. I1
- is similar to function of hierarchical
agglomerative clustering that uses group average
heuristics to determine merge. - use cosine function. I2
-
vector space of K-means
algorithm. -
Cr centroid vector of clusters
20Criterion Functions (2/5)
- External Criterion Functions. E1, E2
- optimize a function that different from each
cluster - external function derived that the centroid
vectors of the different clusters as orthogonal
as possible - C the centroid vector of the
entire docs - D the composite vector of the entire docs.
1/D is constant.
21Criterion Functions (3/5)
-
- define with the Euclidean distance function.
- Hybrid Criterion Functions. H1, H2
- maximize the similarity of docs in each cluster,
minimize the similarity between the clusters
docs and the entire docs - H1. combine criterion function I1, E1
22Criterion Functions (4/5)
- H2. combine criterion function I2, E1
- Graph Based Criterion Functions
- view the relations between docs is to use graphs
- G1 computing pairwise similarities between the
docs - G2 computing pairwise similarities between the
docs and terms - S given collection of n docs
- Gs similarity graph
23Criterion Functions (5/5)
24(No Transcript)
25(No Transcript)
26Experimental Results
27Experimental Results
28Experimental Results
29Data Sets
- the Natural Science category in Naver directory
(http//dir.naver.com) - 6 subcategories in corpora
- 1,215 docs, 17,223 terms, 20 clusters,
- 5 features per a doc, idf
Sub Category No. of Docs. Sub Category No. of Docs.
Physics 102 Earth science 149
Biology 426 Astrology 323
Mathematics 102 Chemistry 113
Total 1,215
30Experimental parameters
- Algorithms
- rd, rdr
- k-1 repeated bisections. (rdr optimize the
criterion function) - direct
- computed by simultaneously finding all k clusters
- agglo
- the agglomerative paradigm
- graph
- using a nearest-neighbor graph
31Experimental parameters
- Criterion Functions
- i1, i2, e1, g1, g1p, h1, h2, clink, slink
- Similarity Functions
- cosine measure
32Experimental results
rb rbr direct agglo graph
I1 .464 .452 .490 .642 .417
I2 .379 .375 .374 .564
E1 .388 .398 .416 .540
G1 .389 .418 .398 .895
G1p .326 .366 .391 .562
H1 .386 .392 .386 .541
H2 .348 .352 .367 .559
Clink .761
slink .895
33Entropy
34Experimental results
rb rbr direct agglo graph
I1 .686 .690 .683 .548 .749
I2 .772 .762 .761 .629
E1 .741 .737 .723 .647
G1 .768 .739 .752 .367
G1p .780 .758 .758 .647
H1 .753 .744 .758 .634
H2 .780 .782 .751 .650
Clink .458 Cut functions
slink .368 Cut functions
35Purity
36Best results
rb rb rbr rbr direct direct agglo agglo graph graph
entr puri entr puri entr puri entr puri entr puri
g1p g1p h2 h2 h1 h1 h1 h1 cut cut
0.326 0.780 0.352 0.782 0.386 0.758 0.541 0.634 0.417 0.749