Title: K-medoid-style Clustering Algorithms for Supervised Summary Generation
 1K-medoid-style Clustering Algorithms for 
Supervised Summary Generation
- Nidal Zeidat  Christoph F. Eick 
- Dept. of Computer Science 
- University of Houston
2Talk Outline
- What is Supervised Clustering? 
- Representative-based Clustering Algorithms 
- Benefits of Supervised Clustering 
- Algorithms for Supervised Clustering 
- Empirical Results 
- Conclusion and Areas of Future Work 
31. (Traditional) Clustering 
- Partition a set of objects into groups of similar 
 objects. Each group is called cluster
- Clustering is used to detect classes in data 
 set (unsupervised learning)
- Clustering is based on a fitness function that 
 relies on a distance measure and usually tries to
 minimize distance between objects within a
 cluster.
4(Traditional) Clustering  (continued)
Attribute1
A
C
B
Attribute2 
 5Supervised Clustering
- Assumes that clustering is applied to classified 
 examples.
- The goal of supervised clustering is to identify 
 class-uniform clusters that have a high
 probability density. ? prefers clusters whose
 members belong to single class (low impurity)
- We would, also, like to keep the number of 
 clusters low (small number of clusters).
6Supervised Clustering  (continued)
Attribute 1
Attribute 1
Attribute 2
Attribute 2
Traditional Clustering
Supervised Clustering 
 7A Fitness Function for Supervised Clustering
-  q(X)  Impurity(X)  ßPenalty(k) 
-  
k number of clusters used n number of examples 
the dataset c number of classes in a dataset. 
 ß Weight for Penalty(k), 0lt ß 2.0 
 82. Representative-Based Supervised Clustering 
(RSC)
- Aims at finding a set of objects among all 
 objects (called representatives) in the data set
 that best represent the objects in the data set.
 Each representative corresponds to a cluster.
- The remaining objects in the data set are, then, 
 clustered around these representatives by
 assigning objects to the cluster of the closest
 representative.
- Remark The popular k-medoid algorithm, also 
 called PAM, is a representative-based clustering
 algorithm.
9Representative Based Supervised Clustering  
(Continued)
Attribute1
Attribute2 
 10Representative Based Supervised Clustering  
(Continued)
2
Attribute1
1
3
Attribute2
4 
 11Representative Based Supervised Clustering  
(Continued)
2
Attribute1
1 
3
Attribute2
4
Objective of RSC Find a subset OR of O such that 
the clustering X obtained by using the objects 
in OR as representatives minimizes q(X). 
 12Why do we use Representative-Based Clustering 
Algorithms?
- Representatives themselves are useful 
- can be used for summarization 
- can be used for dataset compression 
- Smaller search space if compared with algorithms, 
 such as k-means.
- Less sensitive to outliers 
- Can be applied to datasets that contain nominal 
 attributes (not feasible to compute means)
133. Applications of Supervised Clustering
- Enhance classification algorithms. 
- Use SC for Dataset Editing to enhance 
 NN-classifiers ICDM04
- Improve Simple Classifiers ICDM03 
- Learn Sub-classes / Summary Generation 
- Distance Function Learning 
- Dataset Compression/Reduction 
- For Measuring the Difficulty of a Classification 
 Task
14Representative Based Supervised Clustering ? 
Dataset Editing
Attribute1
Attribute1
A
B 
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered using supervised clustering.
b. Dataset edited using cluster representatives. 
 15Representative Based Supervised Clustering ? 
Enhance Simple Classifiers
Attribute1
Attribute2 
 16Representative Based Supervised Clustering ? 
Learning Sub-classes
Attribute1
Ford Trucks
Ford
GMC
GMC Trucks
GMC Van
Ford Vans
Ford Trucks
Attribute2
GMC Van 
 174. Clustering Algorithms Currently Investigated
- Partitioning Around Medoids (PAM). ? Traditional 
- Supervised Partitioning Around Medoids (SPAM). 
- Single Representative Insertion/Deletion Steepest 
 Decent Hill Climbing with Randomized Restart
 (SRIDHCR).
- Top Down Splitting Algorithm (TDS). 
- Supervised Clustering using Evolutionary 
 Computing (SCEC).
18Algorithm SRIDHCR 
 19Set of Medoids after adding one non-medoid q(X) Set of Medoids after removing a medoid q(X)
8 42 62 148 (Initial solution) 0.086 42 62 148 0.086
8 42 62 148 1 0.091 8 62 148 0.073
8 42 62 148 2 0.091 8 42 148 0.313
.... . 8 42 62 0.333
8 42 62 148 52 0.065 42 62 148 0.086
 . 
8 42 62 148 150 0.0715 
Trials in first part (add a non-medoid) Trials in first part (add a non-medoid) Trials in second part (drop a medoid) Trials in second part (drop a medoid)
Run Set of Medoids producing lowest q(X) in the run q(X) Purity
0 8 42 62 148 (Init. Solution) 0.086 0.947
1 8 42 62 148 52 0.065 0.947
2 8 42 62 148 52 122 0.041 0.973
3 42 62 148 52 122 117 0.030 0.987
4 8 62 148 52 122 117 0.021 0.993
5 8 62 148 52 122 117 87 0.016 1.000
6 8 62 52 122 117 87 0.014 1.000
7 8 62 122 117 87 0.012 1.000 
 20Algorithm SPAM 
 21Differences between SPAM and SRIDHCR
- SPAM tries to improve the current solution by 
 replacing a representative by a
 non-representative, whereas SRIDHCR improves the
 current solution by removing a representative/by
 inserting a non-representative.
- SPAM is run keeping the number of clusters k 
 fixed, whereas SRIDHCR searches for a good
 value of k, therefore exploring a larger solution
 space. However, in the case of SRIDHCR which
 choices for k are good is somewhat restricted by
 the selection of the parameter b.
- SRIDHCR is run r times starting from a random 
 initial solution, SPAM is only run once.
225. Performance Measures for the Experimental 
Evaluation
- The investigated algorithms were evaluated based 
 on the following performance measures
- Cluster Purity (Majority ). 
- Value of the fitness function q(X). 
- Average dissimilarity between all objects and 
 their representatives (cluster tightness).
- Wall-Clock Time (WCT). Actual time, in seconds, 
 that the algorithm took to finish the clustering
 task.
23 Algorithm Purity q(X) Tightness(X).
Iris-Plants data set,  clusters3 Iris-Plants data set,  clusters3 Iris-Plants data set,  clusters3 Iris-Plants data set,  clusters3
PAM 0.907 0.0933 0.081
SRIDHCR 0.981 0.0200 0.093
SPAM 0.973 0.0267 0.133
Vehicle data set,  clusters 65 Vehicle data set,  clusters 65 Vehicle data set,  clusters 65 Vehicle data set,  clusters 65
PAM 0.701 0.326 0.044
SRIDHCR 0.835 0.192 0.072
SPAM 0.764 0.263 0.097
Image-Segment data set,  clusters 53 Image-Segment data set,  clusters 53 Image-Segment data set,  clusters 53 Image-Segment data set,  clusters 53
PAM 0.880 0.135 0.027
SRIDHCR 0.980 0.035 0.050
SPAM 0.944 0.071 0.061
Pima-Indian Diabetes data set,  clusters 45 Pima-Indian Diabetes data set,  clusters 45 Pima-Indian Diabetes data set,  clusters 45 Pima-Indian Diabetes data set,  clusters 45
PAM 0.763 0.237 0.056
SRIDHCR 0.859 0.164 0.093
SPAM 0.822 0.202 0.086
7
19
Table 4 Traditional vs. Supervised Clustering 
(ß0.1) 
 24Algorithm q(X) Purity Tightness (X) WCT (Sec.)
IRIS-Flowers Dataset,  clusters3 IRIS-Flowers Dataset,  clusters3 IRIS-Flowers Dataset,  clusters3 IRIS-Flowers Dataset,  clusters3 IRIS-Flowers Dataset,  clusters3
PAM 0.0933 0.907 0.081 0.06
SRIDHCR 0.0200 0.980 0.093 11.00
SPAM 0.0267 0.973 0.133 0.32
Vehicle Dataset,  clusters65 Vehicle Dataset,  clusters65 Vehicle Dataset,  clusters65 Vehicle Dataset,  clusters65 Vehicle Dataset,  clusters65
PAM 0.326 0.701 0.044 372.00
SRIDHCR 0.192 0.835 0.072 1715.00
SPAM 0.263 0.764 0.097 1090.00
Segmentation Dataset,  clusters53 Segmentation Dataset,  clusters53 Segmentation Dataset,  clusters53 Segmentation Dataset,  clusters53 Segmentation Dataset,  clusters53
PAM 0.135 0.880 0.027 4073.00
SRIDHCR 0.035 0.980 0.050 11250.00
SPAM 0.071 0.944 0.061 1422.00
Pima-Indians-Diabetes,  clusters45 Pima-Indians-Diabetes,  clusters45 Pima-Indians-Diabetes,  clusters45 Pima-Indians-Diabetes,  clusters45 Pima-Indians-Diabetes,  clusters45
PAM 0.237 0.763 0.056 186.00
SRIDHCR 0.164 0.859 0.093 660.00
SPAM 0.202 0.822 0.086 58.00
Table 5 Comparative Performance of the Different 
Algorithms, ß0.1 
 25Algorithm Avg. Purity Tightness(X) Avg.WCT (Sec.)
IRIS-Flowers Dataset,  clusters3 IRIS-Flowers Dataset,  clusters3 IRIS-Flowers Dataset,  clusters3 IRIS-Flowers Dataset,  clusters3
PAM 0.907 0.081 0.06
SRIDHCR 0.959 0.104 0.18
SPAM 0.973 0.133 0.33
Vehicle Dataset,  clusters56 Vehicle Dataset,  clusters56 Vehicle Dataset,  clusters56 Vehicle Dataset,  clusters56
PAM 0.681 0.046 505.00
SRIDHCR 0.762 0.081 22.58
SPAM 0.754 0.100 681.00
Segmentation Dataset,  clusters32 Segmentation Dataset,  clusters32 Segmentation Dataset,  clusters32 Segmentation Dataset,  clusters32
PAM 0.875 0.032 1529.00
SRIDHCR 0.946 0.054 169.39
SPAM 0.940 0.065 1053.00
Pima-Indians-Diabetes,  clusters2 Pima-Indians-Diabetes,  clusters2 Pima-Indians-Diabetes,  clusters2 Pima-Indians-Diabetes,  clusters2
PAM 0.656 0.104 0.97
SRIDHCR 0.795 0.109 5.08
SPAM 0.772 0.125 2.70
Table 6 Average Comparative Performance of the 
Different Algorithms, ß0.4 
 26Why is SRIDHCR performing so much better than 
SPAM?
- SPAM is relatively slow compared with a single 
 run of SRIDHCR allowing for 5-30 restarts of
 SRIDHCR using the same resources. This enables
 SRIDHCR to conduct a more balanced exploration of
 the search space.
- Fitness landscape induced by q(X) contains many 
 plateau-like structures (q(X1)q(X2)) and many
 local minima and SPAM seems to get stuck more
 easily.
- The fact that SPAM uses a fixed k-value does not 
 seem beneficiary for finding good solutions,
 e.g. SRIDHCR might explore u1,u2,u3,u4??u1,u2
 ,u3,u4,v1,v2 ?? u3,u4,v1,v2, whereas SPAM
 might terminate with the sub-optimal solution
 u1,u2,u3,u4, if neither the replacement of u1
 through v1 nor the replacement of u2 by v2
 enhances q(X).
27Dataset k ß Ties  Using q(X) Ties  Using Tightness(X)
Iris-Plants 10 0.00001 5.8 0.0004
Iris-Plants 10 0.4 5.7 0.0004
Iris-Plants 50 0.00001 20.5 0.0019
Iris-Plants 50 0.4 20.9 0.0018
Vehicle 10 0.00001 1.04 0.000001
Vehicle 10 0.4 1.06 0.000001
Vehicle 50 0.00001 1.78 0.000001
Vehicle 50 0.4 1.84 0.000001
Segmentation 10 0.00001 0.220 0.000000
Segmentation 10 0.4 0.225 0.000001
Segmentation 50 0.00001 0.626 0.000001
Segmentation 50 0.4 0.638 0.000000
Diabetes 10 0.00001 2.06 0.0
Diabetes 10 0.4 2.05 0.0
Diabetes 50 0.00001 3.43 0.0002
Diabetes 50 0.4 3.45 0.0002
Table 7 Ties distribution 
 28Figure 2 How Purity and k Change as ß Increases 
 296. Conclusions
- As expected, supervised clustering algorithms 
 produced significantly better cluster purity than
 traditional clustering. Improvements range
 between 7 and 19 for different data sets.
- Algorithms that too greedily explore the search 
 space, such as SPAM, do not seem to be very
 suitable for supervised clustering. In general,
 algorithms that explore the search space more
 randomly seem to be more suitable for supervised
 clustering.
- Supervised clustering can be used to enhance 
 classifiers, dataset summarization, and generate
 better distance functions.
30Future Work
- Continue work on supervised clustering algorithms 
- Find better solutions 
- Faster 
- Explain some observations 
- Using supervised clustering for summary 
 generation/learning subclasses
- Using supervised clustering to find compressed 
 nearest neighbor classifiers.
- Using supervised clustering to enhance simple 
 classifiers
- Distance function learning 
31K-Means Algorithm
2
Attribute1
1
3
Attribute2
4 
 32K-Means Algorithm
2
Attribute1
1
3
Attribute2
4