Using Supervised Clustering to Enhance Classifiers - PowerPoint PPT Presentation

About This Presentation

Title:

Using Supervised Clustering to Enhance Classifiers

Description:

Wyoming Map. Eick & Zeidat: Using Supervised Clustering to Enhance Classifiers. ISMIS05. Household Income in 1999: Wyoming Park County ... – PowerPoint PPT presentation

Number of Views:194

Avg rating:3.0/5.0

Slides: 38

Provided by: david134

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Using Supervised Clustering to Enhance Classifiers

1
Using Supervised Clusteringto Enhance Classifiers

Christoph F. Eick and Nidal Zeidat
Department of Computer Science
University of Houston
Organization of the Talk
Supervised Clustering
Representative-based Supervised Clustering
Algorithms
Applications Using Supervised Clustering for
Dataset Editing
Class Decomposition
Region Discovery in Spatial Datasets
Summary and Future Work

2
List of Persons that Contributed to the Work
Presented in Todays Talk

Tae-Wan Ryu (former PhD student now faculty
member Cal State Fullerton)
Ricardo Vilalta (colleague at UH since 2002
Co-Director of the UHs Data Mining and Knowledge
Discovery Group)
Murali Achari (former Master student)
Alain Rouhana (former Master student)
Abraham Bagherjeiran (current PhD student)
Chunshen Chen (current Master student)
Nidal Zeidat (current PhD student)
Sujing Wang (current PhD student)
Kim Wee (current MS student)
Zhenghong Zhao (former Master student)

3
1. Introduction
Ch. Eick
Objectives Supervised Clustering Minimize
cluster impurity while keeping the number of
clusters low (expressed by a fitness function
q(X)).
4
Motivation Finding Subclasses using SC
Attribute1
Ford Trucks
Ford
GMC
GMC Trucks
GMC Van
Ford Vans
Ford SUV
Attribute2
GMC SUV
5
Related Work Supervised Clustering

Sinkkonens SKN02 discriminative clustering and
Tishbys information bottleneck method TPB99,
ST99 can be viewed as probabilistic supervised
clustering algorithms.
There has been a lot of work in the area of
semi-supervised clustering that centers on
clustering with background information. Although
the focus of this work is traditional clustering,
there is still a lot of similarity between
techniques and algorithms they investigate and
the techniques and algorithms we investigate.

6
2. Representative-Based Supervised Clustering

Aims at finding a set of objects among all
objects (called representatives) in the data set
that best represent the objects in the data set.
Each representative corresponds to a cluster.
The remaining objects in the data set are then
clustered around these representatives by
assigning objects to the cluster of the closest
representative.
Remark The popular k-medoid algorithm, also
called PAM, is a representative-based clustering
algorithm.

7
Representative-Based Supervised Clustering
(Continued)
2
Attribute1
1
3
Attribute2
4
8
Representative-Based Supervised Clustering
(continued)
2
Attribute1
1
3
Attribute2
4
Objective of RSC Find a subset OR of O such that
the clustering X obtained by using the objects
in OR as representatives minimizes q(X).
9
SC Algorithms Currently Investigated

Supervised Partitioning Around Medoids (SPAM).
Single Representative Insertion/Deletion Steepest
Decent Hill Climbing with Randomized Restart
(SRIDHCR).
Top Down Splitting Algorithm (TDS).
Supervised Clustering using Evolutionary
Computing (SCEC)
Agglomerative Hierarchical Supervised Clustering
(AHSC)
Grid-Based Supervised Clustering (GRIDSC)

10
A Fitness Function for Supervised Clustering

q(X) Impurity(X) ßPenalty(k)

k number of clusters used n number of examples
the dataset c number of classes in a dataset.
ß Weight for Penalty(k), 0lt ß 2.0
Penalty(k) increase sub-linearly. because the
effect of increasing the of clusters from k to
k1 has greater effect on the end result when k
is small than when it is large. Hence the formula
above
11
Algorithm SRIDHCR (Greedy Hill Climbing)

Highlights
k is not an input parameter, SRIDHCR searches
for best k within the range that is induced by b.
Reports the best clustering found in r runs

12
Supervised Clustering using Evolutionary
Computing SCEC
Initial generation
Next generation
Mutation
Crossover
Copy
Best solution
Final generation
Result
13
Supervised Clustering ---Algorithms and
Applications

Organization of the Talk
Supervised Clustering
Representative-based Supervised Clustering
Algorithms
Applications Using Supervised Clustering for
for Dataset Editing
for Class Decomposition
for Region Discovery in Spatial Datasets
Conclusion and Future Work

14
Nearest Neighbour Rule
Consider a two class problem where each sample
consists of two measurements (x,y).
k 1
For a given query point q, assign the class of
the nearest neighbour.
k 3
Compute the k nearest neighbours and assign the
class by majority vote.
Problem requires good distance function
15
3a. Dataset Reduction Editing

Training data may contain noise, overlapping
classes
Editing seeks to remove noisy points and produce
smooth decision boundaries often by retaining
points far from the decision boundaries
Main Goal of Editing enhance the accuracy of
classifier ( of unseen examples classified
correctly)
Secondary Goal of Editing enhance the speed of a
k-NN classifier

16
Wilson Editing

Wilson 1972
Remove points that do not agree with the majority
of their k nearest neighbours

Earlier example
Overlapping classes
Original data
Original data
Wilson editing with k7
Wilson editing with k7
17
RSC ? Dataset Editing
Attribute1
Attribute1
B
A
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered using supervised clustering.
b. Dataset edited using cluster representatives.
18
Supervised Clustering vs. Clustering the Examples
of Each Separately

Approaches to discover subclasses of a given
class
Cluster the examples of each class separately
Use supervised clustering

Figure 4. Supervised clustering editing vs.
clustering each class (x and o) separately.
Remark A traditional clustering algorithm, such
as k-medoids, would pick o as the cluster
representative, because it is blind on how the
examples of other classes distribute, whereas
supervised clustering would pick o as the
representative obviously, o is not a good
choice for editing, because it attracts points of
the class x, which leads to misclassifications.
19
Experimental Evaluation

We compared a traditional 1-NN classifier and
Supervised Clustering Editing (SCE).
A benchmark consisting of 8 UCI datasets was used
for this purpose.
Accuracies were computed using 10-fold cross
validation.
SRIDHCR was used for supervised clustering.
SCE was tested using different compression rates
by associating different penalties with the
number of clusters found (by setting parameter b
to 0.4 and 1.0).
Compression rates of SCE and Wilson Editing were
computed using 1-(k/n) with n being the size of
the original dataset and k being the size of the
edited dataset.

20
Experimental Results (Table 4)
21
Summary SCE vs. 1-NN-classifier

SCE achieved very high compression rates without
loss in accuracy for 5 of the 8 datasets tested.
SCE accomplished a significant improvement in
accuracy for 3 of the 8 datasets tested.
Surprisingly, many UCI datasets can be compressed
by just using a single representative per class
without a significant loss in accuracy.
SCE, in contrast to other editing techniques,
removes examples that are classified correctly as
well as examples that are classified incorrectly
from the dataset. This explains its much higher
compression rates, if compared to other
techniques.
SCE frequently picks representatives that are in
the center of a region that is dominated by a
single class however, sometimes for with more
complex shapes, the need arises for
representatives to be lined up across of each
other to avoid attracting points in neighboring
clusters.

22
Complex9 Dataset
23
Supervised Clustering Result for Complex9
24
Diamonds9 dataset clustered using SC algorithm
SRIDHCR
25
Future Direction of this Research
p
Data Set
Data Set
IDLA
IDLA
Classifier C
Classifier C
Goal Find p, such that C is more accurate than
C or C and C have approximately the same
accuracy, but C can be learnt more quickly
and/or C classifies new examples more quickly.
26
3.b Class Decomposition
Attribute 1
Attribute 1
Attribute 2
Attribute 2
Attribute 1

Simple classifiers
Encompass a small class of approximating
functions.
Limited flexibility in their decision boundaries

Attribute 2
27
Naïve Bayes vs. Naïve Bayes with Class
Decomposition
28
3.c Discovery of Interesting Regions for
Spatial Data Mining

Task 2D/3D datasets are given discover
interesting regions in the dataset that maximize
a given fitness function examples of region
discovery include
Discover regions that have significant deviations
from the prior probability of a class e.g.
regions in the state of Wyoming were people are
very poor or not poor at all
Discover regions that have significant variation
in the income (fitness is defined based on the
variance with respect to income in a region)
Discover congested regions for traffic control
Our Approach We use (supervised) clustering to
discover such regions with a fitness function
representing a particular measure of
interestingness regions are implicitly defined
by the set of points that belong to a cluster.

29
Wyoming Map
30
Household Income in 1999 Wyoming Park County
31
Clusters ? Regions
Example 2 clusters in red and blue are given
regions are defined by using a Voronoi diagram
based on a NN classifier with k7 region are in
grey and white.
32
An Evaluation Scheme for Discovering Regions that
Deviate from the Prior Probability of a Class C
Let prior(C) C/n p(c,C) percentage of
examples in c that belong to class C Reward(c) is
computed based on p(c.C), prior(C) , and based on
the following parameters
g1,g2,R,R- (g1?1?g2 R,R-?0) relying on the
following interpolation
function (e.g. g10.8,g21.2,R 1, R-1)
qC(X) Sc?X (t(p(c,C),prior(C),g1,g2,R,R-)
cb)/n) with bgt1 (typically, 1.0001ltblt2) the
idea is that increases in cluster-size rewarded
nonlinearly, favoring clusters with more points
as long as ct() increases.
Reward(c)
R
R-
t(p(C),prior(C),g1,g2,R,R-)
prior(C)
prior(C)g1
prior(C)g2
p(c,C)
1
33
Example Discovery of Interesting Regions in
Wyoming Census 2000 Datasets
Ch. Eick
34
Supervised Clustering ---Algorithms and
Applications

Organization of the Talk
Supervised Clustering
Representative-based Supervised Clustering
Algorithms
Applications Using Supervised Clustering for
for Dataset Editing
for Class Decomposition
for Region Discovery in Spatial Datasets
Summary and Future Work

35
4. Summary and Future Work

A novel data mining technique, we term
supervised clustering, was introduced.
The benefits of using supervised clustering as a
preprocessing step to enhance classification
algorithms, such as NN classifiers and naïve
Bayesian classifiers, were demonstrated.
In our current research, we investigate the use
of supervised clustering for spatial data mining,
distance function learning, and for discovering
subclasses.
Moreover, we investigate how to make supervised
clustering adaptive with respect to user
feedback.

36
An Environment for Adaptive (Supervised)
Clusteringfor Summary Generation Applications
Clustering
Summary
Clustering Algorithm
Inputs
changes
Adaptation System
Evaluation System
feedback
Past Experience
Domain Expert
quality
Fitness Functions (predefined)
q(X),
Idea Development of a Generic Clustering/Feedback
/Adaptation Architecture whose objective is to
facilitate the search for clusterings that
maximize an internally and/or an externally given
reward function.
37
Links to 5 Related Papers
VAE03 R. Vilalta, M. Achari, C. Eick, Class
Decomposition via Clustering A New Framework
for Low-Variance Classifiers, in Proc. IEEE
International Conference on Data Mining (ICDM),
Melbourne, Florida, November 2003.
http//www.cs.uh.edu/ceick/kdd/VAE03.pdf EZZ04
C. Eick, N. Zeidat, Z. Zhao, Supervised
Clustering --- Algorithms and Benefits, short
version of this paper to appear in Proc.
International Conference on Tools with AI
(ICTAI), Boca Raton, Florida, November
2004. http//www.cs.uh.edu/ceick/kdd/EZZ04.pdf E
RBV04 C. Eick, A. Rouhana, A. Bagherjeiran, R.
Vilalta, Using Clustering to Learn Distance
Functions for Supervised Similarity Assessment,
to appear MLDM'05, Leipzig, Germany, July
2005. http//www.cs.uh.edu/ceick/kdd/ERBV04.pdf
EZV04 C. Eick, N. Zeidat, R. Vilalta, Using
Representative-Based Clustering for Nearest
Neighbor Dataset Editing, to appear in Proc. IEEE
International Conference on Data Mining (ICDM),
Brighton, England, November 2004. http//www.cs.uh
.edu/ceick/kdd/EZV04.pdf ZSE05 N. Zeidat, S.
Wang, and C. Eick, Data Set Editing Techniques A
Comparative Study, submitted for
publication. http//www.cs.uh.edu/ceick/kdd/ZSE04
.pdf

Write a Comment

User Comments (0)