4. Clustering Methods - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

4. Clustering Methods

Description:

Hierarchical (Agglomerative & Divisive, COBWEB) Density-based (DBSCAN, CLIQUE) ... Repeatedly cut out the longest edges at each iteration until some stopping ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 28

Provided by: publi8

Learn more at: https://www.public.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: 4. Clustering Methods

1
4. Clustering Methods

Concepts
Partitional (k-Means, k-Medoids)
Hierarchical (Agglomerative Divisive, COBWEB)
Density-based (DBSCAN, CLIQUE)
Large size data (STING, BIRCH, CURE)

2
Concepts of Clustering

Clusters
Different ways of representing clusters
Division with boundaries
Venn diagram or spheres
Probabilistic
Dendrograms
Trees
Rules

1 2 3
I1 I2 In
0.5 0.2 0.3
3

About clusters
Inter-clusters distance ? maximization
Intra-clusters distance ? minimization
Clustering vs. classification
Which one is more difficult? Why?
Various possible ways of clustering, which way is
the best?

4
Major Categories

Partitioning Divide into k partitions (k fixed)
repartition to get better clustering.
Hierarchical Divide into different number of
partitions in layers - merge (bottom-up) or
divide (top-down).
Density-based Continue to grow a cluster as long
as the density of the cluster exceeds a threshold
Grid-based First divide space into grids, then
perform clustering on the grids.

5
k-Means

Algorithm
Given k
Randomly pick k instances as the initial centers
Assign the rest instances to closest one of k
clusters
Recalculate the mean of each cluster
Repeat 3 4 until means dont change
How good the clusters are
Initial and final clusters
Within-cluster variation ?diff(x,mean)2
Why dont we consider inter-cluster distance?

6
Example

For simplicity, 1 dimensional objects and k2.
Objects 1, 2, 5, 6,7
K-means
Randomly select 5 and 6 as initial centroids
gt Two clusters 1,2,5 and 6,7 meanC18/3,
meanC26.5
gt 1,2, 5,6,7 meanC11.5, meanC26
gt no change.
Aggregate dissimilarity 0.52 0.52 12
12 2.5

7
Discussions

Limitations
Means cannot be defined for categorical
attributes
Choice of k
Sensitive to outliers
Crisp clustering
Variants of k-means exist
Using modes to deal with categorical attributes
How about distance measures
Is it similar to or different from k-NN?
With and without learning

8
k-Medoids

k-Means algorithm is sensitive to outliers
Is this true? How to prove it?
Medoid the most centrally located point in a
cluster, as a representative point of the
cluster.
In contrast, a centroid is not necessarily inside
a cluster.
An example

Initial Medoids
9
Partition Around Medoids

PAM
Given k
Randomly pick k instances as initial medoids
Assign each instance to the nearest medoid x
Calculate the objective function
the sum of dissimilarities of all instances to
their nearest medoids
Randomly select an instance y
Swap x by y if the swap reduces the objective
function
Repeat (3-6) until no change

10
k-Means and k-Medoids

The key difference lies in how they update means
or medoids
Both require distance calculation and
reassignment of instances
Time complexity
Which one is more costly?
Dealing with outliers

Outlier (100 unit away)
11
EM (Expectation and Maximization)

Moving away from crisp clusters as in k-Means by
allowing an instance to belong to several
clusters
Finite mixtures a statistical clustering model
A mixture is a set of k probability
distributions, representing k clusters
The simplest finite mixture one feature with a
Gaussian
When k2, we need to estimate 5 parameters 2
pairs of µ and s and pA, where pB 1- pA
EM
Estimate using instances
Maximize the overall likelihood that data came
from this data set

12
Agglomerative

Each object is viewed as a cluster (bottom up).
Repeat until the number of clusters is small
enough
Choose a closest pair of clusters
Merge the two into one
Defining closest Centroid (mean of cluster)
distance, (average) sum of pairwise distance,
Refer to the Evaluation part
A dendrogram is a tree that shows clustering
process.

13
Dendrogram

Cluster 1, 2, 4, 5, 6, 7 into two clusters
(centriod distance)
1
2
4
5
6
7

14
An example to show different Links

Single link
Merge the nearest clusters measured by the
shortest edge between the two
(((A B) (C D)) E)
Complete link
Merge the nearest clusters measured by the
longest edge between the two
(((A B) E) (C D))
Average link
Merge the nearest clusters measured by the
average edge length between the two
(((A B) (C D)) E)

B
A
E
C
D
15
Divisive

All instances belong to one cluster (top-down)
To find an optimal division at each layer
(especially the top one) is computationally
prohibitive.
One heuristic method is based on the Minimum
Spanning Tree (MST) algorithm
Connecting all instances with MST (O(N2))
Repeatedly cut out the longest edges at each
iteration until some stopping criterion is met or
until one instance remains in each cluster.

16
COBWEB

Building a conceptual hierarchy incrementally
Category Utility
?k?i?jP(fivij)P(fivijck)P(ckfivij)
All categories ck, all features fi, all feature
values vij
It attempts to maximize both the probability that
two objects in the same category have values in
common and the probability that objects in
different categories will have different property
values

Processing one instance at a time by evaluating
Placing the instance in the best existing
category
Adding a new category containing only the
instance
Merging of two existing categories into a new one
and adding the instance to that category
Splitting of an existing category into two and
placing the instance in the best new resulting
category

Grandparent
Grandparent
Split
Parent
Child 2
Child 1
Merge
Child 2
Child 1
18
Density-based

BBSCAN Density-Based Clustering of Applications
with Noise
It grows regions with sufficiently high density
into clusters and can discover clusters of
arbitrary shape in spatial databases with noise.
Many existing clustering algorithms find
spherical shapes of clusters
DEBSCAN defines a cluster as a maximal set of
density-connected points.

Defining density and connection
?-neighborhood of an object x (core object) (M,
P, O)
MinPts of objects within ?-neighborhood (say, 3)
directly density-reachable (Q from M, M from P)
density-reachable (Q from P, P not from Q)
asymmetric
density-connected (O, R, S) symmetric for
border points
What is the relationship between DR and DC?

Clustering with DBSCAN
Search for clusters by checking the
?-neighborhood of each instance x
If the ?-neighborhood of x contains more than
MinPts, create a new cluster with x as a core
object
Iteratively collect directly density-reachable
objects from these core object and merge
density-reachable clusters
Terminate when no new point can be add to any
cluster
DBSCAN is sensitive to the thresholds of density,
but it is many folds faster than CLARANS
Time complexity O(N log N) if a spatial index is
used, O(N2) otherwise

21
Dealing with Large Data

Key ideas
Reducing the number of instances yet to maintain
the distribution
Identifying relevant subspaces where clusters
possibly exist
Using summarized information to avoid repeated
data access
Sampling
CLARA (Clustering LARge Applications) working on
samples instead of the whole data
CLARANS (Clustering Large Applications based on
RANdomized Search)

Grid STING (STatistical INformation Grid)
Statistical parameters of higher-level cells can
easily be computed from those of lower-level
cells
Attribute-independent count
Attribute-dependent mean, standard deviation,
min, max
Type of distribution normal, uniform,
exponential, or unknown
Irrelevant cells can be removed

23
Representatives

BIRCH using Clustering Feature (CF) and CF tree
A cluster feature is a triplet about sub-clusters
of instances (N, LS, SS)
N - the number of instances, LS linear sum, SS
square sum
Two thresholds branching factor and the max
number of children per non-leaf node
Two phases
Build an initial in-memory CF tree
Apply a clustering algorithm to cluster the leaf
nodes in CF tree
CURE (Clustering Using REpresentitives) is
another example

Taking advantage of the property of density
If its dense in higher dimensional subspaces, it
should be dense in some lower dimensional
subspaces
CLIQUE (CLustering In QUEst)
With high dimensional data, there are many void
subspaces
Using the property identified, we can start with
dense lower dimensional data
CLIQUE is a density-based method that can
automatically find subspaces of the highest
dimensionality such that high-density clusters
exist in those subspaces

25
Chameleon

A hierarchical Clustering Algorithm Using Dynamic
Modeling
Observations on the weakness of CURE and ROCK

26
Summary

There are many clustering algorithms
Good clustering algorithms maximize inter-cluster
dissimilarity and intra-cluster similarity
Without prior knowledge, it is difficult to
choose the best clustering algorithm.
Clustering is an important tool for outlier
analysis.

27
Bibliography

I.H. Witten and E. Frank. Data Mining Practical
Machine Learning Tools and Techniques with Java
Implementations. 2000. Morgan Kaufmann.
M. Kantardzic. Data Mining Concepts, Models,
Methods, and Algorithms. 2003. IEEE.
J. Han and M. Kamber. Data Mining Concepts and
Techniques. 2001. Morgan Kaufmann.
M. H. Dunham. Data Mining Introductory and
Advanced Topics.

Write a Comment

User Comments (0)