Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering

Description:

What is clustering? ... Each object is viewed as a cluster (bottom up) ... Terminate when no new point can be added to any cluster ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 28

Provided by: drhua

Learn more at: https://www.public.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
Clustering

Basic concepts with simple examples
Categories of clustering methods
Challenges

2
What is clustering?

The process of grouping a set of physical or
abstract objects into classes of similar objects.
It is also called unsupervised learning.
It is a common and important task that finds many
applications
Examples where we need clustering?

3
Clusters and representations

Examples of clusters
Different ways of representing clusters
Division with boundaries
Venn diagrams or spheres
Probabilistic
Dendrograms
Trees
Rules

1 2 3
I1 I2 In
0.5 0.2 0.3
4
Differences from Classification

How different?
Which one is more difficult as a learning
problem?
Do we perform clustering in daily activities?
How do we cluster?
How to measure the results of clustering?
With/without class labels
Between classification and clustering
Semi-supervised clustering

5
Major clustering methods

Partitioning methods
k-Means (and EM), k-Medoids
Hierarchical methods
agglomerative, divisive, BIRCH
Similarity and dissimilarity of points in the
same cluster and from different clusters
Distance measures between clusters
minimum, maximum
Means of clusters
Average between clusters

6
How to evaluate

Without labeled data, how can one know one
clustering result is good?
Basic or intuitive idea of clustering for
clustered data points
Within a cluster -
Between clusters
The relationship between the two?
Evaluation methods
Labeled data another assumption instances in
the same clusters are of the same class
Is it reasonable to use class labels in
evaluation?
Unlabeled data we will see below

7
Clustering -- Example 1

For simplicity, 1-dimension objects and k2.
Objects 1, 2, 5, 6,7
K-means
Randomly select 5 and 6 as centroids
gt Two clusters 1,2,5 and 6,7 meanC18/3,
meanC26.5
gt 1,2, 5,6,7 meanC11.5, meanC26
gt no change.
Aggregate dissimilarity 0.52 0.52 12
12 2.5

8
Issues with k-means

A heuristic method
Sensitive to outliers
How to prove it?
Determining k
Trial and error
X-means, PCA-based
Crisp clustering
EM, Fuzzy c-means
Should not be confused with k-NN

9
k-Medoids

Medoid the most centrally located point in a
cluster, as a representative point of the
cluster.
In contrast, a centroid is not necessarily inside
a cluster.
An example

Initial Medoids
10
Partition Around Medoids

PAM
Given k
Randomly pick k instances as initial medoids
Assign each instance to the nearest medoid x
Calculate the objective function
the sum of dissimilarities of all instances to
their nearest medoids
Randomly select an instance y
Swap x by y if the swap reduces the objective
function for all x
Repeat (3-6) until no change

11
k-Means and k-Medoids

The key difference lies in how they update means
or medoids
k-medoids and (N-k) instances pairwise comparison
Both require distance calculation and
reassignment of instances
Time complexity
Which one is more costly?
Dealing with outliers

Outlier (100 unit away)
12
Agglomerative

Each object is viewed as a cluster (bottom up).
Repeat until the number of clusters is small
enough
Choose a closest pair of clusters
Merge the two into one
Defining closest Centroid (mean of cluster)
distance, (average) sum of pairwise distance,
Refer to the Evaluation part
A dendrogram is a tree that shows clustering
process.

13
Clustering -- Example 2

For simplicity, we still use 1-dimension
objects.
Objects 1, 2, 5, 6,7
agglomerative clustering a very frequently used
algorithm
How to cluster
find two closest objects and merge
gt 1,2, so we have now 1.5,5, 6,7
gt 1,2, 5,6, so 1.5, 5.5,7
gt 1,2, 5,6,7.

14
Issues with dendrograms

How to find proper clusters
An alternative divisive algorithms
Top down
Comparing with bottom-up, which is more efficient
Whats the time complexity?
How to efficiently divide the data
A heuristic Minimum Spanning Tree
http//en.wikipedia.org/wiki/Minimum_spanni
ng_tree
Time complexity fastest is about O(e) where e -
edges

15
Distance measures

Single link
Measured by the shortest edge between the two
clusters
Complete link
Measured by the longest edge
Average link
Measured by the average edge length
An example is shown next.

16
An example to show different Links

Single link
Merge the nearest clusters measured by the
shortest edge between the two
(((A B) (C D)) E)
Complete link
Merge the nearest clusters measured by the
longest edge between the two
(((A B) E) (C D))
Average link
Merge the nearest clusters measured by the
average edge length between the two
(((A B) (C D)) E)

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
B
A
E
C
D
17
Other Methods

Density-based methods
DBSCAN a cluster is a maximal set of
density-connected points
Core points defined using epsilon-neighborhood
and minPts
Apply directly density reachable (e.g., P and Q,
Q and M) and density-reachable (P and M, assuming
so are P and N), and density-connected (any
density reachable points, P, Q, M, N) form
clusters
Grid-based methods
STING the lowest level is the original data
statistical parameters of higher-level cells are
computed from the parameters of the lower-level
cells (count, mean, standard deviation, min, max,
distribution)
Model-based methods
Conceptual clustering COBWEB
Category utility
Intraclass similarity
Interclass dissimilarity

18
Density-based

DBSCAN Density-Based Clustering of Applications
with Noise
It grows regions with sufficiently high density
into clusters and can discover clusters of
arbitrary shape in spatial databases with noise.
Many existing clustering algorithms find
spherical shapes of clusters
DEBSCAN defines a cluster as a maximal set of
density-connected points.
Density is defined by an area and of points

Defining density and connection
?-neighborhood of an object x (core object) (M,
P, O)
MinPts of objects within ?-neighborhood (say, 3)
directly density-reachable (Q from M, M from P)
Only core objects are mutually density reachable
density-reachable (Q from P, P not from Q)
asymmetric
density-connected (O, R, S) symmetric for
border points
What is the relationship between DR and DC?

Clustering with DBSCAN
Search for clusters by checking the
?-neighborhood of each instance x
If the ?-neighborhood of x contains more than
MinPts, create a new cluster with x as a core
object
Iteratively collect directly density-reachable
objects from these core object and merge
density-reachable clusters
Terminate when no new point can be added to any
cluster
DBSCAN is sensitive to the thresholds of density,
but it is fast
Time complexity O(N log N) if a spatial index is
used, O(N2) otherwise

Grid STING (STatistical INformation Grid)
Statistical parameters of higher-level cells can
easily be computed from those of lower-level
cells
Attribute-independent count
Attribute-dependent mean, standard deviation,
min, max
Type of distribution normal, uniform,
exponential, or unknown
Irrelevant cells can be removed

BIRCH using Clustering Feature (CF) and CF tree
A cluster feature is a triplet about sub-clusters
of instances (N, LS, SS)
N - the number of instances, LS linear sum, SS
square sum
Two thresholds branching factor and the max
number of children per non-leaf node
Two phases
Build an initial in-memory CF tree
Apply a clustering algorithm to cluster the leaf
nodes in CF tree
CURE (Clustering Using REpresentitives) is
another example, allowing multiple centroids in a
cluster

Taking advantage of the property of density
If its dense in higher dimensional subspaces, it
should be dense in some lower dimensional
subspaces
How to use this property?
CLIQUE (CLustering In QUEst)
With high dimensional data, there are many void
subspaces
Using the property identified, we can start with
dense lower dimensional data
CLIQUE is a density-based method that can
automatically find subspaces of the highest
dimensionality such that high-density clusters
exist in those subspaces

24
Chameleon

A hierarchical Clustering Algorithm Using Dynamic
Modeling
Observations on the weakness of CURE and ROCK
CURE clustering using representatives
ROCK clustering categorical attributes
Based on k-nn and dynamic modeling

25
Graph-based clustering

Sparsification techniques keep the connections to
the most similar (nearest) neighbors of a point
while breaking the connections to less similar
points.
The nearest neighbors of a point tend to belong
to the same class as the point itself.
This reduces the impact of noise and outliers and
sharpens the distinction between clusters.

Neural networks
Self-organizing feature maps (SOMs)
Subspace clustering
Clique if a k-dimensional unit space is dense,
then so are its (k-1)-d subspaces
More will be discussed later
Semi-supervised clustering
http//www.cs.utexas.edu/ml/publication/unsupervi
sed.html
http//www.cs.utexas.edu/users/ml/risc/

27
Challenges