Title: Clustering
1Clustering
2Outline
- Introduction
- K-means clustering
- Hierarchical clustering COBWEB
3Classification vs. Clustering
Classification Supervised learning Learns a
method for predicting the instance class from
pre-labeled (classified) instances
4Clustering
Unsupervised learning Finds natural grouping
of instances given un-labeled data
5Clustering Methods
- Many different method and algorithms
- For numeric and/or symbolic data
- Deterministic vs. probabilistic
- Exclusive vs. overlapping
- Hierarchical vs. flat
- Top-down vs. bottom-up
6Clusters exclusive vs. overlapping
Simple 2-D representation Non-overlapping
Venn diagram Overlapping
7Clustering Evaluation
- Manual inspection
- Benchmarking on existing labels
- Cluster quality measures
- distance measures
- high similarity within a cluster, low across
clusters
8The distance function
- Simplest case one numeric attribute A
- Distance(X,Y) A(X) A(Y)
- Several numeric attributes
- Distance(X,Y) Euclidean distance between X,Y
- Nominal attributes distance is set to 1 if
values are different, 0 if they are equal - Are all attributes equally important?
- Weighting the attributes might be necessary
9Simple Clustering K-means
- Works with numeric data only
- Pick a number (K) of cluster centers (at random)
- Assign every item to its nearest cluster center
(e.g. using Euclidean distance) - Move each cluster center to the mean of its
assigned items - Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)
10K-means example, step 1
Pick 3 initial cluster centers (randomly)
11K-means example, step 2
Assign each point to the closest cluster center
12K-means example, step 3
Move each cluster center to the mean of each
cluster
13K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
14K-means example, step 4
A three points with animation
15K-means example, step 4b
re-compute cluster means
16K-means example, step 5
move cluster centers to cluster means
17Discussion
- Result can vary significantly depending on
initial choice of seeds - Can get trapped in local minimum
- Example
- To increase chance of finding global optimum
restart with different random seeds
18K-means clustering summary
- Advantages
- Simple, understandable
- items automatically assigned to clusters
- Disadvantages
- Must pick number of clusters before hand
- All items forced into a cluster
- Too sensitive to outliers
19K-means variations
- K-medoids instead of mean, use medians of each
cluster - Mean of 1, 3, 5, 7, 9 is
- Mean of 1, 3, 5, 7, 1009 is
- Median of 1, 3, 5, 7, 1009 is
- Median advantage not affected by extreme values
- For large databases, use sampling
5
205
5
20Hierarchical clustering
- Bottom up
- Start with single-instance clusters
- At each step, join the two closest clusters
- Design decision distance between clusters
- E.g. two closest instances in clusters vs.
distance between means - Top down
- Start with one universal cluster
- Find two clusters
- Proceed recursively on each subset
- Can be very fast
- Both methods produce adendrogram
21Incremental clustering
- Heuristic approach (COBWEB/CLASSIT)
- Form a hierarchy of clusters incrementally
- Start
- tree consists of empty root node
- Then
- add instances one by one
- update tree appropriately at each stage
- to update, find the right leaf for an instance
- May involve restructuring the tree
- Base update decisions on category utility
22Clustering weather data
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
2
3
23Clustering weather data
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
5
Merge best host and runner-up
3
Consider splitting the best host if merging
doesnt help
24Final hierarchy
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
Oops! a and b are actually very similar
25Example the iris data (subset)
26Clustering with cutoff
27Category utility
- Category utility quadratic loss functiondefined
on conditional probabilities - Every instance in different category ? numerator
becomes
maximum
number of attributes
28Overfitting-avoidance heuristic
- If every instance gets put into a different
category the numerator becomes (maximal) - Where n is number of all possible attribute
values. - So without k in the denominator of the
CU-formula, every cluster would consist of one
instance!
Maximum value of CU
29Levels of Clustering
30Hierarchical Clustering
- Clusters are created in levels actually creating
sets of clusters at each level. - Agglomerative
- Initially each item in its own cluster
- Iteratively clusters are merged together
- Bottom Up
- Divisive
- Initially all items in one cluster
- Large clusters are successively divided
- Top Down
31Dendrogram
- Dendrogram a tree data structure which
illustrates hierarchical clustering techniques. - Each level shows clusters for that level.
- Leaf individual clusters
- Root one cluster
- A cluster at level i is the union of its children
clusters at level i1.
32Agglomerative Example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
B
A
E
C
D
Threshold of
4
2
3
5
1
A
B
C
D
E
33Distance Between Clusters
- Single Link smallest distance between points
- Complete Link largest distance between points
- Average Link average distance between points
- Centroid distance between centroids
34Single Link Clustering
35Other Clustering Approaches
- EM probability based clustering
- Bayesian clustering
- SOM self-organizing maps
-
36Self-Organizing Map
37Self Organizing Map
- Unsupervised learning
- Competitive learning
winner
output
input (n-dimensional)
38Self Organizing Map
- Determine the winner (the neuron of which the
weight vector has the smallest distance to the
input vector) - Move the weight vector w of the winning neuron
towards the input i
39Self Organizing Map
- Impose a topological order onto the competitive
neurons (e.g., rectangular map) - Let neighbors of the winner share the prize
(The postcode lottery principle) - After learning, neurons with similar weights tend
to cluster on the map
40Self Organizing Map
input
41Self Organizing Map
42Self Organizing Map
- Input uniformly randomly distributed points
- Output Map of 202 neurons
- Training
- Starting with a large learning rate and
neighborhood size, both are gradually decreased
to facilitate convergence
43Self Organizing Map
44Self Organizing Map
45Self Organizing Map
46(No Transcript)
47Self Organizing Map
48Self Organizing Map
49Discussion
- Can interpret clusters by using supervised
learning - learn a classifier based on clusters
- Decrease dependence between attributes?
- pre-processing step
- E.g. use principal component analysis
- Can be used to fill in missing values
- Key advantage of probabilistic clustering
- Can estimate likelihood of data
- Use it to compare different models objectively
50Examples of Clustering Applications
- Marketing discover customer groups and use them
for targeted marketing and re-organization - Astronomy find groups of similar stars and
galaxies - Earth-quake studies Observed earth quake
epicenters should be clustered along continent
faults - Genomics finding groups of gene with similar
expressions
51Clustering Summary
- unsupervised
- many approaches
- K-means simple, sometimes useful
- K-medoids is less sensitive to outliers
- Hierarchical clustering works for symbolic
attributes - Evaluation is a problem