Title: Clustering
1Clustering
- k-Means,
- hierarchical clustering,
- Self-Organizing Maps
2Outline
- k-means clustering
- Hierarchical clustering
- Self-Organizing Maps
3Classification vs. Clustering
Classification Supervised learning
4Classification vs. Clustering
labels unknown
Clustering Unsupervised learning No labels, find
natural grouping of instances
5Many Clustering Applications
- Basically, everywhere where labels are
unknown/uncertain/too expensive - Marketing find groups of similar customers
- Astronomy find groups of similar stars, galaxies
- Earthquake studies cluster earth quake
epicenters along continent faults - Genomics find groups of genes with similar
expressions
6Clustering Methods Terminology
Non-overlapping
Overlapping
7Clustering Methods Terminology
Bottom-up (agglomerative)
Top-down
8Clustering Methods Terminology
Hierarchical
(vs flat)
9Clustering Methods Terminology
Deterministic
Probabilistic
(vs flat)
10k-Means Clustering
11K-means clustering (k3)
Pick k random points initial cluster centers
12K-means clustering (k3)
Assign each point to nearest cluster center
13K-means clustering (k3)
Move cluster centers to mean of each cluster
14K-means clustering (k3)
Reassign points to nearest cluster center
15K-means clustering (k3)
Repeat step 3-4 until cluster centers converge
(dont/hardly move)
16K-means
- Works with numeric data only
- Pick k random points initial cluster centers
- Assign every item to its nearest cluster center
(e.g. using Euclidean distance) - Move each cluster center to the mean of its
assigned items - Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)
17K-means clustering another example
http//www.youtube.com/watch?featureplayer_embedd
edvBVFG7fd1H30
18Discussion
- Result can vary significantly depending on
initial choice of centers - Can get trapped in local minimum
- Example
- To increase chance of finding global optimum
restart with different random seeds
19Discussion, circular data
- Arbitrary results
- Prototypes not on data
20K-means clustering summary
- Advantages
- Simple, understandable
- Instances automatically assigned to clusters
- Fast
- Disadvantages
- Must pick number of clusters beforehand
- All instances forced into a single cluster
- Sensitive to outliers
- Random algorithm
- Random results
- Not always intuitive
- Higher dimensions
21K-means variations
- k-medoids instead of mean, use medians of each
cluster - Mean of 1,3,5,7,1009 is
- Median of 1,3,5,7,1009 is
- For large databases, use sampling
205
5
22How to choose k?
- One important parameter k, but how to choose?
- Domain dependent, we simply want k clusters
- Alternative repeat for several values of k and
choose the best - Example
- cluster mammals properties
- each value of k leads to a different clustering
- use an MDL-based encoding for the data in
clusters - each additional clusterintroduces a penalty
- optimal for k 6
23Clustering Evaluation
- Manual inspection
- Benchmarking on existing labels
- Classification through clustering
- Is this fair?
- Cluster quality measures
- distance measures
- high similarity within a cluster, low across
clusters
24Hierarchical Clustering
25Hierarchical clustering
- Hierarchical clustering represented in
dendrogram - tree structure containing hierarchical clusters
- individual clusters in leafs, union of child
clusters in nodes
26Bottom-up vs top-down clustering
- Bottom up/Agglomerative
- Start with single-instance clusters
- At each step, join two closest clusters
- Top down
- Start with one universal cluster
- Split in two clusters
- Proceed recursively on each subset
27Distance Between Clusters
- Centroid distance between centroids
- Sometimes hard to compute (e.g. mean of
molecules?) - Single Link smallest distance between points
- Complete Link largest distance between points
- Average Link average distance between points
28Clustering dendrogram
29How many clusters?
30Probability-based Clustering
- Given k clusters, each instance belongs to all
clusters (instead of a single one), with a
certain probability - mixture model set of k distributions (one per
cluster) - also each cluster has prior likelihood
- If correct clustering known, we know parameters
and P(Ci) for each cluster calculate P(Cix)
using Bayes rule - How to estimate the unknown parameters?
31Self-Organising Maps
32Self Organizing Map
- Group similar data together
- Dimensionality reduction
- Data visualization technique
- Similar to neural networks
- Neurons try to mimic the input vectors
- The winning neuron (and its neighborhood) wins
- Topology preserving, usingNeighborhood function
33Self Organizing Map
- Input high-dimensional input space
- Output low dimensional (typically 2 or 3)
- network topology
- Training
- Starting with a large learning rate and
neighborhood size, both are gradually decreased
to facilitate convergence - After learning, neurons with similar weights
tend to cluster on the map
34Learning the SOM
- Determine the winner (the neuron of which the
weight vector has the smallest distance to the
input vector) - Move the weight vector w of the winning neuron
towards the input i
35SOM Learning Algorithm
- Initialise SOM (random, or such that dissimilar
input is mapped far apart) - for t from 0 to N
- Randomly select a training instance
- Get the best matching neuron
- calculate distance, e.g.
- Scale neighbors
- Which? decrease over time Hexagons, squares,
Gaussian, - Update of neighbors towards the training instance
36Self Organizing Map
- Neighborhood function to preserve topological
properties of the input space - Neighbors share the prize (postcode lottery
principle)
37SOM of hand-written numerals
38SOM of countries (poverty)
39Clustering Summary
- Unsupervised
- Many approaches
- k-means simple, sometimes useful
- k-medoids is less sensitive to outliers
- Hierarchical clustering works for symbolic
attributes - Self-Organizing Maps
- Evaluation is a problem