Title: Cluster Analysis
1Cluster Analysis
- Craig A. Struble
- Department of Mathematics, Statistics, and
Computer Science - Marquette University
2Clustering Outline
- Problem Overview
- Techniques
- Partitional Algorithms
- Hierarchical Algorithms
- Probability Based Algorithms
- Other Approaches
- Interpretations
- Applications
3Goals
- Explore different clustering techniques
- Understand complexity issues
- Learn to interpret clustering results
- Explore applications of clustering
4Clustering Examples
- Segment customer database based on similar buying
patterns. - Group houses in a town into neighborhoods based
on similar features. - Identify new plant species
- Identify similar Web usage patterns
5Clustering Example
6Clustering Houses
7Clustering Problem
- Given a database Dt1,t2,,tn of tuples and an
integer value k, the Clustering Problem is to
define a mapping fD ? 1,..,k where each ti is
assigned to one cluster Kj, 1ltjltk. - A cluster, Kj, contains precisely those tuples
mapped to it. - Unlike classification problem, clusters are not
known a priori.
8Clustering vs. Classification
- No prior knowledge
- Number of clusters
- Meaning of clusters
- Unsupervised learning
- Data has no class labels
9Clustering Approaches
Clustering
Sampling
Compression
10Clustering Issues
- Outlier handling
- Dynamic data
- Interpreting results
- Evaluating results
- Number of clusters
- Data to be used
- Scalability
11Impact of Outliers on Clustering
12Visualizations
13Cluster Parameters
14Resources
- Classic text is Finding Groups in Data by Kaufman
and Rousseeuw, 1990 - Overwhelming number of algorithms
- Several implementations
- R and Weka
15Partitional Clustering
- Simultaneous clustering
- All elements are in some cluster during each
iteration - May be shifted from one cluster to another
- Some metric is used to determine goodness of
clustering - Average distance between clusters
- Squared error metric
- Combinatorial problem
- 11,259,666,000 ways to cluster 19 items into 4
clusters
16K-Means
Algorithm Kmeans Input k number
of clusters t number of
iterations data the
data Output C a set of k
clusters cent arbitrarily select k objects as
initial centers for i 1 to t do for each d
in data do assign label x to d such that
dist(d, centx) is minimized for
x 1 to k do centx mean value of all
data with label x
17Example Data
18K-Means Example
- Use Euclidean distance (l is number of
dimensions) - Select 10,4 and 5,4 as initial points
19K-Means Clustering
3 clusters
2 clusters
20Cluster Centers
21K-Means Summary
- Very simple algorithm
- Only works on data for which means can be
calculated - Continuous data
- O(knt) time complexity
- k - number of clusters,
- n - number of instances,
- t - number of iterations
- Circular cluster shape only
- Outliers can have very negative impact
22Outliers
23Partitioning Methods
- K-Means (already done)
- MST
- K-Medoids (PAM)
- Fuzzy Clustering
24Dissimilarity Matrices
- Many clustering algorithms use a dissimilarity
matrix as input - Instance x Instance
25Graph Perspective
1
9
5
4.472
4
5.83
5
2
5.38
3
1
5
4
3
2
26MST Algorithm
- Compute the minimal spanning tree of the graph
- Set of edges with minimal total weight so that
each node can be reached - Remove edges that are inconsistent
- E.G. One whose weight is much larger than
average weight of neighboring edges - Connected graph components form clusters
27MST Example
B
A
1
2
E
C
3
1
D
28MST Algorithm
29MST Example
Let k be 3, and let inconsistent edge be
defined as an edge with maximum weight.
B
A
1
E
C
1
D
30MST Summary
- Cost is dominated by MST procedure
- Time and space O(n2)
- Number of clusters not necessarily needed
- Implicitly defined by inconsistent
31K-Medoids
- K-Means is restricted to numeric attributes
- Have to calculate the average object
- A medoid is a representative object in a cluster.
- The algorithm searches for K medoids in the set
of objects.
32K-Medoids
33PAM Algorithm
- PAM consists of two phases
- BUILD - constructs an initial clustering
- SWAP - refines the clustering
- Goal Minimize the sum of dissimilarities to K
representative objects. - Mathematically equivalent to minimizing average
dissimilarity
34PAM Algorithm (BUILD Phase)
- Algorithm PAM (BUILD Phase)
- // Select k representative objects that appear to
minimize - // dissimilarities
- selected // empty set
- for x 1 to k do
- maxgain 0
- for each i in data - selected do
- for each j in data - selected do
- // See if j is closer to i than some
other - // previous selected object
- let Dj min(diss(j,k) for each k in
selected) - let Cji max(Dj - diss(j,i), 0)
- gain gain Cji // total up
improvements from i - if gain gt maxgain then
- maxgain gain
- best i
- selected selected best // best
representative object chosen
35PAM Algorithm (SWAP Phase)
- Algorithm PAM (SWAP Phase)
- // Improve the partitioning, selected comes from
BUILD phase - besti besth first element in selected
- repeat
- selected selected - besti besth // swap i
and h - minswap 0
- for each i in selected do
- for each h in data - selected do
- swap 0
- for each j in data - selected do
- let Dj min(diss(j,k) for each k
in selected) - if Djltdiss(j,i) and Djltdiss(j,h)
then - // closer to something else
- swap swap 0 // do
nothing - else if diss(j,i) Dj then //
closest to i - let Ej be min(diss(j,k) for
each k in selected - i) - if diss(j,h) lt Ej then
- swap swap diss(j,h) -
Dj - else
36PAM Output (from R)
37PAM Output (from R)
38Silhouettes
- These plots give an intuitive sense of how good
the clustering is - Let diss(i,C) be the average dissimilarity
between i and each element in cluster C - Let A be the cluster instance i is in
- a(i) diss(i,A)
- Let B ltgt A be the cluster such that diss(i,B) is
minimized - b(i) diss(i,B)
- The silhouette number s(i) is
39Silhouette Example
40Silhouettes
- Let s(k) be the average silhouette width for k
clusters. - The silhoutte coefficient of a data set is
- The k that maximizes this value is an indication
of the of clusters
41Fuzzy Clustering (FANNY)
- Previous partitioning methods are hard
- Data can be in one and only one cluster
- Instead of saying in or out, give a percent of
membership - This is basis for fuzzy logic
42Fuzzy Clustering (FANNY)
- Algorithm is a bit too complex to cover
- Idea Minimize the objective function
- where uiv is the unknown membership of object i
to cluster v
43FANNY Results
44FANNY Results
45Hierarchical Methods
- Top-down vs. bottom-up
- Agglomerative Nesting (AGNES)
- Divisive Analysis (DIANA)
- BIRCH
46Top-Down vs. Bottom-Up
- Top-down or divisive approaches split the whole
data set into smaller pieces - Bottom-up or agglomerative approaches combine
individual elements
47Agglomerative Nesting
- Combine clusters until one cluster is obtained
- Initially each cluster contains one object
- At each step, select the two most similar
clusters (e.g., average linking)
48Cluster Dissimilarities
diss(i,j)
R
Q
49Cluster Dissimilarities
- The dissimilarity between clusters can be defined
differently - Maximum dissimilarity between two objects
- Complete linkage
- Minimum dissimilarity between two objects
- Single linkage
- Centroid method
- Interval scaled attributes
- Wards method
- Interval scaled attributes
- Error sum of squares of a cluster
50Example
51AGNES Results
52AGNES Results
53Divisive Analysis (DIANA)
- Calculate the diameter of each cluster Q
- Select the cluster Q with the largest diameter
- Split into A and B
- Select object i that maximizes
- Move i from A to B if max value gt 0
54DIANA Results
55DIANA Results
56BIRCH
- Balanced Iterative Reducing and Clustering Using
Hierarchies - Mixes hierarchical clustering with other
techniques - Useful for large data sets, because entire data
is not kept in memory - Identifies and removes outliers from clustering
- Due to differing distribution of data
- Presentation assumes continuous data
57Two central concepts
- A cluster feature (CF) is a triple summarizing
information about a cluster - where N is the number of points in the cluster,
LS is the linear sum of the data points, SS is
the square sum of data points.
58Two central concepts
- Contain enough information to calculate a variety
of distance measures - Addition of CFs accurately represents CF for
merged clusters - Memory and time efficient to maintain
59Two Central Concepts
- A CF tree is a height balanced tree with two
parameters, branching factor B, and diameter
threshold T.
Root
CF1
CF2
CF3
CFn
CF11
CF12
CF13
Level 1
CF1n
Clusters
60Phase 1 Build CF tree
- CF tree is contained in memory and created
dynamically - Identify appropriate leaf Recursively descend
tree following closest child node. - Modify leaf When leaf is reached, add new data
item x to leaf. If leaf contains more than L
entries, split the leaf. (Must also satisfy T.)
61Phase 1 Build CF tree
- Steps continued
- Modify path to leaf Update CF of parent. If leaf
split, add new entry to parent. If parent
violates B, split parent node. Update parent
nodes recursively. - Merging refinement Find some non-leaf node Nj
corresponding to a split stoppage. Find two
closest entries in Nj. If they are not due to the
split, merge the two entries.
62Phase 1 Comments
- The parameters B and L are a function of the page
size P - Splits are caused by P, not by data distribution
- Hence, refinement step
- Increasing T makes a smaller tree, but can hide
outliers - Change T if memory runs out. (Phase 2)
63Phase 3 Global Clustering
- Apply some global clustering technique to the
leaf clusters in the CF tree - Fast, because everything is in memory
- Accurate, because outliers removed, data
represented at a level allowable by memory - Less order dependent, because leaves have data
locality
64Phase 4 Cluster Refinement
- Use centroids of the clusters found in Phase 3
- Identify centroid C closest to data point
- Place data point in cluster represented by C
65Probabilistic Methods
- COBWEB
- Hierarchical description with probabilities
associated to attributes. - Mixture Models
- Define probability distributions for each cluster
in the data.
66COBWEB
- Fisher, 1987
- Incremental approach to clustering
- Creates a classification tree, in which each node
describes a concept and a probabilistic
description of the concept - Prior probability of the concept
- Conditional probabilities for the attributes
given that concept.
67Classification Tree
68Algorithm
- Add each data item to the hierarchy one at a
time. - Try placing the data item in each existing node
(going level by level), select good node by
maximizing average category utility
69Algorithm
- Incorporating a new instance might cause the two
best nodes to merge - Calculate CU for the merged nodes
- Alternatively, incorporating a new instance might
cause a split - Calculate CU for splitting the best node
70Probability-Based Clustering
- Consider clustering data into k clusters
- Model each cluster with a probability
distribution - This set of k distributions is called a mixture,
and the overall model is a finite mixture model. - Each probability distribution gives the
probability of an instance being in a given
cluster
71Mixture Model Clustering
- Simplest case A single numeric attribute and two
clusters A and B each represented by a normal
distribution - Parameters for A ?A - mean, ?A - standard dev.
- Parameters for B ?B - mean, ?B - standard dev.
- And P(A), P(B) 1 - P(A), the prior
probabilities of being in cluster A and B
respectively
72Probability -Based Clustering
?A50, ?A 5, pA0.6 ?B65, ?B 2, pB0.4
73Probability-Based Clustering
- Question is, how do we know the parameters for
the mixture? - ?A ,?A, ?B ,?B,P(A)
- If data is labeled, easy
- But clustering is more often used for unlabeled
data - Use an iterative approach similar in spirit to
the k-means algorithm
74Expectation Maximization
- Start with initial guesses for the parameters
- Calculate cluster probabilities for each instance
- Expectation
- Reestimate the parameters from probabilities
- Maximization
- Repeat
75Maximization
Probality xi is in A
Estimated Mean of A
Estimated Variance of A (maximum likelihood
estimate)
Prior probability of being in A
76Termination
- The EM algorithm converges to a maximum, but
never gets there - Continue until overall likelihood growth is
negligible - Maximum could be local, so repeat several times
with different initial values
77Extending the Model
- Extending to multiple clusters is
straightforward, just use k normal distributions - For multiple attributes, assume independence and
multiply attribute probabilities as in Naïve
Bayes - For nominal attributes, cant use normal
distribution. Have to create probability
distributions for the values, one per cluster.
This gives kv parameters to estimate, where v is
the number of values for the nominal attribute. - Can use different distributions depending on
data e.g., log-normal distribution for
attributes with minimum
78Other Clustering Approaches
- Genetic Algorithms
- Global search for solutions
- Neural Networks
- Competitive Learning
- Kohonen Network
79Kohonen Data
80Applications of Clustering
- Gene function identification
- Document clustering
- Modeling economic data
81Gene Function Identification
- Genome is the blueprint defining an organism
(DNA) - Genes are inherited portions of the genome
related to biological functions - Proteins, non-coding RNA
- Given the collection of biological information,
try to predict or identify the function of genes
with unknown function
82Gene Expression
- A gene is expressed if it is actively being
transcribed (copied into RNA) - Rate of expression is related to the rate of
transcription - Microarray experiments
83Gene Expression Data
Experiments
Clones
84Clustering Gene Expression Data
- Identify genes with similar expression profiles
- Use clustering
- Identify function of known genes in a cluster
- Assign that function to genes of unknown function
in the same cluster
85Clustering Gene Expression Data
Yeast Genome
86Document Clustering
- Represent documents as vectors in a vector space
- Cluster documents in this representation
- Describe/summarize/evaluate the clusters
- Label clusters with meaningful descriptions
87Document Transformation
- Convert document into table form
- Attributes are important words
- Value is number of times word appears
88Document Classification
- Could select a word as classification label
- Identify after clustering
- Look at medoid or centroid and examine
characteristics - Look at of times certain words appear
- Look at which words appear together
- Look at words that dont appear at all
89Document Classification
- Once clusters identified, label each document
with a cluster label - Use a classification technique to identify
cluster relationships - Decision trees for example
- Other kinds of rule induction
90Document Clustering
- MeSH
- 21975 Terms
- RGD
- 2713 Papers
- Dissimilarity matrix
- Multidimensional scaling
- FANNY
- Red - Sequence related
- Black - Physiological
91Economic Modelling
- Nettleton et al. (2000)
- Objective is to identify how to make the Port of
Barcelona the principal port of entry for
merchandise. - Statistics, clustering, outlier analysis,
categorization of continuous values
92Data
- Vessel specific
- Date, type, origin, destination, metric tons
loaded/unloaded, amount invoiced, quality
measure, etc. - Economic indicators
- Consumer price index, date (monthly), Industrial
production index, etc.
93Data Transformation
- Data aggregated into 4 relational tables
- Total monthly visits, etc.
- Joined based on date
- Separated into training (1997-1998) and test sets
(1999)
94Data Exploration
- Used clustering with Condorcet criterion
- IBM Intelligent Miner
- Identified relevant features
- Production import volume gt 400,000 MT for product
1 - Area import volume gt 250,000 MT for area 10
- Then used rule induction to characterize clusters
95Cluster Analysis Summary
- Unsupervised technique
- Similarity-based
- Can modify definition of similarity to meet needs
- Many techniques
- Partitional, hierarchical, probability based, NN,
GA, etc. - Combined with some other descriptive technique
- Decision trees, rule induction, etc.
96Cluster Analysis Summary
- Issues
- Number of clusters
- Quality of clustering
- Meaning of clusters
- Clustering large data sets
- Applications