Title: Cluster Analysis
1Cluster Analysis
- Craig A. Struble
- Department of Mathematics, Statistics, and
Computer Science - Marquette University
2Overview
- Background
- Example K-Means
- Dissimilarity
- Tangent K-Nearest Neighbor
- Partitioning Methods
- Hierarchical Methods
- Probability Based Methods
- Interpreting Clusters
3Goals
- Explore different clustering techniques
- Understand distance measures
- Define new distance measures
- K-nearest neighbor classification and data
imputation - Interpret clustering results
- Applications of clustering
4Background
- Clustering is the process of identifying natural
groupings in the data - Unsupervised learning technique
- No predefined class labels
- Classic text is Finding Groups in Data by Kaufman
and Rousseeuw, 1990 - Several implementations
- R and Weka
5Clustering
6Visualizations
7Example K-Means
Algorithm Kmeans Input k number
of clusters t number of
iterations data the
data Output C a set of k
clusters cent arbitrarily select k objects as
initial centers for i 1 to t do for each d
in data do assign label x to d such that
dist(d, centx) is minimized for
x 1 to k do centx mean value of all
data with label x
8K-Means Example
- Use Euclidean distance (l is number of
dimensions) - Select 10,4 and 5,4 as initial points
9K-Means Clustering
3 clusters
2 clusters
10Cluster Centers
11K-Means Summary
- Very simple algorithm
- Only works on data for which means can be
calculated - Continuous data
- O(knt) time complexity
- Circular cluster shape only
- Outliers can have very negative impact
12Outliers
13Calculating Dissimilarity
- Many clustering algorithms use a dissimilarity
matrix as input - Instance x Instance
14Continuous (Interval-scaled)
- Euclidean distance
- Manhattan (or city block) distance
- Minkowski distance
15Boolean Attributes
- Contingency Table
- Symmetric (trues and falses carry equal
information)
Object j
Object i
16Boolean Attributes
- Assymmetric, true and false differ in importance
- Test for HIV positive
- Number of matches on false, t, not important
- Jaccard coefficient
17Boolean Example
18Nominal Attributes
- Count the number of matches m
- The total number of attributes is p
19Nominal Example
20Ordinal Attributes
- Use ordering to rank the attribute
- none/1, some/2, full/3
- Replace rank with standardized value 0,1
- Calculate using interval dissimilarity value
21Ratio Attributes
- Recall Meaningful zero point
- Generally, only positive values on non-linear
scale - Often follow exponential laws in time
- Decay of radiation intensity
- for some constants A and B
22Options for Ratio Attributes
- Treat them as interval values
- Perform a log transformation
- Treat as an ordinal value, transforming the
values into rankings.
23Example
- Suppose we had calculated the radiation
intensities of two samples after 10 seconds. - x1 8.0 x210.0
- Option 1 (Euclidean Distance)
- diss(x1,x2) 2.0
- Option 2 (log2 transformation)
- y1log28.03 y2log210.03.32
- diss(x1,x2)0.32
24Example (cont.)
- Option 3 (Ordinal transformation)
- Assume that 8 is the 4th highest, 10 is the 8th
higest, and there are 10 total values - y1 (4-1)/(10-1) 0.33
- y2(8-1)/(10-1)0.78
- diss(x1,x2)0.45
25Combining Multiple Types
- Let xi be the ith instance
- Let xif be the fth attribute of xi
- Let ?ij(f)0 if xif or xjf are missing, 1
otherwise - Let dij(f) be the distance measure for feature f
of xi and xj
26Combining Multiple Types
- If f is binary or nominal
- dij(f)0 if xifxjf or 1 otherwise
- If f is interval-based
- where h runs over instances not missing f
- If f is ordinal or ratio-scaled, calculate ranks
and z score, then treat as interval.
27Example
28Standardization
- When working with interval-scaled variables only,
its often necessary to standardize the values - Avoid bias towards one attribute
- Calculate mean mf of attribute f
- Calculate mean absolute deviation
- Calculate z score for the attribute for each
instance
29Tangent K-Nearest Neighbor (KNN)
- Simple classification technique
- Let y be a new instance to classify
- Find the K nearest instances (with labels) to y
- Classify y with the label appearing most
frequently
30Data imputation with KNN
- Use KNN technique with instance missing a value
as described previously - Take the mean (median, mode, etc.) of the K
nearest neighbors - Use the value to fill in for the missing value
31Example (Data Imputation)
32Assessing Data Imputation
- Use the data imputation technique of choice
- Calculate the mean squared error
- Use a statistical test to identify confidence
range of error - Need to identify distribution of error (e.g.
normal, etc.)
33Partitioning Methods
- K-Means (already done)
- K-Medoids (PAM)
- CLARA
- Fuzzy Clustering
34Clustering
35K-Medoids
- K-Means is restricted to numeric attributes
- Have to calculate the average object
- A medoid is a representative object in a cluster.
- The algorithm searches for K medoids in the set
of objects.
36K-Medoids
37PAM Algorithm
- PAM consists of two phases
- BUILD - constructs an initial clustering
- SWAP - refines the clustering
- Goal Minimize the sum of dissimilarities to K
representative objects. - Mathematically equivalent to minimizing average
dissimilarity
38PAM Algorithm (BUILD Phase)
- Algorithm PAM (BUILD Phase)
- // Select k representative objects that appear to
minimize - // dissimilarities
- selected // empty set
- for x 1 to k do
- maxgain 0
- for each i in data - selected do
- for each j in data - selected do
- // See if j is closer to i than some
other - // previous selected object
- let Dj min(diss(j,k) for each k in
selected) - let Cji max(Dj - diss(j,i), 0)
- gain gain Cji // total up
improvements from i - if gain gt maxgain then
- maxgain gain
- best i
- selected selected best // best
representative object chosen
39PAM Algorithm (SWAP Phase)
- Algorithm PAM (SWAP Phase)
- // Improve the partitioning, selected comes from
BUILD phase - besti besth first element in selected
- repeat
- selected selected - besti besth // swap i
and h - minswap 0
- for each i in selected do
- for each h in data - selected do
- swap 0
- for each j in data - selected do
- let Dj min(diss(j,k) for each k
in selected) - if Djltdiss(j,i) and Djltdiss(j,h)
then - // closer to something else
- swap swap 0 // do
nothing - else if diss(j,i) Dj then //
closest to i - let Ej be min(diss(j,k) for
each k in selected - i) - if diss(j,h) lt Ej then
- swap swap diss(j,h) -
Dj - else
40PAM Output (from R)
41PAM Output (from R)
42Silhouettes
- These plots give an intuitive sense of how good
the clustering is - Let diss(i,C) be the average dissimilarity
between i and each element in cluster C - Let A be the cluster instance i is in
- a(i) diss(i,A)
- Let B ltgt A be the cluster such that diss(i,B) is
minimized - b(i) diss(i,B)
- The silhouette number s(i) is
43Silhouettes
- Let s(k) be the average silhouette width for k
clusters. - The silhoutte coefficient of a data set is
- The k that maximizes this value is generally the
best of clusters to use
44CLARA
- Clustering LARge Applications
- Modification of PAM
- Draw a sample from large data set
- Size is typically 402k
- Use PAM to find medoids of sample
- Use medoids to assign clusters to entire data
- Can take multiple samples and choose best
resulting clustering
45Fuzzy Clustering (FANNY)
- Previous partitioning methods are hard
- Data can be in one and only one cluster
- Instead of saying in or out, give a percent of
membership - This is basis for fuzzy logic
46Fuzzy Clustering (FANNY)
- Algorithm is a bit too complex to cover
- Idea Minimize the objective function
- where uiv is the unknown membership of object i
to cluster v
47FANNY Results
48FANNY Results
49Huh?
50Whoa!
51Hierarchical Methods
- Top-down vs. bottom-up
- Agglomerative Nesting (AGNES)
- Divisive Analysis (DIANA)
- BIRCH
52Top-Down vs. Bottom-Up
- Top-down or divisive approaches split the whole
data set into smaller pieces - Bottom-up or agglomerative approaches combine
individual elements
53Agglomerative Nesting
- Combine clusters until one cluster is obtained
- Initially each cluster contains one object
- At each step, select the two most similar
clusters
54Cluster Dissimilarities
diss(i,j)
R
Q
55AGNES Results
56AGNES Results
57AGNES Modifications
- The dissimilarity between clusters can be defined
differently - Maximum dissimilarity between two objects
- Complete linkage
- Minimum dissimilarity between two objects
- Single linkage
- Centroid method
- Interval scaled attributes
- Wards method
- Interval scaled attributes
- Error sum of squares of a cluster
58Divisive Analysis (DIANA)
- Calculate the diameter of each cluster Q
- Select the cluster Q with the largest diameter
- Split into A and B
- Select object i that maximizes
- Move i from A to B if max value gt 0
59DIANA Results
60DIANA Results
61BIRCH
- Balanced Iterative Reducing and Clustering Using
Hierarchies - Mixes hierarchical clustering with other
techniques - Useful for large data sets, because entire data
is not kept in memory - Identifies and removes outliers from clustering
- Due to differing distribution of data
- Presentation assumes continuous data
62Two central concepts
- A cluster feature (CF) is a triple summarizing
information about a cluster - where N is the number of points in the cluster,
LS is the linear sum of the data points, SS is
the square sum of data points.
63Two central concepts
- Contain enough information to calculate a variety
of distance measures - Addition of CFs accurately represents CF for
merged clusters - Memory and time efficient to maintain
64Two Central Concepts
- A CF tree is a height balanced tree with two
parameters, branching factor B, and threshold T.
Root
CF1
CF2
CF3
CFn
CF11
CF12
CF13
Level 1
CF1n
Clusters
65Phase 1 Build CF tree
- CF tree is contained in memory and created
dynamically - Identify appropriate leaf Recursively descend
tree following closest child node. - Modify leaf When leaf is reached, add new data
item x to leaf. If leaf contains more than L
entries, split the leaf. (Must also satisfy T.)
66Phase 1 Build CF tree
- Steps continued
- Modify path to leaf Update CF of parent. If leaf
split, add new entry to parent. If parent
violates B, split parent node. Update parent
nodes recursively. - Merging refinement Find some non-leaf node Nj
corresponding to a split stoppage. Find two
closest entries in Nj. If they are not due to the
split, merge the two entries.
67Phase 1 Comments
- The parameters B and L are a function of the page
size P - Splits are caused by P, not by data distribution
- Hence, refinement step
- Increasing T makes a smaller tree, but can hide
outliers - Change T if memory runs out. (Phase 2)
68Phase 3 Global Clustering
- Apply some global clustering technique to the
leaf clusters in the CF tree - Fast, because everything is in memory
- Accurate, because outliers removed, data
represented at a level allowable by memory - Less order dependent, because leaves have data
locality
69Phase 4 Cluster Refinement
- Use centroids of the clusters found in Phase 3
- Identify centroid C closest to data point
- Place data point in cluster represented by C
70Probabilistic Methods
- These will be delayed until we cover
probabilistic methods in general - Mixture Models
- COBWEB
71Applications of Clustering
- Gene function identification
- Document clustering
- Modeling economic data
72Gene Function Identification
- Genome is the blueprint defining an organism
(DNA) - Genes are inherited portions of the genome
related to biological functions - Proteins, non-coding RNA
- Given the collection of biological information,
try to predict or identify the function of genes
with unknown function
73Gene Expression
- A gene is expressed if it is actively being
transcribed (copied into RNA) - Rate of expression is related to the rate of
transcription - Microarray experiments
74Gene Expression Data
Experiments
Clones
75Clustering Gene Expression Data
- Identify genes with similar expression profiles
- Use clustering
- Identify function of known genes in a cluster
- Assign that function to genes of unknown function
in the same cluster
76Clustering Gene Expression Data
Yeast Genome
77Document Classification
- Represent documents as vectors in a vector space
- Latent Semantic Indexing
- Cluster documents in this representation
- Describe/summarize/evaluate the clusters
- Label clusters with meaningful descriptions
78Document transformation
- Convert document into table form
- Attributes are important words
- Value is number of times word appears
79Document classification
- Could select a word as classification label
- Identify after clustering
- Look at medoid or centroid and examine
characteristics - Look at of times certain words appear
- Look at which words appear together
- Look at words that dont appear at all
80Document Classification
- Once clusters identified, label each document
with a cluster label - Use a classification technique to identify
cluster relationships - Decision trees for example
- Other kinds of rule induction
81Economic Modelling
- Nettleton et al. (2000)
- Objective is to identify how to make the Port of
Barcelona the principal port of entry for
merchandise. - Statistics, clustering, outlier analysis,
categorization of continuous values
82Data
- Vessel specific
- Date, type, origin, destination, metric tons
loaded/unloaded, amount invoiced, quality
measure, etc. - Economic indicators
- Consumer price index, date (monthly), Industrial
production index, etc.
83Data Transformation
- Data aggregated into 4 relational tables
- Total monthly visits, etc.
- Joined based on date
- Separated into training (1997-1998) and test sets
(1999)
84Data Exploration
- Used clustering with Condorcet criterion
- IBM Intelligent Miner
- Identified relevant features
- Production import volume gt 400,000 MT for product
1 - Area import volume gt 250,000 MT for area 10
- Then used rule induction to characterize clusters
85Cluster Analysis Summarization
- Unsupervised technique
- Similarity-based
- Can modify definition of similarity to meet needs
- Usually combined with some other descriptive
technique - Decision trees, rule induction, etc.