Title: ICS 278: Data Mining Lecture 9,10: Clustering Algorithms
1ICS 278 Data MiningLecture 9,10 Clustering
Algorithms
- Padhraic Smyth
- Department of Information and Computer Science
- University of California, Irvine
2Project Progress Report
- Written Progress Report
- Due Tuesday May 18th in class
- Expect at least 3 pages (should be typed not
handwritten) - Hand in written document in class on Tuesday May
18th - 1 Powerpoint slide
- 1 slide that describes your project
- Should contain
- Your name (top right corner)
- Clear description of the main task
- Some visual graphic of data relevant to your task
- 1 bullet or 2 on what methods you plan to use
- Preliminary results or results of exploratory
data analysis - Make it graphical (use text sparingly)
- Submit by 12 noon Monday May 17th
3List of Sections for your Progress Report
- Clear description of task (reuse original
proposal if needed) - Basic task extended bonus tasks (if time
allows) - Discussion of relevant literature
- Discuss prior published/related work (if it
exists) - Preliminary data evaluation
- Exploratory data analysis relevant to your task
- Include as many of plots/graphs as you think are
useful/relevant - Preliminary algorithm work
- Summary of your progress on algorithm
implementation so far - If you are not at this point yet, say so
- Relevant information about other code/algorithms
you have downloaded, some preliminary testing on,
etc. - Difficulties encountered so far
- Plans for the remainder of the quarter
- Algorithm implementation
- Experimental methods
4Clustering
- automated detection of group structure in data
- Typically partition N data points into K groups
(clusters) such that the points in each group are
more similar to each other than to points in
other groups - descriptive technique (contrast with predictive)
- for real-valued vectors, clusters can be thought
of as clouds of points in p-dimensional space
5Clustering
6Why is Clustering useful?
- Discovery of new knowledge from data
- Contrast with supervised classification (where
labels are known) - Long history in the sciences of categories,
taxonomies, etc - Can be very useful for summarizing large data
sets - For large n and/or high dimensionality
- Applications of clustering
- Discovery of new types of galaxies in
astronomical data - Clustering of genes with similar expression
profiles - Cluster pixels in an image into regions of
similar intensity - Segmentation of customers for an e-commerce store
- Clustering of documents produced by a search
engine - . many more
7General Issues in Clustering
- Representation
- What types of clusters are we looking for?
- Score
- The criterion to compare one clustering to
another - Optimization
- Generally, finding the optimal clustering is
NP-hard - Greedy algorithms to optimize score are widely
used - Other issues
- Distance function, D(x(i),x(j)) critical aspect
of clustering, both - distance of pairs of objects
- distance of objects from clusters
- How is K selected?
- Different types of data
- Real-valued versus categorical
- Attribute-valued vectors vs. n2 distance matrix
8General Families of Clustering Algorithms
- partition-based clustering
- e.g. K-means
- probabilistic model-based clustering
- e.g. mixture models
- both of the above work with measurement data,
e.g., feature vectors - hierarchical clustering
- e.g. hierarchical agglomerative clustering
- graph-based clustering
- E.g., min-cut algorithmsboth of the above work
with distance data, e.g., distance matrix
9Partition-Based Clustering
- given n data points Xx(1) x(n)
- output k partitions C C1 CK such that
- each x(i) is assigned to unique Cj
(hard-assignment) - C implicitly represents a mapping from X to C
- Optimization algorithm
- require that scoreC, X is maximized
- e.g., sum-of-squares of within cluster distances
- exhaustive search intractable
- combinatorial optimization to assign n objects to
k classes - large search space possible assignment choices
kn - so, use greedy interative method
- will be subject to local maxima
10Score Function for Partition-Based Clustering
- want compact clusters
- minimize within cluster distances wc(C)
- want different clusters far apart
- maximize between cluster distances bc(C)
- given cluster partitioning C, find centers c1ck
- e.g. for vectors, use centroids of points in
cluster Ci - ck 1/(nk) ? x ? Ck x
- wc(C) sum-of-squares within cluster distance
- wc(C) ?i1k wc(Ci) ?i1k ? x ? Ci
d(x,ci)2 - bc(C) distance between clusters
- bc(C) ?i,j1k d(ci,cj)2
- ScoreC,Xfwc(C),bc(C)
11K-means Clustering
- basic idea
- Score wc(C) sum-of-squares within cluster
distance - start with randomly chosen cluster centers c1
ck - repeat until no cluster memberships change
- assign each point x to cluster with nearest
center - find smallest d(x,ci), over all c1 ck
- recompute cluster centers over data assigned to
them - ci 1/(ni) ? x ? Ci x
- algorithm terminates (finite number of steps)
- decreases Score(X,C) each iteration membership
changes - converges to local maxima of Score(X,C)
- not necessarily the global maxima
- different initial centers (seeds) can lead to
diff local maxs
12K-means Complexity
- time complexity O(I e n k) ltlt exhaustives nk
- I number of interations (steps)
- e cost of distance computation (ep for
Euclidian dist) - speed-up tricks (especially useful in early
iterations) - use nearest x(i)s as cluster centers instead of
mean - reuse of cached dists from size n2 dist mat D
(lowers effective e) - k-mediods use one of x(i)s as center because
mean not defined - recompute centers as points reassigned
- useful for large n (like online neural nets)
more cache efficient - PCA reduce effective e and/or fit more of X in
RAM - condense reduce n by replace group with
prototype - even more clever data structures (see work by
Andrew Moore, CMU)
13K-means example(courtesy of Andrew Moore, CMU)
14K-means
- Ask user how many clusters theyd like. (e.g.
K5)
15K-means
- Ask user how many clusters theyd like. (e.g.
K5) - Randomly guess K cluster Center locations
16K-means
- Ask user how many clusters theyd like. (e.g.
K5) - Randomly guess K cluster Center locations
- Each datapoint finds out which Center its
closest to. (Thus each Center owns a set of
datapoints)
17K-means
- Ask user how many clusters theyd like. (e.g.
k5) - Randomly guess k cluster Center locations
- Each datapoint finds out which Center its
closest to. - Each Center finds the centroid of the points it
owns
18K-means
- Ask user how many clusters theyd like. (e.g.
k5) - Randomly guess k cluster Center locations
- Each datapoint finds out which Center its
closest to. - Each Center finds the centroid of the points it
owns - New Centers gt new boundaries
- Repeat until no change!
19K-means
- Ask user how many clusters theyd like. (e.g.
k5) - Randomly guess k cluster Center locations
- Each datapoint finds out which Center its
closest to. - Each Center finds the centroid of the points it
owns - and jumps there
- Repeat until terminated!
20AcceleratedComputations
Example generated by Pelleg and Moores
accelerated k-means Dan Pelleg and Andrew Moore.
Accelerating Exact k-means Algorithms with
Geometric Reasoning. Proc. Conference on
Knowledge Discovery in Databases 1999, (KDD99)
(available on www.autonlab.org/pap.html)
21K-means continues
22K-means continues
23K-means continues
24K-means continues
25K-means continues
26K-means continues
27K-means continues
28K-means continues
29K-means terminates
30Image
Clusters on color
K-means clustering of RGB (3 value) pixel color
intensities, K 11 segments (courtesy of David
Forsyth, UC Berkeley)
31Issues in K-means clustering
- Simple, but useful
- tends to select compact isotropic cluster
shapes - can be useful for initializing more complex
methods - many algorithmic variations on the basic theme
- Choice of distance measure
- Euclidean distance
- Weighted Euclidean distance
- Many others possible
- Selection of K
- screen diagram - plot SSE versus K, look for
knee - Limitation may not be any clear K value
32Probabilistic Clustering Mixture Models
- assume a probabilistic model for each component
cluster - mixture model f(x) ?k1K wk fk(x?k)
- where wk are K mixing weights
- ? wk 0 ? wk ? 1 and ?k1K wk 1
- where K components fk(x?k) can be
- Gaussian
- Poisson
- exponential
- ...
- Note
- Assumes a model for the data (advantages and
disadvantages) - Results in probabilistic membership p(cluster k
x)
33Gaussian Mixture Models (GMM)
- model for k-th component is normal N(?k,?k)
- often assume diagonal covariance ?jj ?j2 ,
?i?j 0 - or sometimes even simpler ?jj ?2 ,
?i?j 0 - f(x) ?k1K wk fk(x?k) with ?k lt?k , ?kgt or
lt?k ,?kgt - generative model
- randomly choose a component
- selected with probability wk
- generate x N(?k,?k)
- note ?k ?k both d-dim vectors
34Learning Mixture Models from Data
- Score function log-likelihood L(?)
- L(?) log p(X?) log ?H p(X,H?)
- H hidden variables (cluster memberships of each
x) - L(?) cannot be optimized directly
- EM Procedure
- General technique for maximizing log-likelihood
with missing data - For mixtures
- E-step compute memberships p(k x) wk
fk(x?k) / f(x) - M-step pick a new ? to max expected data
log-likelihood - Iterate guaranteed to climb to (local) maximum
of L(?)
35The E (Expectation) Step
Current K clusters and parameters
n data points
E step Compute p(data point i is in group k)
36The M (Maximization) Step
New parameters for the K clusters
n data points
M step Compute q, given n data points and
memberships
37Complexity of EM for mixtures
K models
n data points
Complexity per iteration scales as O( n K f(p) )
38Comments on Mixtures and EM Learning
- Complexity of each EM iteration
- Depends on the probabilistic model being used
- e.g., for Gaussians, Estep is O(nK), Mstep is
O(Knp2) - Sometimes E or M-step is not closed form
- gt can requires numerical methods at each
iteration - K-means interpretation
- Gaussian mixtures with isotropic (diagonal,
equi-variance) ?k s - Approximate the E-step by choosing most likely
cluster (instead of using membership
probabilities) - Generalizations
- Mixtures of multinomials for text data
- Mixtures of Markov chains for Web sequences
- etc
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47Selecting K in mixture models
- cannot just choose K that maximizes likelihood
- Likelihood L(?) ALWAYS larger for larger K
- Model selection alternatives
- 1) penalize complexity
- e.g., BIC L(?) d/2 log n (Bayesian
information criterion) - 2) Bayesian compute posteriors p(k data)
- Can be tricky to compute for mixture models
- 3) (cross) validation popular and practical
- Score different models by log p(Xtest ?)
- split data into train and validate sets
48Example of BIC Score for Red-Blood Cell Data
49(No Transcript)
50Hierarchical Clustering
- Representation tree of nested clusters
- Works from a distance matrix
- advantage xs can be any type of object
- disadvantage computation
- two basic approachs
- merge points (agglomerative)
- divide superclusters (divisive)
- visualize both via dendograms
- shows nesting structure
- merges or splits tree nodes
- Applications
- e.g., clustering of gene expression data
- Useful for seeing hierarchical structure, for
- relatively small data sets
51(No Transcript)
52Agglomerative Methods Bottom-Up
- algorithm based on distance between clusters
- for i1 to n let Ci x(i) -- i.e. start
with n singletons - while more than one cluster left
- let Ci and Cj be cluster pair with minimum
distance over distCi , Cj - merge them, via Ci Ci ? Cj and remove Cj
- time complexity O(n2) to O(n3)
- n iterations (start n clusters end 1 cluster)
- 1st iteration O(nlgn) to O(n2) to find nearest
singleton pair - space complexity O(nlgn) to O(n2)
- accesses all/most distances between x(i)s during
build - interpreting large n dendrogram difficult anyway
(like DTs) - large n idea partition-based clusters at leafs
53Distances Between Clusters
- single link / nearest neighbor measure
- D(Ci,Cj) min d(x,y) x ? Ci, y ? Cj
- can be outlier/noise sensitive
- complete link / furthest neighbor measure
- D(Ci,Cj) max d(x,y) x ? Ci, y ? Cj
- intermediates between those extremes
- average link D(Ci,Cj) avg d(x,y) x ? Ci, y
? Cj - centroid D(Ci,Cj) d(ci,cj) where ci ,
cj are centroids - Wardss SSE measure (for vector data)
- within-cluster sum-of-squared-dists for Ci
for Cj - for merged - DM theme try several, see which is most
interesting
54Dendrogram Using Single-Link Method
notice that y scale ? x scale !
Old Faithful Eruption Duration vs Wait Data
Notice how single-link tends to chain.
dendrogram y-axis crossbars distance score
55Dendogram Using Wards SSE Distance
More balanced than single-link.
Old Faithful Eruption Duration vs Wait Data
56Divisive Methods Top-Down
- algorithm
- begin with single cluster containing all data
- split into components, repeat until clusters
single points - two major types
- monothetic
- split by one variable at a time -- restricts
choice search space - analogous to DTs
- polythetic
- splits by all variables at once -- many choices
makes difficult - less commonly used than agglomerative methods
- generally more computationally intensive
- more choices in search space
57Spectral/Graph-based Clustering
58Clustering non-vector objects
- E.g., sequences, images, documents, etc
- Can be of varying lengths, sizes
- Distance matrix approach
- E.g., compute edit distance/transformations for
pairs of sequences - Apply clustering (e.g., hierarchical) based on
distance matrix - However.does not scale well
- Vectorization
- Represent each object as a vector
- Cluster resulting vectors using vector-space
algorithm - However. can lose (e.g., sequence) information
by going to vector space - Probabilistic model-based clustering
- Treat as mixture of (e.g.) stochastic finite
state machines - Can naturally handle variable lengths
- Will discuss application to Web session
clustering later in the quarter
59K-Means Clustering
Clustering
Task
Partition based on K centers
Representation
Within-cluster sum of squared errors
Score Function
Search/Optimization
Iterative greedy search
Data Management
None specified
Models, Parameters
K centers
60Probabilistic Model-Based Clustering
Clustering
Task
Mixture of Probability Components
Representation
Score Function
Log-likelihood
Search/Optimization
EM (iterative)
Data Management
None specified
Models, Parameters
Probability model
61Single-Link Hierarchical Clustering
Clustering
Task
Representation
Tree of nested groupings
Score Function
No global score
Iterative merging of nearest neighbors
Search/Optimization
Data Management
None specified
Models, Parameters
Dendrogram
62Summary
- General comments
- Many different approaches and algorithms
- What type of cluster structure are you looking
for? - Computational complexity may be an issue for
large n - Dimensionality is also an issue
- Validation is difficult but the payoff can be
large. - Chapter 9
- Covers all of the clustering methods discussed
here (except graph/spectral clustering)