ICS 278: Data Mining Lecture 9,10: Clustering Algorithms - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

ICS 278: Data Mining Lecture 9,10: Clustering Algorithms

Description:

Hand in written document in class on Tuesday May 18th. 1 Powerpoint : ... vectors, clusters can be thought of as clouds of points in p-dimensional space ... – PowerPoint PPT presentation

Number of Views:215

Avg rating:3.0/5.0

Slides: 63

Provided by: Informatio367

Category:

more less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 9,10: Clustering Algorithms

1
ICS 278 Data MiningLecture 9,10 Clustering
Algorithms

Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine

2
Project Progress Report

Written Progress Report
Due Tuesday May 18th in class
Expect at least 3 pages (should be typed not
handwritten)
Hand in written document in class on Tuesday May
18th
1 Powerpoint slide
1 slide that describes your project
Should contain
Your name (top right corner)
Clear description of the main task
Some visual graphic of data relevant to your task
1 bullet or 2 on what methods you plan to use
Preliminary results or results of exploratory
data analysis
Make it graphical (use text sparingly)
Submit by 12 noon Monday May 17th

3
List of Sections for your Progress Report

Clear description of task (reuse original
proposal if needed)
Basic task extended bonus tasks (if time
allows)
Discussion of relevant literature
Discuss prior published/related work (if it
exists)
Preliminary data evaluation
Exploratory data analysis relevant to your task
Include as many of plots/graphs as you think are
useful/relevant
Preliminary algorithm work
Summary of your progress on algorithm
implementation so far
If you are not at this point yet, say so
Relevant information about other code/algorithms
you have downloaded, some preliminary testing on,
etc.
Difficulties encountered so far
Plans for the remainder of the quarter
Algorithm implementation
Experimental methods

4
Clustering

automated detection of group structure in data
Typically partition N data points into K groups
(clusters) such that the points in each group are
more similar to each other than to points in
other groups
descriptive technique (contrast with predictive)
for real-valued vectors, clusters can be thought
of as clouds of points in p-dimensional space

5
Clustering
6
Why is Clustering useful?

Discovery of new knowledge from data
Contrast with supervised classification (where
labels are known)
Long history in the sciences of categories,
taxonomies, etc
Can be very useful for summarizing large data
sets
For large n and/or high dimensionality
Applications of clustering
Discovery of new types of galaxies in
astronomical data
Clustering of genes with similar expression
profiles
Cluster pixels in an image into regions of
similar intensity
Segmentation of customers for an e-commerce store
Clustering of documents produced by a search
engine
. many more

7
General Issues in Clustering

Representation
What types of clusters are we looking for?
Score
The criterion to compare one clustering to
another
Optimization
Generally, finding the optimal clustering is
NP-hard
Greedy algorithms to optimize score are widely
used
Other issues
Distance function, D(x(i),x(j)) critical aspect
of clustering, both
distance of pairs of objects
distance of objects from clusters
How is K selected?
Different types of data
Real-valued versus categorical
Attribute-valued vectors vs. n2 distance matrix

8
General Families of Clustering Algorithms

partition-based clustering
e.g. K-means
probabilistic model-based clustering
e.g. mixture models
both of the above work with measurement data,
e.g., feature vectors
hierarchical clustering
e.g. hierarchical agglomerative clustering
graph-based clustering
E.g., min-cut algorithmsboth of the above work
with distance data, e.g., distance matrix

9
Partition-Based Clustering

given n data points Xx(1) x(n)
output k partitions C C1 CK such that
each x(i) is assigned to unique Cj
(hard-assignment)
C implicitly represents a mapping from X to C
Optimization algorithm
require that scoreC, X is maximized
e.g., sum-of-squares of within cluster distances
exhaustive search intractable
combinatorial optimization to assign n objects to
k classes
large search space possible assignment choices
kn
so, use greedy interative method
will be subject to local maxima

10
Score Function for Partition-Based Clustering

want compact clusters
minimize within cluster distances wc(C)
want different clusters far apart
maximize between cluster distances bc(C)
given cluster partitioning C, find centers c1ck
e.g. for vectors, use centroids of points in
cluster Ci
ck 1/(nk) ? x ? Ck x
wc(C) sum-of-squares within cluster distance
wc(C) ?i1k wc(Ci) ?i1k ? x ? Ci
d(x,ci)2
bc(C) distance between clusters
bc(C) ?i,j1k d(ci,cj)2
ScoreC,Xfwc(C),bc(C)

11
K-means Clustering

basic idea
Score wc(C) sum-of-squares within cluster
distance
start with randomly chosen cluster centers c1
ck
repeat until no cluster memberships change
assign each point x to cluster with nearest
center
find smallest d(x,ci), over all c1 ck
recompute cluster centers over data assigned to
them
ci 1/(ni) ? x ? Ci x
algorithm terminates (finite number of steps)
decreases Score(X,C) each iteration membership
changes
converges to local maxima of Score(X,C)
not necessarily the global maxima
different initial centers (seeds) can lead to
diff local maxs

12
K-means Complexity

time complexity O(I e n k) ltlt exhaustives nk
I number of interations (steps)
e cost of distance computation (ep for
Euclidian dist)
speed-up tricks (especially useful in early
iterations)
use nearest x(i)s as cluster centers instead of
mean
reuse of cached dists from size n2 dist mat D
(lowers effective e)
k-mediods use one of x(i)s as center because
mean not defined
recompute centers as points reassigned
useful for large n (like online neural nets)
more cache efficient
PCA reduce effective e and/or fit more of X in
RAM
condense reduce n by replace group with
prototype
even more clever data structures (see work by
Andrew Moore, CMU)

13
K-means example(courtesy of Andrew Moore, CMU)
14
K-means

Ask user how many clusters theyd like. (e.g.
K5)

15
K-means

Ask user how many clusters theyd like. (e.g.
K5)
Randomly guess K cluster Center locations

16
K-means

Ask user how many clusters theyd like. (e.g.
K5)
Randomly guess K cluster Center locations
Each datapoint finds out which Center its
closest to. (Thus each Center owns a set of
datapoints)

17
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to.
Each Center finds the centroid of the points it
owns

18
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to.
Each Center finds the centroid of the points it
owns
New Centers gt new boundaries
Repeat until no change!

19
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to.
Each Center finds the centroid of the points it
owns
and jumps there
Repeat until terminated!

20
AcceleratedComputations
Example generated by Pelleg and Moores
accelerated k-means Dan Pelleg and Andrew Moore.
Accelerating Exact k-means Algorithms with
Geometric Reasoning. Proc. Conference on
Knowledge Discovery in Databases 1999, (KDD99)
(available on www.autonlab.org/pap.html)
21
K-means continues
22
K-means continues
23
K-means continues
24
K-means continues
25
K-means continues
26
K-means continues
27
K-means continues
28
K-means continues
29
K-means terminates
30
Image
Clusters on color
K-means clustering of RGB (3 value) pixel color
intensities, K 11 segments (courtesy of David
Forsyth, UC Berkeley)
31
Issues in K-means clustering

Simple, but useful
tends to select compact isotropic cluster
shapes
can be useful for initializing more complex
methods
many algorithmic variations on the basic theme
Choice of distance measure
Euclidean distance
Weighted Euclidean distance
Many others possible
Selection of K
screen diagram - plot SSE versus K, look for
knee
Limitation may not be any clear K value

32
Probabilistic Clustering Mixture Models

assume a probabilistic model for each component
cluster
mixture model f(x) ?k1K wk fk(x?k)
where wk are K mixing weights
? wk 0 ? wk ? 1 and ?k1K wk 1
where K components fk(x?k) can be
Gaussian
Poisson
exponential
...
Note
Assumes a model for the data (advantages and
disadvantages)
Results in probabilistic membership p(cluster k
x)

33
Gaussian Mixture Models (GMM)

model for k-th component is normal N(?k,?k)
often assume diagonal covariance ?jj ?j2 ,
?i?j 0
or sometimes even simpler ?jj ?2 ,
?i?j 0
f(x) ?k1K wk fk(x?k) with ?k lt?k , ?kgt or
lt?k ,?kgt
generative model
randomly choose a component
selected with probability wk
generate x N(?k,?k)
note ?k ?k both d-dim vectors

34
Learning Mixture Models from Data

Score function log-likelihood L(?)
L(?) log p(X?) log ?H p(X,H?)
H hidden variables (cluster memberships of each
x)
L(?) cannot be optimized directly
EM Procedure
General technique for maximizing log-likelihood
with missing data
For mixtures
E-step compute memberships p(k x) wk
fk(x?k) / f(x)
M-step pick a new ? to max expected data
log-likelihood
Iterate guaranteed to climb to (local) maximum
of L(?)

35
The E (Expectation) Step
Current K clusters and parameters
n data points
E step Compute p(data point i is in group k)
36
The M (Maximization) Step
New parameters for the K clusters
n data points
M step Compute q, given n data points and
memberships
37
Complexity of EM for mixtures
K models
n data points
Complexity per iteration scales as O( n K f(p) )
38
Comments on Mixtures and EM Learning

Complexity of each EM iteration
Depends on the probabilistic model being used
e.g., for Gaussians, Estep is O(nK), Mstep is
O(Knp2)
Sometimes E or M-step is not closed form
gt can requires numerical methods at each
iteration
K-means interpretation
Gaussian mixtures with isotropic (diagonal,
equi-variance) ?k s
Approximate the E-step by choosing most likely
cluster (instead of using membership
probabilities)
Generalizations
Mixtures of multinomials for text data
Mixtures of Markov chains for Web sequences
etc

39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Selecting K in mixture models

cannot just choose K that maximizes likelihood
Likelihood L(?) ALWAYS larger for larger K
Model selection alternatives
1) penalize complexity
e.g., BIC L(?) d/2 log n (Bayesian
information criterion)
2) Bayesian compute posteriors p(k data)
Can be tricky to compute for mixture models
3) (cross) validation popular and practical
Score different models by log p(Xtest ?)
split data into train and validate sets

48
Example of BIC Score for Red-Blood Cell Data
49
(No Transcript)
50
Hierarchical Clustering

Representation tree of nested clusters
Works from a distance matrix
advantage xs can be any type of object
disadvantage computation
two basic approachs
merge points (agglomerative)
divide superclusters (divisive)
visualize both via dendograms
shows nesting structure
merges or splits tree nodes
Applications
e.g., clustering of gene expression data
Useful for seeing hierarchical structure, for
relatively small data sets

51
(No Transcript)
52
Agglomerative Methods Bottom-Up

algorithm based on distance between clusters
for i1 to n let Ci x(i) -- i.e. start
with n singletons
while more than one cluster left
let Ci and Cj be cluster pair with minimum
distance over distCi , Cj
merge them, via Ci Ci ? Cj and remove Cj
time complexity O(n2) to O(n3)
n iterations (start n clusters end 1 cluster)
1st iteration O(nlgn) to O(n2) to find nearest
singleton pair
space complexity O(nlgn) to O(n2)
accesses all/most distances between x(i)s during
build
interpreting large n dendrogram difficult anyway
(like DTs)
large n idea partition-based clusters at leafs

53
Distances Between Clusters

single link / nearest neighbor measure
D(Ci,Cj) min d(x,y) x ? Ci, y ? Cj
can be outlier/noise sensitive
complete link / furthest neighbor measure
D(Ci,Cj) max d(x,y) x ? Ci, y ? Cj
intermediates between those extremes
average link D(Ci,Cj) avg d(x,y) x ? Ci, y
? Cj
centroid D(Ci,Cj) d(ci,cj) where ci ,
cj are centroids
Wardss SSE measure (for vector data)
within-cluster sum-of-squared-dists for Ci
for Cj - for merged
DM theme try several, see which is most
interesting

54
Dendrogram Using Single-Link Method
notice that y scale ? x scale !
Old Faithful Eruption Duration vs Wait Data
Notice how single-link tends to chain.
dendrogram y-axis crossbars distance score
55
Dendogram Using Wards SSE Distance
More balanced than single-link.
Old Faithful Eruption Duration vs Wait Data
56
Divisive Methods Top-Down

algorithm
begin with single cluster containing all data
split into components, repeat until clusters
single points
two major types
monothetic
split by one variable at a time -- restricts
choice search space
analogous to DTs
polythetic
splits by all variables at once -- many choices
makes difficult
less commonly used than agglomerative methods
generally more computationally intensive
more choices in search space

57
Spectral/Graph-based Clustering
58
Clustering non-vector objects

E.g., sequences, images, documents, etc
Can be of varying lengths, sizes
Distance matrix approach
E.g., compute edit distance/transformations for
pairs of sequences
Apply clustering (e.g., hierarchical) based on
distance matrix
However.does not scale well
Vectorization
Represent each object as a vector
Cluster resulting vectors using vector-space
algorithm
However. can lose (e.g., sequence) information
by going to vector space
Probabilistic model-based clustering
Treat as mixture of (e.g.) stochastic finite
state machines
Can naturally handle variable lengths
Will discuss application to Web session
clustering later in the quarter

59
K-Means Clustering
Clustering
Task
Partition based on K centers
Representation
Within-cluster sum of squared errors
Score Function
Search/Optimization
Iterative greedy search
Data Management
None specified
Models, Parameters
K centers
60
Probabilistic Model-Based Clustering
Clustering
Task
Mixture of Probability Components
Representation
Score Function
Log-likelihood
Search/Optimization
EM (iterative)
Data Management
None specified
Models, Parameters
Probability model
61
Single-Link Hierarchical Clustering
Clustering
Task
Representation
Tree of nested groupings
Score Function
No global score
Iterative merging of nearest neighbors
Search/Optimization
Data Management
None specified
Models, Parameters
Dendrogram
62
Summary

General comments
Many different approaches and algorithms
What type of cluster structure are you looking
for?
Computational complexity may be an issue for
large n
Dimensionality is also an issue
Validation is difficult but the payoff can be
large.
Chapter 9
Covers all of the clustering methods discussed
here (except graph/spectral clustering)