ICS 278: Data Mining Lecture 9,10: Clustering Algorithms - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

ICS 278: Data Mining Lecture 9,10: Clustering Algorithms

Description:

Hand in written document in class on Tuesday May 18th. 1 Powerpoint : ... vectors, clusters can be thought of as clouds of points in p-dimensional space ... – PowerPoint PPT presentation

Number of Views:215
Avg rating:3.0/5.0
Slides: 63
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 9,10: Clustering Algorithms


1
ICS 278 Data MiningLecture 9,10 Clustering
Algorithms
  • Padhraic Smyth
  • Department of Information and Computer Science
  • University of California, Irvine

2
Project Progress Report
  • Written Progress Report
  • Due Tuesday May 18th in class
  • Expect at least 3 pages (should be typed not
    handwritten)
  • Hand in written document in class on Tuesday May
    18th
  • 1 Powerpoint slide
  • 1 slide that describes your project
  • Should contain
  • Your name (top right corner)
  • Clear description of the main task
  • Some visual graphic of data relevant to your task
  • 1 bullet or 2 on what methods you plan to use
  • Preliminary results or results of exploratory
    data analysis
  • Make it graphical (use text sparingly)
  • Submit by 12 noon Monday May 17th

3
List of Sections for your Progress Report
  • Clear description of task (reuse original
    proposal if needed)
  • Basic task extended bonus tasks (if time
    allows)
  • Discussion of relevant literature
  • Discuss prior published/related work (if it
    exists)
  • Preliminary data evaluation
  • Exploratory data analysis relevant to your task
  • Include as many of plots/graphs as you think are
    useful/relevant
  • Preliminary algorithm work
  • Summary of your progress on algorithm
    implementation so far
  • If you are not at this point yet, say so
  • Relevant information about other code/algorithms
    you have downloaded, some preliminary testing on,
    etc.
  • Difficulties encountered so far
  • Plans for the remainder of the quarter
  • Algorithm implementation
  • Experimental methods

4
Clustering
  • automated detection of group structure in data
  • Typically partition N data points into K groups
    (clusters) such that the points in each group are
    more similar to each other than to points in
    other groups
  • descriptive technique (contrast with predictive)
  • for real-valued vectors, clusters can be thought
    of as clouds of points in p-dimensional space

5
Clustering
6
Why is Clustering useful?
  • Discovery of new knowledge from data
  • Contrast with supervised classification (where
    labels are known)
  • Long history in the sciences of categories,
    taxonomies, etc
  • Can be very useful for summarizing large data
    sets
  • For large n and/or high dimensionality
  • Applications of clustering
  • Discovery of new types of galaxies in
    astronomical data
  • Clustering of genes with similar expression
    profiles
  • Cluster pixels in an image into regions of
    similar intensity
  • Segmentation of customers for an e-commerce store
  • Clustering of documents produced by a search
    engine
  • . many more

7
General Issues in Clustering
  • Representation
  • What types of clusters are we looking for?
  • Score
  • The criterion to compare one clustering to
    another
  • Optimization
  • Generally, finding the optimal clustering is
    NP-hard
  • Greedy algorithms to optimize score are widely
    used
  • Other issues
  • Distance function, D(x(i),x(j)) critical aspect
    of clustering, both
  • distance of pairs of objects
  • distance of objects from clusters
  • How is K selected?
  • Different types of data
  • Real-valued versus categorical
  • Attribute-valued vectors vs. n2 distance matrix

8
General Families of Clustering Algorithms
  • partition-based clustering
  • e.g. K-means
  • probabilistic model-based clustering
  • e.g. mixture models
  • both of the above work with measurement data,
    e.g., feature vectors
  • hierarchical clustering
  • e.g. hierarchical agglomerative clustering
  • graph-based clustering
  • E.g., min-cut algorithmsboth of the above work
    with distance data, e.g., distance matrix

9
Partition-Based Clustering
  • given n data points Xx(1) x(n)
  • output k partitions C C1 CK such that
  • each x(i) is assigned to unique Cj
    (hard-assignment)
  • C implicitly represents a mapping from X to C
  • Optimization algorithm
  • require that scoreC, X is maximized
  • e.g., sum-of-squares of within cluster distances
  • exhaustive search intractable
  • combinatorial optimization to assign n objects to
    k classes
  • large search space possible assignment choices
    kn
  • so, use greedy interative method
  • will be subject to local maxima

10
Score Function for Partition-Based Clustering
  • want compact clusters
  • minimize within cluster distances wc(C)
  • want different clusters far apart
  • maximize between cluster distances bc(C)
  • given cluster partitioning C, find centers c1ck
  • e.g. for vectors, use centroids of points in
    cluster Ci
  • ck 1/(nk) ? x ? Ck x
  • wc(C) sum-of-squares within cluster distance
  • wc(C) ?i1k wc(Ci) ?i1k ? x ? Ci
    d(x,ci)2
  • bc(C) distance between clusters
  • bc(C) ?i,j1k d(ci,cj)2
  • ScoreC,Xfwc(C),bc(C)

11
K-means Clustering
  • basic idea
  • Score wc(C) sum-of-squares within cluster
    distance
  • start with randomly chosen cluster centers c1
    ck
  • repeat until no cluster memberships change
  • assign each point x to cluster with nearest
    center
  • find smallest d(x,ci), over all c1 ck
  • recompute cluster centers over data assigned to
    them
  • ci 1/(ni) ? x ? Ci x
  • algorithm terminates (finite number of steps)
  • decreases Score(X,C) each iteration membership
    changes
  • converges to local maxima of Score(X,C)
  • not necessarily the global maxima
  • different initial centers (seeds) can lead to
    diff local maxs

12
K-means Complexity
  • time complexity O(I e n k) ltlt exhaustives nk
  • I number of interations (steps)
  • e cost of distance computation (ep for
    Euclidian dist)
  • speed-up tricks (especially useful in early
    iterations)
  • use nearest x(i)s as cluster centers instead of
    mean
  • reuse of cached dists from size n2 dist mat D
    (lowers effective e)
  • k-mediods use one of x(i)s as center because
    mean not defined
  • recompute centers as points reassigned
  • useful for large n (like online neural nets)
    more cache efficient
  • PCA reduce effective e and/or fit more of X in
    RAM
  • condense reduce n by replace group with
    prototype
  • even more clever data structures (see work by
    Andrew Moore, CMU)

13
K-means example(courtesy of Andrew Moore, CMU)
14
K-means
  • Ask user how many clusters theyd like. (e.g.
    K5)

15
K-means
  • Ask user how many clusters theyd like. (e.g.
    K5)
  • Randomly guess K cluster Center locations

16
K-means
  • Ask user how many clusters theyd like. (e.g.
    K5)
  • Randomly guess K cluster Center locations
  • Each datapoint finds out which Center its
    closest to. (Thus each Center owns a set of
    datapoints)

17
K-means
  • Ask user how many clusters theyd like. (e.g.
    k5)
  • Randomly guess k cluster Center locations
  • Each datapoint finds out which Center its
    closest to.
  • Each Center finds the centroid of the points it
    owns

18
K-means
  • Ask user how many clusters theyd like. (e.g.
    k5)
  • Randomly guess k cluster Center locations
  • Each datapoint finds out which Center its
    closest to.
  • Each Center finds the centroid of the points it
    owns
  • New Centers gt new boundaries
  • Repeat until no change!

19
K-means
  • Ask user how many clusters theyd like. (e.g.
    k5)
  • Randomly guess k cluster Center locations
  • Each datapoint finds out which Center its
    closest to.
  • Each Center finds the centroid of the points it
    owns
  • and jumps there
  • Repeat until terminated!

20
AcceleratedComputations
Example generated by Pelleg and Moores
accelerated k-means Dan Pelleg and Andrew Moore.
Accelerating Exact k-means Algorithms with
Geometric Reasoning. Proc. Conference on
Knowledge Discovery in Databases 1999, (KDD99)
(available on www.autonlab.org/pap.html)
21
K-means continues
22
K-means continues
23
K-means continues
24
K-means continues
25
K-means continues
26
K-means continues
27
K-means continues
28
K-means continues
29
K-means terminates
30
Image
Clusters on color
K-means clustering of RGB (3 value) pixel color
intensities, K 11 segments (courtesy of David
Forsyth, UC Berkeley)
31
Issues in K-means clustering
  • Simple, but useful
  • tends to select compact isotropic cluster
    shapes
  • can be useful for initializing more complex
    methods
  • many algorithmic variations on the basic theme
  • Choice of distance measure
  • Euclidean distance
  • Weighted Euclidean distance
  • Many others possible
  • Selection of K
  • screen diagram - plot SSE versus K, look for
    knee
  • Limitation may not be any clear K value

32
Probabilistic Clustering Mixture Models
  • assume a probabilistic model for each component
    cluster
  • mixture model f(x) ?k1K wk fk(x?k)
  • where wk are K mixing weights
  • ? wk 0 ? wk ? 1 and ?k1K wk 1
  • where K components fk(x?k) can be
  • Gaussian
  • Poisson
  • exponential
  • ...
  • Note
  • Assumes a model for the data (advantages and
    disadvantages)
  • Results in probabilistic membership p(cluster k
    x)

33
Gaussian Mixture Models (GMM)
  • model for k-th component is normal N(?k,?k)
  • often assume diagonal covariance ?jj ?j2 ,
    ?i?j 0
  • or sometimes even simpler ?jj ?2 ,
    ?i?j 0
  • f(x) ?k1K wk fk(x?k) with ?k lt?k , ?kgt or
    lt?k ,?kgt
  • generative model
  • randomly choose a component
  • selected with probability wk
  • generate x N(?k,?k)
  • note ?k ?k both d-dim vectors

34
Learning Mixture Models from Data
  • Score function log-likelihood L(?)
  • L(?) log p(X?) log ?H p(X,H?)
  • H hidden variables (cluster memberships of each
    x)
  • L(?) cannot be optimized directly
  • EM Procedure
  • General technique for maximizing log-likelihood
    with missing data
  • For mixtures
  • E-step compute memberships p(k x) wk
    fk(x?k) / f(x)
  • M-step pick a new ? to max expected data
    log-likelihood
  • Iterate guaranteed to climb to (local) maximum
    of L(?)

35
The E (Expectation) Step
Current K clusters and parameters
n data points
E step Compute p(data point i is in group k)
36
The M (Maximization) Step
New parameters for the K clusters
n data points
M step Compute q, given n data points and
memberships
37
Complexity of EM for mixtures
K models
n data points
Complexity per iteration scales as O( n K f(p) )
38
Comments on Mixtures and EM Learning
  • Complexity of each EM iteration
  • Depends on the probabilistic model being used
  • e.g., for Gaussians, Estep is O(nK), Mstep is
    O(Knp2)
  • Sometimes E or M-step is not closed form
  • gt can requires numerical methods at each
    iteration
  • K-means interpretation
  • Gaussian mixtures with isotropic (diagonal,
    equi-variance) ?k s
  • Approximate the E-step by choosing most likely
    cluster (instead of using membership
    probabilities)
  • Generalizations
  • Mixtures of multinomials for text data
  • Mixtures of Markov chains for Web sequences
  • etc

39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Selecting K in mixture models
  • cannot just choose K that maximizes likelihood
  • Likelihood L(?) ALWAYS larger for larger K
  • Model selection alternatives
  • 1) penalize complexity
  • e.g., BIC L(?) d/2 log n (Bayesian
    information criterion)
  • 2) Bayesian compute posteriors p(k data)
  • Can be tricky to compute for mixture models
  • 3) (cross) validation popular and practical
  • Score different models by log p(Xtest ?)
  • split data into train and validate sets

48
Example of BIC Score for Red-Blood Cell Data
49
(No Transcript)
50
Hierarchical Clustering
  • Representation tree of nested clusters
  • Works from a distance matrix
  • advantage xs can be any type of object
  • disadvantage computation
  • two basic approachs
  • merge points (agglomerative)
  • divide superclusters (divisive)
  • visualize both via dendograms
  • shows nesting structure
  • merges or splits tree nodes
  • Applications
  • e.g., clustering of gene expression data
  • Useful for seeing hierarchical structure, for
  • relatively small data sets

51
(No Transcript)
52
Agglomerative Methods Bottom-Up
  • algorithm based on distance between clusters
  • for i1 to n let Ci x(i) -- i.e. start
    with n singletons
  • while more than one cluster left
  • let Ci and Cj be cluster pair with minimum
    distance over distCi , Cj
  • merge them, via Ci Ci ? Cj and remove Cj
  • time complexity O(n2) to O(n3)
  • n iterations (start n clusters end 1 cluster)
  • 1st iteration O(nlgn) to O(n2) to find nearest
    singleton pair
  • space complexity O(nlgn) to O(n2)
  • accesses all/most distances between x(i)s during
    build
  • interpreting large n dendrogram difficult anyway
    (like DTs)
  • large n idea partition-based clusters at leafs

53
Distances Between Clusters
  • single link / nearest neighbor measure
  • D(Ci,Cj) min d(x,y) x ? Ci, y ? Cj
  • can be outlier/noise sensitive
  • complete link / furthest neighbor measure
  • D(Ci,Cj) max d(x,y) x ? Ci, y ? Cj
  • intermediates between those extremes
  • average link D(Ci,Cj) avg d(x,y) x ? Ci, y
    ? Cj
  • centroid D(Ci,Cj) d(ci,cj) where ci ,
    cj are centroids
  • Wardss SSE measure (for vector data)
  • within-cluster sum-of-squared-dists for Ci
    for Cj - for merged
  • DM theme try several, see which is most
    interesting

54
Dendrogram Using Single-Link Method
notice that y scale ? x scale !
Old Faithful Eruption Duration vs Wait Data
Notice how single-link tends to chain.
dendrogram y-axis crossbars distance score
55
Dendogram Using Wards SSE Distance
More balanced than single-link.
Old Faithful Eruption Duration vs Wait Data
56
Divisive Methods Top-Down
  • algorithm
  • begin with single cluster containing all data
  • split into components, repeat until clusters
    single points
  • two major types
  • monothetic
  • split by one variable at a time -- restricts
    choice search space
  • analogous to DTs
  • polythetic
  • splits by all variables at once -- many choices
    makes difficult
  • less commonly used than agglomerative methods
  • generally more computationally intensive
  • more choices in search space

57
Spectral/Graph-based Clustering
58
Clustering non-vector objects
  • E.g., sequences, images, documents, etc
  • Can be of varying lengths, sizes
  • Distance matrix approach
  • E.g., compute edit distance/transformations for
    pairs of sequences
  • Apply clustering (e.g., hierarchical) based on
    distance matrix
  • However.does not scale well
  • Vectorization
  • Represent each object as a vector
  • Cluster resulting vectors using vector-space
    algorithm
  • However. can lose (e.g., sequence) information
    by going to vector space
  • Probabilistic model-based clustering
  • Treat as mixture of (e.g.) stochastic finite
    state machines
  • Can naturally handle variable lengths
  • Will discuss application to Web session
    clustering later in the quarter

59
K-Means Clustering
Clustering
Task
Partition based on K centers
Representation
Within-cluster sum of squared errors
Score Function
Search/Optimization
Iterative greedy search
Data Management
None specified
Models, Parameters
K centers
60
Probabilistic Model-Based Clustering
Clustering
Task
Mixture of Probability Components
Representation
Score Function
Log-likelihood
Search/Optimization
EM (iterative)
Data Management
None specified
Models, Parameters
Probability model
61
Single-Link Hierarchical Clustering
Clustering
Task
Representation
Tree of nested groupings
Score Function
No global score
Iterative merging of nearest neighbors
Search/Optimization
Data Management
None specified
Models, Parameters
Dendrogram
62
Summary
  • General comments
  • Many different approaches and algorithms
  • What type of cluster structure are you looking
    for?
  • Computational complexity may be an issue for
    large n
  • Dimensionality is also an issue
  • Validation is difficult but the payoff can be
    large.
  • Chapter 9
  • Covers all of the clustering methods discussed
    here (except graph/spectral clustering)
Write a Comment
User Comments (0)
About PowerShow.com