Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering

Description:

What is clustering? ... Each object is viewed as a cluster (bottom up) ... Terminate when no new point can be added to any cluster ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 28
Provided by: drhua
Category:

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
  • Basic concepts with simple examples
  • Categories of clustering methods
  • Challenges

2
What is clustering?
  • The process of grouping a set of physical or
    abstract objects into classes of similar objects.
  • It is also called unsupervised learning.
  • It is a common and important task that finds many
    applications
  • Examples where we need clustering?

3
Clusters and representations
  • Examples of clusters
  • Different ways of representing clusters
  • Division with boundaries
  • Venn diagrams or spheres
  • Probabilistic
  • Dendrograms
  • Trees
  • Rules

1 2 3
I1 I2 In
0.5 0.2 0.3
4
Differences from Classification
  • How different?
  • Which one is more difficult as a learning
    problem?
  • Do we perform clustering in daily activities?
  • How do we cluster?
  • How to measure the results of clustering?
  • With/without class labels
  • Between classification and clustering
  • Semi-supervised clustering

5
Major clustering methods
  • Partitioning methods
  • k-Means (and EM), k-Medoids
  • Hierarchical methods
  • agglomerative, divisive, BIRCH
  • Similarity and dissimilarity of points in the
    same cluster and from different clusters
  • Distance measures between clusters
  • minimum, maximum
  • Means of clusters
  • Average between clusters

6
How to evaluate
  • Without labeled data, how can one know one
    clustering result is good?
  • Basic or intuitive idea of clustering for
    clustered data points
  • Within a cluster -
  • Between clusters
  • The relationship between the two?
  • Evaluation methods
  • Labeled data another assumption instances in
    the same clusters are of the same class
  • Is it reasonable to use class labels in
    evaluation?
  • Unlabeled data we will see below

7
Clustering -- Example 1
  • For simplicity, 1-dimension objects and k2.
  • Objects 1, 2, 5, 6,7
  • K-means
  • Randomly select 5 and 6 as centroids
  • gt Two clusters 1,2,5 and 6,7 meanC18/3,
    meanC26.5
  • gt 1,2, 5,6,7 meanC11.5, meanC26
  • gt no change.
  • Aggregate dissimilarity 0.52 0.52 12
    12 2.5

8
Issues with k-means
  • A heuristic method
  • Sensitive to outliers
  • How to prove it?
  • Determining k
  • Trial and error
  • X-means, PCA-based
  • Crisp clustering
  • EM, Fuzzy c-means
  • Should not be confused with k-NN

9
k-Medoids
  • Medoid the most centrally located point in a
    cluster, as a representative point of the
    cluster.
  • In contrast, a centroid is not necessarily inside
    a cluster.
  • An example

Initial Medoids
10
Partition Around Medoids
  • PAM
  • Given k
  • Randomly pick k instances as initial medoids
  • Assign each instance to the nearest medoid x
  • Calculate the objective function
  • the sum of dissimilarities of all instances to
    their nearest medoids
  • Randomly select an instance y
  • Swap x by y if the swap reduces the objective
    function for all x
  • Repeat (3-6) until no change

11
k-Means and k-Medoids
  • The key difference lies in how they update means
    or medoids
  • k-medoids and (N-k) instances pairwise comparison
  • Both require distance calculation and
    reassignment of instances
  • Time complexity
  • Which one is more costly?
  • Dealing with outliers

Outlier (100 unit away)
12
Agglomerative
  • Each object is viewed as a cluster (bottom up).
  • Repeat until the number of clusters is small
    enough
  • Choose a closest pair of clusters
  • Merge the two into one
  • Defining closest Centroid (mean of cluster)
    distance, (average) sum of pairwise distance,
  • Refer to the Evaluation part
  • A dendrogram is a tree that shows clustering
    process.

13
Clustering -- Example 2
  • For simplicity, we still use 1-dimension
    objects.
  • Objects 1, 2, 5, 6,7
  • agglomerative clustering a very frequently used
    algorithm
  • How to cluster
  • find two closest objects and merge
  • gt 1,2, so we have now 1.5,5, 6,7
  • gt 1,2, 5,6, so 1.5, 5.5,7
  • gt 1,2, 5,6,7.

14
Issues with dendrograms
  • How to find proper clusters
  • An alternative divisive algorithms
  • Top down
  • Comparing with bottom-up, which is more efficient
  • Whats the time complexity?
  • How to efficiently divide the data
  • A heuristic Minimum Spanning Tree
  • http//en.wikipedia.org/wiki/Minimum_spanni
    ng_tree
  • Time complexity fastest is about O(e) where e -
    edges

15
Distance measures
  • Single link
  • Measured by the shortest edge between the two
    clusters
  • Complete link
  • Measured by the longest edge
  • Average link
  • Measured by the average edge length
  • An example is shown next.

16
An example to show different Links
  • Single link
  • Merge the nearest clusters measured by the
    shortest edge between the two
  • (((A B) (C D)) E)
  • Complete link
  • Merge the nearest clusters measured by the
    longest edge between the two
  • (((A B) E) (C D))
  • Average link
  • Merge the nearest clusters measured by the
    average edge length between the two
  • (((A B) (C D)) E)

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
B
A
E
C
D
17
Other Methods
  • Density-based methods
  • DBSCAN a cluster is a maximal set of
    density-connected points
  • Core points defined using epsilon-neighborhood
    and minPts
  • Apply directly density reachable (e.g., P and Q,
    Q and M) and density-reachable (P and M, assuming
    so are P and N), and density-connected (any
    density reachable points, P, Q, M, N) form
    clusters
  • Grid-based methods
  • STING the lowest level is the original data
  • statistical parameters of higher-level cells are
    computed from the parameters of the lower-level
    cells (count, mean, standard deviation, min, max,
    distribution)
  • Model-based methods
  • Conceptual clustering COBWEB
  • Category utility
  • Intraclass similarity
  • Interclass dissimilarity

18
Density-based
  • DBSCAN Density-Based Clustering of Applications
    with Noise
  • It grows regions with sufficiently high density
    into clusters and can discover clusters of
    arbitrary shape in spatial databases with noise.
  • Many existing clustering algorithms find
    spherical shapes of clusters
  • DEBSCAN defines a cluster as a maximal set of
    density-connected points.
  • Density is defined by an area and of points

19
  • Defining density and connection
  • ?-neighborhood of an object x (core object) (M,
    P, O)
  • MinPts of objects within ?-neighborhood (say, 3)
  • directly density-reachable (Q from M, M from P)
  • Only core objects are mutually density reachable
  • density-reachable (Q from P, P not from Q)
    asymmetric
  • density-connected (O, R, S) symmetric for
    border points
  • What is the relationship between DR and DC?

20
  • Clustering with DBSCAN
  • Search for clusters by checking the
    ?-neighborhood of each instance x
  • If the ?-neighborhood of x contains more than
    MinPts, create a new cluster with x as a core
    object
  • Iteratively collect directly density-reachable
    objects from these core object and merge
    density-reachable clusters
  • Terminate when no new point can be added to any
    cluster
  • DBSCAN is sensitive to the thresholds of density,
    but it is fast
  • Time complexity O(N log N) if a spatial index is
    used, O(N2) otherwise

21
  • Grid STING (STatistical INformation Grid)
  • Statistical parameters of higher-level cells can
    easily be computed from those of lower-level
    cells
  • Attribute-independent count
  • Attribute-dependent mean, standard deviation,
    min, max
  • Type of distribution normal, uniform,
    exponential, or unknown
  • Irrelevant cells can be removed

22
  • BIRCH using Clustering Feature (CF) and CF tree
  • A cluster feature is a triplet about sub-clusters
    of instances (N, LS, SS)
  • N - the number of instances, LS linear sum, SS
    square sum
  • Two thresholds branching factor and the max
    number of children per non-leaf node
  • Two phases
  • Build an initial in-memory CF tree
  • Apply a clustering algorithm to cluster the leaf
    nodes in CF tree
  • CURE (Clustering Using REpresentitives) is
    another example, allowing multiple centroids in a
    cluster

23
  • Taking advantage of the property of density
  • If its dense in higher dimensional subspaces, it
    should be dense in some lower dimensional
    subspaces
  • How to use this property?
  • CLIQUE (CLustering In QUEst)
  • With high dimensional data, there are many void
    subspaces
  • Using the property identified, we can start with
    dense lower dimensional data
  • CLIQUE is a density-based method that can
    automatically find subspaces of the highest
    dimensionality such that high-density clusters
    exist in those subspaces

24
Chameleon
  • A hierarchical Clustering Algorithm Using Dynamic
    Modeling
  • Observations on the weakness of CURE and ROCK
  • CURE clustering using representatives
  • ROCK clustering categorical attributes
  • Based on k-nn and dynamic modeling

25
Graph-based clustering
  • Sparsification techniques keep the connections to
    the most similar (nearest) neighbors of a point
    while breaking the connections to less similar
    points.
  • The nearest neighbors of a point tend to belong
    to the same class as the point itself.
  • This reduces the impact of noise and outliers and
    sharpens the distinction between clusters.

26
  • Neural networks
  • Self-organizing feature maps (SOMs)
  • Subspace clustering
  • Clique if a k-dimensional unit space is dense,
    then so are its (k-1)-d subspaces
  • More will be discussed later
  • Semi-supervised clustering
  • http//www.cs.utexas.edu/ml/publication/unsupervi
    sed.html
  • http//www.cs.utexas.edu/users/ml/risc/

27
Challenges
  • Scalability
  • Dealing with different types of attributes
  • Clusters with arbitrary shapes
  • Automatically determining input parameters
  • Dealing with noise (outliers)
  • Order insensitivity of instances presented to
    learning
  • High dimensionality
  • Interpretability and usability
Write a Comment
User Comments (0)
About PowerShow.com