Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining

Description:

The clustering problem is about grouping a set of data tuples ... CURE (Clustering Using REpresentitives) is another example. 9/03. Data Mining Clustering ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 32
Provided by: csWr
Learn more at: http://cecs.wright.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Mining


1
4. Clustering Methods
  • Concepts
  • Partitional (k-Means, k-Medoids)
  • Hierarchical (Agglomerative Divisive, COBWEB)
  • Density-based (DBSCAN, CLIQUE)
  • Large size data (STING, BIRCH, CURE)

2
The Clustering Problem
  • The clustering problem is about grouping a set of
    data tuples into a number of clusters. Data in
    the same cluster are highly similar to each other
    and data in different clusters are highly
    different from each other.
  • About clusters
  • Inter-clusters distance ? maximization
  • Intra-clusters distance ? minimization
  • Clustering vs. classification
  • Which one is more difficult? Why?
  • Various possible ways of clustering, which way is
    the best?

3
Different ways of representing clusters
  • Division with boundaries
  • Venn diagram or spheres
  • Probabilistic
  • Dendrograms
  • Trees
  • Rules

1 2 3
I1 I2 In
0.5 0.2 0.3
4
Major Categories of Algorithms
  • Partitioning Divide into k partitions (k fixed)
    regroup to get better clustering.
  • Hierarchical Divide into different number of
    partitions in layers - merge (bottom-up) or
    divide (top-down).
  • Density-based Continue to grow a cluster as long
    as the density of the cluster exceeds a threshold
  • Grid-based First divide space into grids, then
    perform clustering on the grids.

5
k-Means
  • Algorithm
  • Given k
  • Randomly pick k instances as the initial centers
  • Assign the rest instances to closest one of k
    clusters
  • Recalculate the mean of each cluster
  • Repeat 3 4 until means dont change
  • How good the clusters are
  • Initial and final clusters
  • Within-cluster variation ?diff(x,mean)2
  • Why dont we consider inter-cluster distance?

6
Example
  • For simplicity, 1 dimensional objects and k2.
  • Objects 1, 2, 5, 6,7
  • K-means
  • Randomly select 5 and 6 as initial centroids
  • gt Two clusters 1,2,5 and 6,7 meanC18/3,
    meanC26.5
  • gt 1,2, 5,6,7 meanC11.5, meanC26
  • gt no change.
  • Aggregate dissimilarity 0.52 0.52 12
    12 2.5

7
Discussions
  • Limitations
  • Means cannot be defined for categorical
    attributes
  • Choice of k
  • Sensitive to outliers
  • Crisp clustering
  • Variants of k-means exist
  • Using modes to deal with categorical attributes
  • How about distance measures
  • Is it similar to or different from k-NN?
  • With and without learning

8
k-Medoids
  • k-Means algorithm is sensitive to outliers
  • Is this true? How to prove it?
  • Medoid the most centrally located point in a
    cluster, as a representative point of the
    cluster.
  • In contrast, a centroid is not necessarily in a
    cluster.
  • An example

Initial Medoids
9
Partition Around Medoids
  • PAM
  • Given k
  • Randomly pick k instances as initial medoids
  • Assign each instance to the nearest medoid
  • Calculate the objective function
  • the sum of dissimilarities of all instances to
    their nearest medoids
  • Randomly select an instance y
  • Swap some medoid x by y if the swap reduces the
    objective function
  • Repeat (3-6) until no change

10
k-Means and k-Medoids
  • The key difference lies in how they update means
    or medoids
  • Both require distance calculation and
    reassignment of instances
  • Time complexity
  • Which one is more costly?
  • Dealing with outliers

Outlier (100 unit away)
11
EM (Expectation Maximization)
  • Moving away from crisp clusters as in k-Means by
    allowing an instance to belong to several
    clusters
  • Finite mixtures a statistical clustering model
  • A mixture is a set of k probability
    distributions, representing k clusters
  • The simplest finite mixture one feature with a
    Gaussian
  • When k2, we need to estimate 5 parameters 2
    pairs of ยต, 2 pairs of s, and pA, where pB 1-
    pA
  • EM
  • Estimate using instances
  • Maximize the overall likelihood that data came
    from this data set

12
Agglomerative
  • Each object is viewed as a cluster (bottom up).
  • Repeat until the number of clusters is small
    enough
  • Choose a closest pair of clusters
  • Merge the two into one
  • Defining closest Centroid (mean of cluster)
    distance, (average) sum of pairwise distance,
  • Refer to the Evaluation part
  • A dendrogram is a tree that shows clustering
    process.

13
Dendrogram
  • Cluster 1, 2, 4, 5, 6, 7 into two clusters
    (centriod distance)
  • 1
  • 2
  • 4
  • 5
  • 6
  • 7

14
An example to show different Links
  • Single link
  • Merge the nearest clusters measured by the
    shortest edge between the two
  • (((A B) (C D)) E)
  • Complete link
  • Merge the nearest clusters measured by the
    longest edge between the two
  • (((A B) E) (C D))
  • Average link
  • Merge the nearest clusters measured by the
    average edge length between the two
  • (((A B) (C D)) E)

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
B
A
E
C
D
15
Divisive
  • All instances belong to one cluster (top-down)
  • To find an optimal division at each layer
    (especially the top one) is computationally
    prohibitive.
  • One heuristic method is based on the Minimum
    Spanning Tree (MST) algorithm
  • Connecting all instances with MST (O(N2))
  • Repeatedly cut out the longest edges at each
    iteration until some stopping criterion is met or
    until one instance remains in each cluster.

16
COBWEB
  • Building a conceptual hierarchy incrementally
  • Each cluster has a probabilistic description
  • Category Utility
  • ?k?i?jP(fivij)P(fivijck)P(ckfivij)
  • All categories ck, all features fi, all feature
    values vij
  • It attempts to maximize both the probability that
    two objects in the same category have values in
    common and the probability that objects in
    different categories will have different property
    values

17
A tree of clusters produced by COBWEB
18
  • Processing one instance at a time by choosing
    best among
  • Placing the instance in the best existing
    category
  • Adding a new category containing only the
    instance
  • Merging of two existing categories into a new one
    and adding the instance to that category
  • Splitting of an existing category into two and
    placing the instance in the best new resulting
    category

Grandparent
Grandparent
Split
Parent
Child 2
Child 1
Merge
Child 2
Child 1
19
Cobweb Demo http//kiew.cs.uni-dortmund.de8001/m
lnet/instances/81d91eaae317b2bebb
20
Density-based
  • DBSCAN Density-Based Clustering of Applications
    with Noise
  • It grows regions with sufficiently high density
    into clusters and can discover clusters of
    arbitrary shape in spatial databases with noise.
  • Many existing clustering algorithms find
    spherical shapes of clusters
  • DBSCAN defines a cluster as a maximal set of
    density-connected points.

21
  • Defining density and connection
  • ?-neighborhood of an object x (core object) (M,
    P, Q)
  • MinPts of objects within ?-neighborhood (say, 3)
  • directly density-reachable (Q from M, M from P)
  • density-reachable (Q from P, P not from Q)
    asymmetric
  • density-connected (O, R, S) symmetric ltfor
    border pointsgt
  • What is the relationship between DR and DC?

22
  • Clustering with DBSCAN
  • Search for clusters by checking the
    ?-neighborhood of each instance x
  • If the ?-neighborhood of x contains more than
    MinPts, create a new cluster with x as a core
    object
  • Iteratively collect directly density-reachable
    objects from these core object and merge
    density-reachable clusters
  • Terminate when no new point can be add to any
    cluster
  • DBSCAN is sensitive to the thresholds of density,
    but it is many folds faster than CLARANS
  • Time complexity O(N log N) if a spatial index is
    used, O(N2) otherwise

23
Dealing with Large Data
  • Key ideas
  • Reducing the number of instances to be
    maintained, and yet to maintain the distribution
  • Identifying relevant subspaces where clusters
    possibly exist
  • Using summarized information to avoid repeated
    data access
  • Sampling
  • CLARA (Clustering LARge Applications) working on
    samples instead of the whole data
  • CLARANS (Clustering Large Applications based on
    RANdomized Search)

24
  • Grid STING (STatistical INformation Grid)
  • Statistical parameters of higher-level cells can
    easily be computed from those of lower-level
    cells
  • Attribute-independent count
  • Attribute-dependent mean, standard deviation,
    min, max
  • Type of distribution normal, uniform,
    exponential, or unknown
  • Irrelevant cells can be removed

25
Representatives
  • BIRCH using Clustering Feature (CF) and CF tree
  • A cluster feature is a triplet about sub-clusters
    of instances (N, LS, SS)
  • N - the number of instances, LS linear sum, SS
    square sum
  • Two thresholds branching factor (the max number
    of children per non-leaf node) and diameter
    threshold
  • Two phases
  • Build an initial in-memory CF tree
  • Apply a clustering algorithm to cluster the leaf
    nodes in CF tree
  • CURE (Clustering Using REpresentitives) is
    another example

26
CF Tree
B Branching factor L Threshold max diameter of
subclusters at leaf nodes
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
27
  • Taking advantage of the property of density
  • If its dense in higher dimensional subspaces, it
    should be dense in some lower dimensional
    subspaces
  • CLIQUE (CLustering In QUEst)
  • With high dimensional data, there are many void
    subspaces
  • Using the property identified, we can start with
    dense lower dimensional data
  • CLIQUE is a density-based method that can
    automatically find subspaces of the highest
    dimensionality such that high-density clusters
    exist in those subspaces

28
Drawbacks of Distance-Based Method
  • Drawbacks of square-error based clustering method
  • Consider only one point as representative of a
    cluster
  • Good only for convex shaped, similar size and
    density, and if k can be reasonably estimated

29
Chameleon
  • A hierarchical Clustering Algorithm Using Dynamic
    Modeling
  • Observations on the weakness of pure distance
    based methods
  • Basic steps
  • Build K nearest neighbor graph
  • Partition the graph
  • Merge the strongly connected partitions, in
    terms of strength of connections between
    partitions

30
Summary
  • There are many clustering algorithms
  • Good clustering algorithms maximize inter-cluster
    dissimilarity and intra-cluster similarity
  • Without prior knowledge, it is difficult to
    choose the best clustering algorithm.
  • Clustering is an important tool for outlier
    analysis.

31
Bibliography
  • I.H. Witten and E. Frank. Data Mining Practical
    Machine Learning Tools and Techniques with Java
    Implementations. 2000. Morgan Kaufmann.
  • M. Kantardzic. Data Mining Concepts, Models,
    Methods, and Algorithms. 2003. IEEE.
  • J. Han and M. Kamber. Data Mining Concepts and
    Techniques. 2001. Morgan Kaufmann.
  • M. H. Dunham. Data Mining Introductory and
    Advanced Topics.
Write a Comment
User Comments (0)
About PowerShow.com