Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining

Description:

The clustering problem is about grouping a set of data tuples ... CURE (Clustering Using REpresentitives) is another example. 9/03. Data Mining Clustering ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 32

Provided by: csWr

Learn more at: http://cecs.wright.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining

1
4. Clustering Methods

Concepts
Partitional (k-Means, k-Medoids)
Hierarchical (Agglomerative Divisive, COBWEB)
Density-based (DBSCAN, CLIQUE)
Large size data (STING, BIRCH, CURE)

2
The Clustering Problem

The clustering problem is about grouping a set of
data tuples into a number of clusters. Data in
the same cluster are highly similar to each other
and data in different clusters are highly
different from each other.
About clusters
Inter-clusters distance ? maximization
Intra-clusters distance ? minimization
Clustering vs. classification
Which one is more difficult? Why?
Various possible ways of clustering, which way is
the best?

3
Different ways of representing clusters

Division with boundaries
Venn diagram or spheres
Probabilistic
Dendrograms
Trees
Rules

1 2 3
I1 I2 In
0.5 0.2 0.3
4
Major Categories of Algorithms

Partitioning Divide into k partitions (k fixed)
regroup to get better clustering.
Hierarchical Divide into different number of
partitions in layers - merge (bottom-up) or
divide (top-down).
Density-based Continue to grow a cluster as long
as the density of the cluster exceeds a threshold
Grid-based First divide space into grids, then
perform clustering on the grids.

5
k-Means

Algorithm
Given k
Randomly pick k instances as the initial centers
Assign the rest instances to closest one of k
clusters
Recalculate the mean of each cluster
Repeat 3 4 until means dont change
How good the clusters are
Initial and final clusters
Within-cluster variation ?diff(x,mean)2
Why dont we consider inter-cluster distance?

6
Example

For simplicity, 1 dimensional objects and k2.
Objects 1, 2, 5, 6,7
K-means
Randomly select 5 and 6 as initial centroids
gt Two clusters 1,2,5 and 6,7 meanC18/3,
meanC26.5
gt 1,2, 5,6,7 meanC11.5, meanC26
gt no change.
Aggregate dissimilarity 0.52 0.52 12
12 2.5

7
Discussions

Limitations
Means cannot be defined for categorical
attributes
Choice of k
Sensitive to outliers
Crisp clustering
Variants of k-means exist
Using modes to deal with categorical attributes
How about distance measures
Is it similar to or different from k-NN?
With and without learning

8
k-Medoids

k-Means algorithm is sensitive to outliers
Is this true? How to prove it?
Medoid the most centrally located point in a
cluster, as a representative point of the
cluster.
In contrast, a centroid is not necessarily in a
cluster.
An example

Initial Medoids
9
Partition Around Medoids

PAM
Given k
Randomly pick k instances as initial medoids
Assign each instance to the nearest medoid
Calculate the objective function
the sum of dissimilarities of all instances to
their nearest medoids
Randomly select an instance y
Swap some medoid x by y if the swap reduces the
objective function
Repeat (3-6) until no change

10
k-Means and k-Medoids

The key difference lies in how they update means
or medoids
Both require distance calculation and
reassignment of instances
Time complexity
Which one is more costly?
Dealing with outliers

Outlier (100 unit away)
11
EM (Expectation Maximization)

Moving away from crisp clusters as in k-Means by
allowing an instance to belong to several
clusters
Finite mixtures a statistical clustering model
A mixture is a set of k probability
distributions, representing k clusters
The simplest finite mixture one feature with a
Gaussian
When k2, we need to estimate 5 parameters 2
pairs of µ, 2 pairs of s, and pA, where pB 1-
pA
EM
Estimate using instances
Maximize the overall likelihood that data came
from this data set

12
Agglomerative

Each object is viewed as a cluster (bottom up).
Repeat until the number of clusters is small
enough
Choose a closest pair of clusters
Merge the two into one
Defining closest Centroid (mean of cluster)
distance, (average) sum of pairwise distance,
Refer to the Evaluation part
A dendrogram is a tree that shows clustering
process.

13
Dendrogram

Cluster 1, 2, 4, 5, 6, 7 into two clusters
(centriod distance)
1
2
4
5
6
7

14
An example to show different Links

Single link
Merge the nearest clusters measured by the
shortest edge between the two
(((A B) (C D)) E)
Complete link
Merge the nearest clusters measured by the
longest edge between the two
(((A B) E) (C D))
Average link
Merge the nearest clusters measured by the
average edge length between the two
(((A B) (C D)) E)

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
B
A
E
C
D
15
Divisive

All instances belong to one cluster (top-down)
To find an optimal division at each layer
(especially the top one) is computationally
prohibitive.
One heuristic method is based on the Minimum
Spanning Tree (MST) algorithm
Connecting all instances with MST (O(N2))
Repeatedly cut out the longest edges at each
iteration until some stopping criterion is met or
until one instance remains in each cluster.

16
COBWEB

Building a conceptual hierarchy incrementally
Each cluster has a probabilistic description
Category Utility
?k?i?jP(fivij)P(fivijck)P(ckfivij)
All categories ck, all features fi, all feature
values vij
It attempts to maximize both the probability that
two objects in the same category have values in
common and the probability that objects in
different categories will have different property
values

17
A tree of clusters produced by COBWEB
18

Processing one instance at a time by choosing
best among
Placing the instance in the best existing
category
Adding a new category containing only the
instance
Merging of two existing categories into a new one
and adding the instance to that category
Splitting of an existing category into two and
placing the instance in the best new resulting
category

Grandparent
Grandparent
Split
Parent
Child 2
Child 1
Merge
Child 2
Child 1
19
Cobweb Demo http//kiew.cs.uni-dortmund.de8001/m
lnet/instances/81d91eaae317b2bebb
20
Density-based

DBSCAN Density-Based Clustering of Applications
with Noise
It grows regions with sufficiently high density
into clusters and can discover clusters of
arbitrary shape in spatial databases with noise.
Many existing clustering algorithms find
spherical shapes of clusters
DBSCAN defines a cluster as a maximal set of
density-connected points.

Defining density and connection
?-neighborhood of an object x (core object) (M,
P, Q)
MinPts of objects within ?-neighborhood (say, 3)
directly density-reachable (Q from M, M from P)
density-reachable (Q from P, P not from Q)
asymmetric
density-connected (O, R, S) symmetric ltfor
border pointsgt
What is the relationship between DR and DC?

Clustering with DBSCAN
Search for clusters by checking the
?-neighborhood of each instance x
If the ?-neighborhood of x contains more than
MinPts, create a new cluster with x as a core
object
Iteratively collect directly density-reachable
objects from these core object and merge
density-reachable clusters
Terminate when no new point can be add to any
cluster
DBSCAN is sensitive to the thresholds of density,
but it is many folds faster than CLARANS
Time complexity O(N log N) if a spatial index is
used, O(N2) otherwise

23
Dealing with Large Data

Key ideas
Reducing the number of instances to be
maintained, and yet to maintain the distribution
Identifying relevant subspaces where clusters
possibly exist
Using summarized information to avoid repeated
data access
Sampling
CLARA (Clustering LARge Applications) working on
samples instead of the whole data
CLARANS (Clustering Large Applications based on
RANdomized Search)

Grid STING (STatistical INformation Grid)
Statistical parameters of higher-level cells can
easily be computed from those of lower-level
cells
Attribute-independent count
Attribute-dependent mean, standard deviation,
min, max
Type of distribution normal, uniform,
exponential, or unknown
Irrelevant cells can be removed

25
Representatives

BIRCH using Clustering Feature (CF) and CF tree
A cluster feature is a triplet about sub-clusters
of instances (N, LS, SS)
N - the number of instances, LS linear sum, SS
square sum
Two thresholds branching factor (the max number
of children per non-leaf node) and diameter
threshold
Two phases
Build an initial in-memory CF tree
Apply a clustering algorithm to cluster the leaf
nodes in CF tree
CURE (Clustering Using REpresentitives) is
another example

26
CF Tree
B Branching factor L Threshold max diameter of
subclusters at leaf nodes
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
27

Taking advantage of the property of density
If its dense in higher dimensional subspaces, it
should be dense in some lower dimensional
subspaces
CLIQUE (CLustering In QUEst)
With high dimensional data, there are many void
subspaces
Using the property identified, we can start with
dense lower dimensional data
CLIQUE is a density-based method that can
automatically find subspaces of the highest
dimensionality such that high-density clusters
exist in those subspaces

28
Drawbacks of Distance-Based Method

Drawbacks of square-error based clustering method
Consider only one point as representative of a
cluster
Good only for convex shaped, similar size and
density, and if k can be reasonably estimated

29
Chameleon

A hierarchical Clustering Algorithm Using Dynamic
Modeling
Observations on the weakness of pure distance
based methods
Basic steps
Build K nearest neighbor graph
Partition the graph
Merge the strongly connected partitions, in
terms of strength of connections between
partitions

30
Summary

There are many clustering algorithms
Good clustering algorithms maximize inter-cluster
dissimilarity and intra-cluster similarity
Without prior knowledge, it is difficult to
choose the best clustering algorithm.
Clustering is an important tool for outlier
analysis.

31
Bibliography

I.H. Witten and E. Frank. Data Mining Practical
Machine Learning Tools and Techniques with Java
Implementations. 2000. Morgan Kaufmann.
M. Kantardzic. Data Mining Concepts, Models,
Methods, and Algorithms. 2003. IEEE.
J. Han and M. Kamber. Data Mining Concepts and
Techniques. 2001. Morgan Kaufmann.
M. H. Dunham. Data Mining Introductory and
Advanced Topics.

Write a Comment

User Comments (0)