Cluster Analysis presentation

About This Presentation

Transcript and Presenter's Notes

Title: Cluster Analysis

1
Cluster Analysis
Chapter 7
2
- The Course
DW
DS
OLAP
Star Schema
DP
DS
DM
Association
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining DP Staging Database
3
Chapter Outline

What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning
Hierarchical
Density-Based
Grid-Based (skip)
Model-Based (skip)
Outlier Analysis

4
What is Clustering?

Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing
Organizing data into classes such that there is
high intra-class similarity
low inter-class similarity
Finding the class labels and the number of
classes directly from the data Pin contrast to
classification).
More informally, finding natural groupings among
objects

5
What is Cluster Analysis?

Cluster Collection of data objects (records)
(Intraclass similarity) - Objects are similar to
objects in same cluster
(Interclass dissimilarity) - Objects are
dissimilar to objects in other clusters
Cluster analysis
Statistical method for grouping a set of data
objects into clusters
A good clustering method produces high quality
clusters with high intraclass similarity and low
interclass similarity
Clustering is unsupervised classification
More informally, finding natural groupings among
objects

6
What is Cluster Analysis?

Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups

Salary
Age
Job Title
7
Typical Clustering Applications

As a stand-alone tool to
get insight into data distribution
find the characteristics of each cluster
assign the cluster of a new example
As a preprocessing step for other algorithms
e.g. numerosity reduction using cluster centers
to represent data in clusters. (See the example
in the next slide.)
It is a building block for many data mining
solutions

8
Clustering Example Fitting Troops

Fitting the troops re-design of uniforms for
soldiers
Goal reduce the number of uniform sizes to be
kept in inventory while still providing good fit
Researchers from Cornell University used
clustering and designed a new set of sizes
Traditional clothing size system ordered set of
graduated sizes where all dimensions increase
together
The new system sizes that fit body types
E.g. one size for short-legged, small waisted,
women with wide and long torsos, average arms,
broad shoulders, and skinny necks

9
Other Examples of Clustering Applications

Marketing
help discover distinct groups of customers, and
then use this knowledge to develop targeted
marketing programs
Biology
derive plant and animal taxonomies
find genes with similar function
Land use
identify areas of similar land use in an earth
observation database
Insurance
identify groups of motor insurance policy holders
with a high average claim cost
City-planning
identify groups of houses according to their
house type, value, and geographical location

10
Requirements of Clustering

Scalability
Ability to deal with various types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge
Can deal with noise and outliers
Insensitive to the order of input records
Can handle high dimensionality
Incorporation of user-specified constraints
Interpretability and usability

11
What is Cluster Analysis?

What is Good Clustering
High within-class similarity and low
between-class similarity
The ability to discover some or all of the
hidden patterns

Outlier
12
How do we measure similarity or dissimilarity?
Peter
Piotr
0.23
3
342.7
13
Data Representation

Data matrix
N objects with p attributes
Distances are normally used to measure the
similarity or dissimilarity between two data
objects
Dissimilarity matrix
d(i,j) dissimilarity
between i and j

14
Data Representation

Properties
d(i,j) ? 0
d(i,i) 0
d(i,j) d(j,i)
d(i,j) ? d(i,k) d(k,j)

15
Types of Data in Cluster Analysis

Interval-Scaled Attributes Continuous
measurements of a roughly linear scale. Example
weight, temperature, income, etc
Binary Attributes
Nominal Attributes categorical values where
order has no meaning Example color, gender
Ordinal Attributes categorical values where
order has meaning Example Rank
Ratio-Scaled Attributes Continuous measurements
of a non linear scale, approximately at
exponential scale
Mixed Attributes combination of the above data
types

16
Dissimilarity of Interval-Scaled Values

Step 1 Standardize the data
To ensure they all have equal weight
To match up different scales into a uniform,
single scale
Not always needed! Sometimes we require unequal
weights for an attribute
Step 2 Compute dissimilarity between records
Use Euclidean, Manhattan or Minkowski distance

17
Distance Metrics

Minkowski
Manhattan
Euclidean
Weighted

18
Dissimilarity between Binary Variables

Method A contingency table for binary data
If the binary variable is symmetric
If the binary variable is asymmetric

Object j
Object i
19
Dissimilarity between Binary Variables Example

gender is a symmetric binary
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value
N be set to 0

20
Dissimilarity Between Nominal Attributes

A generalization of the binary attribute in that
it can take more than 2 states, e.g., red,
yellow, blue, green
Method 1 Simple matching
m of attributes that are same for both
records, p total of attributes
Method 2 rewrite the database and create a new
binary attribute for each of the m states
For an object with color yellow, the yellow
attribute is set to 1, while the remaining
attributes are set to 0.

21
Dissimilarity Between Ordinal Attributes

An ordinal attribute can be discrete or
continuous
Order is important (e.g. rank)
Can be treated like interval-scaled
replacing xif by their rank
map the range of each variable onto 0, 1 by
replacing i-th object in the f-th attribute by
compute the dissimilarity using methods for
interval-scaled attributes

22
Dissimilarity Between Ratio-Scaled Attributes

Ratio-scaled attribute a positive measurement on
a nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt
Methods
treat them like interval-scaled attributes not
a good choice because scales may be distorted
apply logarithmic transformation
yif log(xif)
treat them as continuous ordinal data and treat
their rank as interval-scaled.

23
Dissimilarity Between Attributes of Mixed Types

A database may contain all the six types of
attributes
symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio.
Use a weighted formula to combine their effects.
f is binary or nominal dij(f) 0 if xif xjf
, o.w. dij(f) 1.
f is interval-based use the normalized distance
f is ordinal or ratio-scaled
compute ranks rif and
and treat zif as interval-scaled

24
Major Clustering Approaches

Partitioning approach
Partitions objects and then evaluates the
partitions by some criterion, like, minimizing
the sum of square errors. E.g. k-means
k-medoids
Hierarchical approach
Create a hierarchical clustering of the set of
data (or objects) using some criterion E.g.
Diana, Agnes, BIRCH, ROCK, and CAMELEON
Density-based approach
Clusters objects based on connectivity and
density functions E.g. DBSACN OPTICS
Grid-Based (skip)
Model-Based (skip)

25
Partitioning Algorithms Basic Concept

Partitioning method Construct a partition of a
database D of n objects into a set of k clusters,
s.t., min sum of squared distance
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
Global optimal exhaustively enumerate all
partitions
Heuristic methods
k-means Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids)
Each cluster is represented by one of the objects
in the cluster

26
K-Means
27
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in
four steps

28
Stopping/convergence criterion

no (or minimum) re-assignments of data points to
different clusters,
no (or minimum) change of centroids, or
minimum decrease in the sum of squared error
(SSE),

(1)
Ci is the jth cluster, mj is the centroid of
cluster Cj (the mean vector of all the data
points in Cj), and dist(x, mj) is the distance
between data point x and centroid mj.
29
3-means Example Step 1
30
3-means Example Step 2
31
3-means Example Step 3
32
3-means Example Step 4
33
3-means Example Step 5
34
3-means Example 2, Step 6
35
2-means Example

For simplicity, 1 dimensional objects and k2.
Objects 1, 2, 5, 6,7
K-means
Randomly select 5 and 6 as initial centroids
gt Two clusters 1,2,5 and 6,7 meanC18/3,
meanC26.5
gt 1,2, 5,6,7 meanC11.5, meanC26
gt no change.
Aggregate dissimilarity 0.52 0.52 12
12 2.5

36
Comments on the K-Means Method

Strength of the k-means
Relatively efficient O(tkn), where n is of
objects, k is of clusters, and t is of
iterations. Normally, k, t ltlt n.
Often terminates at a local optimum.
Weakness of the k-means
Applicable only when mean is defined, then what
about categorical data?
Need to specify k, the number of clusters, in
advance.
Unable to handle noisy data and outliers.
Not suitable to discover clusters with non-convex
shapes.

37
Variations of the K-Means Method

A few variants of the k-means which differ in
Selection of the initial k means.
Dissimilarity calculations.
Strategies to calculate cluster means.
Handling categorical data k-modes (Huang98)
Replacing means of clusters with modes.
Using new dissimilarity measures to deal with
categorical objects.
Using a frequency-based method to update modes of
clusters.
A mixture of categorical and numerical data
k-prototype method.

38
K-Medoid
39
Medoid - definition

A medoid is an actual point in the dataset that
is centrally located and is therefore
representative of the cluster.

40
k-medoid methods

There are three best-known k-medoid methods
PAM (Partitioning Around Medoids)
CLARA (Clustering LARge Applications)
CLARANS

41
PAM

Arbitrarily choose k objects as the initial
medoids
Until no change, do
(Re)assign each object to the cluster to which
the nearest medoid
Randomly select a non-medoid object o, compute
the total cost, E, of swapping medoid o with o
If E lt 0 then swap o with o to form the new set
of k medoids

42
Swapping Cost

Measure whether o is better than o as a medoid
Use the squared-error criterion
Compute Eo-Eo
Negative swapping brings benefit

43
PAM Example
Arbitrary choose k object as initial medoids
Assign each remaining object to nearest medoids
K2
Randomly select a nonmedoid object,Oramdom
Do loop Until no change
Compute total cost of swapping
Swapping O and Oramdom If quality is improved.
44
Pros and Cons of PAM

PAM is more robust than k-means in the presence
of noise and outliers
Medoids are less influenced by outliers
PAM is efficiently for small data sets but does
not scale well for large data sets
O(k(n-k)2 ) for each iteration
Sampling based method CLARA

45
CLARA (Clustering LARge Applications)

Draw multiple samples of the data set, apply PAM
on each sample, give the best clustering
Perform better than PAM in larger data sets
Efficiency depends on the sample size
A good clustering on samples may not be a good
clustering of the whole data set

46
CLARANS (Clustering Large Applications based upon
RANdomized Search)

The problem space graph of clustering
A vertex is k from n numbers, vertices in
total
PAM search the whole graph
CLARA search some random sub-graphs
CLARANS climbs mountains
Randomly sample a set and select k medoids
Consider neighbors of medoids as candidate for
new medoids
Use the sample set to verify
Repeat multiple times to avoid bad samples

Algorithm CLARANS
1.
2.
3.
4.
5.
6.
7.
8.
Input parameters numlocal and maxneighbor.
Initialize i to 1, and mincost to a large number.
Set current to an arbitrary node in G,,k.
Set j to 1.
Consider a random neighbor S of current, and
based on Equation (5) calculate the cost
differential
of the two nodes.
If 5 haa a lower cost, set current to S, and go
to
Step (3).
Otherwise, increment j by 1. If j 5 maxneighbor,

Requires arbitrary objects and a distance
function
! Medoid mC representative object in a cluster C
! Measure for the compactness of a Cluster C
! Measure for the compactness of a clustering
6
p C
C TD(C) dist( p,m )
k
i
TD TD Ci
1
( )

CLARA Kaufmann and Rousseeuw,1990
! Additional parameter numlocal
! Draws numlocal samples of the data set
! Applies PAM on each sample
! Returns the best of these sets of medoids as
output
! CLARANS Ng and Han, 1994)
! Two additional parameters maxneighbor and
numlocal
! At most maxneighbor many pairs (medoid M,
non-medoid N) are
evaluated in the algorithm.
! The first pair (M, N) for which TDN9M is
smaller than TDcurrent is
swapped (instead of the pair with the minimal
value of TDN9M )
! Finding the local minimum with this procedure
is repeated
numlocal times.
! Efficiency runtime(CLARANS) lt runtime(CLARA) lt
runtime(PAM)

CLARANS(objects DB, Integer k, Real dist,
Integer
numlocal, Integer maxneighbor)
for r from 1 to numlocal do
Randomly select k objects as medoids i 0
while i lt maxneighbor do
Randomly select (Medoid M, Non-medoid N)
Compute changeOfTD_ TDN9M TD
if changeOfTD lt 0 then
substitute M by N
TD TDN9M i 0
else i i 1
if TD lt TD_best then
TD_best TD Store current medoids
return Medoids

51
Hierarchical Methods
52
Hierarchical Clustering

Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition

53
AGNES (Agglomerative Nesting)

Agglomerative, Bottom-up approach
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster

54
DIANA (Divisive Analysis)

Top-down approach
Inverse order of AGNES
Eventually each node forms a cluster on its own

55
A Dendrogram

Shows How the Clusters are Merged Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram.
A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster

56
Recent Hierarchical Clustering Methods

Major weakness of agglomerative clustering
methods
do not scale well time complexity of at least
O(n2), where n is the number of total objects
can never undo what was done previously
Integration of hierarchical with distance-based
clustering
BIRCH uses CF-tree and incrementally adjusts the
quality of sub-clusters
ROCK clustering categorical data by neighbor and
link analysis
CHAMELEON hierarchical clustering using dynamic
modeling

57
BIRCH

Birch Balanced Iterative Reducing and Clustering
using Hierarchies
Incrementally construct a CF (Clustering Feature)
tree, a hierarchical data structure for
multiphase clustering
Phase 1 scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering
structure of the data)
Phase 2 use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree

58
Clustering Feature Vector in BIRCH
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
59
CF-Tree in BIRCH

Clustering feature
summary of the statistics for a given subcluster
the 0-th, 1st and 2nd moments of the subcluster
from the statistical point of view.
registers crucial measurements for computing
cluster and utilizes storage efficiently
A CF tree is a height-balanced tree that stores
the clustering features for a hierarchical
clustering
A nonleaf node in a tree has descendants or
children
The nonleaf nodes store sums of the CFs of their
children
A CF tree has two parameters
Branching factor specify the maximum number of
children.
threshold max diameter of sub-clusters stored at
the leaf nodes

60
The CF Tree Structure
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
61
BIRCH

Strength
Scales linearly finds a good clustering with a
single scan
Improves the quality with a few additional scans
Weakness
handles only numeric data
No natural clustering due to the specification of
branching factor.
Clusters are of spherical shape

62
Density Based Clustering
63
Density-Based Clustering Methods

Clustering based on density (local cluster
criterion), such as density-connected points
Major features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies
DBSCAN
OPTICS (If there is time)

64
DBSCAN

DBSCAN (Density Based Spatial Clustering of
Applications with Noise) is a density-based
algorithm.
Relies on a density-based notion of cluster A
cluster is defined as a maximal set of
density-connected points
A point is a core point if it has more than a
specified number of points MinPts and within a
specified raduis Eps
These are points that are at the interior of a
cluster
A border point has fewer than MinPts within Eps,
but is in the neighborhood of a core point
A noise point is any point that is not a core
point or a border point.

65
DBSCAN Core, Border, and Noise Points
66
DBSCAN The Algorithm

Algorithms
Arbitrary select a point p
Retrieve all points density-reachable from p
w.r.t. Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are
density-reachable from p and DBSCAN visits the
next point of the database.
Continue the process until all of the points have
been processed.
Note A point p is density-reachable from a point
q w.r.t. Eps, MinPts if there is a chain of
points p1, , pn, p1 q, pn p such that p2
pn-1 are all core points.

67
DBSCAN Core, Border and Noise Points
Original Points
Point types core border and noise
Eps 10, MinPts 4
68
When DBSCAN Works Well
Original Points

Resistant to Noise
Can handle clusters of different shapes and sizes

69
When DBSCAN Does NOT Work Well
MinPts4 Eps9.75
Original Points
MinPts4 Eps9.92

Cluster Analysis PowerPoint PPT Presentation