Title: Chapter 6 Cluster Analysis
1Chapter 6 Cluster Analysis
- By Jinn-Yi Yeh Ph.D.
- 4/21/2009
2Outline
- Chapter Objective
- 6.1 Clustering Concepts
- 6.2 Similarity Measures
- 6.3 Agglomerative Hierarchical Clustering
- 6.4 Partitional Clustering
- 6.5 Incremental Clustering
3Chapter Objectives
- Distinguish between different representations of
clusters and different measures of similarities. - Compare basic characteristics of agglomerative-
and partitional-clustering algorithms. - Implement agglomerative algorithms using
single-link or complete-link measures of
similarity.
4Chapter Objectives(cont.)
- Derive the K-means method for partitional
clustering and analysis of its complexity. - Explain the implementatation of
incremental-clustering algorithms and its
advantages and disadvantages.
5What is Cluster Analysis?
- Cluster analysis is a set of methodologies for
automatic classification of samples into a number
of group using a measure of association. - Input
- A set of samples
- A measure of similarity(or dissmislarity) between
two samples. - Output
- A number of groups(cluster)
- A structure of partition
- A generalized description of every cluster
66.1 Clustering Concepts
- Samples for clustering are represented as a
vector of measurements, or more formally, as a
point in a multidimensional space. - Samples within a valid cluster are more similar
to each other than they are to a sample belonging
to a different cluster.
7Clustering Concepts(cont.)
- Clustering methodology is particularly
appropriate for the exploration of
interrelationships among samples to make a
preliminary assessment of the sample structure. - It is very difficult for humans to intuitively
interpret data embedded in a high-dimensional
space.
8Table 6.1
9Unsupervised Classification
- The samples in these data sets have only input
dimensions, and the learning process is
classified as unsupervised. - The objective is to construct decision
boundaries(classification surfaces).
10Problem of Clustering
- Data can reveal clusters with different shapes
and sizes in an n-dimensional data space. - Resolution(fine vs coarse)
- Euclidean 2D space
11Objective Criterion
- Input to a cluster analysis
- (X, s) or (X, d)
- X is a set of descriptions of samples s
measures for similarity between samples
dmeasures for dissimilarity (distance) between
samples
12Objective Criterion(cont.)
- Output to a cluster analysis
- a partition ? G1, G2, , GN where Gk, k 1,
, N is a crisp subset of X such that - The members G1, G2, , GN of ? are called
clusters.
13Formal Description of Discovered Clusters
- Represent a cluster of points in an n-dimensional
space (samples) by their centroid or by a set of
distant (border) points in a cluster. - Represent a cluster graphically using nodes in a
clustering tree. - Represent clusters by using logical expression on
sample attributes.
14Selection of Clustering Technique
- There is no clustering technique that is
universally applicable in uncovering the variety
of structures present in multidimensional data
sets. - The user's understanding of the problem and the,
corresponding data types will be the best
criteria to select the appropriate method.
15Selection of Clustering Technique(cont.)
- Most clustering algorithms are based on the
following two popular approaches - Hierarchical clustering
- organize data in a nested sequence of groups,
which can be displayed In the form of a
dendrogram or a tree structure. - Iterative square-error partitional clustering
- attempt to obtain that partition which minimizes
the within-cluster scatter or maximizes the
between-cluster scatter.
16Selection of Clustering Technique(cont.)
- To guarantee that an optimum solution has been
obtained, one has to examine all possible
partitions of N samples of n-dimensions into K
clusters (for a given K). - Notice that the number of all possible partitions
of a set of N objects into K clusters is given
by
176.2 Similarity Measures
- xi ? X, i 1, , n, is represented by a vector
xi xi1, xi2, , xim. - mthe number of dimensions (features) of samples
- nthe total number of samples
features
.
samples
18Describe Features
- These features can be either quantitative or
qualitative descriptions of the object. - Quantitative features can be subdivided as
- continuous valuesreal numbers where Pj ? R
- discrete valuesbinary numbers Pj 0, 1, or
integers Pj ? Z - interval valuesPj xij 20, 20 lt xij lt 40,
xij 40
19Describe Features(cont.)
- Qualitative features can be
- nominal or unorderedcolor is "blue" or "red"
- ordinalmilitary rank with values "general",
"colonel", etc.
20Similarity
- The word similarity in clustering means that
the value of s(x, x) is large when x and x are
two similar samples the value of s(x, x) is
small when x and x are not similar. - Similarity measure s is symmetric
- s(x, x) s(x, x) , ? x, x ? X
- Similarity measure is normalized
- 0 s(x, x) 1 , ? x, x ? X
21Dissimilarity
- Dissimilarity measure is denoted by d(x, x) ,
?x, x ? X, and it is frequently called a
distance - d(x, x) 0, ? x, x ? X
- d(x, x) d(x, x), ?x, x ? X
- if it is accepted as a metric distance measure,
then a triangular inequality is required - d(x, x) d(x, x) d(x, x), ?x, x, x?X
(triangular inequality)
22Metric Distance Measure
- Euclidean distance in m-dimensional feature
space - L1 metric or city block distance
23Metric Distance Measure
- Minkowski metric (includes the Euclidean distance
and city block distance as special cases) - p 1, then d coincides with L1 distance
- p 2, d is identical with the Euclidean metric
24Example
- 4-dimensional vectors x1 l, 0, 1, 0 and x2
2, 1, - 3, - 1 these distance measures are
d1 1 1 4 1 7d2 (1 1 16 1)1/2
4.36d3 (1 1 64 1)1/3 4.06
25Measures of Similarity
- Cosine-correlation
- It is easy to see that
- Example
- scos(xi,xj) (2030) / (2½.15½) -0.18
26Contingency Table
- athe number of binary attributes of samples xi
and xj such that xik xjk 1. - bthe number of binary attributes of samples xi
and xj such that xik 1 and xjk 0. - cthe number of binary attributes of samples xi
and xj such that xik 0 and xjk 1. - dthe number of binary attributes of samples xi
and xj such that xik xjk 0
27Example
- if xi and xj are 8-dimensional vectors with
binary feature values - xi0,0,1,1,0,1,0,1
- xj0,1,1,0,0,1,0,0
- the values of the parameters introduced are
- a2,b2,c1,d3
28Similarity Measures with Binary Data
- Simple Matching Coeficient (SMC)
- Ssmc(xi, xj) (a d) / (a b c d)
- Jaccard Coefficient
- Sjc(xi, xj) a / (a b c )
- Raos Coefficient
- Src(xi, xj) a / (a b c d)
- Example
- Ssmc(xi, xj) 5/8, Sjc(xi, xj) 2/5, and
Src(xi, xj) 2/8.
29Mutual Neighbor Distance
- MND(xi, xj) NN(xi, xj) NN(xj, xi)
- NN(xi, xj) is the neighbor number of xj with
respect to xi. - If xi is the closest point to xj then NN(xi, xj)
is equal to 1 - if it is the second closest point to xjthen
NN(xi, xj) is equal to 2
30Example
- NN(A, B) NN(B, A) 1 ? MND(A, B) 2
- NN(B, C) 1, NN(C, B) 2 ? MND(B, C) 3
- A and B are more similar than B and C using MND
measure
Figure 6.3 A and B are more similar than B and C
using the MND measure
31Example
- NN(A, B) 1, NN(B, A) 4 ? MND(A, B) 5
- NN(B, C) 1, NN(C, B) 2 ? MND(B, C) 3
- After changes in the context, B and C are more
similar than A and B using MND measure
Figure 6.4 After changes in the context, B and C
are more similar than A and B using the MND
measure
32Distance Measure Between Clusters
- These measures are an essential part in
estimating the quality of a clustering process,
and therefore they are part of clustering
algorithms - 1) Dmin(Ci,Cj)minpi-pjwhere pi?Ci and pj?Cj
- 2) Dmean(Ci,Cj)mi-mjwhere mi and mj are
centriods of Ci and Cj - 3) Davg(Ci,Cj)1/(ninj) ??pi-pjwhere pi?Ci and
pj?Cj , and ni and nj are the numbers of samples
in clusters Ci and Cj - 4) Dmax(Ci,Cj) maxpi-pjwhere pi?Ci and pj?Cj
336.3 Agglomerative Hierarchical Clustering
- Most procedures for hierarchical clustering are
not based on the concept of optimization, and the
goal is to find some approximate, suboptimal
solution, using iterations for improvement of
partitions until convergence. - Algorithms of hierarchical cluster analysis are
divided into the two categories - divisible algorithms
- agglomerative algorithms.
34Divisible VS Agglomerative
- Divisible Algorithms
- Entire set of samples X ? subsets? smaller
subsets ? - Agglomerative Algorithms
- Bottom-up process
- Regards each object as an initial cluster ?Merged
into a coarser partition ? One larger cluster - Agglomerative algorithms are more frequently used
in real-world applications than divisible methods
35Agglomerative Hierarchical Clustering Algorithms
- single-link complete-link
- These two basic algorithms differ only in the way
they characterize the similarity between a pair
of clusters.
36Steps
- 1. Place each sample in its own cluster.
Construct the list of inter-cluster distances for
all distinct unordered pairs of samples, and sort
this list in ascending order. - 2. Step through the sorted list of distances,
forming for each distinct threshold value dk a
graph of the samples where pairs of samples
closer than dk are connected into a new cluster
by a graph edge. If all the samples are members
of a connected graph, stop. Otherwise, repeat
this step. - 3. The output of the algorithm is a nested
hierarchy of graphs, which can be cut at a
desired dissimilarity level forming a partition
(clusters) identified by simple connected
components in the corresponding sub-graph.
37Example
- x1(0,2) , x2 (0, 0) , x3 (1.5, 0) , x4 (5,
0), x5 (5, 2)
Figure 6.6 Five two-dimensional samples for
clustering
38Example
- The distances between these points using the
Euclidian measure - d(x1,x2)2,d(x1,x3)2.5,d(x1,x4)5.39,d(x1,x5)5
- d(x2,x3)1.5,d(x2,x4)5,d(x2,x5)5.29
- d(x3,x4)3.5,d(x3,x5)4.03
- d(x4,x5)2
39Example-Single Link
- Final result x1,x2,x3 and x4,x5
Figure 6.7 Dendrogram by single-link method for
the data set in Figure 6.6
40Example-Complete Link
- Final result x1 and x2,x3 and x4,x5
Figure 6.8 Dendrogram by complete-link method
for the data set in Figure 6.6
41Chameleon Clustering Algorithm
- Unlike traditional agglomerative methods,
Chameleon is a clustering algorithm that tries to
improve the clustering quality by using a more
elaborate criterion when merging two clusters. - Two clusters will be merged if the
interconnectivity and closeness of the merged
clusters is very similar to the interconnectivity
and closeness of the two individual clusters
before merging.
42Chameleon Clustering Algorithm - Steps
- Step1creates a graph G (V, E)
- v ? V represents a data sample
- a weighted edge e(vi, vj)
- Graph G subgraphs
- Step2Chameleon determines the similarity between
each pair of elementary clusters Ci and Cj
according to their relative interconnectivity
RI(Ci, Cj) and their relative closeness RC(Ci,
Cj).
min-cut
min-cut
43Chameleon Clustering Algorithm - Steps(cont.)
- Interconnectivitythe total weight of edges that
are removed when a min-cut is performed - relative interconnectivity RI (Ci, Cj) the
ratio between the interconnectivity of the merged
cluster Ci and Cj to the average
interconnectivity of Ci and Cj. - closeness the average weight of the edges that
are removed when a min-cut is performed on the
cluster. - relative closeness RC(Ci, Cj) the ratio between
the closeness of the merged cluster of Ci and Cj
to the average internal closeness of Ci and Cj
44Chameleon Clustering Algorithm - Steps(cont.)
- Step3 Compute similarity function
- a is a parameter between 0 and 1
- a 1, give equal weight to both measures alt1,
place more emphasis on RI(Ci, Cj) - Chameleon can automatically adapt to the internal
characteristics of the clusters and it is
effective in discovering arbitrarily-shaped
clusters of varying density. However, algorithm
is ineffective for high-dimensional data because
its time complexity for n samples is O(n2).
RC(Ci,Cj) RI(Ci,Cj)a
456.4 Partitional Clustering
- advantage in applications involving large data
sets for which the construction of a dendrogram
is computationally very complex. - criterion function
- locallya subset of samples
- Minimal MND(Mutual Neighbor Distance)
- globallyall of the samples
- Euclidean square-error measure
- Therefore, identifying high-density regions in
the data space is a basic criterion for forming
clusters.
46Partitional Clustering(cont.)
- The most commonly used partitional-clustering
strategy is based on the square-error criterion. - objectiveobtain the partition that, for a fixed
number of clusters, minimizes the total
square-error.
47Partitional Clustering(cont.)
- Suppose that the given set of N samples in an
n-dimensional space has somehow been partitioned
into K clusters C1, C2, , Ck. - Each Ck has nk samples and each sample is in
exactly one cluster, so that ? nk N, where k
1, , K. - The mean vector Mk of cluster Ck is defined as
the centroid of the cluster or
48Partitional Clustering(cont.)
- within-cluster variation
- The square-error for the entire clustering space
49K-means partitional-clustering algorithm
- employing a square-error criterion
- Steps
- select an initial partition with K clusters
containing randomly chosen samples, and compute
the centroids of the clusters, - generate a new partition by assigning each sample
to the closest cluster center, - compute new cluster centers as the centroids of
the clusters, - repeat steps 2 and 3 until an optimum value of
the criterion function is found (or until the
cluster membership stabilizes).
50Example
- x1(0,2) , x2 (0, 0) , x3 (1.5, 0) , x4 (5,
0), x5 (5, 2) - Random distribution
- C1x1,x2,x4 and C2x3,x5
- The centriods for these two clusters are
- M1 (005)/3, (200)/3 1.66, 0.66
- M2 (1.55)/2, (02)/2 3.25, 1.00
Figure 6.6 Five two-dimensional samples for
clustering
51Example(cont.)
- Within-cluster variations
- e12 (0-1.66)2(2-0.66)2 (0-1.66)2
(0-0.66)2(5-1.66)2 (0-0.66)2 19.36 - e22 (1.5-3.25)2 (0-1)2 (5-3.25)2
(2-1)2 8.12 - Total square error
- E2 e12 e22 19.36 8.12 27.48
52Example(cont.)
- Reassign all samples
- d(M1, x1) 2.14 and d(M2, x1) 3.40 ? x1 ? C1
- d(M1, x2) 1.79 and d(M2, x2) 3.40 ? x2 ? C1
- d(M1, x3) 0.83 and d(M2, x3) 2.01 ? x3 ? C1
- d(M1, x4) 3.41 and d(M2, x4) 2.01 ? x4 ? C2
- d(M1, x5) 3.60 and d(M2, x5) 2.01 ? x5 ? C2
- New clusters C1 x1, x2, x3 and C2 x4, x5
- New centroids M1 0.5, 0.67 and M2 5.0,
1.0 - Errors
- e12 4.17 and e22 2.00
- E2 6.17
53Why K-means is so popular?
- Its time complexity is O(nkl), the algorithm has
linear time complexity in the size of the data
set. - n is the number of samples
- k is the number of clusters
- l is the number of iterations taken by the
algorithm to converge - k and l are fixed
- Its space complexity is O(k n).
- It is an order-independent algorithm.
54Disadvantages of K-means algorithm
- A big frustration in using iterative
partitional-clustering programs is the lack of
guidelines available for choosing K-number of
clusters. - The K-means algorithm is very sensitive to noise
and outlier data points - K-mediodsuses the most centrally located object
(mediods) in a cluster to be the cluster
representative.
556.5 Incremental Clustering
- There are more and more applications where it is
necessary to cluster a large collection of data. - large
- 1960sseveral thousand samples for clustering
- Nowmillions of samples of high dimensionality
- ProblemThere are applications where the entire
data set cannot be stored in the main memory
because of its size.
56Possible Approaches
- divide-and-conquer approach
- The data set can be stored in a secondary memory
and subsets of this data are clustered
independently, followed by a merging step to
yield a clustering of the entire set. - Incremental-clustering algorithm
- Data are stored in the secondary memory and data
items are transferred to the main memory one at a
time for clustering. Only the cluster
representations are stored permanently in the
main memory to alleviate space limitations. - A parallel implementation of a clustering
algorithm - The advantages of parallel computers increase the
efficiency of the divide-and-conquer approach.
57Incremental-Clustering Algorithm - Steps
- Assign the first data item to the first cluster.
- Consider the next data item. Either assign this
item to one of the existing clusters or assign it
to a new cluster. This assignment is done based
on some criterion, e.g., the distance between the
new item and the existing cluster centroids.
After every addition of a new item to an existing
cluster, recompute a new value for the centroid. - Repeat step 2 till all the data samples are
clustered.
58Features of Incremental-Clustering Algorithm
- Advantages
- The space requirements of the incremental
algorithm are very small. - centroids of the clusters
- Their time requirements are also small.
- algorithms are noniterative
- Disadvantages
- Not order-independence
59Example - Figure6.6
- x1(0,2) , x2 (0, 0) , x3 (1.5, 0) , x4 (5,
0), x5 (5, 2) - Inputx1?x2?x3?x4?x5
- the threshold level of similarity between
clusters is d 3. - Steps
- The first sample x1 will become the first cluster
C1 x1. The coordinates of x1 will be the
coordinates of the centroid M1 0, 2.
60Example(cont.)
- Start analysis of the other samples.
- Second sample x2 is compared with M1d(x2, M1)
(02 22)1/2 2.0 lt 3 Therefore, x2 ? C1. New
centroid will be M10, 1 - Third sample x3 is compared with the centroid M1
(still the only centroid!) d(x3, M1) (1.52
12) 1/2 1.8 lt 3 x3? C1 ? C1 x1, x2, x3 ?
M1 0.5, 0.66 - Fourth sample x4 is compared with the centroid
M1 d(x4, M1) (4.52 0.662)1/2 4.55 gt 3 C2
x4 with the new centroid M2 5, 0.
61Example(cont.)
- Fifth example x5 is comparing with both cluster
centroids d(x5, M1) (4.52 1.442) 1/2 4.72
gt 3 d(x5, M2) (02 22)1/2 2 lt 3 C2 x4,
x5 ? M2 5, 1 - All samples are analyzed and a final clustering
solution C1 x1, x2, x3 and C2 x4, x5 - The reader may check that the result of the
incremental-clustering process will not be the
same if the order of the samples is different.
62Cluster Feature Vector
- Components of CF
- the number of points (samples) of the cluster
- the centroid of the cluster
- the radius of the cluster
- the square root of the average mean-squared
distance from the centroid to the points in the
cluster - It is very important that we do not need the set
of points in the cluster to compute a new CF.
63Birch clustering algorithm
- We have to mention that this technique is very
efficient for two reasons - CFs occupy less space than any other
representation of clusters. - CFs are sufficient for calculating all the values
involved in making clustering decisions.
64K-nearest neighbor Algorithm
- If samples are with categorical data, then we do
not have a method to calculate centroids as
representatives of the clusters. - K-nearest neighbor may be used to estimate
distances (or similarities) between samples and
existing clusters.
65K-nearest neighbor Algorithm -Steps
- The distances between new sample and all previous
samples, already classified into clusters, are
computed. - The distances are sorted in increasing order, and
K samples with smallest distance values are
selected. - Voting principle is applied New sample will be
added (classified) to the cluster that belongs to
the largest number out of K selected samples.
66Example
- Given six 6-dimensional categorical samples
- X1 A, B, A, B, C, B
- X2 A, A, A, B, A, B
- X3 B, B, A, B, A, B
- X4 B, C, A, B, B, A
- X5 B, A, B, A, C, A
- X6 A, C, B, A, B, B
- Clustered into two clustersC1 X1, X2, X3
and C2 X4, X5, X6 - Classify the new sample Y A, C, A, B, C, A
67Example(cont.)
- Using SMC measure
- Using 1-nearest neighbor rule (K 1) new sample
cannot be classified because there are two
samples (X1 and X4) with the same, highest
similarity (smallest distances), and one of them
is in the class C1 and the other in the class C2.
Similarities with elements in C1 SMC(Y, X1)
4/6 0.66 SMC(Y, X2) 3/6 0.50
SMC(Y, X3) 2/6 0.33
Similarities with elements in C2 SMC(Y, X4)
4/6 0.66 SMC(Y, X5) 2/6 0.33
SMC(Y, X6) 2/6 0.33
similarity 0.66?0.66?0.50?0.33?0.33?0.33
68Example(cont.)
- Using 3-nearest neighbor rule (K 3), and
selecting three largest similarities in the set,
we can see that two samples (X1 and X2) belong to
class C1, and only one sample to class C2. - Using simple voting system Y ? C1 class.
69How to evaluate a clustering algorithm ?
- The first step in evaluation is actually an
assessment of the data domain rather than the
clustering algorithm itself . - Cluster validity is the second step, when we
expect to have our data clusters. - It is subjective .
70Validation Studies for Clustering Algorithms
- External assessment
- Compares the discovered structure to an a priori
structure. - Internal examination
- Try to determine if the discovered structure is
intrinsically appropriate for the data. - Both assessments are subjective and
domain-dependent. - Relative test
- Compares the two structures obtained either from
different cluster methodologies or by using the
same methodology but with different clustering
parameters, such as the order of input samples. - We still need to resolve the question how to
select the structures for comparison.
71Keep in Your Mind
- Every clustering algorithm will find clusters in
a given data set whether they exist or not. the
data should, therefore, be subjected to tests for
clustering tendency before applying a clustering
algorithm, followed by a validation of the
clusters generated by the algorithm. - There is no best clustering algorithm therefore
a user is advised to try several algorithms on a
given data set.
72