Title: Cluster Validation
1Cluster Validation
- Cluster validation
- Assess the quality and reliability of clustering
results. - Why validation?
- To avoid finding clusters formed by chance
- To compare clustering algorithms
- To choose clustering parameters
- e.g., the number of clusters in the K-means
algorithm
2Clusters found in Random Data
Random Points
3Aspects of Cluster Validation
- Comparing the clustering results to ground truth
(externally known results). - External Index
- Evaluating the quality of clusters without
reference to external information. - Use only the data
- Internal Index
- Determining the reliability of clusters.
- To what confidence level, the clusters are not
formed by chance - Statistical framework
4Comparing to Ground Truth
- Notation
- N number of objects in the data set
- PP1,,Pm the set of ground truth clusters
- CC1,,Cn the set of clusters reported by a
clustering algorithm. - The incidence matrix
- N ? N (both rows and columns correspond to
objects). - Pij 1 if Oi and Oj belong to the same ground
truth cluster in P Pij0 otherwise. - Cij 1 if Oi and Oj belong to the same cluster
in C Cij0 otherwise.
5External Index
- A pair of data object (Oi,Oj) falls into one of
the following categories - SS Cij1 and Pij1 (agree)
- DD Cij0 and Pij0 (agree)
- SD Cij1 and Pij0 (disagree)
- DS Cij0 and Pij1 (disagree)
- Rand index
- may be dominated by DD
- Jaccard Coefficient
6Internal Index
- Ground truth may be unavailable
- Use only the data to measure cluster quality
- Measure the homogeneity and separation of
clusters. - SSE Sum of squared errors.
- Calculate the correlation between clustering
results and distance matrix.
7Sum of Squared Error
- Homogeneity is measured by the within cluster sum
of squares - Exactly the objective function of K-means.
- Separation is measured by the between cluster sum
of squares - Where Ci is the size of cluster i,
m is the centroid of the whole data set. - BSS WSS constant
- A larger number of clusters tend to result in
smaller WSS.
8Sum of Squared Error
K1
K2
K4
9Sum of Squared Error
- Can also be used to estimate the number of
clusters.
10Internal Measures SSE
- SSE curve for a more complicated data set
SSE of clusters found using K-means
11Correlation with Distance Matrix
- Distance Matrix
- Dij is the similarity between object Oi and Oj.
- Incidence Matrix
- Cij1 if Oi and Oj belong to the same cluster,
Cij0 otherwise - Compute the correlation between the two matrices
- Only n(n-1)/2 entries needs to be calculated.
- High correlation indicates good clustering.
12Correlation with Distance Matrix
- Given Distance Matrix D d11,d12, , dnn and
Incidence Matrix C c11, c12,, cnn .
-
- Correlation r between D and C is given by
-
13Measuring Cluster Validity Via Correlation
- Correlation of incidence and proximity matrices
for the K-means clusterings of the following two
data sets.
Corr -0.9235
Corr -0.5810
14Clusters found in Random Data
Random Points
15Using Similarity Matrix for Cluster Validation
- Order the similarity matrix with respect to
cluster labels and inspect visually.
16Using Similarity Matrix for Cluster Validation
- Clusters in random data are not so crisp
K-means
17Using Similarity Matrix for Cluster Validation
- Clusters in random data are not so crisp
Complete Link
18Using Similarity Matrix for Cluster Validation
- Clusters in random data are not so crisp
DBSCAN
19Reliability of Clusters
- Need a framework to interpret any measure.
- For example, if our measure of evaluation has
the value, 10, is that good, fair, or poor? - Statistics provide a framework for cluster
validity - The more atypical a clustering result is, the
more likely it represents valid structure in the
data.
20Statistical Framework for SSE
- Example
- Compare SSE of 0.005 against three clusters in
random data - SSE Histogram of 500 sets of random data points
of size 100 distributed over the range 0.2 0.8
for x and y values
SSE 0.005
21Statistical Framework for Correlation
- Correlation of incidence and distance matrices
for the K-means of the following two data sets.
Correlation histogram of random data
Corr -0.5810
Corr -0.9235
22Hyper-geometric Distribution
- Given the total number of genes in the data set
associated with term T is M, if randomly draw n
genes from the data set N, what is the
probability that m of the selected n genes will
be associated with T?
23P-Value
- Based on Hyper-geometric distribution, the
probability of having m genes or fewer associated
to T in N can be calculated by summing the
probabilities of a random list of N genes having
1, 2, , m genes associated to T. So the p-value
of over-representation is as follows