DB Seminar Series: Validation and Presentation of Clustering Results - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

DB Seminar Series: Validation and Presentation of Clustering Results

Description:

This method identifies groups of records that are similar in different conditions. ... Each kind of method provides a different kind of validation with ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 38

Provided by: kevi60

Category:

more less

Transcript and Presenter's Notes

Title: DB Seminar Series: Validation and Presentation of Clustering Results

1
DB Seminar SeriesValidation and Presentation of
Clustering Results

Presented by
Kevin Yip
26 March 2003

2
Introduction

Focus of most research works on data clustering
Accuracy tailor for different data
characteristics (cluster shapes, presence of
noise attributes and outliers, small number of
objects, etc.)
Speed performance (statistical summaries, data
structures, sampling, etc.)
Any other issues to consider?

3
Introduction

Question 1 given a dataset with unknown data
characteristics, if different clustering
algorithms give different results, which one is
more reliable?
Reliable
Object similarity objects of the same cluster
are similar, objects of different clusters are
dissimilar.
Robustness stability of results when different
algorithm parameters are used.
gt The validation problem (a confusing term!)

4
Introduction

Question 2 given a set of clustering results,
how should it be presented so that users can gain
most insights from it?
gt The presentation problem.

5
The validation problem

Two types of validation
External validation (based on some gold
standards).
Internal validation (based on some statistics of
the results).

6
The validation problem

External validation
Confusion matrix

Evaluation function
Precision (Cluster A 10/10, Cluster A 8/12)
Recall (Cluster A 10/10, Cluster A 8/10)
Many others

7
The validation problem

Problems with external validation
Gold standards are usually not available in real
situation (or we can do classification instead).
There can be multiple ways to assign class labels.

8
The validation problem

Internal validation
Method 1 criterion function (E.g. average
within-cluster distance to centroid, C-index,
U-statistics) validating clusters/sets of
clusters.
The function values are easy to compute, and the
clusters produced by different algorithms can be
easily compared.
However, the functions can bias against certain
kinds of data, algorithms, outlier handling
methods, etc. (e.g. spherical v.s. non-spherical
clusters).

9
The validation problem

Karypis, Han and Kumar, 1999.

10
The validation problem

Method 1b criterion function with null
hypothesis
If each record forms a cluster, the average
within-cluster distance to centroid 0, but it
does not imply a good clustering.
For each cluster, compare its value of the
criterion function to that of a cluster with the
same characteristics (e.g. no. of records) from
random data.
Usually requires heavy computation, or is even
computationally impossible with large datasets.

11
The validation problem

Method 2 strong clusters determination
Repeat the clustering process with different
parameter values (e.g. no. of target clusters),
identify the records that are always found in the
same cluster validating clustering algorithm
(in terms of robustness).
This method identifies groups of records that are
similar in different conditions.
The absolute quality of the clusters is not
determined.
Sometimes it is hard to find records that are
always clustered together. There is no
guarantee on how many records are in the strong
clusters.

12
The validation problem

Method 3 agreement between different attributes
(unsupervised LOOCV validating data and
algorithm)
Take out an attribute A and perform clustering.
Calculate the average similarity of the values of
A in each cluster. Repeat for all attributes and
obtain an aggregate index.

13
The validation problem

Assumption objects of the same class have
similar attribute value patterns in all
attributes.
It evaluates the quality of clusters by a
semi-external criteria the left out
attribute.
An example index
In practical use, the index values are plotted
against different parameter values (e.g. no. of
target clusters) and the curves of different
algorithms are compared.

14
The validation problem

Yeung, Haynor and Ruzzo 2001 (Rat CNS data 9
time points, 112 genes)

15
The validation problem

Method 4 method 2 method 3
Similar to method 3, clusters are produced when
each attribute is taken out. But instead of using
the left-out attribute to evaluate the clustering
result, the results are compared to the result
with all attributes.
The is similar to method 2 in that the results
are similar if the clusters are strong. But
instead of finding strong clusters, the goal of
this method is to evaluate how stable are the
clustering algorithms.
An example similarity measure for the cluster
sets

16
The validation problem

This method may suggest to reject the use of some
algorithms if the results are not stable.
However, it may not be able to suggest which
algorithm is good as the quality of clusters is
not considered.
For instance, if an algorithm always puts most of
the objects into a single large cluster, it will
have a very high stability, yet the clusters are
not meaningful.

17
The validation problem

The above methods may not be able to validate
projected clusters
Method 1 some criterion functions give a
monotonic increasing/decreasing value with the
number of selected attributes. Normalizing the
function value by the number may not work as well.

18
The validation problem

Method 1b when generating reference results on
random data, it is very hard to obtain clusters
with desired numbers of records and selected
attributes.
Method 2 projected clustering algorithms are
usually parameter-sensitive, so it is not easy to
find strong clusters.
Methods 3 and 4 the basic assumption of the
methods contradict with the basic assumption of
projected clustering.
A new validation problem for projected clusters
validating the selected attributes.

19
The validation problem

Summary
Different clustering algorithms work well on data
with different data characteristics.
If the data characteristics of a dataset is
unknown, and external validation criteria are not
available, some internal validation methods may
be helpful.
Each kind of method provides a different kind of
validation with different assumptions.
Just as no single clustering algorithm is the
best in all situations, no single validation
method can provide accurate validation in all
cases.

20
The presentation problem

Question how much to present?
Extreme 1
cluster 1 records 1, 2, 7, 10, 16cluster 2
records 3, 4, 9, 14, 20
Easy to understand, suitable for initial
validation by domain experts.
Can miss out a lot of important information.

21
The presentation problem

Extreme 2
Rebuilding score lists with 500 clusters.Totally
12 possible merges.500 clusters remained.Best
mergeCluster with first record 1 and cluster
with first record 229 Score0.95No of selected
attributes19, average relevance of the selected
attributes1.0, mutual disagreement0.0Cluster
with first record 1 no of rec1, no of selected
attr20, avg rel of the selected attr1.0Cluster
with first record 229 no of rec1, no of
selected attr20, avg rel of the selected
attr1.0Summary of the merged clusterAverage
relevance of the selected attributes 1.019
selected attributes 1,2,3,4,5,6,7,8,9,10,11,12,1
3,14,15,16,17,18,192 records 1,229499
clusters remained.
Too detail, difficult to read or interpret.
Not to present, but good to store such detailed
logs for further investigation.

22
The presentation problem

In between the extremes, some useful summaries
Dimension reduction (PCA, ICA, MDS, FastMap,
etc.) followed by 2-D plot
Yeung and Ruzzo, 2001.

It is not always possible to find a good 2D space
where the points of different clusters are well
separated.
But even some clusters overlap, the clearly
separated parts can be already very useful.

23
The presentation problem

Reachability plots
Ankerst et al., 1999.
Allow the identification of sub-clusters.

24
The presentation problem

Ng, Sander and Sleumer, 2001.

25
The presentation problem

Dendrograms (from hierarchical algorithms)
Alizadeh et al., 2000.
Corresponding confusion matrix

26
The presentation problem

Finding leaf ordering
There are 2n-1 possible orderings.
Greedy each time a new cluster is formed, the
locally-optimal ordering is determined.

27
The presentation problem

Finding leaf ordering
There are 2n-1 possible orderings.
Greedy each time a new cluster is formed, the
locally-optimal ordering is determined. Greedy
not globally-optimal, but efficient (O(n) time).

28
The presentation problem

Finding leaf ordering
Globally-optimal maximizing the sum of the
similarity of adjacent elements in the ordering.

Some better orderings (validated by domain
knowledge) compared to those produced by
heuristics have been observed. An O(n4) time
dynamic programming algorithm is available.
29
The presentation problem

Bar-Joseph, Gifford and Jaakkola, 2003.

30
The presentation problem

Pros and cons of dendrograms
Users can find clusters in any subtrees, not
necessarily rooted at the last-formed nodes.
Sub-clusters can be identified.
The relationship between different clusters is
shown.
Hard to interpret cluster boundaries when there
are many objects.

31
Conclusions

When clustering is used in data analysis, it is
rarely possible to yield some excellent results
with a single run of a single algorithm.
Usually before getting some interesting findings,
a lot of results are produced by different
algorithms with different data preprocessing ,
similarity functions, parameter values, etc.
Internal validation methods provide hints for
evaluating the results and choosing the suitable
algorithms to use.

32
Conclusions

However, not all internal validation methods are
appropriate in each situation. A wrong method can
lead to a wrong proof of a good clustering
result.
Similarly, good clustering results can be ruined
by a bad presentation method. The way to present
the results should always take into account how
they are to be used in further investigation.

33
References

Internal validation
Marie-Odile Delorme and Alain Henaut, Merging of
Distance Matrices and Classification by Dynamic
Clustering, CABIOS vol. 4, no. 4, 1988.
George Karypis, Eui-Hong (Sam) Han, Vipin Kumar,
CHAMELEON A Hierarchical Clustering Algorithm
Using Dynamic Modeling, IEEE Computer vol. 32,
no. 8, p.68-75, 1999.
Zhexue Huang, David W. Cheung and Michael K. Ng,
An Empirical Study on the Visual Cluster
Validation Method with Fastmap, DASFAA 2001.

34
References

Internal validation
Ka Yee Yeung, David R. Haynor and Walter L.
Ruzzo, Validating Clustering for Gene Expression
Data, Bioinformatics vol. 17, no. 4, 2001.
Susmita Datta and Somnath Datta, Comparisons and
Validation of Statistical Clustering Techniques
for Microarray Gene Expression Data,
Bioinformatics vol. 19, no. 4, 2003.

35
References

Result presentation
Michael B. Eisen et al., Cluster Analysis and
Display of Genome-wide Expression Patterns, Proc.
Natl. Acad. Sci, USA vol. 95, 1998.
Michael Anakerst et al., OPTICS Ordering Points
To Identify the Clustering Structure, SIGMOD
1999.
Ash A. Alizadeh et al., Distinct Types of Diffuse
Large B-cell Lymphoma Identified by Gene
Expression Profiling, Nature vol. 403, Feb 2000.

36
References

Result presentation
K. Y. Yeung and W. L. Ruzzo, An Empirical Study
of Principal Component Analysis for Clustering
Gene Expression Data, Bioinformatics, vol. 17,
no. 9, 2001.
Ziv Bar-Joseph, David K. Gifford and Tommi S.
Jaakkola, Fast Optimal Leaf Ordering for
Hierarchical Clustering, Bioinformatics, vol. 17,
suppl. 1, 2001.
Raymond T. Ng, Jörg Sander and Monica C. Sleumer,
Hierarchical Cluster Analysis of SAGE Data for
Cancer Profiling, BIOKDD 2001.