DB Seminar Series: Validation and Presentation of Clustering Results - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

DB Seminar Series: Validation and Presentation of Clustering Results

Description:

This method identifies groups of records that are similar in different conditions. ... Each kind of method provides a different kind of validation with ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 38
Provided by: kevi60
Category:

less

Transcript and Presenter's Notes

Title: DB Seminar Series: Validation and Presentation of Clustering Results


1
DB Seminar SeriesValidation and Presentation of
Clustering Results
  • Presented by
  • Kevin Yip
  • 26 March 2003

2
Introduction
  • Focus of most research works on data clustering
  • Accuracy tailor for different data
    characteristics (cluster shapes, presence of
    noise attributes and outliers, small number of
    objects, etc.)
  • Speed performance (statistical summaries, data
    structures, sampling, etc.)
  • Any other issues to consider?

3
Introduction
  • Question 1 given a dataset with unknown data
    characteristics, if different clustering
    algorithms give different results, which one is
    more reliable?
  • Reliable
  • Object similarity objects of the same cluster
    are similar, objects of different clusters are
    dissimilar.
  • Robustness stability of results when different
    algorithm parameters are used.
  • gt The validation problem (a confusing term!)

4
Introduction
  • Question 2 given a set of clustering results,
    how should it be presented so that users can gain
    most insights from it?
  • gt The presentation problem.

5
The validation problem
  • Two types of validation
  • External validation (based on some gold
    standards).
  • Internal validation (based on some statistics of
    the results).

6
The validation problem
  • External validation
  • Confusion matrix
  • Evaluation function
  • Precision (Cluster A 10/10, Cluster A 8/12)
  • Recall (Cluster A 10/10, Cluster A 8/10)
  • Many others

7
The validation problem
  • Problems with external validation
  • Gold standards are usually not available in real
    situation (or we can do classification instead).
  • There can be multiple ways to assign class labels.

8
The validation problem
  • Internal validation
  • Method 1 criterion function (E.g. average
    within-cluster distance to centroid, C-index,
    U-statistics) validating clusters/sets of
    clusters.
  • The function values are easy to compute, and the
    clusters produced by different algorithms can be
    easily compared.
  • However, the functions can bias against certain
    kinds of data, algorithms, outlier handling
    methods, etc. (e.g. spherical v.s. non-spherical
    clusters).

9
The validation problem
  • Karypis, Han and Kumar, 1999.

10
The validation problem
  • Method 1b criterion function with null
    hypothesis
  • If each record forms a cluster, the average
    within-cluster distance to centroid 0, but it
    does not imply a good clustering.
  • For each cluster, compare its value of the
    criterion function to that of a cluster with the
    same characteristics (e.g. no. of records) from
    random data.
  • Usually requires heavy computation, or is even
    computationally impossible with large datasets.

11
The validation problem
  • Method 2 strong clusters determination
  • Repeat the clustering process with different
    parameter values (e.g. no. of target clusters),
    identify the records that are always found in the
    same cluster validating clustering algorithm
    (in terms of robustness).
  • This method identifies groups of records that are
    similar in different conditions.
  • The absolute quality of the clusters is not
    determined.
  • Sometimes it is hard to find records that are
    always clustered together. There is no
    guarantee on how many records are in the strong
    clusters.

12
The validation problem
  • Method 3 agreement between different attributes
    (unsupervised LOOCV validating data and
    algorithm)
  • Take out an attribute A and perform clustering.
    Calculate the average similarity of the values of
    A in each cluster. Repeat for all attributes and
    obtain an aggregate index.

13
The validation problem
  • Assumption objects of the same class have
    similar attribute value patterns in all
    attributes.
  • It evaluates the quality of clusters by a
    semi-external criteria the left out
    attribute.
  • An example index
  • In practical use, the index values are plotted
    against different parameter values (e.g. no. of
    target clusters) and the curves of different
    algorithms are compared.

14
The validation problem
  • Yeung, Haynor and Ruzzo 2001 (Rat CNS data 9
    time points, 112 genes)

15
The validation problem
  • Method 4 method 2 method 3
  • Similar to method 3, clusters are produced when
    each attribute is taken out. But instead of using
    the left-out attribute to evaluate the clustering
    result, the results are compared to the result
    with all attributes.
  • The is similar to method 2 in that the results
    are similar if the clusters are strong. But
    instead of finding strong clusters, the goal of
    this method is to evaluate how stable are the
    clustering algorithms.
  • An example similarity measure for the cluster
    sets

16
The validation problem
  • This method may suggest to reject the use of some
    algorithms if the results are not stable.
  • However, it may not be able to suggest which
    algorithm is good as the quality of clusters is
    not considered.
  • For instance, if an algorithm always puts most of
    the objects into a single large cluster, it will
    have a very high stability, yet the clusters are
    not meaningful.

17
The validation problem
  • The above methods may not be able to validate
    projected clusters
  • Method 1 some criterion functions give a
    monotonic increasing/decreasing value with the
    number of selected attributes. Normalizing the
    function value by the number may not work as well.

18
The validation problem
  • Method 1b when generating reference results on
    random data, it is very hard to obtain clusters
    with desired numbers of records and selected
    attributes.
  • Method 2 projected clustering algorithms are
    usually parameter-sensitive, so it is not easy to
    find strong clusters.
  • Methods 3 and 4 the basic assumption of the
    methods contradict with the basic assumption of
    projected clustering.
  • A new validation problem for projected clusters
    validating the selected attributes.

19
The validation problem
  • Summary
  • Different clustering algorithms work well on data
    with different data characteristics.
  • If the data characteristics of a dataset is
    unknown, and external validation criteria are not
    available, some internal validation methods may
    be helpful.
  • Each kind of method provides a different kind of
    validation with different assumptions.
  • Just as no single clustering algorithm is the
    best in all situations, no single validation
    method can provide accurate validation in all
    cases.

20
The presentation problem
  • Question how much to present?
  • Extreme 1
  • cluster 1 records 1, 2, 7, 10, 16cluster 2
    records 3, 4, 9, 14, 20
  • Easy to understand, suitable for initial
    validation by domain experts.
  • Can miss out a lot of important information.

21
The presentation problem
  • Extreme 2
  • Rebuilding score lists with 500 clusters.Totally
    12 possible merges.500 clusters remained.Best
    mergeCluster with first record 1 and cluster
    with first record 229 Score0.95No of selected
    attributes19, average relevance of the selected
    attributes1.0, mutual disagreement0.0Cluster
    with first record 1 no of rec1, no of selected
    attr20, avg rel of the selected attr1.0Cluster
    with first record 229 no of rec1, no of
    selected attr20, avg rel of the selected
    attr1.0Summary of the merged clusterAverage
    relevance of the selected attributes 1.019
    selected attributes 1,2,3,4,5,6,7,8,9,10,11,12,1
    3,14,15,16,17,18,192 records 1,229499
    clusters remained.
  • Too detail, difficult to read or interpret.
  • Not to present, but good to store such detailed
    logs for further investigation.

22
The presentation problem
  • In between the extremes, some useful summaries
  • Dimension reduction (PCA, ICA, MDS, FastMap,
    etc.) followed by 2-D plot
  • Yeung and Ruzzo, 2001.
  • It is not always possible to find a good 2D space
    where the points of different clusters are well
    separated.
  • But even some clusters overlap, the clearly
    separated parts can be already very useful.

23
The presentation problem
  • Reachability plots
  • Ankerst et al., 1999.
  • Allow the identification of sub-clusters.

24
The presentation problem
  • Ng, Sander and Sleumer, 2001.

25
The presentation problem
  • Dendrograms (from hierarchical algorithms)
  • Alizadeh et al., 2000.
  • Corresponding confusion matrix

26
The presentation problem
  • Finding leaf ordering
  • There are 2n-1 possible orderings.
  • Greedy each time a new cluster is formed, the
    locally-optimal ordering is determined.

27
The presentation problem
  • Finding leaf ordering
  • There are 2n-1 possible orderings.
  • Greedy each time a new cluster is formed, the
    locally-optimal ordering is determined. Greedy
    not globally-optimal, but efficient (O(n) time).

28
The presentation problem
  • Finding leaf ordering
  • Globally-optimal maximizing the sum of the
    similarity of adjacent elements in the ordering.

Some better orderings (validated by domain
knowledge) compared to those produced by
heuristics have been observed. An O(n4) time
dynamic programming algorithm is available.
29
The presentation problem
  • Bar-Joseph, Gifford and Jaakkola, 2003.

30
The presentation problem
  • Pros and cons of dendrograms
  • Users can find clusters in any subtrees, not
    necessarily rooted at the last-formed nodes.
  • Sub-clusters can be identified.
  • The relationship between different clusters is
    shown.
  • Hard to interpret cluster boundaries when there
    are many objects.

31
Conclusions
  • When clustering is used in data analysis, it is
    rarely possible to yield some excellent results
    with a single run of a single algorithm.
  • Usually before getting some interesting findings,
    a lot of results are produced by different
    algorithms with different data preprocessing ,
    similarity functions, parameter values, etc.
  • Internal validation methods provide hints for
    evaluating the results and choosing the suitable
    algorithms to use.

32
Conclusions
  • However, not all internal validation methods are
    appropriate in each situation. A wrong method can
    lead to a wrong proof of a good clustering
    result.
  • Similarly, good clustering results can be ruined
    by a bad presentation method. The way to present
    the results should always take into account how
    they are to be used in further investigation.

33
References
  • Internal validation
  • Marie-Odile Delorme and Alain Henaut, Merging of
    Distance Matrices and Classification by Dynamic
    Clustering, CABIOS vol. 4, no. 4, 1988.
  • George Karypis, Eui-Hong (Sam) Han, Vipin Kumar,
    CHAMELEON A Hierarchical Clustering Algorithm
    Using Dynamic Modeling, IEEE Computer vol. 32,
    no. 8, p.68-75, 1999.
  • Zhexue Huang, David W. Cheung and Michael K. Ng,
    An Empirical Study on the Visual Cluster
    Validation Method with Fastmap, DASFAA 2001.

34
References
  • Internal validation
  • Ka Yee Yeung, David R. Haynor and Walter L.
    Ruzzo, Validating Clustering for Gene Expression
    Data, Bioinformatics vol. 17, no. 4, 2001.
  • Susmita Datta and Somnath Datta, Comparisons and
    Validation of Statistical Clustering Techniques
    for Microarray Gene Expression Data,
    Bioinformatics vol. 19, no. 4, 2003.

35
References
  • Result presentation
  • Michael B. Eisen et al., Cluster Analysis and
    Display of Genome-wide Expression Patterns, Proc.
    Natl. Acad. Sci, USA vol. 95, 1998.
  • Michael Anakerst et al., OPTICS Ordering Points
    To Identify the Clustering Structure, SIGMOD
    1999.
  • Ash A. Alizadeh et al., Distinct Types of Diffuse
    Large B-cell Lymphoma Identified by Gene
    Expression Profiling, Nature vol. 403, Feb 2000.

36
References
  • Result presentation
  • K. Y. Yeung and W. L. Ruzzo, An Empirical Study
    of Principal Component Analysis for Clustering
    Gene Expression Data, Bioinformatics, vol. 17,
    no. 9, 2001.
  • Ziv Bar-Joseph, David K. Gifford and Tommi S.
    Jaakkola, Fast Optimal Leaf Ordering for
    Hierarchical Clustering, Bioinformatics, vol. 17,
    suppl. 1, 2001.
  • Raymond T. Ng, Jörg Sander and Monica C. Sleumer,
    Hierarchical Cluster Analysis of SAGE Data for
    Cancer Profiling, BIOKDD 2001.

37
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com