Learning the threshold in Hierarchical Agglomerative Clustering - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Learning the threshold in Hierarchical Agglomerative Clustering

Description:

The Vehicle dataset shows a huge difference in the number of clusters but a ... The Car dataset suffers from a serious loss of the F-measure, but the difference ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 18
Provided by: non8106
Category:

less

Transcript and Presenter's Notes

Title: Learning the threshold in Hierarchical Agglomerative Clustering


1
Learning the threshold in Hierarchical
Agglomerative Clustering
  • Kristine Daniels
  • Christophe Giraud-Carrier

Speaker Ngai Wang Kay
2
Hierarchical clustering
Threshold
Dendrogram
d3
d2
d1
d2
d1
d3
3
Distance metric
  • Single-link distance metric the minimum of the
    simple distances (e.g. Euclidean distances)
    between the objects in the two clusters.

4
Distance metric
  • Complete-link distance metric the maximum of
    the simple distances between the objects in the
    two clusters.

5
Threshold determination
  • Some applications may just want a set of clusters
    for a particular threshold instead of a
    dendrogram.
  • A more efficient clustering algorithm may be
    developed for such case.
  • There are many possible thresholds.
  • So, it is hard to determine the threshold that
    gives an accurate clustering result (based on a
    measure against the correct clusters).

6
Threshold determination
  • Suppose C1, , Cn are the correct clusters and
    H1, , Hm are the computed clusters.
  • A F-measure is used to determine the accuracy of
    the computed clusters as follows

7
Threshold determination
where N is the dataset size
8
Semi-supervised algorithm
  • Select a random subset S of the dataset.
  • Label the correct clusters of the data in S.
  • Cluster S using the previous algorithm.
  • Compute the F-measure value for each threshold in
    the dendrogram.
  • Find the threshold with the highest F-measure
    value.
  • Cluster the dataset using this threshold.

9
Sample set
  • Preliminary experiments show that a sample set of
    size 50 gives reasonable clustering results.
  • The time complexity of the hierarchical
    clustering is usually O(N2) or higher in simple
    distance computations and numerical comparisons.
  • So, learning the threshold may be a very small
    cost in comparison to that of clustering the
    dataset.

10
Experimental results
  • Experiments are conducted by complete-link
    clustering on various real datasets in the UCI
    repository (http//www.ics.uci.edu/mlearn/mlrepos
    itory.html).
  • They are originally collected for the
    classification problem.
  • The class labels of the data are used as the
    cluster labels in these experiments.

11
Experimental results
12
Experimental results
13
Experimental results
  • Because of the nature of the data, there may be
    many good threshold values.
  • So, large differences between the target and
    learned thresholds do not have to yield large
    differences between the corresponding F-measure
    values.

14
Experimental results
15
Experimental results
  • The Vehicle dataset shows a huge difference in
    the number of clusters but a moderate difference
    in the F-measure.
  • The Car dataset suffers from a serious loss of
    the F-measure, but the difference in the number
    of clusters is small.
  • These anomalies may be explained, in part, by the
    sparseness of the data, the skewness of the
    underlying class distributions, and the cluster
    labels are based on the classification labels.

16
Experimental results
  • The Diabetes dataset achieves a F-measure value
    close to optimal with fewer clusters when using
    the learned threshold.
  • In summary, the learned threshold achieves
    clustering results close to the optimal ones at a
    fraction of the computational cost of clustering
    the whole dataset.

17
Conclusion
  • Hierarchical clustering does not produce a single
    clustering result but a dendrogram, a series of
    nested clusters based on distance thresholds.
  • This leads to the open problem of choosing the
    preferred threshold.
  • An efficient semi-supervised algorithm is
    proposed to obtain such threshold.
  • Experimental results show the clustering results
    obtained using the learned threshold are close to
    optimal.
Write a Comment
User Comments (0)
About PowerShow.com