Learning the threshold in Hierarchical Agglomerative Clustering - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Learning the threshold in Hierarchical Agglomerative Clustering

Description:

Number of Views:36

Avg rating:3.0/5.0

Slides: 18

Provided by: non8106

Category:

Tags: agglomerative | clustering | hierarchical | learning | threshold

Transcript and Presenter's Notes

Title: Learning the threshold in Hierarchical Agglomerative Clustering

1
Learning the threshold in Hierarchical
Agglomerative Clustering

Speaker Ngai Wang Kay
2
Hierarchical clustering
Threshold
Dendrogram
d3
d2
d1
d2
d1
d3
3
Distance metric

Single-link distance metric the minimum of the
simple distances (e.g. Euclidean distances)
between the objects in the two clusters.

4
Distance metric

Complete-link distance metric the maximum of
the simple distances between the objects in the
two clusters.

5
Threshold determination

Some applications may just want a set of clusters
for a particular threshold instead of a
dendrogram.
A more efficient clustering algorithm may be
developed for such case.
There are many possible thresholds.
So, it is hard to determine the threshold that
gives an accurate clustering result (based on a
measure against the correct clusters).

6
Threshold determination

Suppose C1, , Cn are the correct clusters and
H1, , Hm are the computed clusters.
A F-measure is used to determine the accuracy of
the computed clusters as follows

7
Threshold determination
where N is the dataset size
8
Semi-supervised algorithm

9
Sample set

Preliminary experiments show that a sample set of
size 50 gives reasonable clustering results.
The time complexity of the hierarchical
clustering is usually O(N2) or higher in simple
distance computations and numerical comparisons.
So, learning the threshold may be a very small
cost in comparison to that of clustering the
dataset.

10
Experimental results

Experiments are conducted by complete-link
clustering on various real datasets in the UCI
repository (http//www.ics.uci.edu/mlearn/mlrepos
itory.html).
They are originally collected for the
classification problem.
The class labels of the data are used as the
cluster labels in these experiments.

11
Experimental results
12
Experimental results
13
Experimental results

Because of the nature of the data, there may be
many good threshold values.
So, large differences between the target and
learned thresholds do not have to yield large
differences between the corresponding F-measure
values.

14
Experimental results
15
Experimental results

The Vehicle dataset shows a huge difference in
the number of clusters but a moderate difference
in the F-measure.
The Car dataset suffers from a serious loss of
the F-measure, but the difference in the number
of clusters is small.
These anomalies may be explained, in part, by the
sparseness of the data, the skewness of the
underlying class distributions, and the cluster
labels are based on the classification labels.

16
Experimental results

The Diabetes dataset achieves a F-measure value
close to optimal with fewer clusters when using
the learned threshold.
In summary, the learned threshold achieves
clustering results close to the optimal ones at a
fraction of the computational cost of clustering
the whole dataset.

17
Conclusion

Hierarchical clustering does not produce a single
clustering result but a dendrogram, a series of
nested clusters based on distance thresholds.
This leads to the open problem of choosing the
preferred threshold.
An efficient semi-supervised algorithm is
proposed to obtain such threshold.
Experimental results show the clustering results
obtained using the learned threshold are close to
optimal.