Title: Jeremy Tantrum and Werner Stuetzle
1Hierarchical Clustering Revisited
- Jeremy Tantrum and Werner Stuetzle
- (also Alejandro Murua)
- Department of Statistics,
- University of Washington
This work has been supported by NSA grant 62-1942
2Model Based Hybrid Clustering of Large Data Sets
- Introduction
- Hierarchical Model Based Clustering
- Hybrid Clustering
3Introduction
Introduction
Goal Detect that there are 5 or 6 groups Assign
observations to groups
4Non-Parametric Clustering
Introduction
- Premise
- Observations are sampled from a density p(x)
- Groups correspond to modes of p(x)
5Non-Parametric Clustering
Introduction
Method Estimate p(x) non-parametrically and
find significant modes of the estimate
6Model Based Clustering
Introduction
- Premise
- Observations are sampled from a mixture density
- p(x) å pg pg(x)
- Groups correspond to mixture components
7Model Based Clustering
Introduction
Method Estimate pg and parameters of pg(x)
8Model Based vs Non-Parametric
Introduction
- Model Based
- Pro Can estimate the number of groups
- Con Misleading results if model assumptions
are not met - Non-Parametric
- Pro Not dependent on model assumptions
- Con No way of automatically choosing the
number of clusters
9Hierarchical Model Based Clustering
Model Based Clustering
- Start with every point being its own cluster
- Repeatedly merge the two closest clusters
- Closest measured by the decrease in likelihood
when two clusters are merged
p (x)
p1(x)
p2(x)
p (x)
p1(x)
p2(x)
10Choosing the Number of Clusters
Model Based Clustering
- Choose number of clusters by maximizing the
Bayesian Information Criterion - r number of parameters
- n number of observations
- Log likelihood penalized for complexity
11Problems with MBC
-
- Problem Results of MBC can be misleading when
model assumptions are not satisfied!
12Problems with MBC
-
- Problem Results of MBC can be misleading when
model assumptions are not satisfied!
13Hybrid Clustering
Hybrid Clustering
- Finds collections of mixture components which
model the same group - Use unimodality tests to prune the hierarchical
clustering tree - Project data onto Linear Discriminant direction
- Use Hartigans DIP test of unimodality
14Projection Plots
Hybrid Clustering
- Invented by Gnanadesikan, Kettenring, and
Landwehr - Project onto Linear Discriminant coordinate
direction which minimizes ratio - Problem
- For small numbers of observations with high
dimensionality, there will exist a direction
which separates any two groups.
15Projection Plots
Hybrid Clustering
Perfect separation of 3 Points in 2 dimensions
16Projection Plots
Hybrid Clustering
- Solution Project data onto principal components
before projecting onto LDA direction. - PCA doesnt take into account group assignment
- Project data onto principal components
- Project onto linear discriminant direction
- Do test of unimodality
17Hartigans DIP Test of Unimodality
Hybrid Clustering
- For 1-dimensional data
- Find nonparametric MLE of the closest unimodal
distribution function - Calculate DIP statistic
- For p-values Take repeated samples from single
Gaussian fitted to data - Cluster into 2 clusters
- Project onto 1 dimension
- Calculate DIP test
18Hartigans DIP Test of Unimodality
Hybrid Clustering
19Hartigans DIP Test of Unimodality
Hybrid Clustering
20Hartigans DIP Test of Unimodality
Hybrid Clustering
21Hartigans DIP Test of Unimodality
Hybrid Clustering
22GKL-Silverman Plot
- Project data onto 1 dimension and then plot the
density - Use a Gaussian smoother to estimate density
- Parameterized by width of smoother as width
increases number of modes decreases - Closest unimodal
- Unimodal density estimate with smallest width of
smoother
23Hybrid Clustering
p-value 0.43
24Hybrid Clustering
p-value 0.76
25Hybrid Clustering
p-value 0.00
26Italian Olive Oils
Olive Oil Example
- 572 olive oils from 9 areas of Italy
- Areas 4 from southern Italy / 2 from Sardinia /
3 from northern Italy - Data Percentage composition of 8 fatty acids
- MBC finds 20 mixture components
- Fowlkes-Mallows Index 0.39
- Pruning reduces to 7 clusters
- Fowlkes-Mallows Index 0.55
27(Pruned) Hierarchical Clustering Tree
Olive Oil Example
28Cluster Assignments
Olive Oil Example
Before Pruning
After Pruning
29Contributions
- Assessment
- Posterior probability plots
- Misclassification probabilities
- Hybrid Clustering
- Merges ideas of model based and non-parametric
clustering - Clusters can consist of several mixture
components - Less dependency on choice of covariance model
30Extensions
- Apply pruning ideas to non-parametric
hierarchical clustering - Ideas arent restricted to model based clustering
- Produces a purely non-parametric clustering
method with objective way of selecting clusters.
31(No Transcript)