Jeremy Tantrum and Werner Stuetzle - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Jeremy Tantrum and Werner Stuetzle

Description:

Jeremy Tantrum and Werner Stuetzle (also Alejandro Murua) Department ... Model Based & Hybrid Clustering of Large Data ... Kettenring, and Landwehr ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 32
Provided by: jeremyt7
Category:

less

Transcript and Presenter's Notes

Title: Jeremy Tantrum and Werner Stuetzle


1
Hierarchical Clustering Revisited
  • Jeremy Tantrum and Werner Stuetzle
  • (also Alejandro Murua)
  • Department of Statistics,
  • University of Washington

This work has been supported by NSA grant 62-1942
2
Model Based Hybrid Clustering of Large Data Sets
  • Introduction
  • Hierarchical Model Based Clustering
  • Hybrid Clustering

3
Introduction
Introduction
Goal Detect that there are 5 or 6 groups Assign
observations to groups
4
Non-Parametric Clustering
Introduction
  • Premise
  • Observations are sampled from a density p(x)
  • Groups correspond to modes of p(x)

5
Non-Parametric Clustering
Introduction
Method Estimate p(x) non-parametrically and
find significant modes of the estimate
6
Model Based Clustering
Introduction
  • Premise
  • Observations are sampled from a mixture density
  • p(x) å pg pg(x)
  • Groups correspond to mixture components

7
Model Based Clustering
Introduction
Method Estimate pg and parameters of pg(x)
8
Model Based vs Non-Parametric
Introduction
  • Model Based
  • Pro Can estimate the number of groups
  • Con Misleading results if model assumptions
    are not met
  • Non-Parametric
  • Pro Not dependent on model assumptions
  • Con No way of automatically choosing the
    number of clusters

9
Hierarchical Model Based Clustering
Model Based Clustering
  • Start with every point being its own cluster
  • Repeatedly merge the two closest clusters
  • Closest measured by the decrease in likelihood
    when two clusters are merged

p (x)
p1(x)
p2(x)
p (x)
p1(x)
p2(x)
10
Choosing the Number of Clusters
Model Based Clustering
  • Choose number of clusters by maximizing the
    Bayesian Information Criterion
  • r number of parameters
  • n number of observations
  • Log likelihood penalized for complexity

11
Problems with MBC
-
  • Problem Results of MBC can be misleading when
    model assumptions are not satisfied!

12
Problems with MBC
-
  • Problem Results of MBC can be misleading when
    model assumptions are not satisfied!

13
Hybrid Clustering
Hybrid Clustering
  • Finds collections of mixture components which
    model the same group
  • Use unimodality tests to prune the hierarchical
    clustering tree
  • Project data onto Linear Discriminant direction
  • Use Hartigans DIP test of unimodality

14
Projection Plots
Hybrid Clustering
  • Invented by Gnanadesikan, Kettenring, and
    Landwehr
  • Project onto Linear Discriminant coordinate
    direction which minimizes ratio
  • Problem
  • For small numbers of observations with high
    dimensionality, there will exist a direction
    which separates any two groups.

15
Projection Plots
Hybrid Clustering
Perfect separation of 3 Points in 2 dimensions
16
Projection Plots
Hybrid Clustering
  • Solution Project data onto principal components
    before projecting onto LDA direction.
  • PCA doesnt take into account group assignment
  • Project data onto principal components
  • Project onto linear discriminant direction
  • Do test of unimodality

17
Hartigans DIP Test of Unimodality
Hybrid Clustering
  • For 1-dimensional data
  • Find nonparametric MLE of the closest unimodal
    distribution function
  • Calculate DIP statistic
  • For p-values Take repeated samples from single
    Gaussian fitted to data
  • Cluster into 2 clusters
  • Project onto 1 dimension
  • Calculate DIP test

18
Hartigans DIP Test of Unimodality
Hybrid Clustering
19
Hartigans DIP Test of Unimodality
Hybrid Clustering
20
Hartigans DIP Test of Unimodality
Hybrid Clustering
21
Hartigans DIP Test of Unimodality
Hybrid Clustering
22
GKL-Silverman Plot
  • Project data onto 1 dimension and then plot the
    density
  • Use a Gaussian smoother to estimate density
  • Parameterized by width of smoother as width
    increases number of modes decreases
  • Closest unimodal
  • Unimodal density estimate with smallest width of
    smoother

23
Hybrid Clustering
p-value 0.43
24
Hybrid Clustering
p-value 0.76
25
Hybrid Clustering
p-value 0.00
26
Italian Olive Oils
Olive Oil Example
  • 572 olive oils from 9 areas of Italy
  • Areas 4 from southern Italy / 2 from Sardinia /
    3 from northern Italy
  • Data Percentage composition of 8 fatty acids
  • MBC finds 20 mixture components
  • Fowlkes-Mallows Index 0.39
  • Pruning reduces to 7 clusters
  • Fowlkes-Mallows Index 0.55

27
(Pruned) Hierarchical Clustering Tree
Olive Oil Example
28
Cluster Assignments
Olive Oil Example
Before Pruning
After Pruning
29
Contributions
  • Assessment
  • Posterior probability plots
  • Misclassification probabilities
  • Hybrid Clustering
  • Merges ideas of model based and non-parametric
    clustering
  • Clusters can consist of several mixture
    components
  • Less dependency on choice of covariance model

30
Extensions
  • Apply pruning ideas to non-parametric
    hierarchical clustering
  • Ideas arent restricted to model based clustering
  • Produces a purely non-parametric clustering
    method with objective way of selecting clusters.

31
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com