Jeremy Tantrum and Werner Stuetzle

About This Presentation

Title:

Jeremy Tantrum and Werner Stuetzle

Description:

Jeremy Tantrum and Werner Stuetzle (also Alejandro Murua) Department ... Model Based & Hybrid Clustering of Large Data ... Kettenring, and Landwehr ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 32

Provided by: jeremyt7

Category:

more less

Transcript and Presenter's Notes

Title: Jeremy Tantrum and Werner Stuetzle

1
Hierarchical Clustering Revisited

Jeremy Tantrum and Werner Stuetzle
(also Alejandro Murua)
Department of Statistics,
University of Washington

This work has been supported by NSA grant 62-1942
2
Model Based Hybrid Clustering of Large Data Sets

Introduction
Hierarchical Model Based Clustering
Hybrid Clustering

3
Introduction
Introduction
Goal Detect that there are 5 or 6 groups Assign
observations to groups
4
Non-Parametric Clustering
Introduction

Premise
Observations are sampled from a density p(x)
Groups correspond to modes of p(x)

5
Non-Parametric Clustering
Introduction
Method Estimate p(x) non-parametrically and
find significant modes of the estimate
6
Model Based Clustering
Introduction

Premise
Observations are sampled from a mixture density
p(x) å pg pg(x)
Groups correspond to mixture components

7
Model Based Clustering
Introduction
Method Estimate pg and parameters of pg(x)
8
Model Based vs Non-Parametric
Introduction

Model Based
Pro Can estimate the number of groups
Con Misleading results if model assumptions
are not met
Non-Parametric
Pro Not dependent on model assumptions
Con No way of automatically choosing the
number of clusters

9
Hierarchical Model Based Clustering
Model Based Clustering

Start with every point being its own cluster
Repeatedly merge the two closest clusters
Closest measured by the decrease in likelihood
when two clusters are merged

p (x)
p1(x)
p2(x)
p (x)
p1(x)
p2(x)
10
Choosing the Number of Clusters
Model Based Clustering

Choose number of clusters by maximizing the
Bayesian Information Criterion
r number of parameters
n number of observations
Log likelihood penalized for complexity

11
Problems with MBC
-

Problem Results of MBC can be misleading when
model assumptions are not satisfied!

12
Problems with MBC
-

Problem Results of MBC can be misleading when
model assumptions are not satisfied!

13
Hybrid Clustering
Hybrid Clustering

Finds collections of mixture components which
model the same group
Use unimodality tests to prune the hierarchical
clustering tree
Project data onto Linear Discriminant direction
Use Hartigans DIP test of unimodality

14
Projection Plots
Hybrid Clustering

Invented by Gnanadesikan, Kettenring, and
Landwehr
Project onto Linear Discriminant coordinate
direction which minimizes ratio
Problem
For small numbers of observations with high
dimensionality, there will exist a direction
which separates any two groups.

15
Projection Plots
Hybrid Clustering
Perfect separation of 3 Points in 2 dimensions
16
Projection Plots
Hybrid Clustering

Solution Project data onto principal components
before projecting onto LDA direction.
PCA doesnt take into account group assignment
Project data onto principal components
Project onto linear discriminant direction
Do test of unimodality

17
Hartigans DIP Test of Unimodality
Hybrid Clustering

For 1-dimensional data
Find nonparametric MLE of the closest unimodal
distribution function
Calculate DIP statistic
For p-values Take repeated samples from single
Gaussian fitted to data
Cluster into 2 clusters
Project onto 1 dimension
Calculate DIP test

18
Hartigans DIP Test of Unimodality
Hybrid Clustering
19
Hartigans DIP Test of Unimodality
Hybrid Clustering
20
Hartigans DIP Test of Unimodality
Hybrid Clustering
21
Hartigans DIP Test of Unimodality
Hybrid Clustering
22
GKL-Silverman Plot

Project data onto 1 dimension and then plot the
density
Use a Gaussian smoother to estimate density
Parameterized by width of smoother as width
increases number of modes decreases
Closest unimodal
Unimodal density estimate with smallest width of
smoother

23
Hybrid Clustering
p-value 0.43
24
Hybrid Clustering
p-value 0.76
25
Hybrid Clustering
p-value 0.00
26
Italian Olive Oils
Olive Oil Example

572 olive oils from 9 areas of Italy
Areas 4 from southern Italy / 2 from Sardinia /
3 from northern Italy
Data Percentage composition of 8 fatty acids
MBC finds 20 mixture components
Fowlkes-Mallows Index 0.39
Pruning reduces to 7 clusters
Fowlkes-Mallows Index 0.55

27
(Pruned) Hierarchical Clustering Tree
Olive Oil Example
28
Cluster Assignments
Olive Oil Example
Before Pruning
After Pruning
29
Contributions

Assessment
Posterior probability plots
Misclassification probabilities
Hybrid Clustering
Merges ideas of model based and non-parametric
clustering
Clusters can consist of several mixture
components
Less dependency on choice of covariance model

30
Extensions

Apply pruning ideas to non-parametric
hierarchical clustering
Ideas arent restricted to model based clustering
Produces a purely non-parametric clustering
method with objective way of selecting clusters.

31
(No Transcript)

Write a Comment

User Comments (0)