Robust Methods for Locating Multiple Dense Regions in Complex Datasets presentation

About This Presentation

Transcript and Presenter's Notes

Title: Robust Methods for Locating Multiple Dense Regions in Complex Datasets

1
Robust Methods for Locating Multiple Dense
Regions in Complex Datasets

Gunjan Gupta
Department of Electrical Computer Engineering
The University of Texas at Austin
October 30, 2006

2
Why cluster only a part of the data into dense
clusters?
3
Why cluster only a part of the data into dense
clusters?
4
Why cluster only a part of the data into dense
clusters?
Exhaustive clustering (K-Means) result
5
Goal cluster only a (small) part of the data
into multiple dense clusters.
6
Why cluster only a part of the data into dense
clusters?

Little or no labeled data available.
Only a part of the data clusters well.
Or only a fraction of the data relevant.

7
Application Scenarios

Bioinformatics
Gene Microarray data 100s of strongly correlated
genes from 1000s
Phylogenetics data
Document Retrieval
User interested in a few highly relevant
documents.
Market Basket Data
Only some customers have highly correlated
behaviors
Feature selection

8
Practical Issues

How many dense clusters, where are they located?
What fraction of data to cluster?
Notion of density?
All clusters not necessarily equally dense.
Choice of model, distance measure.

9
In this thesis we introduce two new approaches
for finding k dense clusters

A very general parametric approach
Bregman Bubble Clustering
A non parametric approach that provides
significant extension over existing methods
Automated Hierarchical Density Shaving

10
Outline

Parametric approach
Bregman Bubble Clustering
Soft Bregman Bubble Clustering
Pressurization Seeding
Results
Non Parametric approach
Automated Hierarchical Density Shaving
Results
Comparison
Demo

11
Finding a single dense cluster One
Class-IBCrammer et. al 04

Perhaps the first parametric density-based
approach.
Uses the notion of a Bregmanian Ball
Distance measure Bregman divergence
Pros
Faster than non-parametric methods.
Generalize to all Bregman Divergences, a large
class of measures.
Cons
Can only find one dense region.

Bregmanian ball cost average Bregman divergence
from center
12
Bregman Divergences

A large class of divergences
Applicable to a wide variety of data types.
Includes a number of useful distance measures.
e.g. KL-divergence, Itakura-Saito distance,
Squared Euclidean distance, Mahalanobis distance,
etc.
Exploited in Bregman Clustering Banerjee et al.,
2004
Exhaustive partitioning into k segments.
Generalization of K-Means to all Bregman
divergences.

13
Problem Definition Bregman Bubble Clustering

Find k clusters consisting a total of s points
that have the lowest total cost

Cost measure
Bregman divergence
k centroids
Where S
Set of k clusters
14
Bregman Bubble will also be applicable to two
important Bregman Projections

1-Cosine for document clustering
Sq. Euclidean distance between points projected
onto a sphere.
Pearson Distance for Biological data
Sq. Euclidean Distance between Z-scored points
(i.e between points projected on a sphere after
subtracting mean across dimensions).
Equal to 1-Pearson Correlation.

15
Bregmanian Ball vs. Bregman Bubbles
Bregmanian Ball, k1
Bregmanian Bubbles, kgt1

Can show for fixed centers, optimal solution
for forms Bregman Bubbles.

16
Finding k Bregman Bubbles

Optimal solution too slow.
A simple iterative relocation algorithm, Bregman
Bubble Clustering possible
Guaranteed to converge to local minima.
Alternately updates assigned points and centers
of the k bubbles.
Mean best center at each step because of Bregman
divergence property shown by Banerjee et al.,
2004.

17
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 1.77
4
c2
c1
6
18
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.85
5
c2
c1
5
19
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.56
6
c2
c1
4
20
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.45
6
c2
c1
4
21
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.37
5
c2
c1
5
22
Outline

Parametric approach
Bregman Bubble Clustering
Bregman Bubble Soft Clustering
Pressurization Seeding
Results
Non Parametric approach
Automated Hierarchical Density Shaving
Results
Comparison
Demo

23
Bregman Bubble Soft Clustering

A probabilistic model, mixture of k exponential
distributions and one uniform distribution
An EM-based algorithm that alternately updates
the mixture weights and mixing parameters.
Can be showed to converge to local minima.

24
A 2-d dataset, 5 Gaussiansuniform
Can you spot the 5 Gaussians?
25
A 2-d dataset, 5 Gaussiansuniform
26
2-d dataset, after Bregman Soft
Clustering(scalar variances updated)
27
Bregman Bubble Soft Clustering

Exploits a bijection between Bregman divergences
and exponential functions Banerjee et al.,
2004

a convex function
Bregman divergence
Exponential function
Examples
1. Squared Euclidean corresponds to fixed
variance Gaussian
2. KL-Divergence corresponds to multinomial
28
Unification

Can show Bregman Bubble Clustering(BBC) special
case of Bregman Bubble Soft Clustering when
Demonstrates that the BBC algorithm not ad-hoc
but arises out of a mixture of exponentials and a
uniform distribution.
Unifications with previous algorithms
Bregman Bubble (k1) One Class (same
cost as OC-IB)
Bregman Bubble (cluster all data) Bregman
Clustering
Soft Bubble (cluster all data) Bregman Soft
Clustering

29
Outline

Parametric approach
Bregman Bubble Clustering
Bregman Bubble Soft Clustering
Pressurization Seeding
Results
Non Parametric approach
Automated Hierarchical Density Shaving
Results
Comparison
Demo

30
Bregman Bubble Soft Clustering with Pressurization

Problem
BBC very sensitive to initialization
Especially when small clusters desired limited
mobility during local search.
Solution Pressurization
Demo

31
Pressurization demo, iteration 1
watch this area
32
Pressurization demo, iteration 2
33
Pressurization demo, iteration 3
34
Pressurization demo, iteration 9
35
Pressurization demo, iteration 10
36
Pressurization demo, iteration 20
37
Seeding Bregman Bubble Clustering

Goals
Find k centers to start BBC local search.
To overcome local minima problem in BBC.
Automatically estimate k.
Features
Deterministic.
Guaranteed constant times optimal cost for One
Class.

38
Seeding Bregman Bubble Clustering for k1

Input n x n distance matrix, no. of points to
cluster s.
Restrict c (center) to a data point.
Output best cluster centroid.
Algorithm Sort each row, cumulative sum,
normalize, pick best.

Best cluster of size s
39
Seeding Bregman Bubble Clustering for k gt1

Goal
Identify the k dense regions in the data, and the
corresponding k centroids.
Main idea behind the solution
If we run One Class Bregman Bubble (k1) starting
n times from each of the n data points as seed
locations, the n convergence locations would
correspond to one of only k distinct densest
regions in the data.
These k dense locations can be thought of as the
centers of the k dense locations in the data.
A faster approximation of the above possible by
Restricting the k centroid search to the n data
points.
Demo of DGRADE (Density Gradient Enumeration)
next.

40
Density Gradient Enumeration demo
41
Density Gradient Enumeration demo
Step 1 Measure One class cost at each point as
average divergence to say 5 (sone) closest
neighbors (including pt itself)
0.25
42
Density Gradient Enumeration demo
Step 1 Measure the One class cost at each point
(which could also be seen as being inversely
proportional to data density at that point)
0.35
0.41
0.4
0.24
0.19
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
43
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 1
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
Cluster 1 started
0.20 - 0.24
0.25 - 0.34
gt0.34
44
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 2
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
Cluster 2 started
0.25 - 0.34
gt0.34
45
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 3
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
46
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 4
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
47
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 5
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
48
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 6
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
49
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 7
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
50
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 8
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
51
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 9
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
52
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 10
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
53
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 11
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
54
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 12
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
55
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 13
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
56
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 14
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
57
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 15
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
58
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 16
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
59
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 24
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.36
0.19
0.24
0.16
0.18
0.45
0.14
0.25
0.15
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
60
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 24
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.36
0.19
0.24
0.16
0.18
0.45
0.14
0.25
0.15
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
return densest points in each cluster as seeds.
gt0.34
61
Density Gradient Enumeration 2-d Gaussian data
62
Results
Tested on many datasets, using three different
distance measures
63
Results on Lee (gene expression, 1-Pearson)
Overlap Lift
64
Results on 40-d Synthetic (Sq. Euclidean)
Adjusted Rand Index
65
Results on 20-Newsgroup (1-Cosine)
Adjusted Rand Index
66
Seeded BBC, Gasch Array (1-Pearson)
PressurizationSeeding
Pressurization only
Adjusted Rand Index
67
Outline

Parametric approach
Bregman Bubble Clustering
Bregman Bubble Soft Clustering
Seeding
Non Parametric approach
Automated Hierarchical Density Shaving
Comparison
Demo

68
Density Based Clustering Algorithms

HMA (Wishart, 1968)
DBSCAN (Ester et al., 1996)
Density Shaving (DS) and Auto-HDS framework.
Can show there is a strong connection between
above algorithms.
All 3 use a uniform density kernel (density ? no.
points within some radius)

radius
radius
69
Density Shaving (DS)
70
Density Shaving (DS)
71
Density Shaving (DS)
Two inputs 1. f_shave Fraction to shave
(0.38) 2. n? Min. no. of nbrs (3)
Uses a trick to automatically compute correct
ball radius from f_shave and n? .
72
Density Shaving (DS)
Performs graph traversal on dense points to
identify clusters.
73
Density Shaving (DS)
dont care points
74
Properties of DS

Increasing n? has a smoothing effect on
clustering.

n? 5
n? 50
.
x dense dont care points
75
Properties of DS

For a fixed n? , successive runs of DS with
increasing data shaving (f_shave) result in a
hierarchy of clusters.

38 shaving, n? 25
2-D Gaussian example 1298 pts, 5
Gaussians uniform background
15 shaving, n? 25
76
Properties of DS

With a fixed n? , successive runs of DS with
increasing shaving (f_shave) result in a
hierarchy of clusters.

15 shaving
38
62
85

clusters can
- split
- vanish
pts in separate
clusters never
merge into one

2-D Gaussian example 1298 pts, 5
Gaussians uniform background
77
Hierarchical Density Shaving (HDS)

Uses geometric/exponential shaving to create the
hierarchy from DS.
Starting from all, fixed fraction r_shave of data
shaved each iteration.
Clusters that lose points without splitting get
the same id. Example

38 shaving
55 shaving
A
A
B
B
78
An important trickDictionary Row Sort on HDS
Label Matrix
79
Visualization using the sorted Label Matrix
C 85 shaving
A 38
B 62
A
B
C

Sorted matrix plotted
Each cluster plotted in unique color
Dont care points plotted in background color

Sorted rows index

Shows the compact, 8-node hierarchy

Shaving iteration
80
Cluster Stability
level
22 iterations
Shaving iteration ? level
Spatially relevant Projection
level
Stability diff. between first and last level of
a cluster ? no. of shaving
iterations a cluster exists.
81
Cluster Selection

We can show relative stability is independent
of shaving rate r_shave
All clusters can be ranked by stability, even
parents and children.
One way of selecting clusters
- Highest stability clusters picked first.
- Parents and children of picked clusters
eliminated.

82
HDS Visualization Model selection Auto-HDS
Auto HDS (546 pts)

Auto-HDS
-Finds all modes /clusters.
-Finds clusters of varying density
simultaneously.

DS (546 pts, f_shave0.58)
83
Results Gasch Dataset
Results Gasch Dataset
84
Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Reference pool, not stressed
Heat Shock
YPD
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
85
Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Heat Shock
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
86
Results Lee Dataset
87
Outline

Parametric approach
Bregman Bubble Clustering
Bregman Soft Bubble Clustering
Seeding
Non Parametric approach
Automated Hierarchical Density Shaving
Comparison
Demo

88
BBC vs. Auto-HDS A qualitative comparison
89
BBC vs. Auto-HDS Sim-2
Auto-HDS
BBC
90
BBC vs. Auto-HDS Sim-2
91
BBC vs. Auto-HDS Lee
92
Outline

Parametric approach
Bregman Bubble Clustering
Bregman Bubble Soft Clustering
Seeding
Non Parametric approach
Automated Hierarchical Density Shaving
Comparison
Demo

93
Gene DIVER

Gene Density Interactive Visual ExplorER
A scalable implementation of Auto-HDS using
streaming data instead of main memory.
Special features for browsing clusters.
Special features for biological data mining.
Available for download at
http//www.ideal.ece.utexas.edu/gunjan/gened
iver

Lets see the Gene DIVER Demo now
94
Main Contributions

Simultaneously finding dense clusters and pruning
rest useful in many domains.
Parametric method BBC generalizes density based
clustering to a large class of problems.
very scalable to large, high-d data
Robust with pressurization and seeding.
Auto-HDS improves upon non-parametric
density-based clustering in many ways
Well-suited for very high-d datasets.
A powerful visualization.
Interactive clustering, compact hierarchy.
Gene DIVER a powerful tool for the data mining
community, and especially for Bioinformatics
practitioners.

95
Future Work

BBC
Bregman Bubble Coclustering.
Online Bregman Bubble for capturing localized
concept drifts.
Auto-HDS
Variable resolution ISOMAP.
Deterministic coclustering.
Extensions to Gene DIVER.

96
Relevant Papers

G. Gupta, J. Ghosh, Bregman Bubble Clustering A
Robust, Scalable Framework for Locating Multiple,
Dense Regions in Data, ICDM 2006, 12 pages.
Gupta, A. Liu and J. Ghosh, Hierarchical Density
Shaving A clustering and visualization framework
for large biological datasets, ICDM-DMB 2006, 5
pages
G. Gupta, A. Liu and J. Ghosh, Clustering and
Visualization of High-Dimensional Biological
Datasets using a fast HMA Approximation, ANNIE
2006, 6 pages.
G. Gupta and J. Ghosh, Robust One-Class
Clustering Using Hybrid Global and Local Search,
ICML 2005, pp. 273-280
G. Gupta, J. Ghosh, Bregman Bubble Clustering, A
Robust Framework for Mining Dense Clusters, under
review, JMLR.
G. Gupta, A. Liu, J. Ghosh, Automated
Hierarchical Density Shaving A robust, automated
clustering and visualization framework for large
biological datasets, under review, IEEE Tran.
Comp. Bio. Bioinform.

97
?
98
Backup Slides from Here
99
Other papers

G. Gupta and J. Ghosh, Detecting Seasonal Trends
and Cluster Motion Visualization for very High
Dimensional Transactional Data, SDM-2001
G. Gupta and J. Ghosh, Value Balanced
Agglomerative Connectivity Clustering, Proc. SPIE
Conf. on Data Mining and Knowledge Discovery,
SPIE-2001.
G. Gupta, A. Strehl and J. Ghosh. Distance Based
Clustering of Association Rules, Annie-99

100
Properties of Auto-HDS

Fast O(n n? log n) using a heap-based imp.
Gene DIVER a memory efficient heap-based
implementation.
Extremely compact hierarchy of clusters.
Visualization
Creates a spatially relevant 2-D projection of
points and clusters.
Spatially relevant 2-D project of the compact
hierarchy.
Model selection
Can define a notion of stability (analogous to
cluster height)
Based on stability, can select the most stable
clusters automatically.

101
Finding relevant subsets, related work

Density based clustering e.g. DBSCAN Ester et
al.
Pros good for low-d spatial data.
Cons not suitable for high-d or non-metric
scenarios.
Gene Shaving Hastie et al. 2000
Pros well-suited for gene expression datasets.
Cons greedily finds clusters, slow, implicit Sq.
Euclidean assumptions.
PLAID Lazzeroni et al. 2002
Pros cluster rows and columns simultaneously
good for high-d
Cons greedy extraction of clusters as plaids,
assumes additive layers, not true for many
datasets.

102
DGRADE Selecting sone parameter

sone is like a smoothing parameter to DGRADE
As sone increases, k declines.
Three scenarios for determining sone
If k given, find the smallest sone that results
in k clusters
If not, find k that occurs for the longest
interval (max stability) of sone values
Or, find largest k that occurs at least a certain
no. of times (minimum stability)

103
DGRADE Selecting sone parameter
k5 input, sone was found as 57

Example on 2-d data

Automatic k(4) and sone (62)
104
Seeded BBC on 40-d Synthetic (Sq. Euclidean)
PressurizationSeeding
Using Pressurization only
Adjusted Rand Index

Write a Comment

User Comments (0)

About PowerShow.com

Robust Methods for Locating Multiple Dense Regions in Complex Datasets PowerPoint PPT Presentation