Title: Robust Methods for Locating Multiple Dense Regions in Complex Datasets
1Robust Methods for Locating Multiple Dense
Regions in Complex Datasets
- Gunjan Gupta
- Department of Electrical Computer Engineering
- The University of Texas at Austin
- October 30, 2006
2Why cluster only a part of the data into dense
clusters?
3Why cluster only a part of the data into dense
clusters?
4Why cluster only a part of the data into dense
clusters?
Exhaustive clustering (K-Means) result
5Goal cluster only a (small) part of the data
into multiple dense clusters.
6Why cluster only a part of the data into dense
clusters?
- Little or no labeled data available.
- Only a part of the data clusters well.
- Or only a fraction of the data relevant.
7Application Scenarios
- Bioinformatics
- Gene Microarray data 100s of strongly correlated
genes from 1000s - Phylogenetics data
- Document Retrieval
- User interested in a few highly relevant
documents. - Market Basket Data
- Only some customers have highly correlated
behaviors - Feature selection
8Practical Issues
- How many dense clusters, where are they located?
- What fraction of data to cluster?
- Notion of density?
- All clusters not necessarily equally dense.
- Choice of model, distance measure.
9In this thesis we introduce two new approaches
for finding k dense clusters
- A very general parametric approach
- Bregman Bubble Clustering
- A non parametric approach that provides
significant extension over existing methods - Automated Hierarchical Density Shaving
10Outline
- Parametric approach
- Bregman Bubble Clustering
- Soft Bregman Bubble Clustering
- Pressurization Seeding
- Results
- Non Parametric approach
- Automated Hierarchical Density Shaving
- Results
- Comparison
- Demo
11Finding a single dense cluster One
Class-IBCrammer et. al 04
- Perhaps the first parametric density-based
approach. - Uses the notion of a Bregmanian Ball
- Distance measure Bregman divergence
- Pros
- Faster than non-parametric methods.
- Generalize to all Bregman Divergences, a large
class of measures. - Cons
- Can only find one dense region.
Bregmanian ball cost average Bregman divergence
from center
12Bregman Divergences
- A large class of divergences
- Applicable to a wide variety of data types.
- Includes a number of useful distance measures.
- e.g. KL-divergence, Itakura-Saito distance,
Squared Euclidean distance, Mahalanobis distance,
etc. - Exploited in Bregman Clustering Banerjee et al.,
2004 - Exhaustive partitioning into k segments.
- Generalization of K-Means to all Bregman
divergences.
13Problem Definition Bregman Bubble Clustering
- Find k clusters consisting a total of s points
that have the lowest total cost
Cost measure
Bregman divergence
k centroids
Where S
Set of k clusters
14Bregman Bubble will also be applicable to two
important Bregman Projections
- 1-Cosine for document clustering
- Sq. Euclidean distance between points projected
onto a sphere. - Pearson Distance for Biological data
- Sq. Euclidean Distance between Z-scored points
(i.e between points projected on a sphere after
subtracting mean across dimensions). - Equal to 1-Pearson Correlation.
15Bregmanian Ball vs. Bregman Bubbles
Bregmanian Ball, k1
Bregmanian Bubbles, kgt1
- Can show for fixed centers, optimal solution
for forms Bregman Bubbles.
16Finding k Bregman Bubbles
- Optimal solution too slow.
- A simple iterative relocation algorithm, Bregman
Bubble Clustering possible - Guaranteed to converge to local minima.
- Alternately updates assigned points and centers
of the k bubbles. - Mean best center at each step because of Bregman
divergence property shown by Banerjee et al.,
2004.
17Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 1.77
4
c2
c1
6
18Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.85
5
c2
c1
5
19Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.56
6
c2
c1
4
20Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.45
6
c2
c1
4
21Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.37
5
c2
c1
5
22Outline
- Parametric approach
- Bregman Bubble Clustering
- Bregman Bubble Soft Clustering
- Pressurization Seeding
- Results
- Non Parametric approach
- Automated Hierarchical Density Shaving
- Results
- Comparison
- Demo
23Bregman Bubble Soft Clustering
- A probabilistic model, mixture of k exponential
distributions and one uniform distribution - An EM-based algorithm that alternately updates
the mixture weights and mixing parameters. - Can be showed to converge to local minima.
24A 2-d dataset, 5 Gaussiansuniform
Can you spot the 5 Gaussians?
25A 2-d dataset, 5 Gaussiansuniform
262-d dataset, after Bregman Soft
Clustering(scalar variances updated)
27Bregman Bubble Soft Clustering
- Exploits a bijection between Bregman divergences
and exponential functions Banerjee et al.,
2004
a convex function
Bregman divergence
Exponential function
Examples
1. Squared Euclidean corresponds to fixed
variance Gaussian
2. KL-Divergence corresponds to multinomial
28Unification
- Can show Bregman Bubble Clustering(BBC) special
case of Bregman Bubble Soft Clustering when - Demonstrates that the BBC algorithm not ad-hoc
but arises out of a mixture of exponentials and a
uniform distribution. - Unifications with previous algorithms
- Bregman Bubble (k1) One Class (same
cost as OC-IB) - Bregman Bubble (cluster all data) Bregman
Clustering - Soft Bubble (cluster all data) Bregman Soft
Clustering
29Outline
- Parametric approach
- Bregman Bubble Clustering
- Bregman Bubble Soft Clustering
- Pressurization Seeding
- Results
- Non Parametric approach
- Automated Hierarchical Density Shaving
- Results
- Comparison
- Demo
30Bregman Bubble Soft Clustering with Pressurization
- Problem
- BBC very sensitive to initialization
- Especially when small clusters desired limited
mobility during local search. - Solution Pressurization
- Demo
31Pressurization demo, iteration 1
watch this area
32Pressurization demo, iteration 2
33Pressurization demo, iteration 3
34Pressurization demo, iteration 9
35Pressurization demo, iteration 10
36Pressurization demo, iteration 20
37Seeding Bregman Bubble Clustering
- Goals
- Find k centers to start BBC local search.
- To overcome local minima problem in BBC.
- Automatically estimate k.
- Features
- Deterministic.
- Guaranteed constant times optimal cost for One
Class.
38Seeding Bregman Bubble Clustering for k1
- Input n x n distance matrix, no. of points to
cluster s. - Restrict c (center) to a data point.
- Output best cluster centroid.
- Algorithm Sort each row, cumulative sum,
normalize, pick best.
Best cluster of size s
39Seeding Bregman Bubble Clustering for k gt1
- Goal
- Identify the k dense regions in the data, and the
corresponding k centroids. - Main idea behind the solution
- If we run One Class Bregman Bubble (k1) starting
n times from each of the n data points as seed
locations, the n convergence locations would
correspond to one of only k distinct densest
regions in the data. - These k dense locations can be thought of as the
centers of the k dense locations in the data. - A faster approximation of the above possible by
- Restricting the k centroid search to the n data
points. - Demo of DGRADE (Density Gradient Enumeration)
next.
40Density Gradient Enumeration demo
41Density Gradient Enumeration demo
Step 1 Measure One class cost at each point as
average divergence to say 5 (sone) closest
neighbors (including pt itself)
0.25
42Density Gradient Enumeration demo
Step 1 Measure the One class cost at each point
(which could also be seen as being inversely
proportional to data density at that point)
0.35
0.41
0.4
0.24
0.19
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
43Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 1
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
Cluster 1 started
0.20 - 0.24
0.25 - 0.34
gt0.34
44Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 2
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
Cluster 2 started
0.25 - 0.34
gt0.34
45Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 3
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
46Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 4
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
47Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 5
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
48Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 6
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
49Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 7
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
50Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 8
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
51Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 9
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
52Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 10
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
53Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 11
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
54Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 12
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
55Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 13
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
56Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 14
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
57Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 15
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
58Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 16
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
59Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 24
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.36
0.19
0.24
0.16
0.18
0.45
0.14
0.25
0.15
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
60Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 24
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.36
0.19
0.24
0.16
0.18
0.45
0.14
0.25
0.15
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
return densest points in each cluster as seeds.
gt0.34
61Density Gradient Enumeration 2-d Gaussian data
62Results
Tested on many datasets, using three different
distance measures
63Results on Lee (gene expression, 1-Pearson)
Overlap Lift
64Results on 40-d Synthetic (Sq. Euclidean)
Adjusted Rand Index
65Results on 20-Newsgroup (1-Cosine)
Adjusted Rand Index
66Seeded BBC, Gasch Array (1-Pearson)
PressurizationSeeding
Pressurization only
Adjusted Rand Index
67Outline
- Parametric approach
- Bregman Bubble Clustering
- Bregman Bubble Soft Clustering
- Seeding
- Non Parametric approach
- Automated Hierarchical Density Shaving
- Comparison
- Demo
68Density Based Clustering Algorithms
- HMA (Wishart, 1968)
- DBSCAN (Ester et al., 1996)
- Density Shaving (DS) and Auto-HDS framework.
- Can show there is a strong connection between
above algorithms. - All 3 use a uniform density kernel (density ? no.
points within some radius)
radius
radius
69Density Shaving (DS)
70Density Shaving (DS)
71Density Shaving (DS)
Two inputs 1. f_shave Fraction to shave
(0.38) 2. n? Min. no. of nbrs (3)
Uses a trick to automatically compute correct
ball radius from f_shave and n? .
72Density Shaving (DS)
Performs graph traversal on dense points to
identify clusters.
73Density Shaving (DS)
dont care points
74Properties of DS
- Increasing n? has a smoothing effect on
clustering.
n? 5
n? 50
.
x dense dont care points
75Properties of DS
- For a fixed n? , successive runs of DS with
increasing data shaving (f_shave) result in a
hierarchy of clusters.
38 shaving, n? 25
2-D Gaussian example 1298 pts, 5
Gaussians uniform background
15 shaving, n? 25
76Properties of DS
- With a fixed n? , successive runs of DS with
increasing shaving (f_shave) result in a
hierarchy of clusters.
15 shaving
38
62
85
- clusters can
- - split
- - vanish
- pts in separate
- clusters never
- merge into one
2-D Gaussian example 1298 pts, 5
Gaussians uniform background
77Hierarchical Density Shaving (HDS)
- Uses geometric/exponential shaving to create the
hierarchy from DS. - Starting from all, fixed fraction r_shave of data
shaved each iteration. - Clusters that lose points without splitting get
the same id. Example
38 shaving
55 shaving
A
A
B
B
78An important trickDictionary Row Sort on HDS
Label Matrix
79Visualization using the sorted Label Matrix
C 85 shaving
A 38
B 62
A
B
C
- Sorted matrix plotted
- Each cluster plotted in unique color
- Dont care points plotted in background color
Sorted rows index
- Shows the compact, 8-node hierarchy
Shaving iteration
80Cluster Stability
level
22 iterations
Shaving iteration ? level
Spatially relevant Projection
level
Stability diff. between first and last level of
a cluster ? no. of shaving
iterations a cluster exists.
81Cluster Selection
- We can show relative stability is independent
of shaving rate r_shave - All clusters can be ranked by stability, even
parents and children. - One way of selecting clusters
- - Highest stability clusters picked first.
- - Parents and children of picked clusters
eliminated.
82HDS Visualization Model selection Auto-HDS
Auto HDS (546 pts)
- Auto-HDS
- -Finds all modes /clusters.
- -Finds clusters of varying density
- simultaneously.
DS (546 pts, f_shave0.58)
83Results Gasch Dataset
Results Gasch Dataset
84Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Reference pool, not stressed
Heat Shock
YPD
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
85Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Heat Shock
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
86Results Lee Dataset
87Outline
- Parametric approach
- Bregman Bubble Clustering
- Bregman Soft Bubble Clustering
- Seeding
- Non Parametric approach
- Automated Hierarchical Density Shaving
- Comparison
- Demo
88BBC vs. Auto-HDS A qualitative comparison
89BBC vs. Auto-HDS Sim-2
Auto-HDS
BBC
90BBC vs. Auto-HDS Sim-2
91BBC vs. Auto-HDS Lee
92Outline
- Parametric approach
- Bregman Bubble Clustering
- Bregman Bubble Soft Clustering
- Seeding
- Non Parametric approach
- Automated Hierarchical Density Shaving
- Comparison
- Demo
93Gene DIVER
- Gene Density Interactive Visual ExplorER
- A scalable implementation of Auto-HDS using
streaming data instead of main memory. - Special features for browsing clusters.
- Special features for biological data mining.
- Available for download at
- http//www.ideal.ece.utexas.edu/gunjan/gened
iver
Lets see the Gene DIVER Demo now
94Main Contributions
- Simultaneously finding dense clusters and pruning
rest useful in many domains. - Parametric method BBC generalizes density based
clustering to a large class of problems. - very scalable to large, high-d data
- Robust with pressurization and seeding.
- Auto-HDS improves upon non-parametric
density-based clustering in many ways - Well-suited for very high-d datasets.
- A powerful visualization.
- Interactive clustering, compact hierarchy.
- Gene DIVER a powerful tool for the data mining
community, and especially for Bioinformatics
practitioners.
95Future Work
- BBC
- Bregman Bubble Coclustering.
- Online Bregman Bubble for capturing localized
concept drifts. - Auto-HDS
- Variable resolution ISOMAP.
- Deterministic coclustering.
- Extensions to Gene DIVER.
96Relevant Papers
- G. Gupta, J. Ghosh, Bregman Bubble Clustering A
Robust, Scalable Framework for Locating Multiple,
Dense Regions in Data, ICDM 2006, 12 pages. - Gupta, A. Liu and J. Ghosh, Hierarchical Density
Shaving A clustering and visualization framework
for large biological datasets, ICDM-DMB 2006, 5
pages - G. Gupta, A. Liu and J. Ghosh, Clustering and
Visualization of High-Dimensional Biological
Datasets using a fast HMA Approximation, ANNIE
2006, 6 pages. - G. Gupta and J. Ghosh, Robust One-Class
Clustering Using Hybrid Global and Local Search,
ICML 2005, pp. 273-280 - G. Gupta, J. Ghosh, Bregman Bubble Clustering, A
Robust Framework for Mining Dense Clusters, under
review, JMLR. - G. Gupta, A. Liu, J. Ghosh, Automated
Hierarchical Density Shaving A robust, automated
clustering and visualization framework for large
biological datasets, under review, IEEE Tran.
Comp. Bio. Bioinform.
97?
98Backup Slides from Here
99Other papers
- G. Gupta and J. Ghosh, Detecting Seasonal Trends
and Cluster Motion Visualization for very High
Dimensional Transactional Data, SDM-2001 - G. Gupta and J. Ghosh, Value Balanced
Agglomerative Connectivity Clustering, Proc. SPIE
Conf. on Data Mining and Knowledge Discovery,
SPIE-2001. - G. Gupta, A. Strehl and J. Ghosh. Distance Based
Clustering of Association Rules, Annie-99
100Properties of Auto-HDS
- Fast O(n n? log n) using a heap-based imp.
- Gene DIVER a memory efficient heap-based
implementation. - Extremely compact hierarchy of clusters.
- Visualization
- Creates a spatially relevant 2-D projection of
points and clusters. - Spatially relevant 2-D project of the compact
hierarchy. - Model selection
- Can define a notion of stability (analogous to
cluster height) - Based on stability, can select the most stable
clusters automatically.
101Finding relevant subsets, related work
- Density based clustering e.g. DBSCAN Ester et
al. - Pros good for low-d spatial data.
- Cons not suitable for high-d or non-metric
scenarios. - Gene Shaving Hastie et al. 2000
- Pros well-suited for gene expression datasets.
- Cons greedily finds clusters, slow, implicit Sq.
Euclidean assumptions. - PLAID Lazzeroni et al. 2002
- Pros cluster rows and columns simultaneously
good for high-d - Cons greedy extraction of clusters as plaids,
assumes additive layers, not true for many
datasets.
102DGRADE Selecting sone parameter
- sone is like a smoothing parameter to DGRADE
- As sone increases, k declines.
- Three scenarios for determining sone
- If k given, find the smallest sone that results
in k clusters - If not, find k that occurs for the longest
interval (max stability) of sone values - Or, find largest k that occurs at least a certain
no. of times (minimum stability)
103DGRADE Selecting sone parameter
k5 input, sone was found as 57
Automatic k(4) and sone (62)
104Seeded BBC on 40-d Synthetic (Sq. Euclidean)
PressurizationSeeding
Using Pressurization only
Adjusted Rand Index