Robust Methods for Locating Multiple Dense Regions in Complex Datasets PowerPoint PPT Presentation

presentation player overlay
1 / 104
About This Presentation
Transcript and Presenter's Notes

Title: Robust Methods for Locating Multiple Dense Regions in Complex Datasets


1
Robust Methods for Locating Multiple Dense
Regions in Complex Datasets
  • Gunjan Gupta
  • Department of Electrical Computer Engineering
  • The University of Texas at Austin
  • October 30, 2006

2
Why cluster only a part of the data into dense
clusters?
3
Why cluster only a part of the data into dense
clusters?
4
Why cluster only a part of the data into dense
clusters?
Exhaustive clustering (K-Means) result
5
Goal cluster only a (small) part of the data
into multiple dense clusters.
6
Why cluster only a part of the data into dense
clusters?
  • Little or no labeled data available.
  • Only a part of the data clusters well.
  • Or only a fraction of the data relevant.

7
Application Scenarios
  • Bioinformatics
  • Gene Microarray data 100s of strongly correlated
    genes from 1000s
  • Phylogenetics data
  • Document Retrieval
  • User interested in a few highly relevant
    documents.
  • Market Basket Data
  • Only some customers have highly correlated
    behaviors
  • Feature selection

8
Practical Issues
  • How many dense clusters, where are they located?
  • What fraction of data to cluster?
  • Notion of density?
  • All clusters not necessarily equally dense.
  • Choice of model, distance measure.

9
In this thesis we introduce two new approaches
for finding k dense clusters
  • A very general parametric approach
  • Bregman Bubble Clustering
  • A non parametric approach that provides
    significant extension over existing methods
  • Automated Hierarchical Density Shaving

10
Outline
  • Parametric approach
  • Bregman Bubble Clustering
  • Soft Bregman Bubble Clustering
  • Pressurization Seeding
  • Results
  • Non Parametric approach
  • Automated Hierarchical Density Shaving
  • Results
  • Comparison
  • Demo

11
Finding a single dense cluster One
Class-IBCrammer et. al 04
  • Perhaps the first parametric density-based
    approach.
  • Uses the notion of a Bregmanian Ball
  • Distance measure Bregman divergence
  • Pros
  • Faster than non-parametric methods.
  • Generalize to all Bregman Divergences, a large
    class of measures.
  • Cons
  • Can only find one dense region.

Bregmanian ball cost average Bregman divergence
from center
12
Bregman Divergences
  • A large class of divergences
  • Applicable to a wide variety of data types.
  • Includes a number of useful distance measures.
  • e.g. KL-divergence, Itakura-Saito distance,
    Squared Euclidean distance, Mahalanobis distance,
    etc.
  • Exploited in Bregman Clustering Banerjee et al.,
    2004
  • Exhaustive partitioning into k segments.
  • Generalization of K-Means to all Bregman
    divergences.

13
Problem Definition Bregman Bubble Clustering
  • Find k clusters consisting a total of s points
    that have the lowest total cost

Cost measure
Bregman divergence
k centroids
Where S
Set of k clusters
14
Bregman Bubble will also be applicable to two
important Bregman Projections
  • 1-Cosine for document clustering
  • Sq. Euclidean distance between points projected
    onto a sphere.
  • Pearson Distance for Biological data
  • Sq. Euclidean Distance between Z-scored points
    (i.e between points projected on a sphere after
    subtracting mean across dimensions).
  • Equal to 1-Pearson Correlation.

15
Bregmanian Ball vs. Bregman Bubbles
Bregmanian Ball, k1
Bregmanian Bubbles, kgt1
  • Can show for fixed centers, optimal solution
    for forms Bregman Bubbles.

16
Finding k Bregman Bubbles
  • Optimal solution too slow.
  • A simple iterative relocation algorithm, Bregman
    Bubble Clustering possible
  • Guaranteed to converge to local minima.
  • Alternately updates assigned points and centers
    of the k bubbles.
  • Mean best center at each step because of Bregman
    divergence property shown by Banerjee et al.,
    2004.

17
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 1.77
4
c2
c1
6
18
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.85
5
c2
c1
5
19
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.56
6
c2
c1
4
20
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.45
6
c2
c1
4
21
Bregman Bubble Clustering demo for k2, s10
(average sq. Euclidean distance from assigned
centroid) 0.37
5
c2
c1
5
22
Outline
  • Parametric approach
  • Bregman Bubble Clustering
  • Bregman Bubble Soft Clustering
  • Pressurization Seeding
  • Results
  • Non Parametric approach
  • Automated Hierarchical Density Shaving
  • Results
  • Comparison
  • Demo

23
Bregman Bubble Soft Clustering
  • A probabilistic model, mixture of k exponential
    distributions and one uniform distribution
  • An EM-based algorithm that alternately updates
    the mixture weights and mixing parameters.
  • Can be showed to converge to local minima.

24
A 2-d dataset, 5 Gaussiansuniform
Can you spot the 5 Gaussians?
25
A 2-d dataset, 5 Gaussiansuniform
26
2-d dataset, after Bregman Soft
Clustering(scalar variances updated)
27
Bregman Bubble Soft Clustering
  • Exploits a bijection between Bregman divergences
    and exponential functions Banerjee et al.,
    2004

a convex function
Bregman divergence
Exponential function
Examples
1. Squared Euclidean corresponds to fixed
variance Gaussian
2. KL-Divergence corresponds to multinomial
28
Unification
  • Can show Bregman Bubble Clustering(BBC) special
    case of Bregman Bubble Soft Clustering when
  • Demonstrates that the BBC algorithm not ad-hoc
    but arises out of a mixture of exponentials and a
    uniform distribution.
  • Unifications with previous algorithms
  • Bregman Bubble (k1) One Class (same
    cost as OC-IB)
  • Bregman Bubble (cluster all data) Bregman
    Clustering
  • Soft Bubble (cluster all data) Bregman Soft
    Clustering

29
Outline
  • Parametric approach
  • Bregman Bubble Clustering
  • Bregman Bubble Soft Clustering
  • Pressurization Seeding
  • Results
  • Non Parametric approach
  • Automated Hierarchical Density Shaving
  • Results
  • Comparison
  • Demo

30
Bregman Bubble Soft Clustering with Pressurization
  • Problem
  • BBC very sensitive to initialization
  • Especially when small clusters desired limited
    mobility during local search.
  • Solution Pressurization
  • Demo

31
Pressurization demo, iteration 1
watch this area
32
Pressurization demo, iteration 2
33
Pressurization demo, iteration 3
34
Pressurization demo, iteration 9
35
Pressurization demo, iteration 10
36
Pressurization demo, iteration 20
37
Seeding Bregman Bubble Clustering
  • Goals
  • Find k centers to start BBC local search.
  • To overcome local minima problem in BBC.
  • Automatically estimate k.
  • Features
  • Deterministic.
  • Guaranteed constant times optimal cost for One
    Class.

38
Seeding Bregman Bubble Clustering for k1
  • Input n x n distance matrix, no. of points to
    cluster s.
  • Restrict c (center) to a data point.
  • Output best cluster centroid.
  • Algorithm Sort each row, cumulative sum,
    normalize, pick best.

Best cluster of size s
39
Seeding Bregman Bubble Clustering for k gt1
  • Goal
  • Identify the k dense regions in the data, and the
    corresponding k centroids.
  • Main idea behind the solution
  • If we run One Class Bregman Bubble (k1) starting
    n times from each of the n data points as seed
    locations, the n convergence locations would
    correspond to one of only k distinct densest
    regions in the data.
  • These k dense locations can be thought of as the
    centers of the k dense locations in the data.
  • A faster approximation of the above possible by
  • Restricting the k centroid search to the n data
    points.
  • Demo of DGRADE (Density Gradient Enumeration)
    next.

40
Density Gradient Enumeration demo
41
Density Gradient Enumeration demo
Step 1 Measure One class cost at each point as
average divergence to say 5 (sone) closest
neighbors (including pt itself)
0.25
42
Density Gradient Enumeration demo
Step 1 Measure the One class cost at each point
(which could also be seen as being inversely
proportional to data density at that point)
0.35
0.41
0.4
0.24
0.19
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
43
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 1
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
Cluster 1 started
0.20 - 0.24
0.25 - 0.34
gt0.34
44
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 2
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
Cluster 2 started
0.25 - 0.34
gt0.34
45
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 3
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
46
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 4
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
47
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 5
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
48
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 6
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
49
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 7
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
50
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 8
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
51
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 9
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
52
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 10
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
53
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 11
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.16
0.24
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
54
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 12
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
55
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 13
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
56
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 14
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
57
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 15
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
58
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 16
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.19
0.36
0.24
0.16
0.18
0.45
0.15
0.14
0.25
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
59
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 24
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.36
0.19
0.24
0.16
0.18
0.45
0.14
0.25
0.15
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
gt0.34
60
Density Gradient Enumeration demo
Step 2 Visit points in the order of decreasing
density connect next densest point to densest
among 5 closest neighbors and relabel. Start new
cluster if point itself densest
0.35
0.41
0.4
0.24
0.19
Iteration 24
0.26
0.37
0.22
0.09
0.25
0.29
0.1
0.36
0.19
0.24
0.16
0.18
0.45
0.14
0.25
0.15
0.58
lt0.14
0.7
0.42
0.15 - 0.19
0.20 - 0.24
0.25 - 0.34
return densest points in each cluster as seeds.
gt0.34
61
Density Gradient Enumeration 2-d Gaussian data
62
Results
Tested on many datasets, using three different
distance measures
63
Results on Lee (gene expression, 1-Pearson)
Overlap Lift
64
Results on 40-d Synthetic (Sq. Euclidean)
Adjusted Rand Index
65
Results on 20-Newsgroup (1-Cosine)
Adjusted Rand Index
66
Seeded BBC, Gasch Array (1-Pearson)
PressurizationSeeding
Pressurization only
Adjusted Rand Index
67
Outline
  • Parametric approach
  • Bregman Bubble Clustering
  • Bregman Bubble Soft Clustering
  • Seeding
  • Non Parametric approach
  • Automated Hierarchical Density Shaving
  • Comparison
  • Demo

68
Density Based Clustering Algorithms
  • HMA (Wishart, 1968)
  • DBSCAN (Ester et al., 1996)
  • Density Shaving (DS) and Auto-HDS framework.
  • Can show there is a strong connection between
    above algorithms.
  • All 3 use a uniform density kernel (density ? no.
    points within some radius)

radius
radius
69
Density Shaving (DS)
70
Density Shaving (DS)
71
Density Shaving (DS)
Two inputs 1. f_shave Fraction to shave
(0.38) 2. n? Min. no. of nbrs (3)
Uses a trick to automatically compute correct
ball radius from f_shave and n? .
72
Density Shaving (DS)
Performs graph traversal on dense points to
identify clusters.
73
Density Shaving (DS)
dont care points
74
Properties of DS
  • Increasing n? has a smoothing effect on
    clustering.

n? 5
n? 50
.
x dense dont care points
75
Properties of DS
  • For a fixed n? , successive runs of DS with
    increasing data shaving (f_shave) result in a
    hierarchy of clusters.

38 shaving, n? 25
2-D Gaussian example 1298 pts, 5
Gaussians uniform background
15 shaving, n? 25
76
Properties of DS
  • With a fixed n? , successive runs of DS with
    increasing shaving (f_shave) result in a
    hierarchy of clusters.

15 shaving
38
62
85
  • clusters can
  • - split
  • - vanish
  • pts in separate
  • clusters never
  • merge into one

2-D Gaussian example 1298 pts, 5
Gaussians uniform background
77
Hierarchical Density Shaving (HDS)
  • Uses geometric/exponential shaving to create the
    hierarchy from DS.
  • Starting from all, fixed fraction r_shave of data
    shaved each iteration.
  • Clusters that lose points without splitting get
    the same id. Example

38 shaving
55 shaving
A
A
B
B
78
An important trickDictionary Row Sort on HDS
Label Matrix
79
Visualization using the sorted Label Matrix
C 85 shaving
A 38
B 62
A
B
C
  • Sorted matrix plotted
  • Each cluster plotted in unique color
  • Dont care points plotted in background color

Sorted rows index
  • Shows the compact, 8-node hierarchy

Shaving iteration
80
Cluster Stability
level
22 iterations
Shaving iteration ? level
Spatially relevant Projection
level
Stability diff. between first and last level of
a cluster ? no. of shaving
iterations a cluster exists.
81
Cluster Selection
  • We can show relative stability is independent
    of shaving rate r_shave
  • All clusters can be ranked by stability, even
    parents and children.
  • One way of selecting clusters
  • - Highest stability clusters picked first.
  • - Parents and children of picked clusters
    eliminated.

82
HDS Visualization Model selection Auto-HDS
Auto HDS (546 pts)
  • Auto-HDS
  • -Finds all modes /clusters.
  • -Finds clusters of varying density
  • simultaneously.

DS (546 pts, f_shave0.58)
83
Results Gasch Dataset
Results Gasch Dataset
84
Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Reference pool, not stressed
Heat Shock
YPD
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
85
Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Heat Shock
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
86
Results Lee Dataset
87
Outline
  • Parametric approach
  • Bregman Bubble Clustering
  • Bregman Soft Bubble Clustering
  • Seeding
  • Non Parametric approach
  • Automated Hierarchical Density Shaving
  • Comparison
  • Demo

88
BBC vs. Auto-HDS A qualitative comparison
89
BBC vs. Auto-HDS Sim-2
Auto-HDS
BBC
90
BBC vs. Auto-HDS Sim-2
91
BBC vs. Auto-HDS Lee
92
Outline
  • Parametric approach
  • Bregman Bubble Clustering
  • Bregman Bubble Soft Clustering
  • Seeding
  • Non Parametric approach
  • Automated Hierarchical Density Shaving
  • Comparison
  • Demo

93
Gene DIVER
  • Gene Density Interactive Visual ExplorER
  • A scalable implementation of Auto-HDS using
    streaming data instead of main memory.
  • Special features for browsing clusters.
  • Special features for biological data mining.
  • Available for download at
  • http//www.ideal.ece.utexas.edu/gunjan/gened
    iver

Lets see the Gene DIVER Demo now
94
Main Contributions
  • Simultaneously finding dense clusters and pruning
    rest useful in many domains.
  • Parametric method BBC generalizes density based
    clustering to a large class of problems.
  • very scalable to large, high-d data
  • Robust with pressurization and seeding.
  • Auto-HDS improves upon non-parametric
    density-based clustering in many ways
  • Well-suited for very high-d datasets.
  • A powerful visualization.
  • Interactive clustering, compact hierarchy.
  • Gene DIVER a powerful tool for the data mining
    community, and especially for Bioinformatics
    practitioners.

95
Future Work
  • BBC
  • Bregman Bubble Coclustering.
  • Online Bregman Bubble for capturing localized
    concept drifts.
  • Auto-HDS
  • Variable resolution ISOMAP.
  • Deterministic coclustering.
  • Extensions to Gene DIVER.

96
Relevant Papers
  • G. Gupta, J. Ghosh, Bregman Bubble Clustering A
    Robust, Scalable Framework for Locating Multiple,
    Dense Regions in Data, ICDM 2006, 12 pages.
  • Gupta, A. Liu and J. Ghosh, Hierarchical Density
    Shaving A clustering and visualization framework
    for large biological datasets, ICDM-DMB 2006, 5
    pages
  • G. Gupta, A. Liu and J. Ghosh, Clustering and
    Visualization of High-Dimensional Biological
    Datasets using a fast HMA Approximation, ANNIE
    2006, 6 pages.
  • G. Gupta and J. Ghosh, Robust One-Class
    Clustering Using Hybrid Global and Local Search,
    ICML 2005, pp. 273-280
  • G. Gupta, J. Ghosh, Bregman Bubble Clustering, A
    Robust Framework for Mining Dense Clusters, under
    review, JMLR.
  • G. Gupta, A. Liu, J. Ghosh, Automated
    Hierarchical Density Shaving A robust, automated
    clustering and visualization framework for large
    biological datasets, under review, IEEE Tran.
    Comp. Bio. Bioinform.

97
?
98
Backup Slides from Here
99
Other papers
  • G. Gupta and J. Ghosh, Detecting Seasonal Trends
    and Cluster Motion Visualization for very High
    Dimensional Transactional Data, SDM-2001
  • G. Gupta and J. Ghosh, Value Balanced
    Agglomerative Connectivity Clustering, Proc. SPIE
    Conf. on Data Mining and Knowledge Discovery,
    SPIE-2001.
  • G. Gupta, A. Strehl and J. Ghosh. Distance Based
    Clustering of Association Rules, Annie-99

100
Properties of Auto-HDS
  • Fast O(n n? log n) using a heap-based imp.
  • Gene DIVER a memory efficient heap-based
    implementation.
  • Extremely compact hierarchy of clusters.
  • Visualization
  • Creates a spatially relevant 2-D projection of
    points and clusters.
  • Spatially relevant 2-D project of the compact
    hierarchy.
  • Model selection
  • Can define a notion of stability (analogous to
    cluster height)
  • Based on stability, can select the most stable
    clusters automatically.

101
Finding relevant subsets, related work
  • Density based clustering e.g. DBSCAN Ester et
    al.
  • Pros good for low-d spatial data.
  • Cons not suitable for high-d or non-metric
    scenarios.
  • Gene Shaving Hastie et al. 2000
  • Pros well-suited for gene expression datasets.
  • Cons greedily finds clusters, slow, implicit Sq.
    Euclidean assumptions.
  • PLAID Lazzeroni et al. 2002
  • Pros cluster rows and columns simultaneously
    good for high-d
  • Cons greedy extraction of clusters as plaids,
    assumes additive layers, not true for many
    datasets.

102
DGRADE Selecting sone parameter
  • sone is like a smoothing parameter to DGRADE
  • As sone increases, k declines.
  • Three scenarios for determining sone
  • If k given, find the smallest sone that results
    in k clusters
  • If not, find k that occurs for the longest
    interval (max stability) of sone values
  • Or, find largest k that occurs at least a certain
    no. of times (minimum stability)

103
DGRADE Selecting sone parameter
k5 input, sone was found as 57
  • Example on 2-d data

Automatic k(4) and sone (62)
104
Seeded BBC on 40-d Synthetic (Sq. Euclidean)
PressurizationSeeding
Using Pressurization only
Adjusted Rand Index
Write a Comment
User Comments (0)
About PowerShow.com