Efficient%20Algorithms%20for%20Non-parametric%20Clustering%20With%20Clutter

About This Presentation

Title:

Efficient%20Algorithms%20for%20Non-parametric%20Clustering%20With%20Clutter

Description:

Mixture model approach mixture of Gaussians for features, Poisson process for clutter ... 1. Introduction: Clustering and Clutter. 2. The Cuevas-Febreiro ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 72

Provided by: me7788

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient%20Algorithms%20for%20Non-parametric%20Clustering%20With%20Clutter

1
Efficient Algorithms for Non-parametric
Clustering With Clutter

Weng-Keen Wong
Andrew Moore
(In partial fulfillment of the speaking
requirement)

2
Problems From the Physical Sciences
Minefield detection (Dasgupta and Raftery 1998)
Earthquake faults (Byers and Raftery 1998)
3
Problems From the Physical Sciences
(Pereira 2002)
(Sloan Digital Sky Survey 2000)
4
A Simplified Example
5
Clustering with Single Linkage Clustering
Clusters
Single Linkage Clustering MST
6
Clustering with Mixture Models
Resulting Clusters
Mixture of Gaussians with a Uniform Background
Component
7
Clustering with CFF
Cuevas-Febrero-Fraiman
Original Dataset
8
Related Work

(Dasgupta and Raftery 98)
Mixture model approach mixture of Gaussians for
features, Poisson process for clutter
(Byers and Raftery 98)
K-nearest neighbour distances for all points
modeled as a mixture of two gamma distributions,
one for clutter and one for the features
Classify each data point based on which component
it was most likely generated from

9
Outline

1. Introduction Clustering and Clutter
2. The Cuevas-Febreiro-Fraiman Algorithm
3. Optimizing Step One of CFF
4. Optimizing Step Two of CFF
5. Results

10
The CFF Algorithm Step One

Find the high
density datapoints

11
The CFF Algorithm Step Two

Cluster the high density points using Single
Linkage Clustering
Stop when link length gt ?

12
The CFF Algorithm

Originally intended to estimate the number of
clusters
Can also be used to find clusters against a noisy
background

13
Step One Density Estimators

Finding high density points requires a density
estimator
Want to make as few assumptions about underlying
density as possible
Use a non-parametric density estimator

14
A Simple Non-Parametric Density Estimator

A datapoint is a high
density datapoint if
The number of
datapoints within a
hypersphere of radius
h is gt threshold c

15
Speeding up the Non-Parametric Density Estimator

Addressed in a separate paper (Gray and Moore
2001)
Two basic ideas
1. Use a dual tree algorithm (Gray and Moore
2000)
2. Cut search off early without computing exact
densities (Moore 2000)

16
Step Two Euclidean Minimum Spanning Trees (EMSTs)

Traditional MST algorithms assume you are given
all the distances
Implies O(N2) memory usage
Want to use a Euclidean Minimum Spanning Tree
algorithm

17
Optimizing Clustering Step

Exploit recent results in computational geometry
for efficient EMSTs
Involves modification to GeoMST2 algorithm by
(Narasimhan et al 2000)
GeoMST2 is based on Well-Separated Pairwise
Decompositions (WSPDs) (Callahan 1995)
Our optimizations gain an order of magnitude
speedup, especially in higher dimensions

18
Outline for Optimizing Step Two

1. High level overview of GeoMST2
2. Properties of a WSPD
3. How to create a WSPD
4. More detailed description of GeoMST2
5. Our optimizations

19
Intuition behind GeoMST2
20
Intuition behind GeoMST2
21
High Level Overview of GeoMST2
Well-Separated Pairwise Decomposition

(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)

22
High Level Overview of GeoMST2
Well-Separated Pairwise Decomposition
Each Pair (Ai,Bi) represents a possible edge in
the MST

(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)

23
High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition

(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)

2. Take the pair (Ai,Bi) that corresponds to the
shortest edge
3. If the vertices of that edge are not in the
same connected component, add the edge to the
MST. Repeat Step 2.
24
A Well-Separated Pair (Callahan 1995)

Let A and B be point sets in ?d
Let RA and RB be their respective bounding
hyper-rectangles
Define MargDistance(A,B) to be the minimum
distance between RA and RB

25
A Well-Separated Pair (Cont)

The point sets A and B are considered to be
well-separated if
MargDistance(A,B) ? maxDiam(RA),Diam(RB)

26
Interaction Product

The interaction product between two point sets A
and B is defined as
A ? B p,p p ? A, p ? B, p ? p

27
Interaction Product

The interaction product between two point sets A
and B is defined as
A ? B p,p p ? A, p ? B, p ? p

This is the set of all distinct pairs with one
element in the pair from A and the other element
from B
28
Interaction Product Definition

The interaction product between two point sets A
and B is defined as
A ? B p,p p ? A, p ? B, p ? p

For Example A 1,2,3 B 4,5 A ? B
1,4, 1,5, 2,4, 2,5, 3,4, 3,5
29
Interaction Product
Now let A and B be the same point set ie. A
0,1,2,3,4 B 0,1,2,3,4

A ? B 0,1, 0,2, 0,3,0,4,
1,2, 1,3, 1,4,
2,3, 2,4,
3,4

30
Interaction Product
Now let A and B be the same point set ie. A
0,1,2,3,4 B 0,1,2,3,4

A ? B 0,1, 0,2, 0,3, 0,4,
1,2, 1,3, 1,4,
2,3, 2,4,
3,4

Think of this as all possible edges in a
complete, undirected graph with 0,1,2,3,4 as
the vertices
31
A Well-Separated Pairwise Decomposition
Pair 1 (0,1)
Pair 2 (0,1, 2)
Pair 3 (0,1,2,3,4)
Pair 4 (3, 4)
Claim The set of pairs (0,1), (0,1,
2), (0,1,2,3,4), (3, 4) form a
Well-Separated Decomposition.
32
Interaction Product Properties

If P is a point set in ?d then a WSPD of P is a
set of pairs (Ai,Bi),,(Ak,Bk) with the following
properties
1. Ai ? P and Bi ? P for all i 1,,k
2. Ai ? Bi ? for all i 1, , k

A 0,1,2,3,4 B 0,1,2,3,4 (0,1),
(0,1, 2), (0,1,2,3,4), (3, 4)
clearly satisfies Properties 1 and 2
33
Interaction Product Property 3

3. (Ai ? Bi) ? (Aj ? Bj) ? for all i,j such
that i ? j

From (0,1), (0,1, 2), (0,1,2,3,4),
(3, 4) we get the following interaction
products A1 ? B1 0,1 A2 ? B2
0,2,1,2 A3 ? B3 0,3,1,3,2,3,0,4,
1,4,2,4 A4 ? B4 3,4 These Interaction
Products are all disjoint
34
Interaction Product Property 4

P ? P 0,1, 0,2, 0,3, 0,4, 1,2,
1,3, 1,4, 2,3, 2,4,
3,4 A1 ? B1 0,1 A2 ? B2
0,2,1,2 A3 ? B3 0,3,1,3,2,3,0,4,
1,4,2,4 A4 ? B4 3,4 The Union of the
above Interaction Products gives back P ? P
35
Interaction Product Property 5

5. Ai and Bi are well-separated for all i1,,k

36
Two Points to Note about WSPDs

Two distinct points are considered to be
well-separated
For any data set of size n, there is a trivial
WSPD of size (n choose 2)

37
A Well-Separated Pairwise Decomposition
(Continued)
If there are n points in P, a WSPD of P can be
constructed in O(nlogn) time with O(n) elements
using a fair split tree (Callahan 1995)
38
A Fair Split Tree
39
Creating a WSPD
Are the nodes outlined in yellow well-separated?
No.
40
Creating a WSPD
Recurse on children of node with widest dimension
41
Creating a WSPD
Recurse on children of node with widest dimension
42
Creating a WSPD
Recurse on children of node with widest dimension
43
Creating a WSPD
And so on
44
Base Case
Eventually you will find a well-separated pair of
nodes. Add this pair to the WSPD.
45
Another Example of the Base Case
46
Creating a WSPD

FindWSPD(W,NodeA,NodeB)
if( IsWellSeparated(NodeA,NodeB))
AddPair(W,NodeA,NodeB)
else
if( MaxHrectDimLength(NodeA) lt
MaxHrectDimLength(NodeB) )
Swap(NodeA,NodeB)
FindWSPD(W,NodeA-gtLeft,NodeB) FindWSPD(W,NodeA-
gtRight,NodeB)

47
High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition

(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)

2. Take the pair (Ai,Bi) that corresponds to the
shortest edge
3. If the vertices of that edge are not in the
same connected component, add the edge to the
MST. Repeat Step 2
48
Bichromatic Closest Pair Distance

Given two sets (Ai,Bi), the Bichromatic
Closest Pair Distance is the closest distance
from a point in Ai to a point in Bi

49
High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition

(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)

2. Take the pair (Ai,Bi) with the shortest BCP
distance
3. If Ai and Bi are not already connected, add
the edge to the MST. Repeat Step 2.
50
GeoMST2 Example Start
Current MST
51
GeoMST2 Example Iteration 1
Current MST
52
GeoMST2 Example Iteration 2
Current MST
53
GeoMST2 Example Iteration 3
Current MST
54
GeoMST2 Example Iteration 4
Current MST
55
High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition
Modification for CFF If BCP distance gt ?,
terminate

(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)

2. Take the pair (Ai,Bi) with the shortest BCP
distance
3. If Ai and Bi are not already connected, add
the edge to the MST. Repeat Step 2.
56
Optimizations

We dont need the EMST
We just need to cluster all points that are
within ? distance or less from each other
Allows two optimizations to GeoMST2 code

57
High Level Overview of GeoMST2
Optimizations take place in Step 1
1. Create the Well-Separated Pairwise
Decomposition

(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)

2. Take the pair (Ai,Bi) with the shortest BCP
distance
3. If Ai and Bi are not already connected, add
the edge to the MST. Repeat Step 2.
58
Recall How to Create the WSPD
59
Optimization 1 Illustration
60
Optimization 1

Ignore all links that are gt ?
Every pair (Ai, Bi) in the WSPD becomes an edge
unless it joins two already connected components
If MargDistance(Ai,Bi) gt ?, then an edge of
length ? cannot exist between a point in Ai and
Bi
Dont include such a pair in the WSPD

61
Optimization 2 Illustration
62
Optimization 2

Join all elements that are within ? distance of
each other
If the max distance separating the bounding
hyper-rectangles of Ai and Bi is ? ?, then join
all the points in Ai and Bi if they are not
already connected
Do not add such a pair (Ai,Bi) to the WSPD

63
Implications of the optimizations

Reduce the amount of time spent in creating the
WSPD
Reduce the number of WSPDs, thereby speeding up
the GeoMST2 algorithm by reducing the size of the
priority queue

64
Results

Ran step two algorithms on subsets of the Sloan
Digital Sky Survey
7 attributes 4 colors, 2 sky coordinates, 1
redshift value
Compared Kruskal, GeoMST2, and
?-clustering

65
Results (GeoMST2 vs ?-Clustering vs Kruskal in
4D)
66
Results (GeoMST2 vs ?-Clustering in 3D)
67
Results (GeoMST2 vs ?-Clustering in 4D)
68
Results (Change in Time as ? changes for 4D data)
69
Results (Increasing Dimensions vs Time
70
Future Work

More accurate, faster non-parametric density
estimator
Use ball trees instead of fair split tree
Optimize algorithm if we keep h constant but vary
c and ?

71
Conclusions

?-clustering outperforms GeoMST2 by nearly an
order of magnitude in higher dimensions
Combining the optimizations in both steps will
yield an efficient algorithm for clustering
against clutter on massive data sets

Write a Comment

User Comments (0)

About PowerShow.com

Efficient%20Algorithms%20for%20Non-parametric%20Clustering%20With%20Clutter - PowerPoint PPT Presentation

Efficient%20Algorithms%20for%20Non-parametric%20Clustering%20With%20Clutter

Mixture model approach mixture of Gaussians for features, Poisson process for clutter ... 1. Introduction: Clustering and Clutter. 2. The Cuevas-Febreiro ... – PowerPoint PPT presentation