Efficient Density-Based Clustering of Complex Objects presentation

About This Presentation

Transcript and Presenter's Notes

Title: Efficient Density-Based Clustering of Complex Objects

1
Efficient Density-Based Clustering of Complex
Objects

Stefan Brecheisen, Hans-Peter Kriegel, Martin
Pfeifle
University of Munich
Institute for Computer Science
Brighton,UK
November 01-04, 2004

2
Outline

Density-Based Clustering
Clustering of Complex Objects
Experimental Evaluation

3
Outline

Density-Based Clustering
Core Object Density-Reachability
DBSCAN OPTICS
Clustering of Complex Objects
Experimental Evaluation

4
Data Mining

Larger and larger amounts of data collected
automatically
Too large for humans to analyze manually
Tools to assist analysis necessary ? KDD / Data
Mining

Hubble Space Telescope
Telecommunication Data
Market-Basket Data
5
Clustering

Clustering
Efficiently grouping the database into sub-groups
(clusters) such that
similarity within clusters maximized
similarity between clusters minimized

Flat Clustering one level of clusters

Hierarchical Clustering
nested clusters

e.g. density-based clustering algorithm DBSCAN
KDD 96
e.g. density-based clustering algorithm OPTICS
SIGMOD 99
6
Density-Based Clustering I

Parameters
range e and minimal weight MinPts
Definition core object
q is core object if rangeQuery (q,e) ³
MinPts
Definition directly density-reachable
p directly density-reachable from q if
q is a core object and p Î rangeQuery (q,e)
Definition density-reachable
density-reachable transitive closure of
directly density-reachable

7
Density-Based Clustering II

Core Idea of Hierarchical Cluster Ordering
Order the objects linearly such that
objects of a cluster are adjacent in the
ordering.

8
Density-Based Clustering II

Core Idea of Hierarchical Cluster Ordering
Order the objects linearly such that
objects of a cluster are adjacent in the
ordering.
Definition core-distance

MinPts 5
e
o
core-distance(o)
9
Density-Based Clustering II

Core Idea of Hierarchical Cluster Ordering
Order the objects linearly such that
objects of a cluster are adjacent in the
ordering.
Definition core-distance
Definition reachability-distance

MinPts 5
e
o
p
core-distance(o)
10
OPTICS Algorithm
reach
?
44
seedlist
11
OPTICS Algorithm
reach
?
?
44
44
B
core- distance
I
A
e
(B,40) (I, 40)
seedlist
12
OPTICS Algorithm
reach
?
?
C
44
44
B
I
A
seedlist (I, 40) (C, 40)
13
OPTICS Algorithm
reach
?
?
E
G
D
H
F
C
44
44
B
K
M
N
I
A
L
P
J
I
R
seedlist (J, 20) (K, 20) (L, 31) (C, 40) (M, 40)
(R, 43)
14
OPTICS Algorithm
reach
?
?
E
G
D
H
F
C
44
44
B
K
M
N
I
A
L
P
J
I
J
R
seedlist (L, 19) (K, 20) (R, 21) (M, 30) (P,
31) (C, 40)
15
OPTICS Algorithm
reach
?
?
E
G
D
H
F
C
44
44
B

K
M
N
I
A
L
P
J
I
J
L
R
seedlist (M, 18) (K, 18) (R, 20) (P, 21) (N,
35) (C, 40)
16
OPTICS Algorithm
reach
?
E
G
D
H
F
C
44
B
K
M
N
I
A
L
P
J
A
B
I
J
L
M
K
N
R
P
C
D
F
G
E
H
R
seedlist -
17
OPTICS Algorithm
reach
?
E
G
D
H
F
C
44
B
K
M
N
I
A
L
P
J
A
B
I
J
L
M
K
N
R
P
C
D
F
G
E
H
R
seedlist -
18
Outline

Foundations of Density-Based Clustering
Core Object Density-Reachability
DBSCAN OPTICS
Clustering of Complex Objects
Direct Integration of the Multi-Step Query
Processing Paradigm
Experimental Evaluation

19
Complex Objects
complex objects
complex models
complex distance measure
20
Single-Step Clustering Approach

Density-based Clustering algorithms, like DBSCAN
and OPTICS

Performance Problems
For each database object q, we perform one range
query.
Expensive exact distance computation do(o,q) for
each object o of the database independent of the
e- range

2
1
Query Q(q,e)
Result R(q,e)
21
Multi-Step Query Processing

Multi-Step Similarity Search
Range Queries (Faloutsos et al. 94)
k-Nearest Neighbor Queries (Korn et al. 96)
Optimal k- Nearest Neighbor Queries (Seidl,
Kriegel 98)
No False Drops?

Lower-Bounding Property
filter distance
object distance
22
Traditional Multi-Step Clustering Approach
Density-based Clustering algorithms, like DBSCAN
and OPTICS

Performance Problems
For each database object q, we perform one range
query (1).
The range query is first performed on the filter
information (2,3).
One expensive exact distance computation do(o,q)
for each object o of the candidate set C(q,e) is
performed (4). This refinement step is very
expensive for non-selective filters or high e
values.

23
Integrated Multi-Step Clustering Approach
Extended Density-based Clustering algorithms,
like DBSCAN and OPTICS

Proposed Solution
For each database object q, we perform one range
query on the filter information (1,2).
Only those exact distances do(o,q) are computed
which are necessary to determine the
core-properties of q (3).
A beneficial heuristic for determining the
reachability-properties is applied which saves on
exact distance computations (4).

Direct integration of the multi-step query
processing paradigm into the clustering
algorithm
postponing expensive exact distance
computations as long as possible

1
2
3
4
postponed computations of do(o,q) for
Reach.-properties of o
computation of do(o,q) for Core - properties
of q
Query Q(q,e) using df
Candidates C (q,e)
24
Integrated Multi-Step Clustering Approach
Determination of Core-Properties
MinPts3 e75
Sorted Distance List
Filter Information
df(K,Q)10
do(K,Q)53
df(Z,Q)12
do(Z,Q)69
e
Q
df(R,Q)18
do(R,Q)49
do(R,Q)53
df(M,Q)55
df(A,Q)58

First, we carry out a range query on the filter
for each query object Q.
Second, we order the resulting candidate set
in ascending order according to the filter
distance.
Third, we walk through the candidate set and
perform exact
distance calculations until we can be sure
that we have found
the MinPts nearest neighbors.

df(I,Q)65
25
Integrated Multi-Step Clustering Approach
Extended Seedlist

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

first elements are ascendingly ordered

df(R,B)18
df(K,B)20
d0(M,C)65
df(R,D)34
df(K,L)30
df(K,G)43
df(K,C)55
each list of predecessor objects is ascendingly
ordered
26
Integrated Multi-Step Clustering Approach
Extended Seedlist

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist

df(R,B)18
df(K,B)20
d0(M,C)65
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
df(K,G)43
do(R,Q)53
do(K,Q)53
df(K,C)55
df(M,Q)55
df(A,Q)58
df(I,Q)65
27
Integrated Multi-Step Clustering Approach
Extended Seedlist

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist

df(R,B)18
df(K,B)20
d0(M,C)65
d0(Z,Q)69
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
28
Integrated Multi-Step Clustering Approach
Extended Seedlist

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist

df(R,B)18
df(K,B)20
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
29
Integrated Multi-Step Clustering Approach
Extended Seedlist

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist

df(R,B)18
df(K,B)20
d0(M,C)65
d0(Z,Q)69
df(M,Q)55
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
30
Integrated Multi-Step Clustering Approach
Extended Seedlist

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist

df(R,B)18
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(A,Q)58
do(K,Q)53
df(R,D)34
d0(M,C)65
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
31
Integrated Multi-Step Clustering Approach
Extended Seedlist

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist

df(A,Q)58
df(R,B)18
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
do(K,Q)53
df(R,D)34
d0(M,C)65
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
32
Integrated Multi-Step Clustering Approach
Determination of Next Query Object

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

df(A,Q)58
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
do(R,B)44
df(R,B)18
d0(M,C)65
df(R,D)34
df(K,L)30
d0(R,Q)53
df(K,G)43
do(K,Q)53
33
Integrated Multi-Step Clustering Approach
Determination of Next Query Object

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

df(A,Q)58
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
d0(M,C)65
34
Integrated Multi-Step Clustering Approach
Determination of Next Query Object

Data Structure List of Lists
Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.

df(A,Q)58
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
d0(K,B)25
d0(M,C)65
df(K,L)30
df(K,G)43
do(K,Q)53
35
Outline

Foundations of Density-Based Clustering
Core Object Density-Reachability
DBSCAN OPTICS
Clustering of Complex Objects
Direct Integration of the Multi-Step Query
Processing Paradigm
Experimental Evaluation

36
Experimental Evaluation
Test Data Sets

Graphs representing images DAWAK 03
Expensive exact distance function
Selective filter used

High dimensional feature vectors
representing CAD objects DASFAA 03
not very selective filter used
(Euclidean norm)

37
Experimental Evaluation
DBSCAN
Feature vectors
Graphs
runtime sec.
runtime sec.
no. of objects
no. of objects

Already non-selective filters (feature vectors)
are helpful for accelerating DBSCAN by up to an
order of magnitude when using the new integrated
multi-step query processing approach.
The traditional multi-step query processing
approach does not benefit from non-selective
filters (feature vectors), as the cardinality of
the candidate set is still high even when small
e-values are used.
When filters of high selectivity (graphs) are
used, our new integrated multi-step query
processing approach leads to a speed-up of two
orders of magnitude compared to a full table
scan.

38
Experimental Evaluation
OPTICS
Feature vectors
Graphs
runtime sec.
runtime sec.
no. of objects
no. of objects
no. of objects

When using filters of high selectivity (graphs),
our new integrated multi-step query processing
approach outperforms the traditional multi-step
query processing approach and the full table scan
by a factor of up to 30.
For high e-values, as used with OPTICS, the full
table scan performs even better than the
traditional multi-step query processing approach.

39
Conclusions

Summary Efficient Density-Based Clustering of
Complex Objects
direct integration of the multi-step query
processing
paradigm into the clustering algorithm
MinPts-nearest neighbor queries on the exact
information
postponing expensive exact distance computations
as
long as possible
Future Work
integration of the multi-step query processing
paradigm into
other data mining algorithms

Write a Comment

User Comments (0)

About PowerShow.com

Efficient Density-Based Clustering of Complex Objects PowerPoint PPT Presentation