Title: Efficient Density-Based Clustering of Complex Objects
1Efficient Density-Based Clustering of Complex
Objects
- Stefan Brecheisen, Hans-Peter Kriegel, Martin
Pfeifle - University of Munich
- Institute for Computer Science
- Brighton,UK
- November 01-04, 2004
2Outline
- Density-Based Clustering
-
- Clustering of Complex Objects
- Experimental Evaluation
-
3Outline
- Density-Based Clustering
- Core Object Density-Reachability
- DBSCAN OPTICS
- Clustering of Complex Objects
- Experimental Evaluation
-
4Data Mining
- Larger and larger amounts of data collected
automatically - Too large for humans to analyze manually
- Tools to assist analysis necessary ? KDD / Data
Mining
Hubble Space Telescope
Telecommunication Data
Market-Basket Data
5Clustering
- Clustering
- Efficiently grouping the database into sub-groups
(clusters) such that - similarity within clusters maximized
- similarity between clusters minimized
Flat Clustering one level of clusters
- Hierarchical Clustering
- nested clusters
e.g. density-based clustering algorithm DBSCAN
KDD 96
e.g. density-based clustering algorithm OPTICS
SIGMOD 99
6Density-Based Clustering I
- Parameters
- range e and minimal weight MinPts
- Definition core object
- q is core object if rangeQuery (q,e) ³
MinPts - Definition directly density-reachable
- p directly density-reachable from q if
- q is a core object and p ÃŽ rangeQuery (q,e)
- Definition density-reachable
- density-reachable transitive closure of
directly density-reachable
7Density-Based Clustering II
- Core Idea of Hierarchical Cluster Ordering
- Order the objects linearly such that
objects of a cluster are adjacent in the
ordering.
8Density-Based Clustering II
- Core Idea of Hierarchical Cluster Ordering
- Order the objects linearly such that
objects of a cluster are adjacent in the
ordering. - Definition core-distance
MinPts 5
e
o
core-distance(o)
9Density-Based Clustering II
- Core Idea of Hierarchical Cluster Ordering
- Order the objects linearly such that
objects of a cluster are adjacent in the
ordering. - Definition core-distance
- Definition reachability-distance
MinPts 5
e
o
p
core-distance(o)
10OPTICS Algorithm
reach
?
44
seedlist
11OPTICS Algorithm
reach
?
?
44
44
B
core- distance
I
A
e
(B,40) (I, 40)
seedlist
12OPTICS Algorithm
reach
?
?
C
44
44
B
I
A
seedlist (I, 40) (C, 40)
13OPTICS Algorithm
reach
?
?
E
G
D
H
F
C
44
44
B
K
M
N
I
A
L
P
J
I
R
seedlist (J, 20) (K, 20) (L, 31) (C, 40) (M, 40)
(R, 43)
14OPTICS Algorithm
reach
?
?
E
G
D
H
F
C
44
44
B
K
M
N
I
A
L
P
J
I
J
R
seedlist (L, 19) (K, 20) (R, 21) (M, 30) (P,
31) (C, 40)
15OPTICS Algorithm
reach
?
?
E
G
D
H
F
C
44
44
B
K
M
N
I
A
L
P
J
I
J
L
R
seedlist (M, 18) (K, 18) (R, 20) (P, 21) (N,
35) (C, 40)
16OPTICS Algorithm
reach
?
E
G
D
H
F
C
44
B
K
M
N
I
A
L
P
J
A
B
I
J
L
M
K
N
R
P
C
D
F
G
E
H
R
seedlist -
17OPTICS Algorithm
reach
?
E
G
D
H
F
C
44
B
K
M
N
I
A
L
P
J
A
B
I
J
L
M
K
N
R
P
C
D
F
G
E
H
R
seedlist -
18Outline
- Foundations of Density-Based Clustering
- Core Object Density-Reachability
- DBSCAN OPTICS
- Clustering of Complex Objects
- Direct Integration of the Multi-Step Query
Processing Paradigm - Experimental Evaluation
19Complex Objects
complex objects
complex models
complex distance measure
20Single-Step Clustering Approach
Density-based Clustering algorithms, like DBSCAN
and OPTICS
- Performance Problems
- For each database object q, we perform one range
query. - Expensive exact distance computation do(o,q) for
each object o of the database independent of the
e- range
2
1
Query Q(q,e)
Result R(q,e)
21Multi-Step Query Processing
- Multi-Step Similarity Search
- Range Queries (Faloutsos et al. 94)
- k-Nearest Neighbor Queries (Korn et al. 96)
- Optimal k- Nearest Neighbor Queries (Seidl,
Kriegel 98) - No False Drops?
Lower-Bounding Property
filter distance
object distance
22Traditional Multi-Step Clustering Approach
Density-based Clustering algorithms, like DBSCAN
and OPTICS
- Performance Problems
- For each database object q, we perform one range
query (1). - The range query is first performed on the filter
information (2,3). - One expensive exact distance computation do(o,q)
for each object o of the candidate set C(q,e) is
performed (4). This refinement step is very
expensive for non-selective filters or high e
values.
23Integrated Multi-Step Clustering Approach
Extended Density-based Clustering algorithms,
like DBSCAN and OPTICS
- Proposed Solution
- For each database object q, we perform one range
query on the filter information (1,2). - Only those exact distances do(o,q) are computed
which are necessary to determine the
core-properties of q (3). - A beneficial heuristic for determining the
reachability-properties is applied which saves on
exact distance computations (4).
- Direct integration of the multi-step query
processing paradigm into the clustering
algorithm - postponing expensive exact distance
computations as long as possible
1
2
3
4
postponed computations of do(o,q) for
Reach.-properties of o
computation of do(o,q) for Core - properties
of q
Query Q(q,e) using df
Candidates C (q,e)
24Integrated Multi-Step Clustering Approach
Determination of Core-Properties
MinPts3 e75
Sorted Distance List
Filter Information
df(K,Q)10
do(K,Q)53
df(Z,Q)12
do(Z,Q)69
e
Q
df(R,Q)18
do(R,Q)49
do(R,Q)53
df(M,Q)55
df(A,Q)58
- First, we carry out a range query on the filter
- for each query object Q.
- Second, we order the resulting candidate set
- in ascending order according to the filter
distance. - Third, we walk through the candidate set and
perform exact - distance calculations until we can be sure
that we have found - the MinPts nearest neighbors.
df(I,Q)65
25Integrated Multi-Step Clustering Approach
Extended Seedlist
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
first elements are ascendingly ordered
df(R,B)18
df(K,B)20
d0(M,C)65
df(R,D)34
df(K,L)30
df(K,G)43
df(K,C)55
each list of predecessor objects is ascendingly
ordered
26Integrated Multi-Step Clustering Approach
Extended Seedlist
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
result list of the current query object Q
which has to be inserted into the extended
seedlist
df(R,B)18
df(K,B)20
d0(M,C)65
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
df(K,G)43
do(R,Q)53
do(K,Q)53
df(K,C)55
df(M,Q)55
df(A,Q)58
df(I,Q)65
27Integrated Multi-Step Clustering Approach
Extended Seedlist
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
result list of the current query object Q
which has to be inserted into the extended
seedlist
df(R,B)18
df(K,B)20
d0(M,C)65
d0(Z,Q)69
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
28Integrated Multi-Step Clustering Approach
Extended Seedlist
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
result list of the current query object Q
which has to be inserted into the extended
seedlist
df(R,B)18
df(K,B)20
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
29Integrated Multi-Step Clustering Approach
Extended Seedlist
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
result list of the current query object Q
which has to be inserted into the extended
seedlist
df(R,B)18
df(K,B)20
d0(M,C)65
d0(Z,Q)69
df(M,Q)55
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
30Integrated Multi-Step Clustering Approach
Extended Seedlist
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
result list of the current query object Q
which has to be inserted into the extended
seedlist
df(R,B)18
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(A,Q)58
do(K,Q)53
df(R,D)34
d0(M,C)65
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
31Integrated Multi-Step Clustering Approach
Extended Seedlist
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
result list of the current query object Q
which has to be inserted into the extended
seedlist
df(A,Q)58
df(R,B)18
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
do(K,Q)53
df(R,D)34
d0(M,C)65
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
32Integrated Multi-Step Clustering Approach
Determination of Next Query Object
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
df(A,Q)58
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
do(R,B)44
df(R,B)18
d0(M,C)65
df(R,D)34
df(K,L)30
d0(R,Q)53
df(K,G)43
do(K,Q)53
33Integrated Multi-Step Clustering Approach
Determination of Next Query Object
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
df(A,Q)58
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
d0(M,C)65
34Integrated Multi-Step Clustering Approach
Determination of Next Query Object
- Data Structure List of Lists
- Additional information about possible predecessor
objects are stored in order to postpone exact
distance calculations as long as possible.
df(A,Q)58
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
d0(K,B)25
d0(M,C)65
df(K,L)30
df(K,G)43
do(K,Q)53
35Outline
- Foundations of Density-Based Clustering
- Core Object Density-Reachability
- DBSCAN OPTICS
- Clustering of Complex Objects
- Direct Integration of the Multi-Step Query
Processing Paradigm - Experimental Evaluation
36Experimental Evaluation
Test Data Sets
- Graphs representing images DAWAK 03
- Expensive exact distance function
- Selective filter used
- High dimensional feature vectors
- representing CAD objects DASFAA 03
- not very selective filter used
- (Euclidean norm)
37Experimental Evaluation
DBSCAN
Feature vectors
Graphs
runtime sec.
runtime sec.
no. of objects
no. of objects
- Already non-selective filters (feature vectors)
are helpful for accelerating DBSCAN by up to an
order of magnitude when using the new integrated
multi-step query processing approach. - The traditional multi-step query processing
approach does not benefit from non-selective
filters (feature vectors), as the cardinality of
the candidate set is still high even when small
e-values are used. - When filters of high selectivity (graphs) are
used, our new integrated multi-step query
processing approach leads to a speed-up of two
orders of magnitude compared to a full table
scan.
38Experimental Evaluation
OPTICS
Feature vectors
Graphs
runtime sec.
runtime sec.
no. of objects
no. of objects
no. of objects
- When using filters of high selectivity (graphs),
our new integrated multi-step query processing
approach outperforms the traditional multi-step
query processing approach and the full table scan
by a factor of up to 30. - For high e-values, as used with OPTICS, the full
table scan performs even better than the
traditional multi-step query processing approach.
39Conclusions
- Summary Efficient Density-Based Clustering of
Complex Objects - direct integration of the multi-step query
processing - paradigm into the clustering algorithm
- MinPts-nearest neighbor queries on the exact
information - postponing expensive exact distance computations
as - long as possible
- Future Work
- integration of the multi-step query processing
paradigm into - other data mining algorithms