Efficient Density-Based Clustering of Complex Objects - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Efficient Density-Based Clustering of Complex Objects

Description:

Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer Science – PowerPoint PPT presentation

Number of Views:308
Avg rating:3.0/5.0
Slides: 40
Provided by: Martin714
Category:

less

Transcript and Presenter's Notes

Title: Efficient Density-Based Clustering of Complex Objects


1
Efficient Density-Based Clustering of Complex
Objects
  • Stefan Brecheisen, Hans-Peter Kriegel, Martin
    Pfeifle
  • University of Munich
  • Institute for Computer Science
  • Brighton,UK
  • November 01-04, 2004

2
Outline
  • Density-Based Clustering
  • Clustering of Complex Objects
  • Experimental Evaluation

3
Outline
  • Density-Based Clustering
  • Core Object Density-Reachability
  • DBSCAN OPTICS
  • Clustering of Complex Objects
  • Experimental Evaluation

4
Data Mining
  • Larger and larger amounts of data collected
    automatically
  • Too large for humans to analyze manually
  • Tools to assist analysis necessary ? KDD / Data
    Mining

Hubble Space Telescope
Telecommunication Data
Market-Basket Data
5
Clustering
  • Clustering
  • Efficiently grouping the database into sub-groups
    (clusters) such that
  • similarity within clusters maximized
  • similarity between clusters minimized

Flat Clustering one level of clusters
  • Hierarchical Clustering
  • nested clusters

e.g. density-based clustering algorithm DBSCAN
KDD 96
e.g. density-based clustering algorithm OPTICS
SIGMOD 99
6
Density-Based Clustering I
  • Parameters
  • range e and minimal weight MinPts
  • Definition core object
  • q is core object if rangeQuery (q,e) ³
    MinPts
  • Definition directly density-reachable
  • p directly density-reachable from q if
  • q is a core object and p ÃŽ rangeQuery (q,e)
  • Definition density-reachable
  • density-reachable transitive closure of
    directly density-reachable

7
Density-Based Clustering II
  • Core Idea of Hierarchical Cluster Ordering
  • Order the objects linearly such that
    objects of a cluster are adjacent in the
    ordering.

8
Density-Based Clustering II
  • Core Idea of Hierarchical Cluster Ordering
  • Order the objects linearly such that
    objects of a cluster are adjacent in the
    ordering.
  • Definition core-distance

MinPts 5
e
o
core-distance(o)
9
Density-Based Clustering II
  • Core Idea of Hierarchical Cluster Ordering
  • Order the objects linearly such that
    objects of a cluster are adjacent in the
    ordering.
  • Definition core-distance
  • Definition reachability-distance

MinPts 5
e
o
p
core-distance(o)
10
OPTICS Algorithm
reach
?
44
seedlist
11
OPTICS Algorithm
reach
?
?
44
44
B
core- distance
I
A
e
(B,40) (I, 40)
seedlist
12
OPTICS Algorithm
reach
?
?
C
44
44
B
I
A
seedlist (I, 40) (C, 40)
13
OPTICS Algorithm
reach
?
?
E
G
D
H
F
C
44
44
B
K
M
N
I
A
L
P
J
I
R
seedlist (J, 20) (K, 20) (L, 31) (C, 40) (M, 40)
(R, 43)
14
OPTICS Algorithm
reach
?
?
E
G
D
H
F
C
44
44
B
K
M
N
I
A
L
P
J
I
J
R
seedlist (L, 19) (K, 20) (R, 21) (M, 30) (P,
31) (C, 40)
15
OPTICS Algorithm
reach
?
?
E
G
D
H
F
C
44
44
B

K
M
N
I
A
L
P
J
I
J
L
R
seedlist (M, 18) (K, 18) (R, 20) (P, 21) (N,
35) (C, 40)
16
OPTICS Algorithm
reach
?
E
G
D
H
F
C
44
B
K
M
N
I
A
L
P
J
A
B
I
J
L
M
K
N
R
P
C
D
F
G
E
H
R
seedlist -
17
OPTICS Algorithm
reach
?
E
G
D
H
F
C
44
B
K
M
N
I
A
L
P
J
A
B
I
J
L
M
K
N
R
P
C
D
F
G
E
H
R
seedlist -
18
Outline
  • Foundations of Density-Based Clustering
  • Core Object Density-Reachability
  • DBSCAN OPTICS
  • Clustering of Complex Objects
  • Direct Integration of the Multi-Step Query
    Processing Paradigm
  • Experimental Evaluation

19
Complex Objects
complex objects
complex models
complex distance measure
20
Single-Step Clustering Approach

Density-based Clustering algorithms, like DBSCAN
and OPTICS
  • Performance Problems
  • For each database object q, we perform one range
    query.
  • Expensive exact distance computation do(o,q) for
    each object o of the database independent of the
    e- range

2
1
Query Q(q,e)
Result R(q,e)
21
Multi-Step Query Processing
  • Multi-Step Similarity Search
  • Range Queries (Faloutsos et al. 94)
  • k-Nearest Neighbor Queries (Korn et al. 96)
  • Optimal k- Nearest Neighbor Queries (Seidl,
    Kriegel 98)
  • No False Drops?

Lower-Bounding Property
filter distance
object distance
22
Traditional Multi-Step Clustering Approach
Density-based Clustering algorithms, like DBSCAN
and OPTICS
  • Performance Problems
  • For each database object q, we perform one range
    query (1).
  • The range query is first performed on the filter
    information (2,3).
  • One expensive exact distance computation do(o,q)
    for each object o of the candidate set C(q,e) is
    performed (4). This refinement step is very
    expensive for non-selective filters or high e
    values.

23
Integrated Multi-Step Clustering Approach
Extended Density-based Clustering algorithms,
like DBSCAN and OPTICS
  • Proposed Solution
  • For each database object q, we perform one range
    query on the filter information (1,2).
  • Only those exact distances do(o,q) are computed
    which are necessary to determine the
    core-properties of q (3).
  • A beneficial heuristic for determining the
    reachability-properties is applied which saves on
    exact distance computations (4).
  • Direct integration of the multi-step query
    processing paradigm into the clustering
    algorithm
  • postponing expensive exact distance
    computations as long as possible

1
2
3
4
postponed computations of do(o,q) for
Reach.-properties of o
computation of do(o,q) for Core - properties
of q
Query Q(q,e) using df
Candidates C (q,e)
24
Integrated Multi-Step Clustering Approach
Determination of Core-Properties
MinPts3 e75
Sorted Distance List
Filter Information
df(K,Q)10
do(K,Q)53
df(Z,Q)12
do(Z,Q)69
e
Q
df(R,Q)18
do(R,Q)49
do(R,Q)53
df(M,Q)55
df(A,Q)58
  • First, we carry out a range query on the filter
  • for each query object Q.
  • Second, we order the resulting candidate set
  • in ascending order according to the filter
    distance.
  • Third, we walk through the candidate set and
    perform exact
  • distance calculations until we can be sure
    that we have found
  • the MinPts nearest neighbors.

df(I,Q)65
25
Integrated Multi-Step Clustering Approach
Extended Seedlist
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.

first elements are ascendingly ordered


df(R,B)18
df(K,B)20
d0(M,C)65
df(R,D)34
df(K,L)30
df(K,G)43
df(K,C)55
each list of predecessor objects is ascendingly
ordered
26
Integrated Multi-Step Clustering Approach
Extended Seedlist
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist


df(R,B)18
df(K,B)20
d0(M,C)65
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
df(K,G)43
do(R,Q)53
do(K,Q)53
df(K,C)55
df(M,Q)55
df(A,Q)58
df(I,Q)65
27
Integrated Multi-Step Clustering Approach
Extended Seedlist
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist




df(R,B)18
df(K,B)20
d0(M,C)65
d0(Z,Q)69
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
28
Integrated Multi-Step Clustering Approach
Extended Seedlist
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist


df(R,B)18
df(K,B)20
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
29
Integrated Multi-Step Clustering Approach
Extended Seedlist
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist


df(R,B)18
df(K,B)20
d0(M,C)65
d0(Z,Q)69
df(M,Q)55
do(K,Q)53
df(R,D)34
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
30
Integrated Multi-Step Clustering Approach
Extended Seedlist
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist


df(R,B)18
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(A,Q)58
do(K,Q)53
df(R,D)34
d0(M,C)65
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
31
Integrated Multi-Step Clustering Approach
Extended Seedlist
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.

result list of the current query object Q
which has to be inserted into the extended
seedlist


df(A,Q)58
df(R,B)18
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
do(K,Q)53
df(R,D)34
d0(M,C)65
df(K,L)30
do(Z,Q)69
d0(R,Q)53
df(K,G)43
do(R,Q)53
do(K,Q)53
df(M,Q)55
df(A,Q)58
df(I,Q)65
32
Integrated Multi-Step Clustering Approach
Determination of Next Query Object
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.



df(A,Q)58
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
do(R,B)44
df(R,B)18
d0(M,C)65
df(R,D)34
df(K,L)30
d0(R,Q)53
df(K,G)43
do(K,Q)53
33
Integrated Multi-Step Clustering Approach
Determination of Next Query Object
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.



df(A,Q)58
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
d0(M,C)65
34
Integrated Multi-Step Clustering Approach
Determination of Next Query Object
  • Data Structure List of Lists
  • Additional information about possible predecessor
    objects are stored in order to postpone exact
    distance calculations as long as possible.



df(A,Q)58
df(K,B)20
d0(Z,Q)69
df(M,Q)55
df(I,Q)65
d0(K,B)25
d0(M,C)65
df(K,L)30
df(K,G)43
do(K,Q)53
35
Outline
  • Foundations of Density-Based Clustering
  • Core Object Density-Reachability
  • DBSCAN OPTICS
  • Clustering of Complex Objects
  • Direct Integration of the Multi-Step Query
    Processing Paradigm
  • Experimental Evaluation

36
Experimental Evaluation
Test Data Sets
  • Graphs representing images DAWAK 03
  • Expensive exact distance function
  • Selective filter used
  • High dimensional feature vectors
  • representing CAD objects DASFAA 03
  • not very selective filter used
  • (Euclidean norm)

37
Experimental Evaluation
DBSCAN
Feature vectors
Graphs
runtime sec.
runtime sec.
no. of objects
no. of objects
  • Already non-selective filters (feature vectors)
    are helpful for accelerating DBSCAN by up to an
    order of magnitude when using the new integrated
    multi-step query processing approach.
  • The traditional multi-step query processing
    approach does not benefit from non-selective
    filters (feature vectors), as the cardinality of
    the candidate set is still high even when small
    e-values are used.
  • When filters of high selectivity (graphs) are
    used, our new integrated multi-step query
    processing approach leads to a speed-up of two
    orders of magnitude compared to a full table
    scan.

38
Experimental Evaluation
OPTICS
Feature vectors
Graphs
runtime sec.
runtime sec.
no. of objects
no. of objects
no. of objects
  • When using filters of high selectivity (graphs),
    our new integrated multi-step query processing
    approach outperforms the traditional multi-step
    query processing approach and the full table scan
    by a factor of up to 30.
  • For high e-values, as used with OPTICS, the full
    table scan performs even better than the
    traditional multi-step query processing approach.

39
Conclusions
  • Summary Efficient Density-Based Clustering of
    Complex Objects
  • direct integration of the multi-step query
    processing
  • paradigm into the clustering algorithm
  • MinPts-nearest neighbor queries on the exact
    information
  • postponing expensive exact distance computations
    as
  • long as possible
  • Future Work
  • integration of the multi-step query processing
    paradigm into
  • other data mining algorithms
Write a Comment
User Comments (0)
About PowerShow.com