Clustering Objects in Large Spatial Network - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Clustering Objects in Large Spatial Network

Description:

Cheap clustering cost. 25. DBSCAN. The DBSCAN algorithm is adapted to our model ... Applications in wider domain. Edge weight interpretation ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 38

Provided by: iCs8

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Objects in Large Spatial Network

1
Clustering Objects in Large Spatial Network

DB seminar
Speaker Ken Yiu
Date 31/10/2003

2
Outline

Motivation
Problem definition and related work
Disk based representation of the model
K-medoid and Density based algorithms
Applications in wider domain
Experimental results
Conclusions

3
Motivation

Euclidean distance of two points does not
represent the actual distance
Shortest path of two points constrained by
transportation network
Real examples road network, river network, plane
network, rail network,
Conventional clustering algorithms cannot be
applied without distance matrix
Require computing expensive all pairs distance

4
Problem definition

Network weighted graph G(V,E,W)
V set of nodes
E set of edges
W E?? associate each edge an edge weight
Point a position on a edge (ni, nj, pos) where
pos ? 0,W(e) and e(ni, nj)
the point is pos units away from the node ni
along the edge (ni, nj)
How to ensure unique representation
Require in the triplet that niltnj

5
Example model
6
Problem definition

Direct distance dL(x,y)
8 if x and y are on different edges
x.pos-y.pos (dist. along edge), otherwise
Network distance shortest distance
Between nodes
Can be computed by Dijkstras algorithm
Between points x on (na, nb) and y on (nc, nd)
Different edges
Same edge minimum of previous expression and
dL(x,y)
Example using previous slide

7
Dijkstras algorithm

Shortest path algorithm
One source, all destination(s)
Many variations (by using different ADT)
We choose to use priority queue/heap

Set dist for all nodes to ? Qlt(ns,0)gt //
node,dist pair for src. node while (Q ?
?) Entry B Dequeue(Q) distB.nodeB.dist
for each non visited adjacent node nv of
B.node Create new entry B B.node
nv B.distB.distW(e(B.node,nv)) Enqueue(Q,
B)
8
Dijkstras algorithm
Network expansion from node 1 to all nodes
9
Properties

Network distance is a metric
Satisfies triangular inequality
d(x,z) ? d(x,y) d(y,z) (min. dist)
With dist. matrix for all pairs of nodes, dist.
between points found in O(1) time
Only need to calculate all pairs dist. of nodes
which are incident to some edges with some points
(Let V be these set of nodes)
Time complexity discussed later
Space for dist. matrix O(V2) instead of
O(V2)
Possible that V ltlt V

10
Properties

Planar graphs
Graph drawn in plane with no edge crossings
Most real-life road networks are planar
E ? 3V - 6 ( by Eulers formula )
Time complexity
Dijkstras algorithm
O(E logE ) ( i.e. O(V logV ) for planar
graphs)
Prev. node dist. matrix (for planar graph)
Apply Dijkstras algorithm on each node in V
O(VV logV ) ( at most O (V2 logV ) )

11
Related work

K-medoid algorithms PAM, CLARA,CLARANS
Density based algorithm DBSCAN
C2P
Finding NN is efficient in spatial DB but not in
our model
e.g. expensive to find the NN for an outlier
Classical graph partitioning/clustering
algorithms
Single link, Complete link
Too expensive
Minimum spanning tree, then remove long edges
Sensitive to outliers

12
Related work

Networking field (NTC,CCAM)
Partition nodes to facilitate efficient query
processing in terms of I/O cost
Do not consider point/object lying on edges
Spatial Network Databases (SNDB)
Only discuss simple query processing such as
point query, range query,
Do not discuss about clustering

13
Transformation for graph alg.

Our interest is in the points but not the nodes
Need transformation for graph clus. alg.
Transformed graph may not be planar
More complex model ? more expensive clustering

14
Conventional clustering alg.

Many clustering alg. can be applied
Disadvantage They require N2 distance matrix
between every pairs of objects
Expensive to compute
Too large to be fit in memory and even in disk
for moderate N
Better methods
Our approach cluster the points and compute the
distances at the same time
By integrating Dijkstras algorithm with the
clustering algorithm(s)
Should have similar complexity as Dijkstras
algorithm

15
Disk-based representation

Point group
Avoid storing the edge redundantly
Group of points on the same edge
Fixed fields edge, of pts
Variable fields ref. pos. of each pt.
Adjacency list
Fixed fields size of adj. List
Variable fields adjacent node ID, edge weight,
pointer to its point group
Btree index
Efficient search of adjacency lists and points

16
Disk-based representation
17
K-medoid

Frequent operation finding the nearest medoid
for each point p
p on edge (ni, nj), Mu,Mv nearest to ni,nj
respectively
With node distance matrix, above comparison can
be done in O(2k) time (Avg O(1) time ?)
Node dist. matrix still expensive (to compute and
store) for large networks
Need to use more efficient method
Concurrent expansion from the medoids
Incremental replacement of node distances
Medoid pool

18
K-med (concurrent expansion)

Network expansion (Dijkstras algorithm) for each
medoid too expensive
Apply it concurrently for all k medoids
By sharing the same heap/priority queue

19
K-med (concurrent expansion)
20
K-med (incremental replacement)

Only few medoids (e.g. 1 medoid) are changed in
the next (k-medoid) iteration
Reuse result of previous iteration
Drill hole(s)
Expand from border of the hole(s) and all medoids

21
K-med (incremental replacement)
ChgC set of medoid index to be changed (cluster
ID in 1,k) Generalize the case for changing
arbitrary number of medoids Common case ChgC1
22
K-med (Medoid Pool)

Choose Ak points as the medoid pool
A trade off between quality and speed
Medoid can only be chosen from the pool
Apply network expansion for each medoid
Store in the AkV dist. array (in memory/disk)
Find the nearest medoid for each node
More benefit for more iterations, and vice versa

23
Density based algorithms

Disadvantages of k-medoid
Convergence of k-medoid depends on the choice of
medoid
May take many iterations
Cannot identify outliers easily
Density based clustering methods
Easily identifies outliers
Fast
obtain the optimal solution (by its density
definition)
do not need to improve result by iterations
?-Link (new) , DBSCAN

24
?-Link

One parameter ?
Points within distance ? in the same cluster
Method (similar to network expansion)
Only one heap is used
Store NNdist of each node to a clustered point
(in current cluster)
Update NNdist of the nodes
Whenever NNdistnz ? and NNdistnz? ?, expand
from that node i.e. Enqueue entry (nz,
NNdistnz)
Result is a special case of DBSCAN
No need to perform expensive range search
Cheap clustering cost

25
DBSCAN

The DBSCAN algorithm is adapted to our model
Two parameters MinPts and ?
Only need to discuss about the range search with
? distance
Straight forward range search may introduce
duplicates which affect the count of ?-neighbors
A simple check can avoid (not eliminate) these
duplicates but we do not discuss it here

26
Comparison of the techniques

K-medoid (common point assignment, heap space)
Dijkstras algorithm for each medoid
Time O(k VlogV) per iteration
Space O(kV) for storing node dist. (label) to
medoids
Concurrent Expansion
Time O(VlogV) per iteration
Space O(V) for storing node dist. to medoids
Incremental Medoid Replacement
Better average case, worst case same as
concurrent expansion
Space O(V) for storing node dist. to medoids
Medoid Pool
Time O(AkVlogV) for overhead, O(V) per
iteration
Space O(AkV) for pool dist, O(V) for node
dist.

Note complexity for planar graph (or sparse
graph) only !
27
Comparison of the techniques

Density based methods
?-Link
Time O(VlogV) for transversing network
Better average case, only transverse the part of
network containing points
Space O(V) for storing node dist. to medoids
DBSCAN
Time total O(N range-search(?))
Space only space for ? range search, much
smaller than O(V)

28
Applications in wider domain

Edge weight interpretation
Different users can interpret edge weights in
different ways
Set edge weight as the edges traveling time
Obtain clusters based on traveling time (t)
Set edge weight as the edges traveling cost
Obtain clusters based on traveling cost ()
Set edge weight as
Obtain clusters based on

29
Applications in wider domain

Combination of networks
Useful for discovering clusters across different
networks
Modeling methods
Transition node
Node with (at least) an edge to another network
Transition edge weights
Assigned as the cost of transition
The combined network may not be planar but it is
still a sparse graph with low average degree
Still efficient for performing clustering

30
Preliminary results

Real network
Main roads in North America
175,813 nodes and 179,179 edges
Generate synthetic clusters in the network using
expansion
N number of points
k number of clusters
V number of nodes
Default N100K, k10, V100K

31
Preliminary results
Elements pop from heap as a function of k,
V100K
Heap Pop size
k
32
Preliminary results
Cost of k-med(CE) as functions of N (fix V100)
Cost of k-med(CE) as functions of V (fix N1K)
Cost of different algorithms as functions of V
(fix N100K)
Cost of density based methods as functions of ?
(fix N100K)
33
Future Work

Avoid scanning points many times
Cluster the weighted nodes instead of points
Scan points once and store summary statistics
(pt. count, seg. dist.) in each node
Space O(V) acceptable
For each edge e(ni, nj)
Count the of points of pos 0, ½W(e))
Increment point count of the node ni
Count the of points of pos ½W(e),W(e)
Increment point count of the node nj
Increment the segment distance of both nodes ni
and nj by W(e)/2

34
Future Work

k-medoid
CE can still be applied
Only put nodes with non-empty point counts in
clusters
Density based methods
The density for the region near node ni is
defined as
Point count of ni / Segment distance of ni
Adapt the definitions for facilitating the
discovery of these clusters

35
Conclusion

A new clustering problem formulated
Propose a disk based representation for large
datasets
Reduce or amortize the effort of finding shortest
distance
Propose three clustering algorithms
Efficient and scalable
Applicable for some other related problems

36
References

1) E. W. Dijkstra. A note on two problems in
connection with graphs. Numeriche Mathematik,
1269271, 1959.
2) A. K. Jain and R. C. Dubes. Algorithms for
Clustering Data. Prentice Hall, 1988.
3) E. Martin, H. P. Kriegel, J. Sander, and X.
Xu. A density-based algorithm for discovering
clusters in large spatial databases with noise.
In ACM SIGKDD, 1996.
4) A. Nanopoulos, Y. Theodoridis, and Y.
Manolopoulos. C2P Clustering based on closest
pairs. In VLDB, 2001.
5) D. Papadias, J. Zhang, N. Mamoulis, and Y.
Tao. Query processing in spatial network
databases. In VLDB, 2003.

37
References

6) S. Shekhar and D. Liu. CCAM A connectivity
clustered access method for networks and network
computations. Journal of Computational Biology,
19(1)102119, 1997.
7) S. H. Woo and S. B. Yang. An improved network
clustering method for I/O-ecient query
processing. In ACM GIS, 2000.

Write a Comment

User Comments (0)