Title: Clustering Objects in Large Spatial Network
1Clustering Objects in Large Spatial Network
- DB seminar
- Speaker Ken Yiu
- Date 31/10/2003
2Outline
- Motivation
- Problem definition and related work
- Disk based representation of the model
- K-medoid and Density based algorithms
- Applications in wider domain
- Experimental results
- Conclusions
3Motivation
- Euclidean distance of two points does not
represent the actual distance - Shortest path of two points constrained by
transportation network - Real examples road network, river network, plane
network, rail network, - Conventional clustering algorithms cannot be
applied without distance matrix - Require computing expensive all pairs distance
4Problem definition
- Network weighted graph G(V,E,W)
- V set of nodes
- E set of edges
- W E?? associate each edge an edge weight
- Point a position on a edge (ni, nj, pos) where
pos ? 0,W(e) and e(ni, nj) - the point is pos units away from the node ni
along the edge (ni, nj) - How to ensure unique representation
- Require in the triplet that niltnj
5Example model
6Problem definition
- Direct distance dL(x,y)
- 8 if x and y are on different edges
- x.pos-y.pos (dist. along edge), otherwise
- Network distance shortest distance
- Between nodes
- Can be computed by Dijkstras algorithm
- Between points x on (na, nb) and y on (nc, nd)
- Different edges
- Same edge minimum of previous expression and
dL(x,y) - Example using previous slide
7Dijkstras algorithm
- Shortest path algorithm
- One source, all destination(s)
- Many variations (by using different ADT)
- We choose to use priority queue/heap
Set dist for all nodes to ? Qlt(ns,0)gt //
node,dist pair for src. node while (Q ?
?) Entry B Dequeue(Q) distB.nodeB.dist
for each non visited adjacent node nv of
B.node Create new entry B B.node
nv B.distB.distW(e(B.node,nv)) Enqueue(Q,
B)
8Dijkstras algorithm
Network expansion from node 1 to all nodes
9Properties
- Network distance is a metric
- Satisfies triangular inequality
- d(x,z) ? d(x,y) d(y,z) (min. dist)
- With dist. matrix for all pairs of nodes, dist.
between points found in O(1) time - Only need to calculate all pairs dist. of nodes
which are incident to some edges with some points
(Let V be these set of nodes) - Time complexity discussed later
- Space for dist. matrix O(V2) instead of
O(V2) - Possible that V ltlt V
10Properties
- Planar graphs
- Graph drawn in plane with no edge crossings
- Most real-life road networks are planar
- E ? 3V - 6 ( by Eulers formula )
- Time complexity
- Dijkstras algorithm
- O(E logE ) ( i.e. O(V logV ) for planar
graphs) - Prev. node dist. matrix (for planar graph)
- Apply Dijkstras algorithm on each node in V
- O(VV logV ) ( at most O (V2 logV ) )
11Related work
- K-medoid algorithms PAM, CLARA,CLARANS
- Density based algorithm DBSCAN
- C2P
- Finding NN is efficient in spatial DB but not in
our model - e.g. expensive to find the NN for an outlier
- Classical graph partitioning/clustering
algorithms - Single link, Complete link
- Too expensive
- Minimum spanning tree, then remove long edges
- Sensitive to outliers
12Related work
- Networking field (NTC,CCAM)
- Partition nodes to facilitate efficient query
processing in terms of I/O cost - Do not consider point/object lying on edges
- Spatial Network Databases (SNDB)
- Only discuss simple query processing such as
point query, range query, - Do not discuss about clustering
13Transformation for graph alg.
- Our interest is in the points but not the nodes
- Need transformation for graph clus. alg.
- Transformed graph may not be planar
- More complex model ? more expensive clustering
14Conventional clustering alg.
- Many clustering alg. can be applied
- Disadvantage They require N2 distance matrix
between every pairs of objects - Expensive to compute
- Too large to be fit in memory and even in disk
for moderate N - Better methods
- Our approach cluster the points and compute the
distances at the same time - By integrating Dijkstras algorithm with the
clustering algorithm(s) - Should have similar complexity as Dijkstras
algorithm
15Disk-based representation
- Point group
- Avoid storing the edge redundantly
- Group of points on the same edge
- Fixed fields edge, of pts
- Variable fields ref. pos. of each pt.
- Adjacency list
- Fixed fields size of adj. List
- Variable fields adjacent node ID, edge weight,
pointer to its point group - Btree index
- Efficient search of adjacency lists and points
16Disk-based representation
17K-medoid
- Frequent operation finding the nearest medoid
for each point p - p on edge (ni, nj), Mu,Mv nearest to ni,nj
respectively - With node distance matrix, above comparison can
be done in O(2k) time (Avg O(1) time ?) - Node dist. matrix still expensive (to compute and
store) for large networks - Need to use more efficient method
- Concurrent expansion from the medoids
- Incremental replacement of node distances
- Medoid pool
18K-med (concurrent expansion)
- Network expansion (Dijkstras algorithm) for each
medoid too expensive - Apply it concurrently for all k medoids
- By sharing the same heap/priority queue
19K-med (concurrent expansion)
20K-med (incremental replacement)
- Only few medoids (e.g. 1 medoid) are changed in
the next (k-medoid) iteration - Reuse result of previous iteration
- Drill hole(s)
- Expand from border of the hole(s) and all medoids
21K-med (incremental replacement)
ChgC set of medoid index to be changed (cluster
ID in 1,k) Generalize the case for changing
arbitrary number of medoids Common case ChgC1
22K-med (Medoid Pool)
- Choose Ak points as the medoid pool
- A trade off between quality and speed
- Medoid can only be chosen from the pool
- Apply network expansion for each medoid
- Store in the AkV dist. array (in memory/disk)
- Find the nearest medoid for each node
- More benefit for more iterations, and vice versa
23Density based algorithms
- Disadvantages of k-medoid
- Convergence of k-medoid depends on the choice of
medoid - May take many iterations
- Cannot identify outliers easily
- Density based clustering methods
- Easily identifies outliers
- Fast
- obtain the optimal solution (by its density
definition) - do not need to improve result by iterations
- ?-Link (new) , DBSCAN
24?-Link
- One parameter ?
- Points within distance ? in the same cluster
- Method (similar to network expansion)
- Only one heap is used
- Store NNdist of each node to a clustered point
(in current cluster) - Update NNdist of the nodes
- Whenever NNdistnz ? and NNdistnz? ?, expand
from that node i.e. Enqueue entry (nz,
NNdistnz) - Result is a special case of DBSCAN
- No need to perform expensive range search
- Cheap clustering cost
25DBSCAN
- The DBSCAN algorithm is adapted to our model
- Two parameters MinPts and ?
- Only need to discuss about the range search with
? distance - Straight forward range search may introduce
duplicates which affect the count of ?-neighbors - A simple check can avoid (not eliminate) these
duplicates but we do not discuss it here
26Comparison of the techniques
- K-medoid (common point assignment, heap space)
- Dijkstras algorithm for each medoid
- Time O(k VlogV) per iteration
- Space O(kV) for storing node dist. (label) to
medoids - Concurrent Expansion
- Time O(VlogV) per iteration
- Space O(V) for storing node dist. to medoids
- Incremental Medoid Replacement
- Better average case, worst case same as
concurrent expansion - Space O(V) for storing node dist. to medoids
- Medoid Pool
- Time O(AkVlogV) for overhead, O(V) per
iteration - Space O(AkV) for pool dist, O(V) for node
dist.
Note complexity for planar graph (or sparse
graph) only !
27Comparison of the techniques
- Density based methods
- ?-Link
- Time O(VlogV) for transversing network
- Better average case, only transverse the part of
network containing points - Space O(V) for storing node dist. to medoids
- DBSCAN
- Time total O(N range-search(?))
- Space only space for ? range search, much
smaller than O(V)
28Applications in wider domain
- Edge weight interpretation
- Different users can interpret edge weights in
different ways - Set edge weight as the edges traveling time
- Obtain clusters based on traveling time (t)
- Set edge weight as the edges traveling cost
- Obtain clusters based on traveling cost ()
- Set edge weight as
- Obtain clusters based on
29Applications in wider domain
- Combination of networks
- Useful for discovering clusters across different
networks - Modeling methods
- Transition node
- Node with (at least) an edge to another network
- Transition edge weights
- Assigned as the cost of transition
- The combined network may not be planar but it is
still a sparse graph with low average degree - Still efficient for performing clustering
30Preliminary results
- Real network
- Main roads in North America
- 175,813 nodes and 179,179 edges
- Generate synthetic clusters in the network using
expansion - N number of points
- k number of clusters
- V number of nodes
- Default N100K, k10, V100K
31Preliminary results
Elements pop from heap as a function of k,
V100K
Heap Pop size
k
32Preliminary results
Cost of k-med(CE) as functions of N (fix V100)
Cost of k-med(CE) as functions of V (fix N1K)
Cost of different algorithms as functions of V
(fix N100K)
Cost of density based methods as functions of ?
(fix N100K)
33Future Work
- Avoid scanning points many times
- Cluster the weighted nodes instead of points
- Scan points once and store summary statistics
(pt. count, seg. dist.) in each node - Space O(V) acceptable
- For each edge e(ni, nj)
- Count the of points of pos 0, ½W(e))
- Increment point count of the node ni
- Count the of points of pos ½W(e),W(e)
- Increment point count of the node nj
- Increment the segment distance of both nodes ni
and nj by W(e)/2
34Future Work
- k-medoid
- CE can still be applied
- Only put nodes with non-empty point counts in
clusters - Density based methods
- The density for the region near node ni is
defined as - Point count of ni / Segment distance of ni
- Adapt the definitions for facilitating the
discovery of these clusters
35Conclusion
- A new clustering problem formulated
- Propose a disk based representation for large
datasets - Reduce or amortize the effort of finding shortest
distance - Propose three clustering algorithms
- Efficient and scalable
- Applicable for some other related problems
36References
- 1) E. W. Dijkstra. A note on two problems in
connection with graphs. Numeriche Mathematik,
1269271, 1959. - 2) A. K. Jain and R. C. Dubes. Algorithms for
Clustering Data. Prentice Hall, 1988. - 3) E. Martin, H. P. Kriegel, J. Sander, and X.
Xu. A density-based algorithm for discovering
clusters in large spatial databases with noise.
In ACM SIGKDD, 1996. - 4) A. Nanopoulos, Y. Theodoridis, and Y.
Manolopoulos. C2P Clustering based on closest
pairs. In VLDB, 2001. - 5) D. Papadias, J. Zhang, N. Mamoulis, and Y.
Tao. Query processing in spatial network
databases. In VLDB, 2003.
37References
- 6) S. Shekhar and D. Liu. CCAM A connectivity
clustered access method for networks and network
computations. Journal of Computational Biology,
19(1)102119, 1997. - 7) S. H. Woo and S. B. Yang. An improved network
clustering method for I/O-ecient query
processing. In ACM GIS, 2000.