Clustering Objects in Large Spatial Network - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Clustering Objects in Large Spatial Network

Description:

Cheap clustering cost. 25. DBSCAN. The DBSCAN algorithm is adapted to our model ... Applications in wider domain. Edge weight interpretation ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 38
Provided by: iCs8
Category:

less

Transcript and Presenter's Notes

Title: Clustering Objects in Large Spatial Network


1
Clustering Objects in Large Spatial Network
  • DB seminar
  • Speaker Ken Yiu
  • Date 31/10/2003

2
Outline
  • Motivation
  • Problem definition and related work
  • Disk based representation of the model
  • K-medoid and Density based algorithms
  • Applications in wider domain
  • Experimental results
  • Conclusions

3
Motivation
  • Euclidean distance of two points does not
    represent the actual distance
  • Shortest path of two points constrained by
    transportation network
  • Real examples road network, river network, plane
    network, rail network,
  • Conventional clustering algorithms cannot be
    applied without distance matrix
  • Require computing expensive all pairs distance

4
Problem definition
  • Network weighted graph G(V,E,W)
  • V set of nodes
  • E set of edges
  • W E?? associate each edge an edge weight
  • Point a position on a edge (ni, nj, pos) where
    pos ? 0,W(e) and e(ni, nj)
  • the point is pos units away from the node ni
    along the edge (ni, nj)
  • How to ensure unique representation
  • Require in the triplet that niltnj

5
Example model
6
Problem definition
  • Direct distance dL(x,y)
  • 8 if x and y are on different edges
  • x.pos-y.pos (dist. along edge), otherwise
  • Network distance shortest distance
  • Between nodes
  • Can be computed by Dijkstras algorithm
  • Between points x on (na, nb) and y on (nc, nd)
  • Different edges
  • Same edge minimum of previous expression and
    dL(x,y)
  • Example using previous slide

7
Dijkstras algorithm
  • Shortest path algorithm
  • One source, all destination(s)
  • Many variations (by using different ADT)
  • We choose to use priority queue/heap

Set dist for all nodes to ? Qlt(ns,0)gt //
node,dist pair for src. node while (Q ?
?) Entry B Dequeue(Q) distB.nodeB.dist
for each non visited adjacent node nv of
B.node Create new entry B B.node
nv B.distB.distW(e(B.node,nv)) Enqueue(Q,
B)
8
Dijkstras algorithm
Network expansion from node 1 to all nodes
9
Properties
  • Network distance is a metric
  • Satisfies triangular inequality
  • d(x,z) ? d(x,y) d(y,z) (min. dist)
  • With dist. matrix for all pairs of nodes, dist.
    between points found in O(1) time
  • Only need to calculate all pairs dist. of nodes
    which are incident to some edges with some points
    (Let V be these set of nodes)
  • Time complexity discussed later
  • Space for dist. matrix O(V2) instead of
    O(V2)
  • Possible that V ltlt V

10
Properties
  • Planar graphs
  • Graph drawn in plane with no edge crossings
  • Most real-life road networks are planar
  • E ? 3V - 6 ( by Eulers formula )
  • Time complexity
  • Dijkstras algorithm
  • O(E logE ) ( i.e. O(V logV ) for planar
    graphs)
  • Prev. node dist. matrix (for planar graph)
  • Apply Dijkstras algorithm on each node in V
  • O(VV logV ) ( at most O (V2 logV ) )

11
Related work
  • K-medoid algorithms PAM, CLARA,CLARANS
  • Density based algorithm DBSCAN
  • C2P
  • Finding NN is efficient in spatial DB but not in
    our model
  • e.g. expensive to find the NN for an outlier
  • Classical graph partitioning/clustering
    algorithms
  • Single link, Complete link
  • Too expensive
  • Minimum spanning tree, then remove long edges
  • Sensitive to outliers

12
Related work
  • Networking field (NTC,CCAM)
  • Partition nodes to facilitate efficient query
    processing in terms of I/O cost
  • Do not consider point/object lying on edges
  • Spatial Network Databases (SNDB)
  • Only discuss simple query processing such as
    point query, range query,
  • Do not discuss about clustering

13
Transformation for graph alg.
  • Our interest is in the points but not the nodes
  • Need transformation for graph clus. alg.
  • Transformed graph may not be planar
  • More complex model ? more expensive clustering

14
Conventional clustering alg.
  • Many clustering alg. can be applied
  • Disadvantage They require N2 distance matrix
    between every pairs of objects
  • Expensive to compute
  • Too large to be fit in memory and even in disk
    for moderate N
  • Better methods
  • Our approach cluster the points and compute the
    distances at the same time
  • By integrating Dijkstras algorithm with the
    clustering algorithm(s)
  • Should have similar complexity as Dijkstras
    algorithm

15
Disk-based representation
  • Point group
  • Avoid storing the edge redundantly
  • Group of points on the same edge
  • Fixed fields edge, of pts
  • Variable fields ref. pos. of each pt.
  • Adjacency list
  • Fixed fields size of adj. List
  • Variable fields adjacent node ID, edge weight,
    pointer to its point group
  • Btree index
  • Efficient search of adjacency lists and points

16
Disk-based representation
17
K-medoid
  • Frequent operation finding the nearest medoid
    for each point p
  • p on edge (ni, nj), Mu,Mv nearest to ni,nj
    respectively
  • With node distance matrix, above comparison can
    be done in O(2k) time (Avg O(1) time ?)
  • Node dist. matrix still expensive (to compute and
    store) for large networks
  • Need to use more efficient method
  • Concurrent expansion from the medoids
  • Incremental replacement of node distances
  • Medoid pool

18
K-med (concurrent expansion)
  • Network expansion (Dijkstras algorithm) for each
    medoid too expensive
  • Apply it concurrently for all k medoids
  • By sharing the same heap/priority queue

19
K-med (concurrent expansion)
20
K-med (incremental replacement)
  • Only few medoids (e.g. 1 medoid) are changed in
    the next (k-medoid) iteration
  • Reuse result of previous iteration
  • Drill hole(s)
  • Expand from border of the hole(s) and all medoids

21
K-med (incremental replacement)
ChgC set of medoid index to be changed (cluster
ID in 1,k) Generalize the case for changing
arbitrary number of medoids Common case ChgC1
22
K-med (Medoid Pool)
  • Choose Ak points as the medoid pool
  • A trade off between quality and speed
  • Medoid can only be chosen from the pool
  • Apply network expansion for each medoid
  • Store in the AkV dist. array (in memory/disk)
  • Find the nearest medoid for each node
  • More benefit for more iterations, and vice versa

23
Density based algorithms
  • Disadvantages of k-medoid
  • Convergence of k-medoid depends on the choice of
    medoid
  • May take many iterations
  • Cannot identify outliers easily
  • Density based clustering methods
  • Easily identifies outliers
  • Fast
  • obtain the optimal solution (by its density
    definition)
  • do not need to improve result by iterations
  • ?-Link (new) , DBSCAN

24
?-Link
  • One parameter ?
  • Points within distance ? in the same cluster
  • Method (similar to network expansion)
  • Only one heap is used
  • Store NNdist of each node to a clustered point
    (in current cluster)
  • Update NNdist of the nodes
  • Whenever NNdistnz ? and NNdistnz? ?, expand
    from that node i.e. Enqueue entry (nz,
    NNdistnz)
  • Result is a special case of DBSCAN
  • No need to perform expensive range search
  • Cheap clustering cost

25
DBSCAN
  • The DBSCAN algorithm is adapted to our model
  • Two parameters MinPts and ?
  • Only need to discuss about the range search with
    ? distance
  • Straight forward range search may introduce
    duplicates which affect the count of ?-neighbors
  • A simple check can avoid (not eliminate) these
    duplicates but we do not discuss it here

26
Comparison of the techniques
  • K-medoid (common point assignment, heap space)
  • Dijkstras algorithm for each medoid
  • Time O(k VlogV) per iteration
  • Space O(kV) for storing node dist. (label) to
    medoids
  • Concurrent Expansion
  • Time O(VlogV) per iteration
  • Space O(V) for storing node dist. to medoids
  • Incremental Medoid Replacement
  • Better average case, worst case same as
    concurrent expansion
  • Space O(V) for storing node dist. to medoids
  • Medoid Pool
  • Time O(AkVlogV) for overhead, O(V) per
    iteration
  • Space O(AkV) for pool dist, O(V) for node
    dist.

Note complexity for planar graph (or sparse
graph) only !
27
Comparison of the techniques
  • Density based methods
  • ?-Link
  • Time O(VlogV) for transversing network
  • Better average case, only transverse the part of
    network containing points
  • Space O(V) for storing node dist. to medoids
  • DBSCAN
  • Time total O(N range-search(?))
  • Space only space for ? range search, much
    smaller than O(V)

28
Applications in wider domain
  • Edge weight interpretation
  • Different users can interpret edge weights in
    different ways
  • Set edge weight as the edges traveling time
  • Obtain clusters based on traveling time (t)
  • Set edge weight as the edges traveling cost
  • Obtain clusters based on traveling cost ()
  • Set edge weight as
  • Obtain clusters based on

29
Applications in wider domain
  • Combination of networks
  • Useful for discovering clusters across different
    networks
  • Modeling methods
  • Transition node
  • Node with (at least) an edge to another network
  • Transition edge weights
  • Assigned as the cost of transition
  • The combined network may not be planar but it is
    still a sparse graph with low average degree
  • Still efficient for performing clustering

30
Preliminary results
  • Real network
  • Main roads in North America
  • 175,813 nodes and 179,179 edges
  • Generate synthetic clusters in the network using
    expansion
  • N number of points
  • k number of clusters
  • V number of nodes
  • Default N100K, k10, V100K

31
Preliminary results
Elements pop from heap as a function of k,
V100K
Heap Pop size
k
32
Preliminary results
Cost of k-med(CE) as functions of N (fix V100)
Cost of k-med(CE) as functions of V (fix N1K)
Cost of different algorithms as functions of V
(fix N100K)
Cost of density based methods as functions of ?
(fix N100K)
33
Future Work
  • Avoid scanning points many times
  • Cluster the weighted nodes instead of points
  • Scan points once and store summary statistics
    (pt. count, seg. dist.) in each node
  • Space O(V) acceptable
  • For each edge e(ni, nj)
  • Count the of points of pos 0, ½W(e))
  • Increment point count of the node ni
  • Count the of points of pos ½W(e),W(e)
  • Increment point count of the node nj
  • Increment the segment distance of both nodes ni
    and nj by W(e)/2

34
Future Work
  • k-medoid
  • CE can still be applied
  • Only put nodes with non-empty point counts in
    clusters
  • Density based methods
  • The density for the region near node ni is
    defined as
  • Point count of ni / Segment distance of ni
  • Adapt the definitions for facilitating the
    discovery of these clusters

35
Conclusion
  • A new clustering problem formulated
  • Propose a disk based representation for large
    datasets
  • Reduce or amortize the effort of finding shortest
    distance
  • Propose three clustering algorithms
  • Efficient and scalable
  • Applicable for some other related problems

36
References
  • 1) E. W. Dijkstra. A note on two problems in
    connection with graphs. Numeriche Mathematik,
    1269271, 1959.
  • 2) A. K. Jain and R. C. Dubes. Algorithms for
    Clustering Data. Prentice Hall, 1988.
  • 3) E. Martin, H. P. Kriegel, J. Sander, and X.
    Xu. A density-based algorithm for discovering
    clusters in large spatial databases with noise.
    In ACM SIGKDD, 1996.
  • 4) A. Nanopoulos, Y. Theodoridis, and Y.
    Manolopoulos. C2P Clustering based on closest
    pairs. In VLDB, 2001.
  • 5) D. Papadias, J. Zhang, N. Mamoulis, and Y.
    Tao. Query processing in spatial network
    databases. In VLDB, 2003.

37
References
  • 6) S. Shekhar and D. Liu. CCAM A connectivity
    clustered access method for networks and network
    computations. Journal of Computational Biology,
    19(1)102119, 1997.
  • 7) S. H. Woo and S. B. Yang. An improved network
    clustering method for I/O-ecient query
    processing. In ACM GIS, 2000.
Write a Comment
User Comments (0)
About PowerShow.com