Efficient and Effective Clustering Methods for Spatial Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient and Effective Clustering Methods for Spatial Data Mining

Description:

Title: PowerPoint Presentation Last modified by: Pavan Podila Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 27
Provided by: uhe3
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient and Effective Clustering Methods for Spatial Data Mining


1
Efficient and Effective Clustering Methods for
Spatial Data Mining
  • Raymond T. Ng, Jiawei Han
  • Pavan Podila
  • COSC 6341, Fall 04

2
Overview
  • Spatial Data Mining
  • Clustering techniques
  • CLARANS
  • Spatial and Non-Spatial dominant CLARANS
  • Observations
  • Summary

3
Overview
  • Spatial Data Mining
  • Clustering techniques
  • CLARANS
  • Spatial and Non-Spatial dominant CLARANS
  • Observations
  • Summary

4
Spatial Data Mining
  • Identifying interesting relationships and
    characteristics that may exist implicitly in
    Spatial Databases
  • Different from Relational Databases
  • Spatial objects - store both spatial and
    non-spatial attributes
  • Queries (All Walmart stores within 10 miles of
    UH)
  • Spatial Joins, work on spatial indexes (R-tree)
  • Huge sizes (Tera bytes)
  • GIS is a classic example

5
Overview
  • Spatial Data Mining
  • Clustering techniques
  • CLARANS
  • Spatial and Non-Spatial dominant CLARANS
  • Observations
  • Summary

6
Partitioning Methods
  • Given K, the number of partitions to create, a
    partitioning method constructs initial
    partitions. It then iterative refines the quality
    of these clusters so as to maximize intra-cluster
    similarity and inter-cluster dissimilarity.
  • Quality of Clustering Average dissimilarity of
    objects from their cluster centers (medoids)
  • Selected algorithms
  • K-medoids
  • PAM
  • CLARA
  • CLARANS

7
K-Medoids
  • Partition based clustering (K partitions)
  • Effective, why ?
  • Resistant to outliers
  • Do not depend on order in which data points are
    examined
  • Cluster center is part of dataset, unlike k-means
    where cluster center is gravity based
  • Experiments show that large data sets are handled
    efficiently

K-medoids
K-means
8
PAM (Partitioning Around Medoids)
  • Goal Find K representative objects of the data
    set. Each of the K objects is called a Medoid,
    the most centrally located object within a
    cluster.

9
PAM (2)
  • Start with K data points designated as medoids.
    Create cluster around a medoid by moving data
    points close to the medoid Oj belongs to Oi
  • if d(Oj, Oi) minOe d(Oj, Oe)
  • Iteratively replace Oi with Oh if quality of
    clustering improves.
  • Swapping cost, Cijh, associated for replacing a
    selected object Oi with a non-selected object Oh

10
PAM (3)
  • O(k(n-k)2) for each iteration
  • Good for small data sets (n100, k5)

11
CLARA (Clustering LARge Applications)
  • Improvement over PAM
  • Finds medoids in a sample from the dataset
  • Idea If the samples are sufficiently random,
    the medoids of the sample approximate the medoids
    of the dataset
  • Heuristics 5 samples of size 402k gives
    satisfactory results
  • Works well for large datasets (n1000, k10)

12
Overview
  • Spatial Data Mining
  • Clustering techniques
  • CLARANS
  • Spatial and Non-Spatial dominant CLARANS
  • Observations
  • Summary

13
CLARANS (Clustering Large Applications based
on RANdomized Search)
  • A graph abstraction, Gn,k
  • Each vertex is a collection of k medoids
  • S1 S2 k 1
  • Each node has k(n-k) neighbors
  • Cost of each node is total dissimilarity of
    objects to their medoids
  • PAM searches whole graph
  • CLARA searches subgraph

14
CLARANS (2)
  • Experimental values
  • numLocal 2
  • maxNeighbors
  • max(1.25 of k(n-k), 250)

15
CLARANS (3)
  • Outperforms PAM and CLARA in terms of running
    time and quality of clustering
  • O(n2) for each iteration

CLARANS vs CLARA
CLARANS vs PAM
16
Overview
  • Spatial Data Mining
  • Clustering techniques
  • CLARANS
  • Spatial and Non-Spatial dominant CLARANS
  • Observations
  • Summary

17
Generalization
  • Useful to mine non-spatial attributes
  • Process of merging tuples based on a concept
    hierarchy
  • DBLearn SQL query, gen. hierarchy and threshold

18
Silhouette
  • Silhouette of object Oj
  • determines how much Oj belongs to its cluster
  • Between -1 and 1
  • 1 indicates high degree of membership
  • Silhouette width of cluster
  • Average silhouette of all objects in cluster
  • Silhouette coefficient
  • Average silhouette widths of k clusters

Silhoutte width Interpretation
0.71 1 Strong cluster
0.51 0.7 Reasonable cluster
0.26 0.5 Weak or artificial cluster
0.25 No cluster found
19
SD and NSD approach
  • SD Spatial Dominant
  • NSD Non-Spatial Dominant
  • Clustering for spatial attributes /
    Generalization for non-spatial attributes
  • Dominance is decided by what is carried out first
    (clustering/generalization)
  • Second phase works on tuples from previous stage

20
SD(CLARANS)
  • Finds non-spatial generalizations from spatial
    clustering
  • Value for Knat is determined through heuristics
    using the silhouette coefficients
  • Clustering phase can be treated as finding
    spatial generalization hierarchy

21
NSD(CLARANS)
  • Finds spatial clusters from non-spatial
    generalizations
  • Clusters may overlap

22
Overview
  • Spatial Data Mining
  • Clustering techniques
  • CLARANS
  • Spatial and Non-Spatial dominant CLARANS
  • Observations
  • Summary

23
Observations
  • In all previous methods, quality of mining
    depends on the SQL query
  • CLARANS assumes that the entire dataset is in
    memory. Not always the case for large data sets.
  • Quality of results cannot be guaranteed when N is
    very large due to Randomized Search

24
Observations (2)
  • Other clustering algorithms proposed for Spatial
    Data Mining
  • Hierarchical BIRCH
  • Density based DBSCAN, GDBSCAN, DBRS
  • Grid based STING

25
Summary
  • A seminal paper on use of clustering for spatial
    data mining
  • CLARANS is an effective clustering technique for
    large datasets
  • SD(CLARANS)/NSD(CLARANS) are effective spatial
    data mining algorithms

26
References
  • Primary
  • Efficient and Effective Clustering Methods for
    Spatial Data Mining (1994) - Raymond T. Ng,
    Jiawei Han
  • Secondary
  • CLARANS A Method for Clustering Objects for
    Spatial Data Mining - Raymond T. Ng, Jiawei Han
  • Clustering for Mining in Large Spatial Databases
    - Martin Ester, Hans-Peter Kriegel, Jörg Sander,
    Xiaowei Xu
  • An Introduction to Spatial Database Systems -
    Ralf Hartmut Güting
Write a Comment
User Comments (0)
About PowerShow.com