Efficient and Effective Clustering Methods for Spatial Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient and Effective Clustering Methods for Spatial Data Mining

Description:

Title: PowerPoint Presentation Last modified by: Pavan Podila Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 27

Provided by: uhe3

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient and Effective Clustering Methods for Spatial Data Mining

1
Efficient and Effective Clustering Methods for
Spatial Data Mining

Raymond T. Ng, Jiawei Han
Pavan Podila
COSC 6341, Fall 04

2
Overview

Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant CLARANS
Observations
Summary

3
Overview

Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant CLARANS
Observations
Summary

4
Spatial Data Mining

Identifying interesting relationships and
characteristics that may exist implicitly in
Spatial Databases
Different from Relational Databases
Spatial objects - store both spatial and
non-spatial attributes
Queries (All Walmart stores within 10 miles of
UH)
Spatial Joins, work on spatial indexes (R-tree)
Huge sizes (Tera bytes)
GIS is a classic example

5
Overview

Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant CLARANS
Observations
Summary

6
Partitioning Methods

Given K, the number of partitions to create, a
partitioning method constructs initial
partitions. It then iterative refines the quality
of these clusters so as to maximize intra-cluster
similarity and inter-cluster dissimilarity.
Quality of Clustering Average dissimilarity of
objects from their cluster centers (medoids)
Selected algorithms
K-medoids
PAM
CLARA
CLARANS

7
K-Medoids

Partition based clustering (K partitions)
Effective, why ?
Resistant to outliers
Do not depend on order in which data points are
examined
Cluster center is part of dataset, unlike k-means
where cluster center is gravity based
Experiments show that large data sets are handled
efficiently

K-medoids
K-means
8
PAM (Partitioning Around Medoids)

Goal Find K representative objects of the data
set. Each of the K objects is called a Medoid,
the most centrally located object within a
cluster.

9
PAM (2)

Start with K data points designated as medoids.
Create cluster around a medoid by moving data
points close to the medoid Oj belongs to Oi
if d(Oj, Oi) minOe d(Oj, Oe)
Iteratively replace Oi with Oh if quality of
clustering improves.
Swapping cost, Cijh, associated for replacing a
selected object Oi with a non-selected object Oh

10
PAM (3)

O(k(n-k)2) for each iteration
Good for small data sets (n100, k5)

11
CLARA (Clustering LARge Applications)

Improvement over PAM
Finds medoids in a sample from the dataset
Idea If the samples are sufficiently random,
the medoids of the sample approximate the medoids
of the dataset
Heuristics 5 samples of size 402k gives
satisfactory results
Works well for large datasets (n1000, k10)

12
Overview

Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant CLARANS
Observations
Summary

13
CLARANS (Clustering Large Applications based
on RANdomized Search)

A graph abstraction, Gn,k
Each vertex is a collection of k medoids
S1 S2 k 1
Each node has k(n-k) neighbors
Cost of each node is total dissimilarity of
objects to their medoids
PAM searches whole graph
CLARA searches subgraph

14
CLARANS (2)

Experimental values
numLocal 2
maxNeighbors
max(1.25 of k(n-k), 250)

15
CLARANS (3)

Outperforms PAM and CLARA in terms of running
time and quality of clustering
O(n2) for each iteration

CLARANS vs CLARA
CLARANS vs PAM
16
Overview

Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant CLARANS
Observations
Summary

17
Generalization

Useful to mine non-spatial attributes
Process of merging tuples based on a concept
hierarchy
DBLearn SQL query, gen. hierarchy and threshold

18
Silhouette

Silhouette of object Oj
determines how much Oj belongs to its cluster
Between -1 and 1
1 indicates high degree of membership
Silhouette width of cluster
Average silhouette of all objects in cluster
Silhouette coefficient
Average silhouette widths of k clusters

Silhoutte width Interpretation
0.71 1 Strong cluster
0.51 0.7 Reasonable cluster
0.26 0.5 Weak or artificial cluster
0.25 No cluster found
19
SD and NSD approach

SD Spatial Dominant
NSD Non-Spatial Dominant
Clustering for spatial attributes /
Generalization for non-spatial attributes
Dominance is decided by what is carried out first
(clustering/generalization)
Second phase works on tuples from previous stage

20
SD(CLARANS)

Finds non-spatial generalizations from spatial
clustering
Value for Knat is determined through heuristics
using the silhouette coefficients
Clustering phase can be treated as finding
spatial generalization hierarchy

21
NSD(CLARANS)

Finds spatial clusters from non-spatial
generalizations
Clusters may overlap

22
Overview

Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant CLARANS
Observations
Summary

23
Observations

In all previous methods, quality of mining
depends on the SQL query
CLARANS assumes that the entire dataset is in
memory. Not always the case for large data sets.
Quality of results cannot be guaranteed when N is
very large due to Randomized Search

24
Observations (2)

Other clustering algorithms proposed for Spatial
Data Mining
Hierarchical BIRCH
Density based DBSCAN, GDBSCAN, DBRS
Grid based STING

25
Summary

A seminal paper on use of clustering for spatial
data mining
CLARANS is an effective clustering technique for
large datasets
SD(CLARANS)/NSD(CLARANS) are effective spatial
data mining algorithms

26
References

Primary
Efficient and Effective Clustering Methods for
Spatial Data Mining (1994) - Raymond T. Ng,
Jiawei Han
Secondary
CLARANS A Method for Clustering Objects for
Spatial Data Mining - Raymond T. Ng, Jiawei Han
Clustering for Mining in Large Spatial Databases
- Martin Ester, Hans-Peter Kriegel, Jörg Sander,
Xiaowei Xu
An Introduction to Spatial Database Systems -
Ralf Hartmut Güting