Discovering Interesting Regions in Spatial Data Sets - PowerPoint PPT Presentation

About This Presentation

Title:

Discovering Interesting Regions in Spatial Data Sets

Description:

Representative-based Clustering Using Randomized Hill Climbing (CLEVER) ... Representative-based Clustering with Gabriel Graph Based Post-processing (MOSAIC) ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 26

Provided by: lindaj156

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Discovering Interesting Regions in Spatial Data Sets

1
Discovering Interesting Regions inSpatial Data
Sets

Christoph F. Eick for Data Mining Class
Motivation Examples of Region Discovery
Region Discovery Framework
A Fitness For Hotspot Discovery
Other Fitness Functions
A Family of Clustering Algorithms for Region
Discovery
Summary

2
Discovering Interesting Regions inSpatial Data
Sets

Christoph F. Eick for Data Mining Class
Motivation Examples of Region Discovery
Region Discovery Framework
A Fitness For Hotspot Discovery
Other Fitness Functions
A Family of Clustering Algorithms for Region
Discovery
Summary

3
Next 2-3 Classes

Region Discovery Framework
DBSCAN
Hierarchical Clustering
Clustering Algorithms for Region Discovery
Clever,
Critical Issues with Respect to Clustering
Programming Project-specific Discussion
Similarity Assessment

4
1. Motivation Examples of Region Discovery
Application 1 Hot-spot Discovery this
presentation, EVJW07 Application 2 Regional
Association Rule Mining and Scoping DEWY06,
DEYWN07 Application 3 Find Interesting Regions
with respect to a Continuous Variable Application
4 Regional Co-location Mining EPWSN07 Applicati
on 5 Find representative regions (Sampling)
b1.01
RD-Algorithm
b1.04
Wells in Texas Green safe well with respect to
arsenic Red unsafe well
5
2. Region Discovery Framework

We assume we have spatial or spatio-temporal
datasets that have the following structure
(x,y,z,tltnon-spatial attributesgt)
e.g. (longitude, lattitude, class_variable)
or (longitude, lattitude, continous_variable)
Clustering occurs in the (x,y,z,t)-space
regions are found in this space.
The non-spatial attributes are used by the
fitness function but neither in distance
computations nor by the clustering algorithm
itself.
For the remainder of the talk, we view region
discovery as a clustering task and assume that
regions and clusters are the same

6
Region Discovery Framework Continued

The algorithms we currently investigate solve the
following problem
Given
A dataset O with a schema R
A distance function d defined on instances of R
A fitness function q(X) that evaluates clustering
Xc1,,ck as follows
q(X) ?c?X reward(c)?c?X interestingness(c)size(
c)? with bgt1
Objective
Find c1,,ck ? O such that
ci?cj? if i?j
Xc1,,ck maximizes q(X)
All cluster ci?X are contiguous (each pair of
objects belonging to ci has to be
delaunay-connected with respect to ci and to d)
c1?,,?ck ? O
c1,,ck are usually ranked based on the reward
each cluster receives, and low reward clusters
are frequently not reported

7
Challenges for Region Discovery

Recall and precision with respect to the
discovered regions should be high
Definition of measures of interestingness and of
corresponding parameterized reward-based fitness
functions that capture what domain experts find
interesting in spatial datasets
Detection of regions at different levels of
granularities (from very local to almost global
patterns)
Detection of regions of arbitrary shapes
Necessity to cope with very large datasets
Regions should be properly ranked by relevance
(reward) in many application only the top-k
regions are of interest
Design and implementation of clustering
algorithms that are suitable to address
challenges 1, 3, 4, 5 and 6.

8
3. Fitness Function for Hot Spot Discovery

Class of Interest Unsafe_Well
Prior Probability 20
?1 0.5, ?2 1.5
R 1, R- 1
ß 1.1, ?1.

10
30
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
c 50 200 200 350 200
P(c, Unsafe) 20/50 40 40/200 20 10/200 5 30/350 8.6 100/20050
Reward
9
4. Fitness Functions for Other Region Discovery
Tasks
4.1 Creating Contour Maps for Water Temperature
(Temp)
Fig. 1 Sea Surface Temperature on July 7 2002
Var2.2 Reward 48,5 Rank 3

Mean11.2
A single region and its summary

Examples in the data set WT have the form
(x,y,temp) var(c,temp) denotes the variance of
variable temp in region c
interestingness(c)
IF
var(c,temp)gtvar(WT,temp)
THEN 0
ELSE
min(1, log20(var(WT,temp)/var(c,temp)))?
with ? being a parameter (with default
1)
Basically, regions receive rewards if their
variance is lower than the variance of the
variable temperature for the whole data set, and
regions whose variance is at least 20 times less
receive the maximum reward of 1.

10
4.2 Finding Regions with High Water Temperature
Differences

Examples in the data set WT have the form
(x,y,Temp)
Fitness function Let c be a cluster to be
evaluated
interestingness(c)
IF var(c,temp)ltvar(WT,temp)
THEN 0
ELSE min(1, log20(var(c,temp)/var(WT,tem
p)))? )
with ? being a parameter (with default
1)

11
4.3 Programming Project Fitness Functions Purity
r1
r2
(6, 2, 2)
r3
(0, 0, 5)
(2,2,1)
We assume we have 3 classes in r1 we have 6
objects of class1, 3 objects of class 2, and 2
objects of class1
We assume th0.5 and ?2 i(r1)
(0.6-0.5)20.01 i(r2)(1-0.5)20.25 i(r3)0 q
(X)q(r1,r2,r3) 0.0110b 0.255b
12
Programming Project Fitness Functions Variance
r3 Var(r3)1100
r1 var(r1)80
O Var(O)100
r2 Var(r2)200
r4 Var(r4)20
We assume ?1 and b10 i(r1) 0 i(r2)log10(2)0.
3010 i(r3)1 i(r4)0
13
Programming Project Function MSE
r1
r2
(2,2) (4,4)
(-1,-1) (-7,-7) (-4,-4)
MSE(r1)(1212121212)/22 MSE(r2)(3
23232321200)/312
14
4.4 Regional Co-location Mining
R1
R2
Regional Co-location
R3
R4
Task Find Co-location patterns for the following
data-set.
Global Co-location and are
co-located in the whole dataset
15
A Reward Function for Binary Co-location

Task Find regions in which the density of 2 or
more classes is elevated. In general, multipliers
lC are computed for every region r, indicating
how much the density of instances of class C is
elevated in region r compared to Cs density in
the whole space, and the interestness of a region
with respect to two classes C1 and C2 is assessed
proportional to the product lC1lC2
Example Binary Co-Location Reward Framework
lC(r)p(C,r)/prior(C)
?C1,C2 1/((prior(C1)prior(C2)) maximum
multiplier
kC1,C2(r) IF lC1(r)lt1 or lC2(r )lt1 THEN 0
ELSE sqrt((lC1(r)1)(lC2(r)1))/(
?C1,C2 1)
interestingness(r) maxC1,C2C1?C2 (kC1,C2(c))

16
The Ultimate Vision of the Presented Research
DomainExpert
Spatial Databases
Measure ofInterestingness Acquisition Tool
Database Integration Tool
Fitness Function
Data Set
Family of Clustering Algorithms
Region DiscoveryDisplay
Ranked Set of Interesting Regions and their
Properties
Visualization Tools
Architecture Region Discovery Engine
17
How to Apply the Suggested Methodology

With the assistance of domain experts determine
structure of dataset to be used.
Acquire measure of interestingness for the
problem of hand (this was purity, variance, MSE,
probability elevation of two or more classes in
the examples discussed before)
Convert measure of interestingness into a
reward-based fitness function. The designed
fitness function should assign a reward of 0 to
boring regions. It is also a good idea to
normalize rewards by limiting the maximum reward
to 1.
After the region discovery algorithm has been
run, rank and visualize the top k regions with
respect to rewards obtained (interestingness(c)si
ze(c)?), and their properties which are usually
task specific.

18
5. A Family of Clustering Algorithms for Region
Discovery

Supervised Partitioning Around Medoids (SPAM).
Representative-based Clustering Using Randomized
Hill Climbing (CLEVER)
Supervised Clustering using Evolutionary
Computing (SCEC)
Agglomerative Hierarchical Supervised Clustering
(SCAH)
Hierarchical Grid-based Supervised Clustering
(SCHG)
Supervised Clustering using Multi-Resolution
Grids (SCMRG)
Representative-based Clustering with Gabriel
Graph Based Post-processing (MOSAIC)
Supervised Clustering using Density Estimation
Techniques (SCDE)

Remark For a more details about SCEC, SPAM,
SRIDHCR see EZZ04, ZEZ06 the PKDD06 paper
briefly discusses SCAH, SCHG, SCMRG
19
SCAH (Agglomerative Hierarchical)
Inputs A dataset Oo1,...,on A distance Matrix
D d(oi,oj) oi,oj ? O , Output Clustering
Xc1,,ck Algorithm 1) Initialize
Create single object clusters ci oi, 1 i
n Compute merge candidates based on nearest
clusters 2) DO FOREVER a) Find the pair
(ci, cj) of merge candidates that improves q(X)
the most b) If no such pair exist terminate,
returning Xc1,,ck c) Delete the two
clusters ci and cj from X and add the cluster ci
? cj to X d) Update inter-cluster
distances incrementally e) Update merge
candidates based on inter-cluster distances
20
SCHG (Hierarchical Grid-based)
Remark Same as SCAH, but uses grid cells as
initial clusters Inputs A dataset
Oo1,...,on A grid structure G Output Clusterin
g Xc1,,ck Algorithm 1) Initialize
Create clusters making each single non-empty grid
cell a cluster Compute merge candidates (all
pairs of neighboring grid cells) 2) DO FOREVER
a) Find the pair (ci, cj) of merge candidates
that improves q(X) the most b) If no such
pair exist terminate, returning Xc1,,ck
c) Delete the two clusters ci and cj from X and
add the cluster cci ? cj to X d)
Update merge candidates ?c?X (MC(c,c) ? MC(c,
ci) ? MC(c, cj ))
21
Ideas SCMRG (Divisive, Multi-Resolution Grids)
Cell Processing Strategy 1. If a cell receives
a reward that is larger than the sum of its
rewards its ancestors return that cell.
2. If a cell and its ancestor do not receive
any reward prune 3. Otherwise, process the
children of the cell (drill down)
22
Code SCMRG
23
Problems with SCAH
Too restrictive definition of merge candidates
XXX OOO OOO XXX
No look ahead
Non-contiguous clusters
24
6. Summary

A framework for region discovery that relies on
additive, reward-based fitness functions and
views region discovery as a clustering problem
has been introduced.
Evidence concerning the usefulness of the
framework for hot spot discovery problems has
been presented.
As a by-product some known and not so well known
flaws of hierarchical clustering algorithms have
been identified.
The ultimate vision of this research is the
development of region discovery engines that
assist earth scientists in finding interesting
regions in spatial datasets.

25
Why should people use Region Discovery Engines
(RDE)?

RDE finds sub-regions with special
characteristics in large spatial datasets and
presents findings in an understandable form. This
is important for
Focused summarization
Find interesting subsets in spatial datasets for
further studies
Identify regions with unexpected patterns
because they are unexpected they deviate from
global patterns therefore, their regional
characteristics are frequently important for
domain experts
Without powerful region discovery algorithms,
finding regional patters tends to be haphazard,
and only leads to discoveries if ad-hoc region
boundaries have enough resemblance with the true
decision boundary
Exploratory data analysis for a mostly unknown
dataset
Co-location statistics frequently blurred when
arbitrary region definitions are used, hiding the
true relationship of two co-occurring phenomena
that become invisible by taking averages over
regions in which a strong relationship is watered
down, by including objects that do not contribute
to the relationship (example High crime-rates
along the major rivers in Texas)
Data set reduction focused sampling