A%20Unified%20Approach%20to%20Spatial%20Outliers%20Detection - PowerPoint PPT Presentation

About This Presentation

Title:

A%20Unified%20Approach%20to%20Spatial%20Outliers%20Detection

Description:

Minneapolis- St. Paul traffic data (loop-detector) Benchmark tasks. Model building ... Minneapolis St. Paul Traffic Data. Outlier Station Detected. Conclusion ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 53

Provided by: unkn746

Learn more at: https://www.spatial.cs.umn.edu

Category:

more less

Transcript and Presenter's Notes

Title: A%20Unified%20Approach%20to%20Spatial%20Outliers%20Detection

1

A Unified Approach to Spatial Outliers Detection
Chang-Tien Lu
Spatial Database Lab
Department of Computer Science
University of Minnesota
ctlu_at_cs.umn.edu
http//www.cs.umn.edu/research/shashi-group

2
Outline

Introduction
Motivation
General Definition of Spatial Outlier
Related Work
Proposed Approach and Algorithm
Evaluation of Proposed Approach
Conclusions

3
Spatial Data Mining

Spatial Databases are too large to analyze
manually
NASA Earth Observation System (EOS)
National Institute of Justice Crime mapping
Census Bureau, Dept. of Commerce - Census Data
Spatial Data Mining
Discover frequent and interesting spatial
patterns for post processing (knowledge
discovery)
Pattern examples outliers, crime hot spots,
land-use classification
Historical Example
London, 1854
Cholera water pump

4
Spatial Outlier

Definition A data point that is extreme
relative to its neighbors

5
Application Domain Traffic Data

6
An Example of Spatial Outlier

Spatial outlier S, global outlier G,
L

7
Spatial Outlier Detection Z s(x) approach
Function
If
Declare x as a spatial outlier
8
Evaluation of Statistical Assumption

Distribution of traffic station attribute f(x)
looks normal
Distribution of
looks normal too!

9
Outlier Detection Tests
10
Outlier Detection Tests

Related work
1-dimension ignores geographic location
Homogeneous multi-dimension mixes location with
attributes
Spatial outlier
2 classes of dimensions location, attribute
Neighborhood based on location dimension
Difference compares attribute dimension
Comparison of outlier detection methods

11
Issues

Numerous tests
Each has custom algorithm
Add complexity to implement Spatial Database
Management System
Desirable
Unified test
A general algorithm to perform different tests
High performance

12
Our Contribution

A general definition of spatial outlier
S - outlier
Show that existing definitions are special cases
of
s -outlier
Design efficient algorithms
Analyze the computation structures
Develop I/O cost models
Evaluate alternate disk page clustering methods

13
General Definition of Spatial Outlier

Given
A spatial framework SF consisting of locations
s1, s2, , sn
An attribute function f si ? R
(R set of real numbers)
A neighborhood relationship N ? SF ? SF
A neighborhood aggregation function RN ?
R
A difference function Fdiff R ? R ? R
Statistic test function ST R ? True, False
Test is based on Fdiff (f, (f, N) )
General definition S -outlier
An object O ? SF is a S -outlier (f, ,
Fdiff , ST )
if ST TRUE

14
Related Work- Spatial Outlier Tests

Different spatial outlier tests
Spatial statistic approach
Scatter plot approach (Luc Anselin 94)
Moran Scatter plot approach (Luc Anselin 95)
Variogram cloud approach (Graphic)
All these are special cases of S -outlier
Show this for one case scatter plot

15
Scatter Plot Approach

Lemma
Scatter plot is a special case
of S -outlier
Given
An attribute function f(x)
A neighborhood relationship N(x)
An aggregation function
A difference function
Fdiff ? E(x) (m ? f(x) b)
Detect spatial outlier by
Statistic test function
ST

E(x)
f(x)
16
Outline

Introduction
Proposed Approach and Algorithm
Problem formulation
Our approach
Efficient algorithm
Cost model
Evaluation of Proposed Approach
Conclusions

17
Problem Formulation

General definition S -outlier
An object O is a S -outlier (f, , Fdiff, ST
)
if ST TRUE
Design
An efficient algorithm to detect S -outlier
i,e., O si? si ? SF , si is a spatial
outlier
Objective
Efficiency to minimize the computation time
Constraints
Fdiff and ST are algebraic aggregate functions of
values
of f and
The size of the data set gtgt the main memory size
Computation time is determined by I/O time

18
Aggregate Function

Distributive aggregate function F
Global F value can computed by applying the G
function to the value of F in each partition of
the data set, F G for most cases
Algebraic aggregate function F
Global F value can be computed using a fixed
number of sub-aggregates from each partition of
the data set

19
Our Approach

Separate two phases
Model building
Testing (a node or a set of nodes)
Computation structure of model building
Key insights
Spatial self join using N(x) relationship
Algebraic aggregate functions can be computed in
one disk scan of spatial join
Computation structure of testing
Single node spatial range query
Get-All-Neighbors(x) operation

20
An Example of Our Approach Scatter plot

Model building
An attribute function f(x)
Neighborhood aggregate function
Distributive aggregate functions
Algebraic aggregate functions
where
,
Testing
Difference function
where
Statistic test function

21
Model Building Algorithm

Steps
For each node x
Retrieve data record of x (f(x), list of
neighbors(x))
Get-All-Neighbors(x) Retrieve data records of
neighbor(x)
If neighbor y is not in the memory buffer,
request another I/O operation
Compute neighborhood aggregate function
Accumulate distributive aggregate function
, ,..,
Compute algebraic aggregate function
, ,.. ,
I/O cost
Dominant operation Get-All-Neighbors(x)
I/O cost of Get-All-Neighbors(x) is determined by
the clustering efficiency

22
Attribute Table
Node x Attribute volume f(x) List of Neighbors N(x)
V1 125 V3, V5, V10
V2 130 V4, V6
V3 140 V1, V7
. . .
V10 120 V1, V7, V12
23
Computation structure

Lemma
In model building algebraic aggregate function
can be computed in one disk scan of the spatial
self-join
Proof
A fixed k number of distributive aggregate
functions
, ,.., are used to store the
aggregate values
Algebraic aggregate functions are then computed
using these aggregate values
In all cases, one needs a very small number of
distributive aggregate functions (k ? 10)

24
Testing Algorithm

Steps
For each node x
Retrieve data record of x (f(x), list of
neighbors(x))
Get-All-Neighbors(x) Retrieve data records of
neighbor(x)
If neighbor y is not in the memory buffer
Request another I/O operation
Compute difference function Fdiff
If test function ST True
Declare x as a spatial outlier

25
I/O Cost Model

Definition
CE Clustering efficiency
N Total number of nodes
Bfr Blocking factor (number of nodes in a disk
page)
K Avg. number of neighbors for each node
L Number of nodes in a route
Cost model of A1
The cost to retrieve all nodes
The cost to retrieve neighbors of all nodes
Cost model of A2

26
Clustering Efficiency

CE definition
Probability vi and a neighbor of vi are stored
in the same disk page
(Total number of unsplit edges)/(Total number of
edges)
Computation cost (I/O cost) is determined by
Clustering Efficiency (CE)

27
Clustering Efficiency

An example
CE depends on
Disk page size
Node record size,
edge distribution
over nodes
Clustering method

28
I/O Cost Model

Definition
CE Clustering efficiency
N Total number of nodes
Bfr Blocking factor (number of nodes in a disk
page)
K Avg. number of neighbors for each node
L Number of nodes in a route
Cost model of Model Building
The cost to retrieve all nodes
The cost to retrieve neighbors of all nodes
Cost model of Testing

29
Outline

Introduction
Proposed Approach and Algorithm
Evaluation of Proposed Approach
Candidates (Clustering Methods)
Experiment Design
Results
conclusions

30
Experimental Evaluation (Summary)

Hypothesis
I/O cost of the algorithm is determined by the
clustering efficiency
Physical Data Page Clustering Method
CCAM
Cell Tree
Z-order
Benchmark data
Minneapolis- St. Paul traffic data
(loop-detector)
Benchmark tasks
Model building
Testing
Metrics Clustering efficiency(CE), I/O cost

31
Clustering Method CCAM

Connectivity Clustered Access Method
Cluster the nodes via min-cut graph partitioning
Use B tree with Z-order as the secondary index

32
Clustering Method CCAM
33
Clustering Methods Cell Tree

Binary Space Partitioning (BSP)
Decompose universe into disjoint convex subspaces
Cannot exploit edge information, pure geometric

34
Clustering Method Cell Tree
35
Clustering Method Z-order

Impose a total order on the nodes
Z-order interleave (bits of X, bits of Y)

36
Clustering Method Z-order
37
Experiment Design

Questions/Hypotheses
What is the ranking of candidate clustering
methods?
Is CE a predictor of relative performance of
clustering methods?
Does cost model predict observed ranking?
What are the effects of parameters
Disk page size
Memory buffer size

38
Experiment Design
39
Model Building Effect of Page Size

Fixed parameters buffer size
Variable parameters page size, clustering
strategy

Configuration

CCAM has the best performance, the highest CE
value
High CE gt Low I/O cost
Cost model (N/Bfr)NK(1-CE)
Increase page size gt reduce number of page
accesses

Trends
40
Model Building Effect of Buffer Size

Fixed parameters
Page size 2 Kbytes
Clustering Efficiency
CCAM0.81, Cell0.69,
Z-ord0.51
Variable parameters
Number of buffers
Clustering strategies
Trends
Increase buffer size
gt reduce number of page accesses
CCAM has the best performance

41
Testing Effect of Page Size

Fixed parameters buffer size
Variable parameters page size, clustering
strategy

Configuration
Trends

CCAM has the best performance, the highest CE
value
High CE gt Low I/O cost
Cost model L(1-CE)LK(1-CE)
Increase page size
Reduce number of page access, performance gap
reduces

42
Summary of Experimental Results

Model building and testing
CCAM achieves higher clustering efficiency
CCAM has lower I/O than Cell-Tree and Z-order
Higher CE leads to lower I/O cost
CE is a good predictor of relative I/O
performance
Page size improves clustering efficiency of all
methods
Reduces performance gap between methods

43
Outlier Station Detected
44
MinneapolisSt. Paul Traffic Data

45
Outlier Station Detected
46
Conclusion

A general definition of spatial outlier
S -outlier models existing spatial outlier
definitions
Scatter plot, Moran Scatterplot, Spatial
Statistic, ..
Efficient algorithm to detect spatial outlier
Model building Testing
Recognize the computation structure of algorithms
Algebraic aggregate functions on ? self join
Get-All-Neighbor(x) dominates I/O cost
Develop algebraic cost models
Evaluate alternate disk page clustering methods
CCAM, Cell-Tree, Z-order

47
Future Directions

Extend Spatial Outlier Detection Test
Multi-attributes
Combination of traffic volume, speed, occupancy
Location attribute includes time
Temporal and Spatial-Temporal Outliers
Explore other spatial patterns beyond spatial
outlier
Land-use classification
Co-locations
Fire ignition source feature
Needle vegetation type feature
Drought feature
Iterative Spatial Outlier Detection

48
Application Domain Traffic Volume Matrix
49
Spatial Outlier Detection
50
Iterative Spatial Outlier Detection
51
Related Publications

Detecting Graph-based Spatial Outliers
Algorithms and Applications, ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, September, 2001
Detecting Graph-based Spatial Outliers, the
International Journal of Intelligent Data
Analysis (IDA), Vol. 6, No 3. 2002
A Unified Approach to Spatial Outliers Detection,
IEEE Transactions on Knowledge and Data
Engineering. (under review)

52
http//www.cs.umn.edu/ctlu
Thank you !!!

Write a Comment

User Comments (0)