Title: A%20Unified%20Approach%20to%20Spatial%20Outliers%20Detection
1- A Unified Approach to Spatial Outliers Detection
- Chang-Tien Lu
- Spatial Database Lab
- Department of Computer Science
- University of Minnesota
- ctlu_at_cs.umn.edu
- http//www.cs.umn.edu/research/shashi-group
2Outline
- Introduction
- Motivation
- General Definition of Spatial Outlier
- Related Work
- Proposed Approach and Algorithm
- Evaluation of Proposed Approach
- Conclusions
3Spatial Data Mining
- Spatial Databases are too large to analyze
manually - NASA Earth Observation System (EOS)
- National Institute of Justice Crime mapping
- Census Bureau, Dept. of Commerce - Census Data
- Spatial Data Mining
- Discover frequent and interesting spatial
patterns for post processing (knowledge
discovery) - Pattern examples outliers, crime hot spots,
land-use classification - Historical Example
- London, 1854
- Cholera water pump
4Spatial Outlier
- Definition A data point that is extreme
relative to its neighbors
5Application Domain Traffic Data
6An Example of Spatial Outlier
- Spatial outlier S, global outlier G,
L
7Spatial Outlier Detection Z s(x) approach
Function
If
Declare x as a spatial outlier
8Evaluation of Statistical Assumption
- Distribution of traffic station attribute f(x)
looks normal - Distribution of
looks normal too!
9Outlier Detection Tests
10Outlier Detection Tests
- Related work
- 1-dimension ignores geographic location
- Homogeneous multi-dimension mixes location with
attributes - Spatial outlier
- 2 classes of dimensions location, attribute
- Neighborhood based on location dimension
- Difference compares attribute dimension
- Comparison of outlier detection methods
11Issues
- Numerous tests
- Each has custom algorithm
- Add complexity to implement Spatial Database
Management System - Desirable
- Unified test
- A general algorithm to perform different tests
- High performance
12Our Contribution
- A general definition of spatial outlier
- S - outlier
- Show that existing definitions are special cases
of - s -outlier
- Design efficient algorithms
- Analyze the computation structures
- Develop I/O cost models
- Evaluate alternate disk page clustering methods
13General Definition of Spatial Outlier
- Given
- A spatial framework SF consisting of locations
s1, s2, , sn - An attribute function f si ? R
- (R set of real numbers)
- A neighborhood relationship N ? SF ? SF
- A neighborhood aggregation function RN ?
R - A difference function Fdiff R ? R ? R
- Statistic test function ST R ? True, False
- Test is based on Fdiff (f, (f, N) )
- General definition S -outlier
- An object O ? SF is a S -outlier (f, ,
Fdiff , ST ) - if ST TRUE
14Related Work- Spatial Outlier Tests
- Different spatial outlier tests
- Spatial statistic approach
- Scatter plot approach (Luc Anselin 94)
- Moran Scatter plot approach (Luc Anselin 95)
- Variogram cloud approach (Graphic)
- All these are special cases of S -outlier
- Show this for one case scatter plot
15Scatter Plot Approach
- Lemma
- Scatter plot is a special case
- of S -outlier
- Given
- An attribute function f(x)
- A neighborhood relationship N(x)
- An aggregation function
- A difference function
- Fdiff ? E(x) (m ? f(x) b)
- Detect spatial outlier by
- Statistic test function
- ST
E(x)
f(x)
16Outline
- Introduction
- Proposed Approach and Algorithm
- Problem formulation
- Our approach
- Efficient algorithm
- Cost model
- Evaluation of Proposed Approach
- Conclusions
17Problem Formulation
- General definition S -outlier
- An object O is a S -outlier (f, , Fdiff, ST
) - if ST TRUE
- Design
- An efficient algorithm to detect S -outlier
- i,e., O si? si ? SF , si is a spatial
outlier - Objective
- Efficiency to minimize the computation time
- Constraints
- Fdiff and ST are algebraic aggregate functions of
values - of f and
- The size of the data set gtgt the main memory size
- Computation time is determined by I/O time
18Aggregate Function
- Distributive aggregate function F
- Global F value can computed by applying the G
function to the value of F in each partition of
the data set, F G for most cases - Algebraic aggregate function F
- Global F value can be computed using a fixed
number of sub-aggregates from each partition of
the data set
19Our Approach
- Separate two phases
- Model building
- Testing (a node or a set of nodes)
- Computation structure of model building
- Key insights
- Spatial self join using N(x) relationship
- Algebraic aggregate functions can be computed in
one disk scan of spatial join - Computation structure of testing
- Single node spatial range query
- Get-All-Neighbors(x) operation
20An Example of Our Approach Scatter plot
- Model building
- An attribute function f(x)
- Neighborhood aggregate function
- Distributive aggregate functions
-
- Algebraic aggregate functions
-
- where
, - Testing
- Difference function
- where
- Statistic test function
-
21Model Building Algorithm
- Steps
- For each node x
- Retrieve data record of x (f(x), list of
neighbors(x)) - Get-All-Neighbors(x) Retrieve data records of
neighbor(x) - If neighbor y is not in the memory buffer,
request another I/O operation - Compute neighborhood aggregate function
- Accumulate distributive aggregate function
- , ,..,
- Compute algebraic aggregate function
- , ,.. ,
- I/O cost
- Dominant operation Get-All-Neighbors(x)
- I/O cost of Get-All-Neighbors(x) is determined by
the clustering efficiency
22Attribute Table
Node x Attribute volume f(x) List of Neighbors N(x)
V1 125 V3, V5, V10
V2 130 V4, V6
V3 140 V1, V7
. . .
V10 120 V1, V7, V12
23Computation structure
- Lemma
- In model building algebraic aggregate function
can be computed in one disk scan of the spatial
self-join - Proof
- A fixed k number of distributive aggregate
functions - , ,.., are used to store the
aggregate values - Algebraic aggregate functions are then computed
using these aggregate values - In all cases, one needs a very small number of
distributive aggregate functions (k ? 10)
24Testing Algorithm
- Steps
- For each node x
- Retrieve data record of x (f(x), list of
neighbors(x)) - Get-All-Neighbors(x) Retrieve data records of
neighbor(x) - If neighbor y is not in the memory buffer
- Request another I/O operation
- Compute difference function Fdiff
- If test function ST True
- Declare x as a spatial outlier
25I/O Cost Model
- Definition
- CE Clustering efficiency
- N Total number of nodes
- Bfr Blocking factor (number of nodes in a disk
page) - K Avg. number of neighbors for each node
- L Number of nodes in a route
- Cost model of A1
-
- The cost to retrieve all nodes
- The cost to retrieve neighbors of all nodes
- Cost model of A2
-
26Clustering Efficiency
- CE definition
- Probability vi and a neighbor of vi are stored
in the same disk page - (Total number of unsplit edges)/(Total number of
edges) - Computation cost (I/O cost) is determined by
Clustering Efficiency (CE)
27Clustering Efficiency
- An example
-
- CE depends on
- Disk page size
- Node record size,
- edge distribution
- over nodes
- Clustering method
28I/O Cost Model
- Definition
- CE Clustering efficiency
- N Total number of nodes
- Bfr Blocking factor (number of nodes in a disk
page) - K Avg. number of neighbors for each node
- L Number of nodes in a route
- Cost model of Model Building
-
- The cost to retrieve all nodes
- The cost to retrieve neighbors of all nodes
- Cost model of Testing
-
29Outline
- Introduction
- Proposed Approach and Algorithm
- Evaluation of Proposed Approach
- Candidates (Clustering Methods)
- Experiment Design
- Results
- conclusions
30Experimental Evaluation (Summary)
- Hypothesis
- I/O cost of the algorithm is determined by the
clustering efficiency - Physical Data Page Clustering Method
- CCAM
- Cell Tree
- Z-order
- Benchmark data
- Minneapolis- St. Paul traffic data
(loop-detector) - Benchmark tasks
- Model building
- Testing
- Metrics Clustering efficiency(CE), I/O cost
31Clustering Method CCAM
- Connectivity Clustered Access Method
- Cluster the nodes via min-cut graph partitioning
- Use B tree with Z-order as the secondary index
32Clustering Method CCAM
33Clustering Methods Cell Tree
- Binary Space Partitioning (BSP)
- Decompose universe into disjoint convex subspaces
- Cannot exploit edge information, pure geometric
34Clustering Method Cell Tree
35Clustering Method Z-order
- Impose a total order on the nodes
- Z-order interleave (bits of X, bits of Y)
36Clustering Method Z-order
37Experiment Design
- Questions/Hypotheses
- What is the ranking of candidate clustering
methods? - Is CE a predictor of relative performance of
clustering methods? - Does cost model predict observed ranking?
- What are the effects of parameters
- Disk page size
- Memory buffer size
38Experiment Design
39Model Building Effect of Page Size
- Fixed parameters buffer size
- Variable parameters page size, clustering
strategy
Configuration
- CCAM has the best performance, the highest CE
value - High CE gt Low I/O cost
- Cost model (N/Bfr)NK(1-CE)
- Increase page size gt reduce number of page
accesses
Trends
40Model Building Effect of Buffer Size
- Fixed parameters
- Page size 2 Kbytes
- Clustering Efficiency
- CCAM0.81, Cell0.69,
- Z-ord0.51
- Variable parameters
- Number of buffers
- Clustering strategies
- Trends
- Increase buffer size
- gt reduce number of page accesses
- CCAM has the best performance
41Testing Effect of Page Size
- Fixed parameters buffer size
- Variable parameters page size, clustering
strategy
Configuration
Trends
- CCAM has the best performance, the highest CE
value - High CE gt Low I/O cost
- Cost model L(1-CE)LK(1-CE)
- Increase page size
- Reduce number of page access, performance gap
reduces
42Summary of Experimental Results
- Model building and testing
- CCAM achieves higher clustering efficiency
- CCAM has lower I/O than Cell-Tree and Z-order
- Higher CE leads to lower I/O cost
- CE is a good predictor of relative I/O
performance - Page size improves clustering efficiency of all
methods - Reduces performance gap between methods
43Outlier Station Detected
44MinneapolisSt. Paul Traffic Data
45Outlier Station Detected
46Conclusion
- A general definition of spatial outlier
- S -outlier models existing spatial outlier
definitions - Scatter plot, Moran Scatterplot, Spatial
Statistic, .. - Efficient algorithm to detect spatial outlier
- Model building Testing
- Recognize the computation structure of algorithms
- Algebraic aggregate functions on ? self join
- Get-All-Neighbor(x) dominates I/O cost
- Develop algebraic cost models
- Evaluate alternate disk page clustering methods
- CCAM, Cell-Tree, Z-order
47Future Directions
- Extend Spatial Outlier Detection Test
- Multi-attributes
- Combination of traffic volume, speed, occupancy
- Location attribute includes time
- Temporal and Spatial-Temporal Outliers
- Explore other spatial patterns beyond spatial
outlier - Land-use classification
- Co-locations
- Fire ignition source feature
- Needle vegetation type feature
- Drought feature
- Iterative Spatial Outlier Detection
48Application Domain Traffic Volume Matrix
49Spatial Outlier Detection
50Iterative Spatial Outlier Detection
51Related Publications
- Detecting Graph-based Spatial Outliers
Algorithms and Applications, ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, September, 2001 - Detecting Graph-based Spatial Outliers, the
International Journal of Intelligent Data
Analysis (IDA), Vol. 6, No 3. 2002 - A Unified Approach to Spatial Outliers Detection,
IEEE Transactions on Knowledge and Data
Engineering. (under review)
52http//www.cs.umn.edu/ctlu
Thank you !!!