A%20Unified%20Approach%20to%20Spatial%20Outliers%20Detection - PowerPoint PPT Presentation

About This Presentation
Title:

A%20Unified%20Approach%20to%20Spatial%20Outliers%20Detection

Description:

Minneapolis- St. Paul traffic data (loop-detector) Benchmark tasks. Model building ... Minneapolis St. Paul Traffic Data. Outlier Station Detected. Conclusion ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 53
Provided by: unkn746
Category:

less

Transcript and Presenter's Notes

Title: A%20Unified%20Approach%20to%20Spatial%20Outliers%20Detection


1
  • A Unified Approach to Spatial Outliers Detection
  • Chang-Tien Lu
  • Spatial Database Lab
  • Department of Computer Science
  • University of Minnesota
  • ctlu_at_cs.umn.edu
  • http//www.cs.umn.edu/research/shashi-group

2
Outline
  • Introduction
  • Motivation
  • General Definition of Spatial Outlier
  • Related Work
  • Proposed Approach and Algorithm
  • Evaluation of Proposed Approach
  • Conclusions

3
Spatial Data Mining
  • Spatial Databases are too large to analyze
    manually
  • NASA Earth Observation System (EOS)
  • National Institute of Justice Crime mapping
  • Census Bureau, Dept. of Commerce - Census Data
  • Spatial Data Mining
  • Discover frequent and interesting spatial
    patterns for post processing (knowledge
    discovery)
  • Pattern examples outliers, crime hot spots,
    land-use classification
  • Historical Example
  • London, 1854
  • Cholera water pump

4
Spatial Outlier
  • Definition A data point that is extreme
    relative to its neighbors

5
Application Domain Traffic Data

6
An Example of Spatial Outlier
  • Spatial outlier S, global outlier G,
    L

7
Spatial Outlier Detection Z s(x) approach
Function
If
Declare x as a spatial outlier
8
Evaluation of Statistical Assumption
  • Distribution of traffic station attribute f(x)
    looks normal
  • Distribution of
    looks normal too!

9
Outlier Detection Tests
10
Outlier Detection Tests
  • Related work
  • 1-dimension ignores geographic location
  • Homogeneous multi-dimension mixes location with
    attributes
  • Spatial outlier
  • 2 classes of dimensions location, attribute
  • Neighborhood based on location dimension
  • Difference compares attribute dimension
  • Comparison of outlier detection methods

11
Issues
  • Numerous tests
  • Each has custom algorithm
  • Add complexity to implement Spatial Database
    Management System
  • Desirable
  • Unified test
  • A general algorithm to perform different tests
  • High performance

12
Our Contribution
  • A general definition of spatial outlier
  • S - outlier
  • Show that existing definitions are special cases
    of
  • s -outlier
  • Design efficient algorithms
  • Analyze the computation structures
  • Develop I/O cost models
  • Evaluate alternate disk page clustering methods

13
General Definition of Spatial Outlier
  • Given
  • A spatial framework SF consisting of locations
    s1, s2, , sn
  • An attribute function f si ? R
  • (R set of real numbers)
  • A neighborhood relationship N ? SF ? SF
  • A neighborhood aggregation function RN ?
    R
  • A difference function Fdiff R ? R ? R
  • Statistic test function ST R ? True, False
  • Test is based on Fdiff (f, (f, N) )
  • General definition S -outlier
  • An object O ? SF is a S -outlier (f, ,
    Fdiff , ST )
  • if ST TRUE

14
Related Work- Spatial Outlier Tests
  • Different spatial outlier tests
  • Spatial statistic approach
  • Scatter plot approach (Luc Anselin 94)
  • Moran Scatter plot approach (Luc Anselin 95)
  • Variogram cloud approach (Graphic)
  • All these are special cases of S -outlier
  • Show this for one case scatter plot

15
Scatter Plot Approach
  • Lemma
  • Scatter plot is a special case
  • of S -outlier
  • Given
  • An attribute function f(x)
  • A neighborhood relationship N(x)
  • An aggregation function
  • A difference function
  • Fdiff ? E(x) (m ? f(x) b)
  • Detect spatial outlier by
  • Statistic test function
  • ST

E(x)
f(x)
16
Outline
  • Introduction
  • Proposed Approach and Algorithm
  • Problem formulation
  • Our approach
  • Efficient algorithm
  • Cost model
  • Evaluation of Proposed Approach
  • Conclusions

17
Problem Formulation
  • General definition S -outlier
  • An object O is a S -outlier (f, , Fdiff, ST
    )
  • if ST TRUE
  • Design
  • An efficient algorithm to detect S -outlier
  • i,e., O si? si ? SF , si is a spatial
    outlier
  • Objective
  • Efficiency to minimize the computation time
  • Constraints
  • Fdiff and ST are algebraic aggregate functions of
    values
  • of f and
  • The size of the data set gtgt the main memory size
  • Computation time is determined by I/O time

18
Aggregate Function
  • Distributive aggregate function F
  • Global F value can computed by applying the G
    function to the value of F in each partition of
    the data set, F G for most cases
  • Algebraic aggregate function F
  • Global F value can be computed using a fixed
    number of sub-aggregates from each partition of
    the data set

19
Our Approach
  • Separate two phases
  • Model building
  • Testing (a node or a set of nodes)
  • Computation structure of model building
  • Key insights
  • Spatial self join using N(x) relationship
  • Algebraic aggregate functions can be computed in
    one disk scan of spatial join
  • Computation structure of testing
  • Single node spatial range query
  • Get-All-Neighbors(x) operation

20
An Example of Our Approach Scatter plot
  • Model building
  • An attribute function f(x)
  • Neighborhood aggregate function
  • Distributive aggregate functions
  • Algebraic aggregate functions
  • where
    ,
  • Testing
  • Difference function
  • where
  • Statistic test function

21
Model Building Algorithm
  • Steps
  • For each node x
  • Retrieve data record of x (f(x), list of
    neighbors(x))
  • Get-All-Neighbors(x) Retrieve data records of
    neighbor(x)
  • If neighbor y is not in the memory buffer,
    request another I/O operation
  • Compute neighborhood aggregate function
  • Accumulate distributive aggregate function
  • , ,..,
  • Compute algebraic aggregate function
  • , ,.. ,
  • I/O cost
  • Dominant operation Get-All-Neighbors(x)
  • I/O cost of Get-All-Neighbors(x) is determined by
    the clustering efficiency

22
Attribute Table
Node x Attribute volume f(x) List of Neighbors N(x)
V1 125 V3, V5, V10
V2 130 V4, V6
V3 140 V1, V7
. . .
V10 120 V1, V7, V12
23
Computation structure
  • Lemma
  • In model building algebraic aggregate function
    can be computed in one disk scan of the spatial
    self-join
  • Proof
  • A fixed k number of distributive aggregate
    functions
  • , ,.., are used to store the
    aggregate values
  • Algebraic aggregate functions are then computed
    using these aggregate values
  • In all cases, one needs a very small number of
    distributive aggregate functions (k ? 10)

24
Testing Algorithm
  • Steps
  • For each node x
  • Retrieve data record of x (f(x), list of
    neighbors(x))
  • Get-All-Neighbors(x) Retrieve data records of
    neighbor(x)
  • If neighbor y is not in the memory buffer
  • Request another I/O operation
  • Compute difference function Fdiff
  • If test function ST True
  • Declare x as a spatial outlier

25
I/O Cost Model
  • Definition
  • CE Clustering efficiency
  • N Total number of nodes
  • Bfr Blocking factor (number of nodes in a disk
    page)
  • K Avg. number of neighbors for each node
  • L Number of nodes in a route
  • Cost model of A1
  • The cost to retrieve all nodes
  • The cost to retrieve neighbors of all nodes
  • Cost model of A2

26
Clustering Efficiency
  • CE definition
  • Probability vi and a neighbor of vi are stored
    in the same disk page
  • (Total number of unsplit edges)/(Total number of
    edges)
  • Computation cost (I/O cost) is determined by
    Clustering Efficiency (CE)

27
Clustering Efficiency
  • An example
  • CE depends on
  • Disk page size
  • Node record size,
  • edge distribution
  • over nodes
  • Clustering method

28
I/O Cost Model
  • Definition
  • CE Clustering efficiency
  • N Total number of nodes
  • Bfr Blocking factor (number of nodes in a disk
    page)
  • K Avg. number of neighbors for each node
  • L Number of nodes in a route
  • Cost model of Model Building
  • The cost to retrieve all nodes
  • The cost to retrieve neighbors of all nodes
  • Cost model of Testing

29
Outline
  • Introduction
  • Proposed Approach and Algorithm
  • Evaluation of Proposed Approach
  • Candidates (Clustering Methods)
  • Experiment Design
  • Results
  • conclusions

30
Experimental Evaluation (Summary)
  • Hypothesis
  • I/O cost of the algorithm is determined by the
    clustering efficiency
  • Physical Data Page Clustering Method
  • CCAM
  • Cell Tree
  • Z-order
  • Benchmark data
  • Minneapolis- St. Paul traffic data
    (loop-detector)
  • Benchmark tasks
  • Model building
  • Testing
  • Metrics Clustering efficiency(CE), I/O cost

31
Clustering Method CCAM
  • Connectivity Clustered Access Method
  • Cluster the nodes via min-cut graph partitioning
  • Use B tree with Z-order as the secondary index

32
Clustering Method CCAM
33
Clustering Methods Cell Tree
  • Binary Space Partitioning (BSP)
  • Decompose universe into disjoint convex subspaces
  • Cannot exploit edge information, pure geometric

34
Clustering Method Cell Tree
35
Clustering Method Z-order
  • Impose a total order on the nodes
  • Z-order interleave (bits of X, bits of Y)

36
Clustering Method Z-order
37
Experiment Design
  • Questions/Hypotheses
  • What is the ranking of candidate clustering
    methods?
  • Is CE a predictor of relative performance of
    clustering methods?
  • Does cost model predict observed ranking?
  • What are the effects of parameters
  • Disk page size
  • Memory buffer size

38
Experiment Design
39
Model Building Effect of Page Size
  • Fixed parameters buffer size
  • Variable parameters page size, clustering
    strategy

Configuration
  • CCAM has the best performance, the highest CE
    value
  • High CE gt Low I/O cost
  • Cost model (N/Bfr)NK(1-CE)
  • Increase page size gt reduce number of page
    accesses

Trends
40
Model Building Effect of Buffer Size
  • Fixed parameters
  • Page size 2 Kbytes
  • Clustering Efficiency
  • CCAM0.81, Cell0.69,
  • Z-ord0.51
  • Variable parameters
  • Number of buffers
  • Clustering strategies
  • Trends
  • Increase buffer size
  • gt reduce number of page accesses
  • CCAM has the best performance

41
Testing Effect of Page Size
  • Fixed parameters buffer size
  • Variable parameters page size, clustering
    strategy

Configuration
Trends
  • CCAM has the best performance, the highest CE
    value
  • High CE gt Low I/O cost
  • Cost model L(1-CE)LK(1-CE)
  • Increase page size
  • Reduce number of page access, performance gap
    reduces

42
Summary of Experimental Results
  • Model building and testing
  • CCAM achieves higher clustering efficiency
  • CCAM has lower I/O than Cell-Tree and Z-order
  • Higher CE leads to lower I/O cost
  • CE is a good predictor of relative I/O
    performance
  • Page size improves clustering efficiency of all
    methods
  • Reduces performance gap between methods

43
Outlier Station Detected
44
MinneapolisSt. Paul Traffic Data

45
Outlier Station Detected
46
Conclusion
  • A general definition of spatial outlier
  • S -outlier models existing spatial outlier
    definitions
  • Scatter plot, Moran Scatterplot, Spatial
    Statistic, ..
  • Efficient algorithm to detect spatial outlier
  • Model building Testing
  • Recognize the computation structure of algorithms
  • Algebraic aggregate functions on ? self join
  • Get-All-Neighbor(x) dominates I/O cost
  • Develop algebraic cost models
  • Evaluate alternate disk page clustering methods
  • CCAM, Cell-Tree, Z-order

47
Future Directions
  • Extend Spatial Outlier Detection Test
  • Multi-attributes
  • Combination of traffic volume, speed, occupancy
  • Location attribute includes time
  • Temporal and Spatial-Temporal Outliers
  • Explore other spatial patterns beyond spatial
    outlier
  • Land-use classification
  • Co-locations
  • Fire ignition source feature
  • Needle vegetation type feature
  • Drought feature
  • Iterative Spatial Outlier Detection

48
Application Domain Traffic Volume Matrix
49
Spatial Outlier Detection
50
Iterative Spatial Outlier Detection
51
Related Publications
  • Detecting Graph-based Spatial Outliers
    Algorithms and Applications, ACM SIGKDD
    International Conference on Knowledge Discovery
    and Data Mining, September, 2001
  • Detecting Graph-based Spatial Outliers, the
    International Journal of Intelligent Data
    Analysis (IDA), Vol. 6, No 3. 2002
  • A Unified Approach to Spatial Outliers Detection,
    IEEE Transactions on Knowledge and Data
    Engineering. (under review)

52
http//www.cs.umn.edu/ctlu
Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com