Title: Spatial Data Mining: Three Case Studies
1Spatial Data MiningThree Case Studies
For additional details www.cs.umn.edu/shekhar/p
roblems.html
Shashi Shekhar, University of Minnesota Presented
to UCGIS Summer Assembly 2001
2Background
- NSF workshop on GIS and DM (3/99)
- Spatial data1, 8 - traffic, bird habitats,
global climate, logistics, ... - For spatial patterns - outliers, location
prediction, associations, sequential
associations, trends,
3Framework
- Problem statement capture special needs
- Data exploration maps, new methods
- Try reusing classical methods
- from data mining, spatial statistics
- If reuse is not possible, invent new methods
- Validation, Performance tuning
4Case 1 Spatial Outliers
- Problem stations different from neighbors
SIGKDD 2001 - Data - space-time plot, distr. Of f(x), S(x)
- Distribution of base attribute
- spatially smooth
- frequency distribution over value domain normal
- Classical test - Pr.item in population is low
- Q? distribution of diff.f(x), neighborhood
aggf(x) - Insight this statistic is distributed normally!
- Test (z-score on the statistics) gt 2
- Performance - spatial join, clustering methods
5Spatial outlier detection4
- Spatial outlier
- A data point that is extreme relative to
- it neighbors
- Given
- A spatial graph GV,E
- A neighbor relationship (K neighbors)
- An attribute function f V -gt R
- An aggregation function f aggr R k -gt R
- Confidence level threshold ?
- Find
- O vi vi ?V, vi is a spatial outlier
- Objective
- Correctness The attribute values of vi
- is extreme, compared with its
neighbors - Computational efficiency
- Constraints
- Attribute value is normally distributed
- Computation cost dominated by I/O op.
6Spatial outlier detection
- Spatial Outlier Detection Test
- 1. Choice of Spatial Statistic
- S(x) f(x)E y? N(x)(f(y))
- Theorem S(x) is normally distributed
- if f(x) is normally
distributed - 2. Test for Outlier Detection
- (S(x) - ?s) / ?s gt ?
- Hypothesis
- I/O cost determined by clustering efficiency
f(x)
S(x)
Spatial outlier and its neighbors
7Spatial outlier detection
- Results
- 1. CCAM achieves higher clustering efficiency
(CE) - 2. CCAM has lower I/O cost
- 3. Higher CE leads to lower
- I/O cost
- 4. Page size improves CE for
- all methods
-
I/O cost
CE value
Cell-Tree
Z-order
CCAM
8Case 2 Location Prediction
- Citations SIAM DM Conf. 2001, SIGKDD DMKD 2000
- Problem predict nesting site in marshes
- given vegetation, water depth, distance to edge,
etc. - Data - maps of nests and attributes
- spatially clustered nests, spatially smooth
attributes - Classical method logistic regression, decision
trees, bayesian classifier - but, independence assumption is violated ! Misses
auto-correlation ! - Spatial auto-regression (SAR), Markov random
field bayesian classifier - Open issues spatial accuracy vs. classification
accurary - Open issue performance - SAR learning is slow!
9Location Prediction6, 7, 8
- Given
- 1. Spatial Framework
- 2. Explanatory functions
- 3. A dependent function
- 4. A family of function mappings
-
- Find A function
- Objectivemaximize
- classification_accuracy
- Constraints
- Spatial Autocorrelation exists
Nest locations
Distance to open water
Water depth
Vegetation durability
10Evaluation Changing Model
- Linear Regression
- Spatial Regression
- Spatial model is better
11Evaluation Changing measure
New measure
12Case 3 Spatial Association Rules
- Citation Symp. On Spatial Databases 2001
- Problem Given a set of boolean spatial features
- find subsets of co-located features, e.g. (fire,
drought, vegetation) - Data - continuous space, partition not natural,
no reference feature - Classical data mining approach association rules
- But, Look Ma! No Transactions!!! No support
measure! - Approach Work with continuous data without
transactionizing it! - confidence Pr.fire at s drought in N(s) and
vegetation in N(s) - support cardinality of spatial join of instances
of fire, drought, dry veg. - participation min. fraction of instances of a
features in join result - new algorithm using spatial joins and apriori_gen
filters
13Co-location Patterns2, 3
Can you find co-location patterns from the
following sample dataset?
Answers and
14Co-location Patterns
Can you find co-location patterns from the
following sample dataset?
15Co-location Patterns
- Spatial Co-location
- A set of features frequently co-located
- Given
- A set T of K boolean spatial feature types
Tf1,f2, , fk - A set P of N locations Pp1, , pN in a
spatial frame work S, pi? P is of some spatial
feature in T - A neighbor relation R over locations in S
- Find
- Tc ?subsets of T frequently co-located
- Objective
- Correctness
- Completeness
- Efficiency
- Constraints
- R is symmetric and reflexive
- Monotonic prevalence measure
Reference Feature Centric
Window Centric
Event Centric
16Co-location Patterns
Comparison with association rules
Association rules Co-location rules
underlying space discrete sets continuous space
item-types item-types events /Boolean spatial features
collections transactions neighborhoods
prevalence measure support participation index
conditional probability measure Pr. A in T B in T Pr. A in N(L) B at L
- Participation index
- Participation ratio pr(fi, c) of feature fi in
co-location c f1, f2, , fk fraction of
instances of fi with - feature f1, , fi-1, fi1, , fk nearby
2.Participation index minpr(fi, c) - Algorithm
- Hybrid Co-location Miner
17Conclusions Future Directions
- Spatial domains may not satisfy assumptions of
classical methods - data auto-correlation, continuous geographic
space - patterns global vs. local, e.g. spatial outliers
vs. outliers - data exploration maps and albums
- Open Issues
- patterns hot-spots, blobology (shape), spatial
trends, - metrics spatial accuracy(predicted locations),
spatial contiguity(clusters) - spatio-temporal dataset
- scale and resolutions sentivity of patterns
- geo-statistical confidence measure for mined
patterns
18References
- S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X.
Liu and C.T. Liu, Spatial Databases
Accomplishments and Research Needs, IEEE
Transactions on Knowledge and Data Engineering,
Jan.-Feb. 1999. - S. Shekhar and Y. Huang, Discovering Spatial
Co-location Patterns a Summary of Results, In
Proc. of 7th International Symposium on Spatial
and Temporal Databases (SSTD01), July 2001. - S. Shekhar, Y. Huang, and H. Xiong, Performance
Evaluation of Co-location Miner, the IEEE
International Conference on Data Mining
(ICDM01), Nov. 2001. (submitted) - S. Shekhar, C.T. Lu, P. Zhang, "Detecting
Graph-based Spatial Outliers Algorithms and
Applications, the Seventh ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, 2001. - S. Shekhar, S. Chawla, the book Spatial
Database Concepts, Implementation and Trends.
(To be published in 2001) - S. Chawla, S. Shekhar, W. Wu and U. Ozesmi,
Extending Data Mining for Spatial Applications
A Case Study in Predicting Nest Locations,
Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge
Discovery (DMKD 2000), Dallas, TX, May 14,
2000. - S. Chawla, S. Shekhar, W. Wu and U. Ozesmi,
Modeling Spatial Dependencies for Mining
Geospatial Data, First SIAM International
Conference on Data Mining, 2001. - S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu,
and S. Chawla, Spatial Contextual Classification
and Prediction Models for Mining Geospatial
Data, IEEE Transactions on Multimedia, 2001.
(Submitted) -
Some papers are available on the Web sites
http//www.cs.umn.edu/research/shashi-group/