Title: Mining for Spatial Patterns
1 Mining for Spatial Patterns
- Shashi Shekhar
- Department of Computer Science
- University of Minnesota http//www.cs.umn.edu/s
hekhar - Collaborators
- U. of Minnesota V. Kumar, G. Karypis, C.T. Lu,
W. Wu, Y. Huang, V. Raju, P. Zhang, P. Tan, M.
Steinbach - NASA Ames Research Center C. Potter
- California State University, Monterey Bay S.
Klooster - This work was partially funded by NASA and Army
High Performance Computing Center
2Background
- NSF workshop on GIS and DM (3/99)
- Spatial data - traffic, bird habitats, global
climate, logistics, ... - For spatial patterns - outliers, location
prediction, associations, sequential
associations, clustering, trends,
3Framework
- Problem statement capture special needs
- Data exploration maps, new methods
- Try reusing classical methods
- from data mining, spatial statistics
- If reuse is not possible, invent new methods
- Validation, Performance tuning
4Research Goals
- Research Goals
- modeling of ecological data
- event modeling
- zone modeling.
- finding spatio-temporal patterns
- associations
- predictive models.
A key interest is finding connections between the
ocean and the land.
5 Sources of Earth Science Data
- Before 1950, very sparse, unreliable data.
- Since 1950, reliable global data.
- Ocean temperature and pressure are based on data
from ships. - Most land data, (solar, precipitation,
temperature and pressure) comes from weather
stations. - Since 1981, data has been available from Earth
orbiting satellites. - FPAR, a measure related to plant
- Since 1999 TERRA, the flagship of the NASA Earth
Observing System, is providing much more detailed
data.
6 Example Pattern Teleconnections
- Teleconnections are the simultaneous variation in
climate and related processes over widely
separated points on the Earth. - For example, El Nino is the anomalous warming of
the eastern tropical region of the Pacific, and
has been linked to various climate phenomena. - Droughts in Australia and Southern Africa
- Heavy rainfall along the western coast of South
America - Milder winters in the Midwest
7Net Primary Production (NPP)
- Net Primary Production (NPP) is the net
assimilation of atmospheric carbon dioxide (CO2)
into organic matter by plants. - NPP is driven by solar radiation and can be
constrained by precipitation and temperature. - NPP is a key variable for understanding the
global carbon cycle and ecological dynamics of
the Earth. - Keeping track of NPP is important because it
includes the food source of humans and all other
organisms. - Sudden changes in the NPP of a region can have a
direct impact on the regional ecology. - An ecosystem model for predicting NPP, CASA (the
Carnegie Ames Stanford Approach) provides a
detailed view of terrestrial productivity.
8Benefits of Data Mining
- Data mining provides earth scientist with tools
that allow them to spend more time choosing and
exploring interesting families of hypotheses. - However, statistics is needed to provide methods
for determining the statistical significance of
results. - By applying the proposed data mining techniques,
some of the steps of hypothesis generation and
evaluation will be automated, facilitated and
improved. - Association rules provide a new framework for
detecting relationships between events.
9Approaches
10Clustering
- Interested in relationships between regions, not
points. - For land, clustering based on NPP or other
variables, e.g., precipitation, temperature. - For ocean, clustering based on SST (Sea Surface
Temperature). - When raw NPP and SST are used, clustering can
find seasonal patterns. - Anomalous regions have plant growth patterns
which reversed from those typically observed in
the hemisphere in which they reside, and are easy
to spot.
11 Clustering
El Nino Regions
SNN clusters of SST that are highly correlated
with El Nino indices.
12Spatial Association Rule
- Citation Symp. On Spatial Databases 2001
- Problem Given a set of boolean spatial features
- find subsets of co-located features, e.g. (fire,
drought, vegetation) - Data - continuous space, partition not natural,
no reference feature - Classical data mining approach association rules
- But, Look Ma! No Transactions!!! No support
measure! - Approach Work with continuous data without
transactionizing it! - confidence Pr.fire at s drought in N(s) and
vegetation in N(s) - support cardinality of spatial join of instances
of fire, drought, dry veg. - participation min. fraction of instances of a
features in join result - new algorithm using spatial joins and apriori_gen
filters
13 Event Definition
- Convert the time series into sequence of events
at each spatial location.
14 Interesting Association Patterns
- Use domain knowledge to eliminate uninteresting
patterns. - A pattern is less interesting if it occurs at
random locations. - Approach
- Partition the land area into distinct groups
(e.g., based on land-cover type). - For each pattern, find the regions for which the
pattern can be applied. - If the pattern occurs mostly in a certain group
of land areas, then it is potentially
interesting. - If the pattern occurs frequently in all groups of
land areas, then it is less interesting.
15 Association Rules
- Intra-zone non-sequential Patterns
- Region corresponds to semi-arid grasslands, a
type of vegetation, which is able to quickly take
advantage of high precipitation than forests. - Hypothesis FPAR-Hi events could be related to
unusual precipitation conditions.
16 Co-location
Can you find co-location patterns from the
following sample dataset?
Answers and
17 Co-location
Can you find co-location patterns from the
following sample dataset?
18 Co-location
Spatial Co-location A set of features
frequently co-located Given A set T of K
boolean spatial feature types Tf1,f2, ,
fk A set P of N locations Pp1, , pN in
a spatial frame work S, pi? P is of some spatial
feature in T A neighbor relation R over
locations in S Find Tc ?subsets of T
frequently co-located Objective Correctness
Completeness Efficiency Constraints R
is symmetric and reflexive Monotonic
prevalence measure
Reference Feature Centric
Window Centric
Event Centric
19 Co-location
Comparison with association rules
Association rules Co-location rules
underlying space discrete sets continuous space
item-types item-types events /Boolean spatial features
collections transactions neighborhoods
prevalence measure support participation index
conditional probability measure Pr. A in T B in T Pr. A in N(L) B at L
Participation index Participation ratio pr(fi, c)
of feature fi in co-location c f1, f2, , fk
fraction of instances of fi with feature f1, ,
fi-1, fi1, , fk nearby 2.Participation index
minpr(fi, c) Algorithm Hybrid Co-location
Miner
20Spatial Co-location Patterns
- Spatial feature A,B,C and their instances
- Possible associations are (A, B), (B, C), etc.
- Neighbor relationship includes following pairs
- A1, B1
- A2, B1
- A2, B2
- B1, C1
- B2, C2
21Spatial Co-location Patterns
- Partition approachYasuhiko, KDD 2001
- Support not well defined,i.e. not independent of
execution trace - Has a fast heuristic which is hard to analyze for
correctness/completeness
Spatial feature A,B, C, and their instances
Support A,B1 B,C2
Support A,B 2 B,C2
22Spatial Co-location Patterns
- Reference feature approach Han SSD 95
- C as reference feature to get transactions
- Transactions (B1) (B2)
- Support (A,B) ? from Apriori algorithm
- Note Neighbor relationship includes following
pairs - A1, B1
- A2, B1
- A2, B2
- B1, C1
- B2, C2
Spatial feature A,B, C, and their instances
23Spatial Co-location Patterns
- Our approach (Event Centric)
- Neighborhood instead of transactions
- Spatial join on neighbor relationship
- Support ? Prevalence
- Participation index min. p_ratio
- P_ratio(A, (A,B)) fraction of instance of A
participating in join(A,B, neighbor) - Examples
- Support(A,B)min(2/2,3/3)1
- Support(B,C)min(2/2,2/2)1
Spatial feature A,B, C, and their instances
24Spatial Co-location Patterns
Support(A,B)min(2/2,3/3)1
Spatial feature A,B, C, and their instances
Support(B,C)min(2/2,2/2)1
Support A,B 2 B,C2
- Reference feature approach
C as reference feature Transactions (B1)
(B2) Support (A,B) ?
Support A,B1 B,C2
25Spatial Outliers
- Spatial Outlier A data point that is extreme
relative to it neighbors - Case Study traffic stations different from
neighbors SIGKDD 2001 - Data - space-time plot, distr. Of f(x), S(x)
- Distribution of base attribute
- spatially smooth
- frequency distribution over value domain normal
- Classical test - Pr.item in population is low
- Q? distribution of diff.f(x), neighborhood
aggf(x) - Insight this statistic is distributed normally!
- Test (z-score on the statistics) gt 2
- Performance - spatial join, clustering methods
26 Spatial Outlier Detection
Given A spatial graph GV,E A neighbor
relationship (K neighbors) An attribute
function V -gt R An aggregation function
R k -gt R A comparison function
Confidence level threshold ? Statistic test
function ST R -gtT, F Find O vi vi ?V,
vi is a spatial outlier Objective
Correctness The attribute values of vi is
extreme, compared with its neighbors
Computational efficiency Constraints
and ST are algebraic aggregate functions of
and Computation cost dominated by I/O op.
27 Spatial Outlier Detection
Spatial Outlier Detection Test 1. Choice of
Spatial Statistic S(x) f(x)E y?
N(x)(f(y)) Theorem S(x) is normally
distributed if f(x) is
normally distributed 2. Test for Outlier
Detection (S(x) - ?s) / ?s gt ?
Hypothesis I/O cost determined by clustering
efficiency
f(x)
S(x)
28 Spatial Outlier Detection
Results 1. CCAM achieves higher clustering
efficiency (CE) 2. CCAM has lower I/O cost
3. High CE gt low I/O cost 4. Big Page gt high
CE
I/O cost
CE value
Z-order
CCAM
Cell-Tree
29 A Unified Approach Spatial Outliers
- Tests quantitative, graphical
- Results
- Computation spatial self-join
- Tests algebraic functions of join
- Join predicate neighbor relations
- I/O-cost f(clustering efficiency)
- Our algorithm is I/O-efficient for
- Algebric tests
Scatter Plot
Original Data
Our Approach
30 Graphical Spatial Tests
Moran Scatter Plot
Original Data
Variogram Cloud
31Location Prediction
- Citations IEEE Tran. on Multimedia 2002, SIAM DM
Conf. 2001, SIGKDD DMKD 2000 - Problem predict nesting site in marshes
- given vegetation, water depth, distance to edge,
etc. - Data - maps of nests and attributes
- spatially clustered nests, spatially smooth
attributes - Classical method logistic regression, decision
trees, bayesian classifier - but, independence assumption is violated ! Misses
auto-correlation ! - Spatial auto-regression (SAR), Markov random
field bayesian classifier - Open issues spatial accuracy vs. classification
accurary - Open issue performance - SAR learning is slow!
32Location Prediction
Given 1. Spatial Framework 2. Explanatory
functions 3. A dependent class 4. A family
of function mappings Find Classification
model Objectivemaximize classification_accurac
y Constraints Spatial Autocorrelation exists
Nest locations
Distance to open water
Water depth
Vegetation durability
33Motivation and Framework
34Solution Procedures
- Spatial Autoregression Model (SAR)
- y ?Wy X? ?
- W models neighborhood relationships
- ? models strength of spatial dependencies
- ? error vector
- Solutions
- ? and ? - can be estimated using ML or Bayesian
stat. - e.g., spatial econometrics package uses Bayesian
approach using sampling-based Markov Chain Monte
Carlo (MCMC) method. - Likelihood-based estimation requires O(n3) ops.
- Other alternatives divide and conquer, sparse
matrix, LU decomposition, etc.
35Evaluation
- Linear Regression
- Spatial Regression
- Spatial model is better
36Solution Procedures
- Markov Random Field based Bayesian Classifiers
- Pr(li X, Li) Pr(Xli, Li) Pr(li Li) / Pr
(X) - Pr(li Li) can be estimated from training data
- Li denotes set of labels in the neighborhood of
si excluding labels at si - Pr(Xli, Li) can be estimated using kernel
functions - Solutions
- stochastic relaxation Geman
- Iterated conditional modes Besag
- Graph cut Boykov
37Comparison
- SAR can be rewritten as y (QX) ? Q?
- where Q (I- ?W)-1 which can be viewed as a
spatial smoothing operation. - This transformation shows that SAR is similar to
linear logistic model, and thus suffers with same
limitations i.e., SAR model assumes linear
separability of classes in transformed feature
space - SAR model also make more restrictive assumptions
about the distribution of features and class
shapes than MRF - The relationship between SAR and MRF are
analogous to the relationship between logistic
regression and Bayesian classifiers. - Our experimental results shows that MRF model
yields better spatial and classification
accuracies than SAR predictions.
38MRF vs. SAR
Confusion Matrix
Spatial Confusion Matrix
39Experiment Design
40Prediction Maps(Learning)
MRF-P Prediction (ADNP3.36)
Actual Nest Sites (Real Learning)
NZ85
NZ138
MRF-GMM Prediction (ADNP5.88)
SAR Prediction (ADNP9.80)
NZ140
NZ130
41Prediction Maps(Testing)
MRF-P Prediction (ADNP2.84)
Actual Nest Sites (Real Testing)
Actual Nest Sites (Real Learning)
NZ30
NZ80
MRF-GMM Prediction (ADNP3.35)
SAR Prediction (ADNP8.63)
NZ76
NZ80
42 Conclusion and Future Directions
- Spatial domains may not satisfy assumptions of
classical methods - data auto-correlation, continuous geographic
space - patterns global vs. local, e.g. spatial outliers
vs. outliers - data exploration maps and albums
- Open Issues
- patterns hot-spots, blobology (shape), spatial
trends, - metrics spatial accuracy(predicted locations),
spatial contiguity(clusters) - spatio-temporal dataset
- scale and resolutions sentivity of patterns
- geo-statistical confidence measure for mined
patterns
43Reference
- S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X.
Liu and C.T. Liu, Spatial Databases
Accomplishments and Research Needs, IEEE
Transactions on Knowledge and Data Engineering,
Jan.-Feb. 1999. - S. Shekhar and Y. Huang, Discovering Spatial
Co-location Patterns a Summary of Results, In
Proc. of 7th International Symposium on Spatial
and Temporal Databases (SSTD01), July 2001. - S. Shekhar, C.T. Lu, P. Zhang, "Detecting
Graph-based Spatial Outliers Algorithms and
Applications, the Seventh ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, 2001. - S. Shekhar, C.T. Lu, P. Zhang, Detecting
Graph-based Saptial Outlier, Intelligent Data
Analysis, To appear in Vol. 6(3), 2002 - S. Shekhar, S. Chawla, the book Spatial
Database Concepts, Implementation and Trends,
Prentice Hall, 2002 - S. Chawla, S. Shekhar, W. Wu and U. Ozesmi,
Extending Data Mining for Spatial Applications
A Case Study in Predicting Nest Locations, Proc.
Int. Confi. on 2000 ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge
Discovery (DMKD 2000), Dallas, TX, May 14, 2000. - S. Chawla, S. Shekhar, W. Wu and U. Ozesmi,
Modeling Spatial Dependencies for Mining
Geospatial Data, First SIAM International
Conference on Data Mining, 2001. - S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu,
and S. Chawla, Spatial Contextual Classification
and Prediction Models for Mining Geospatial
Data,To Appear in IEEE Transactions on
Multimedia, 2002. - S. Shekhar, V. Kumar, P. Tan. M. Steinbach, Y.
Huang, P. Zhang, C. Potter, S. Klooster, Mining
Patterns in Earth Science Data, IEEE Computing
in Science and Engineering (Submitted)
44Reference
- S. Shekhar, C.T. Lu, P. Zhang, A Unified
Approach to Spatial Outliers Detection, IEEE
Transactions on Knowledge and Data Engineering
(Submitted) - S. Shekhar, C.T. Lu, X. Tan, S. Chawla, Map Cube
A Visualization Tool for Spatial Data Warehouses,
as Chapter of Geographic Data Mining and
Knowledge Discovery. Harvey J. Miller and Jiawei
Han (eds.), Taylor and Francis, 2001, ISBN
0-415-23369-0. - S. Shekhar, Y. Huang, W. Wu, C.T. Lu, What's
Spatial about Spatial Data Mining Three Case
Studies , as Chapter of Book Data Mining for
Scientific and Engineering Applications. V.
Kumar, R. Grossman, C. Kamath, R. Namburu (eds.),
Kluwer Academic Pub., 2001, ISBN 1-4020-0033-2 - Shashi Shekhar and Yan Huang , Multi-resolution
Co-location Miner a New Algorithm to Find
Co-location Patterns in Spatial Datasets, Fifth
Workshop on Mining Scientific Datasets (SIAM 2nd
Data Mining Conference), April 2002