Spatial Temporal Data Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Spatial Temporal Data Mining

1
Spatial Temporal Data Mining

Wei Wang
Data Mining Lab, UCLA
January 21, 1999

2
Outline

Introduction
Statistical Clustering
User-defined Trigger
Spatial Index Structure for High Dimensional
Point Data
Temporal Spatial Pattern Detection
ongoing research

3
Introduction

Spatial data mining has been an active research
area during recent years.
For some well know problem, e.g., clustering,
many existing algorithms are not efficient
enough.
There is still room for improvement.
There are a lot of interesting problems remaining
uninvestigated.
We classify a subset of problems and try to solve
them efficiently.

4
Outline

STING a statistical information grid approach to
spatial data mining
STING an approach to active spatial data mining
PK-tree a spatial index structure for high
dimensional point data
Temporal spatial pattern detection

5
STING

Spatial database is usually huge.
Efficiency of the data mining algorithm is
crucial.
Example each person is an object
Query Find high income area within California
high income salary gt 50,000
area gt 4 square miles
Traditional Method
Step 1 Select out all person whose salary are
high.
Step 2 Do clustering analysis on those persons
selected out.
Step 3 Form the region that each cluster
occupies.
Step 4 Return those regions larger than 4 square
miles.
If high income is defined as 80 persons have
salary gt 50,000
then the previous method can not even answer the
query.
STING was proposed to solve such problem
efficiently.

6
STING

Region Query Example
Select the maximal regions that have at least 100
houses per unit area and at least 70 of the
house prices are above 400K and with total area
at least 100 units with 90 confidence.

SELECT REGION FROM house-map WHERE DENSITY IN
(100, ?) AND price RANGE (400000, ?) WITH
PERCENT (0.7, 1) AND AREA (100, ?) AND WITH
CONFIDENCE 0.9
7
STING

Objects are represented by points, each of which
has associated spatial attributes, its location,
and non-spatial (numerical) attributes.
Space is recursively divided into smaller
rectangular cells until certain level is reached.
A hierarchical structure is employed.
The average number of objects in a leaf cell is
in the range from several dozens to several
thousands.
Preprocess data
capture the statistical information

8
STING
1st layer (root)
(i-1)th layer
ith layer
(i1)th layer (leaf layer)
9
STING

For each cell, we have
attribute-independent parameter
n number of objects
attribute-dependent parameters (for each
numerical attribute)
mean mean value of the attribute
std standard deviation of the attribute value
min the minimum value of the attribute
max the maximum value of the attribute
distribution the type of distribution that best
fits the attribute value (can be NONE)
Bottom-up generation when the data is loaded into
the database.
Linear compilation time
Only has to be done once ? not for each query.

10
STING

Take advantage of the statistical information
captured.
Only go through relevant cells at each level.
Root is relevant.
For each relevant cell, we exam its children at
next level by statistical test and label them as
relevant or not relevant.
Form regions from relevant leaf level cells.
Do not need to access full database.
It is very efficient.

11
STING

The computational complexity of STING is linearly
proportional to the number of leaf cells.
We used the SEQUOIA 2000 benchmark as the data
set to compare the performance of STING with
other approaches.

12
STING

STING is a query-independent approach.
The statistical information exists independently
of queries.
STING has a much smaller response time compared
to other approaches
The computational complexity is linearly
proportional to the number of leaves.
I/O cost is low.
STING can support different resolution of query
result.
Regions returned by STING approach that returned
by DBSCAN when the granularity approaches zero.
Parameters in the hierarchical structure can be
maintained efficiently by incremental update.

13
Outline

STING a statistical information grid approach to
spatial data mining
STING an approach to active spatial data mining
PK-tree a spatial index structure for high
dimensional point data
Temporal spatial pattern detection

14
STING

Moreover, since objects evolve, interesting
patterns may emerge or disappear over time.
Example
Trigger Do bandwidth reallocation when the
average call length is greater than 10 minutes
within the region where at least 10 cellular
phones are in use per squared mile.
This can not be supported by traditional database
triggers efficiently
due to the fact that the class membership of an
object is not only determined by its non-spatial
attributes but also by the attributes of objects
in its neighborhood.
Naïve approach re-evaluate condition
periodically.
Not efficient.

15
STING

STING was an extension of STING to support
user-defined trigger.
In spatial databases, object insertion, deletion,
and update are primitive events.
Observation Usually, only the cumulative effect
of a set of primitive events may cause the
trigger condition to be true.
We refer such set of primitive events to as a
composite events.

16
STING

Condition-Action paradigm
In general, it is difficult or even impossible
for user to specify all possible composite events
that may cause the trigger condition to be true.
In general, evaluating a user-defined trigger T
usually involves two aspects
Find a set of composite events E(s) that may
cause the trigger condition CT to become true.
Each time some composite event in E(s) occurs,
check the status (false or true) of CT (given
that CT was false previously).

17
STING

Observation As a side effect of the occurrence
of some composite event, the set of composite
events E(s) that could cause CT to transition
from false to true might also evolve over time.
Two set of composite events we need to consider
the set of composite events E(s) that can cause
CT to become true
need to re-evaluate CT
the set of composite events F(s) that can cause a
change to E(s)
need to update E(s)

18
STING

Observation In spatial databases, the effect of
an event is usually local to its neighborhood.
STING decomposes the user-defined trigger into a
set of sub-triggers associated with individual
cells in the hierarchical structure.
These sub-triggers are used to monitor composite
events in E(s) and F(s) and change accordingly
when E(s) and F(s) evolves.

Level 4
Level 3
19
STING

Updates are suspended at some level in the
hierarchy until such time that the cumulative
effect of these updates might cause the trigger
condition to become satisfied.

Level 2
Level 1
Level 1
Level 2
Level 3
20
STING

Example Trigger bandwidth reallocation when the
total area occupied by those regions in
California where at least 10 cellular phones are
in use per squared mile and the average length of
phone calls is at least 15 minutes with total
area at least 50 squared miles increases by at
least 10 squared miles.

DEFINE TRIGGER example ON cellular-phone WHEN
SELECT SIZE(REGION) INCREASE RANGE (10,
?) WHERE DENSITY IN RANGE (10, ?) AND
AVERAGE(length) IN RANGE (15, ?) AND AREA IN
RANGE (50, ?) LOCATION California DO
bandwidth-reallocation
21
STING

Observation Trigger condition CT is a
conjunction of predicates P1 ? P2 ? ? Pn and
can not be true if one predicate is false.
They can be evaluated in a certain order the ith
predicate is tested when all previous i -1
predicates are true.
The evaluation order should be chosen in such a
way that the total cost is minimum.
STING evaluates CT in the order location,
density condition, attribute condition, each of
which is evaluated in a different phase.
Location only needs to be evaluated once and the
cost can be regarded as constant in the trigger
evaluation process.
If the location is fixed, unnecessary
sub-triggers set on cells outside the location
can be avoided and hence save the evaluation cost
of other predicates.
Sub-triggers set during an earlier phase will
exist longer than those set in a later phase.
It is better to first evaluate the predicate that
takes less time to handle.
cost(density) lt cost(attribute)

22
STING

Average CPU cycles for handling each type of
sub-trigger

23
STING
24
Outline

STING a statistical information grid approach to
spatial data mining
STING an approach to active spatial data mining
PK-tree a spatial index structure for high
dimensional point data
Temporal spatial pattern detection

25
PK-tree

As both the number of objects and the number of
attributes are very large, it is essential to
organize the set of objects by some dynamic
indexing structure.
Point index methods

26
PK-tree
Spatial decomposition Space is recursively
divided until a level LD such that each cell
contains at most one point.
27
PK-tree
16 intermediate nodes, height 3
28
PK-tree
5 intermediate nodes, height 2
29
PK-tree

PK-tree employs a concept of K-instantiable cell
to eliminate unnecessary nodes.
Point cell a non-empty cell at level LD
A cell C is K-instantiable iff
C is a point cell, or
there does not exist (K-1) or less K-instantiable
sub-cells to cover all non-empty space in C
Only K-instantiable cells serve as nodes in the
PK-tree (expect the root).
The parent-child relationship follows naturally
from the cell-subcell relationship.

30
PK-tree

Properties
Bounds on nodes outdegree
allows allocating one node to a page
Bounded storage space
Existence and Uniqueness
enables us to analyze the behavior of a PK-tree
easier.
Expected height
log(N) under some general condition
guarantees efficiency of retrieval and update.
No overlapping among sibling nodes
efficient retrieval
Empirical studies shown that the PK-tree
outperforms SR-tree and X-tree by a wide margin.

31
PK-tree
Height of generated trees on 100,000 points
Size of index in MB
32
PK-tree
KNN query on clustered data distribution
33
PK-tree

Real data set NASA Sky Telescope Data
200,000 two-dimensional points (they are the
coordinates of crater locations on the surface of
Mars)

34
Outline

STING a statistical information grid approach to
spatial data mining
STING an approach to active spatial data mining
PK-tree a spatial index structure for high
dimensional point data
Temporal spatial pattern detection

35
Temporal Spatial Pattern Detection

When the number of attributes is large and/or the
value of attributes evolve frequently, the
complexity of patterns and the number of
potential patterns increase.
It is not desirable or even feasible to ask the
user specify interesting patterns.
E.g., the user wants to know any possible
patterns involving certain attributes such as
salary, rent, cellular phone usage, etc.
Existing association rule algorithm can not be
applied.
Continuous attribute domain
Temporal evolution
Prior knowledge about relationships among
attributes and objects

36
Temporal Spatial Pattern Detection

Object ? represented by point
primitive attributes
spatial attributes, i.e., coordinates of its
position
non-spatial attributes, e.g., name, weight,
height, salary, rent
derived attributes ? derived from primitive
attribute(s)
environment attributes, e.g., distance to a
hospital, average income in the neighborhood area
Consider a sequence of snapshots S1, S2, , Sn
Temporal Spatial Pattern
describes a possible relationship among evolution
of attributes
E.g., if the user want to know patterns involving
salary and distance to big city, then one
interesting pattern would be people receiving a
raise tends to move further away from the big
city from 1987 to 1993..

37
Temporal Spatial Pattern Detection

More complicated patterns
Patterns on clustering evolution
Patterns of high order
Patterns whose cause and consequence do not
happen together
There is a delay for the consequence to show up.
Patterns involving relationships among objects
e.g., people who live far away from any doctor
tend to move to a place closer to some doctor.
Environment variables evolve independently over
time.

Write a Comment

User Comments (0)

About PowerShow.com

Spatial Temporal Data Mining PowerPoint PPT Presentation