STING - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

STING

Description:

Spatial data may be thought of as features located on or referenced to the ... land use classifications, property ownership parcels, drinking water intakes, ... – PowerPoint PPT presentation

Number of Views:183

Avg rating:3.0/5.0

Slides: 35

Provided by: kan76

Category:

Tags: sting

more less

Transcript and Presenter's Notes

Title: STING

1
STING

Kanav Kahol
CSE-591-Spring 2002
kanav_at_asu.edu
www.public.asu.edu/kkahol

2
What or who is STING?

A singer who was the lead singer of the band
Police and then took up solo career and won many
grammys.
The bite of a scorpion.
A Statistical Information Grid Approach to
Spatial Data Mining.
All of the above.

3
What is Spatial Data?

Many definitions according to specific areas
According to GIS
Spatial data may be thought of as features
located on or referenced to the Earth's surface,
such as roads, streams, political boundaries,
schools, land use classifications, property
ownership parcels, drinking water intakes,
pollution discharge sites - in short, anything
that can be mapped.
Geographic features are stored as a series of
coordinate values. Each point along a road or
other feature is defined by positional coordinate
value, such as longitude and latitude.
The GIS stores and manages the data not as a map
but as a series of layers or, as they are
sometimes called, themes

When viewed in a GIS, these layersvisually
appear as one graphic, but areactually still
independent of each other.This allows changes to
specific themes, without affecting the others.
Discussion Question 1 So can you define spatial
Data Generically????
4
What are Spatial Databases?

Spatial database systems aim at storing,
retrieving, manipulating, querying, and
analyzing geometric data.
Special data types are necessary to model
geometry and to suitably represent
geometric data in database systems. These data
types are usually called spatial
data types, such as point, line, and region but
also include more complex types like
partitions and graphs (networks).
Data Type understanding is a prerequisite for an
effective construction of important
components of a spatial database system (like
spatial index structures, optimizers
for spatial data, spatial query languages,
storage management, and graphical user
interfaces) and for a cooperation with
extensible DBMS providing spatial type
extension packages (like spatial data blades and
cartridges).
Excellent tutorial on spatial data and data types
available at
http//www.informatik.fernuni-hagen.de/import/pi4
/schneider/abstracts/TutorialSDT.html

5
Spatial Data Resources
Pennsylvania Spatial Data Access
http//www.pasda.psu.edu/ The Missouri Spatial
Data Information Service http//msdis.missouri.ed
u/ National Spatial Data Infrastructure
http//www.fgdc.gov/nsdi/nsdi.html Michigan
Department of Natural Resources Online
www.dnr.state.mi.us/spatialdatalibrary/
Georgia Spatial Data Infrastructure Home Page
www.gis.state.ga.us/ Free GIS Data - GIS Data
Depot www.gisdatadepot.com
6
Spatial Data Mining

Discovery of interesting characteristics and
patterns that may implicitly exist in spatial
databases.
Huge amount of data specialized in nature.
Clustering and region oriented queries are common
problems in this domain.
We deal with high dimensional data generally.
Applications GIS, Medical Imaging etc.

7
Problems????????

Huge Amount of Data Specialized in Nature

Complexity
Defining of geometric patterns and region
oriented queries
Conceptual nature of problem!
Spatial Data Accessing

8
STING-An Introduction

STING is a grid based method to efficiently
process many
common region oriented queries on a set of points
What defines region? You tell me! Essentially it
is a set of points
satisfying some criterion
It is a hierarchical Method. The idea is to
capture statistical
information associated with spatial cells in
such a manner that
the whole classes of queries can be answered
without referring
to the individual objects.
Complexity is hence even less than O(n) infact
what do you
think it will be???
Link to Paper http//citeseer.nj.nec.com/wang97st
ing.html

9
Related Work
Great comparison of Clustering algorithms http//w
ww.cs.ualberta.ca/joerg/papers/KI-Journal.pdf
10
Generalization Based Approaches

Two types Spatial Data Dominant and Non- Spatial
Data Dominant
Both of these require that a generalization
hierarchy is given explicitly by experts or is
somehow generated automatically.
Quality of mined data depends on the structure of
the hierarchy.
Computational Complexity O(nlogn)
So the onus shifted to developing algorithms
which discover characteristics directly from
data. This was the motivation to move to
clustering algorithms

11
Clustering Based Approaches

BIRCH Already covered Remember it?? Complexity??
The problem with BIRCH is that it does not work
well with clusters which are not spherical.
DBSCAN Already covered Remember it??
Complexity??
The Global Parameter Eps determination in DBSCAN
requires human participation
When the point set to be clustered is the
response set of objects with some qualifications,
then determination of Eps must be done each time
and cost is hence higher.

12
Clustering Based Approaches

CLARANS Clustering Large Applications based upon
RANdomized Search.
Although claims have been made on it being linear
it is essentially quadratic.
The computational Complexity is at least ?(KN2)
where N is the number of data point and K is the
number of clusters.
Quality of results can not be guaranteed when N
is large as we use Randomized Search
Optimization with Randomized Search
Heuristics The (A)NFL Theorem, Realistic
Scenarios, and Dicult Functions

13
Related Work

All the approaches described in previous slides
are all query dependent approaches
The structure of queries influence the structure
of the algorithm and cannot be generalized to all
queries.
As they scan all the data points the complexity
will at least be O(N)

14
STING THE OVERVIEW

Spatial Area is divided into rectangular cells
Different levels of cells corresponding to
different resolution and these cells have a
hierarchical structure.
Each cell at a higher level is partitioned into
number of cells of the next lower level
Statistical information of each cell is
calculated and stored beforehand and is used to
answer queries

15
GRID CELL HIERARCHY
Each Cell at (i-1)th level has 4 children at ith
level (can be changed) The size of leaf cell is
dependent on the density of objects. Generally it
should be from several dozens to thousands
16
GRID CELL HIERARCHY

For each cell we have attribute-dependent and
attribute-independent parameters
The attribute independent parameter is number of
objects in a cell-n
For attribute dependent parameters it is assumed
that for each object its attributes have
numerical values.
For each Numerical attribute we have the
following five parameters

17
GRID CELL HIERARCHY

m- mean of all values in this cell
s- standard deviation of all values in this cell
min-the minimum value of the attribute in this
cell
max-the minimum value of the attribute in this
cell
distribution-the type of distribution this cell
follows. (This is of enumeration type)

18
Parameter Generation

The determination of dist parameter is as follows
First the dist is set to distribution followed by
most point
An estimate is made on number of conflicting
points confl according to following
Rules
1) if disti is not equal to dist, mmi and ssi
then confl is increased
by amount ni.
2) if disti is not equal to dist, mmi or ssi
but not both then confl is
set to n.
3) if distidist and mmi and ssi then confl is
not changed
4) if disti dist, mmi or ssi but not both
then confl is set to n.
Finally if confl/n is greater than a threshold
(say 0.05) then dist is set to none or
Original dist is retained

19
Parameter Generation
The parameters of the current cell
are N220 m20.27 s2.37 min3.8 max40 distNORMA
L
This is so because there are 220 data points out
of which 10 are not NORMAL So confl/n10/2200.045
lt0.05 hence it is still NORMAL. The parameters
are calculated only once so overall compilation
time is O(N) But querying requires much less
time as we only scan the number of grid cells K
i.e. O(K)
20
Query Types

If hierarchical structure cannot answer a query
then can go to underlying database
SQL like Language used to describe queries
Two types of common queries found one is to find
region specifying certain constraints and other
take in a region and return some attribute of the
region

21
Query Type Examples
22
Algorithm

Top down querying. Examine cells at a higher
level determine if the cell is relevant to query
at some confidence level. This likelihood can be
defined as the proportion of objects in this cell
that satisfy the query conditions. After
obtaining the confidence interval, we label this
cell to be relevant or not relevant at the
specified confidence level.
After doing so for the present layer process is
repeated for the children cells of the RELEVANT
cells in the present layer only!!!
Procedure continues till the bottom most layer
Find region formed by relevant cells and return
them
If not satisfactory retrieve those data that fall
into the relevant cells from database and do
some further processing.

23
Algorithm

After all cells are labeled as relevant or not
relevant, we can easily find all regions that
satisfy the density specified by Breadth First
Search.
For a relevant cell, we examine cells within a
certain distance d from the center of the current
cell to see if the average density within this
small area is greater than density specified.
If yes the cells are put into a queue
Step 2 and 3 are repeated for all the cells in
the queue except cells previously examined are
omitted.
When the queue is empty we get one region.

24
Algorithm

The distance d max (l, v(f/c?)
l, c, f are the side length of bottom layer cell,
the specified density and small constant number
set by STING (does not vary from query to
another)
L is usually the dominant term so we generally
have to examine the neighborhood term. If only
granularity is very small do we need examine very
cell at that distance rather than just the
neighborhood.

25
Example
Given Data Houses one of the attribute is
price QueryFind those regions with area at
least A where the number of houses per unit area
is at least c and at least b of the houses have
price between a and b with (1 - a) confidence
where a lt b. Here, a could be -æ and b could be
æ. This query can be written as
We begin from the top level working our way down.
Assume the dist type is NORMAL First we calculate
the proportion of houses whose price lies between
a,b The probability that price lies between a
and b is

m and s are mean and standard deviation of all
prices.
26
Example

Now as we assume prices to be independent of m
and s the number of houses with price range a,
b has a binomial distribution with parameters n
and p where n is number of houses. Now we
consider the following cases according to n, np
and n(1-p)

nlt30 binomial distribution used to determine
confidence interval of the
number of houses whose prices fall into a,
b, and divide it by n to get the
confidence interval for the proportion.
b) When n gt 30, n p ³ 5, and n(1 - p ) ³ 5, the
proportion that the price falls
in a, b has a normal distribution
Then 100(1 - alpha)
confidence interval of the proportion is
c) When ngt30 but nplt5 , the Poissons distribution
with parameters
is used for approximation.
d) When ngt30 but n(1-p)lt5, we can calculate the
proportion of houses (X)
whose price is not in a,b using Poissons
distribution with n(1-p) and
1-X is the proportion of houses whose prices
is in a,b.

27
Example

Once we have the confidence interval or the
estimated range p1, p2, we can label this cell
as relevant or not relevant.
Let S be area of cells at bottom layer. If
p1xnltSxcx , we can label as not relevant
otherwise as relevant

28
Analysis of STING

Step one takes constant time
Step 2 and 3 total time is proportional to the
total number of cells in the hierarchy.
Total number of cells is 1.33K, where K is number
of cells at bottom layer.
In all cases it is found or claimed to be O(K)
Discussion Question what is the complexity if we
need to go to step 7 in the algorithm??

29
Quality

STING under the following sufficient condition
guarantee that if a region satisfies the
specification of the query then it is returned.
Let F be a region. The width of F is defined as
the side length of the maximum square that can
fit in F.

30
Limiting Behavior of STING

The regions returned by Sting are an
approximation of the result by DBSCAN. As the
granularity approaches zero the regions returned
by STING approaches result of DBSCAN.
SO worst case complexity is O(nlogn)!!!!!

31
Performance measure
Case A Normal Distribution Query in e.g.
answered in 0.2 sec Structure generation 9.8
second
Case A None Query in e.g. answered in 0.22
sec Structure generation 9.7 second
32
Performance measure

Used a benchmark called SEQUOLA 2000 to compare
STING, DBSCAN, CLARANS
All the previous algorithms have three phases in
query answering
Find Query Response
Build auxiliary structure
Do clustering
STING does all of this in one step so is
inherently better.

33
Discussion Question

STING is trivially parallelizable. Comment why
and what is the importance of this statement?

34
References

STING Statistical Information Grid approach to
spatial data mining. Wei Wang et al.
Optimization with Randomized Search Heuristics
The (A)NFL Theorem, Realistic Scenarios, and
Dicult Functions. Stefan Droste et al.
Efficient and Effective clustering Method for
spatial data mining. R. Ng et al.
BIRCH An efficient data clustering method for
very large databases. T Zhang et al.
Tutorial on Spatial data types
http//www.informatik.fernuni-hagen.de/import/pi4/
schneider/abstracts/TutorialSDT.html
An efficient Approach to Clustering in Large
Multimedia Databases with Noise. A Hinneburg et
al.
Comparison of clustering algorithms
http//www.cs.ualberta.ca/joerg/papers/KI-Journal
.pdf