Title: STING
1STING
- Kanav Kahol
- CSE-591-Spring 2002
- kanav_at_asu.edu
- www.public.asu.edu/kkahol
2What or who is STING?
- A singer who was the lead singer of the band
Police and then took up solo career and won many
grammys. - The bite of a scorpion.
- A Statistical Information Grid Approach to
Spatial Data Mining. - All of the above.
3What is Spatial Data?
- Many definitions according to specific areas
- According to GIS
- Spatial data may be thought of as features
located on or referenced to the Earth's surface,
such as roads, streams, political boundaries,
schools, land use classifications, property
ownership parcels, drinking water intakes,
pollution discharge sites - in short, anything
that can be mapped. - Geographic features are stored as a series of
coordinate values. Each point along a road or
other feature is defined by positional coordinate
value, such as longitude and latitude. - The GIS stores and manages the data not as a map
but as a series of layers or, as they are
sometimes called, themes
When viewed in a GIS, these layersvisually
appear as one graphic, but areactually still
independent of each other.This allows changes to
specific themes, without affecting the others.
Discussion Question 1 So can you define spatial
Data Generically????
4What are Spatial Databases?
- Spatial database systems aim at storing,
retrieving, manipulating, querying, and - analyzing geometric data.
- Special data types are necessary to model
geometry and to suitably represent - geometric data in database systems. These data
types are usually called spatial - data types, such as point, line, and region but
also include more complex types like - partitions and graphs (networks).
- Data Type understanding is a prerequisite for an
effective construction of important - components of a spatial database system (like
spatial index structures, optimizers - for spatial data, spatial query languages,
storage management, and graphical user - interfaces) and for a cooperation with
extensible DBMS providing spatial type - extension packages (like spatial data blades and
cartridges). - Excellent tutorial on spatial data and data types
available at - http//www.informatik.fernuni-hagen.de/import/pi4
/schneider/abstracts/TutorialSDT.html
5Spatial Data Resources
Pennsylvania Spatial Data Access
http//www.pasda.psu.edu/ The Missouri Spatial
Data Information Service http//msdis.missouri.ed
u/ National Spatial Data Infrastructure
http//www.fgdc.gov/nsdi/nsdi.html Michigan
Department of Natural Resources Online
www.dnr.state.mi.us/spatialdatalibrary/
Georgia Spatial Data Infrastructure Home Page
www.gis.state.ga.us/ Free GIS Data - GIS Data
Depot www.gisdatadepot.com
6Spatial Data Mining
- Discovery of interesting characteristics and
patterns that may implicitly exist in spatial
databases. - Huge amount of data specialized in nature.
- Clustering and region oriented queries are common
problems in this domain. - We deal with high dimensional data generally.
- Applications GIS, Medical Imaging etc.
7Problems????????
- Huge Amount of Data Specialized in Nature
- Complexity
- Defining of geometric patterns and region
- oriented queries
- Conceptual nature of problem!
- Spatial Data Accessing
8STING-An Introduction
- STING is a grid based method to efficiently
process many - common region oriented queries on a set of points
- What defines region? You tell me! Essentially it
is a set of points - satisfying some criterion
- It is a hierarchical Method. The idea is to
capture statistical - information associated with spatial cells in
such a manner that - the whole classes of queries can be answered
without referring - to the individual objects.
- Complexity is hence even less than O(n) infact
what do you - think it will be???
- Link to Paper http//citeseer.nj.nec.com/wang97st
ing.html
9Related Work
Great comparison of Clustering algorithms http//w
ww.cs.ualberta.ca/joerg/papers/KI-Journal.pdf
10Generalization Based Approaches
- Two types Spatial Data Dominant and Non- Spatial
Data Dominant - Both of these require that a generalization
hierarchy is given explicitly by experts or is
somehow generated automatically. - Quality of mined data depends on the structure of
the hierarchy. - Computational Complexity O(nlogn)
- So the onus shifted to developing algorithms
which discover characteristics directly from
data. This was the motivation to move to
clustering algorithms
11Clustering Based Approaches
- BIRCH Already covered Remember it?? Complexity??
- The problem with BIRCH is that it does not work
well with clusters which are not spherical. - DBSCAN Already covered Remember it??
Complexity?? - The Global Parameter Eps determination in DBSCAN
requires human participation - When the point set to be clustered is the
response set of objects with some qualifications,
then determination of Eps must be done each time
and cost is hence higher.
12Clustering Based Approaches
- CLARANS Clustering Large Applications based upon
RANdomized Search. - Although claims have been made on it being linear
it is essentially quadratic. - The computational Complexity is at least ?(KN2)
where N is the number of data point and K is the
number of clusters. - Quality of results can not be guaranteed when N
is large as we use Randomized Search - Optimization with Randomized Search
Heuristics The (A)NFL Theorem, Realistic
Scenarios, and Dicult Functions
13Related Work
- All the approaches described in previous slides
are all query dependent approaches - The structure of queries influence the structure
of the algorithm and cannot be generalized to all
queries. - As they scan all the data points the complexity
will at least be O(N)
14STING THE OVERVIEW
- Spatial Area is divided into rectangular cells
- Different levels of cells corresponding to
different resolution and these cells have a
hierarchical structure. - Each cell at a higher level is partitioned into
number of cells of the next lower level - Statistical information of each cell is
calculated and stored beforehand and is used to
answer queries
15GRID CELL HIERARCHY
Each Cell at (i-1)th level has 4 children at ith
level (can be changed) The size of leaf cell is
dependent on the density of objects. Generally it
should be from several dozens to thousands
16GRID CELL HIERARCHY
- For each cell we have attribute-dependent and
attribute-independent parameters - The attribute independent parameter is number of
objects in a cell-n - For attribute dependent parameters it is assumed
that for each object its attributes have
numerical values. - For each Numerical attribute we have the
following five parameters
17GRID CELL HIERARCHY
- m- mean of all values in this cell
- s- standard deviation of all values in this cell
- min-the minimum value of the attribute in this
cell - max-the minimum value of the attribute in this
cell - distribution-the type of distribution this cell
follows. (This is of enumeration type)
18Parameter Generation
- The determination of dist parameter is as follows
- First the dist is set to distribution followed by
most point - An estimate is made on number of conflicting
points confl according to following - Rules
- 1) if disti is not equal to dist, mmi and ssi
then confl is increased - by amount ni.
- 2) if disti is not equal to dist, mmi or ssi
but not both then confl is - set to n.
- 3) if distidist and mmi and ssi then confl is
not changed - 4) if disti dist, mmi or ssi but not both
then confl is set to n. - Finally if confl/n is greater than a threshold
(say 0.05) then dist is set to none or - Original dist is retained
-
19Parameter Generation
The parameters of the current cell
are N220 m20.27 s2.37 min3.8 max40 distNORMA
L
This is so because there are 220 data points out
of which 10 are not NORMAL So confl/n10/2200.045
lt0.05 hence it is still NORMAL. The parameters
are calculated only once so overall compilation
time is O(N) But querying requires much less
time as we only scan the number of grid cells K
i.e. O(K)
20Query Types
- If hierarchical structure cannot answer a query
then can go to underlying database - SQL like Language used to describe queries
- Two types of common queries found one is to find
region specifying certain constraints and other
take in a region and return some attribute of the
region
21Query Type Examples
22Algorithm
- Top down querying. Examine cells at a higher
level determine if the cell is relevant to query
at some confidence level. This likelihood can be
defined as the proportion of objects in this cell
that satisfy the query conditions. After
obtaining the confidence interval, we label this
cell to be relevant or not relevant at the
specified confidence level. - After doing so for the present layer process is
repeated for the children cells of the RELEVANT
cells in the present layer only!!! - Procedure continues till the bottom most layer
- Find region formed by relevant cells and return
them - If not satisfactory retrieve those data that fall
into the relevant cells from database and do
some further processing.
23Algorithm
- After all cells are labeled as relevant or not
relevant, we can easily find all regions that
satisfy the density specified by Breadth First
Search. - For a relevant cell, we examine cells within a
certain distance d from the center of the current
cell to see if the average density within this
small area is greater than density specified. - If yes the cells are put into a queue
- Step 2 and 3 are repeated for all the cells in
the queue except cells previously examined are
omitted. - When the queue is empty we get one region.
24Algorithm
- The distance d max (l, v(f/c?)
- l, c, f are the side length of bottom layer cell,
the specified density and small constant number
set by STING (does not vary from query to
another) - L is usually the dominant term so we generally
have to examine the neighborhood term. If only
granularity is very small do we need examine very
cell at that distance rather than just the
neighborhood.
25Example
Given Data Houses one of the attribute is
price QueryFind those regions with area at
least A where the number of houses per unit area
is at least c and at least b of the houses have
price between a and b with (1 - a) confidence
where a lt b. Here, a could be -æ and b could be
æ. This query can be written as
We begin from the top level working our way down.
Assume the dist type is NORMAL First we calculate
the proportion of houses whose price lies between
a,b The probability that price lies between a
and b is
m and s are mean and standard deviation of all
prices.
26Example
- Now as we assume prices to be independent of m
and s the number of houses with price range a,
b has a binomial distribution with parameters n
and p where n is number of houses. Now we
consider the following cases according to n, np
and n(1-p)
- nlt30 binomial distribution used to determine
confidence interval of the - number of houses whose prices fall into a,
b, and divide it by n to get the - confidence interval for the proportion.
- b) When n gt 30, n p ³ 5, and n(1 - p ) ³ 5, the
proportion that the price falls - in a, b has a normal distribution
Then 100(1 - alpha) - confidence interval of the proportion is
- c) When ngt30 but nplt5 , the Poissons distribution
with parameters - is used for approximation.
- d) When ngt30 but n(1-p)lt5, we can calculate the
proportion of houses (X) - whose price is not in a,b using Poissons
distribution with n(1-p) and - 1-X is the proportion of houses whose prices
is in a,b.
27Example
- Once we have the confidence interval or the
estimated range p1, p2, we can label this cell
as relevant or not relevant. - Let S be area of cells at bottom layer. If
p1xnltSxcx , we can label as not relevant
otherwise as relevant
28Analysis of STING
- Step one takes constant time
- Step 2 and 3 total time is proportional to the
total number of cells in the hierarchy. - Total number of cells is 1.33K, where K is number
of cells at bottom layer. - In all cases it is found or claimed to be O(K)
- Discussion Question what is the complexity if we
need to go to step 7 in the algorithm??
29Quality
- STING under the following sufficient condition
guarantee that if a region satisfies the
specification of the query then it is returned. - Let F be a region. The width of F is defined as
the side length of the maximum square that can
fit in F.
30Limiting Behavior of STING
- The regions returned by Sting are an
approximation of the result by DBSCAN. As the
granularity approaches zero the regions returned
by STING approaches result of DBSCAN. - SO worst case complexity is O(nlogn)!!!!!
31Performance measure
Case A Normal Distribution Query in e.g.
answered in 0.2 sec Structure generation 9.8
second
Case A None Query in e.g. answered in 0.22
sec Structure generation 9.7 second
32Performance measure
- Used a benchmark called SEQUOLA 2000 to compare
STING, DBSCAN, CLARANS - All the previous algorithms have three phases in
query answering - Find Query Response
- Build auxiliary structure
- Do clustering
- STING does all of this in one step so is
inherently better.
33Discussion Question
- STING is trivially parallelizable. Comment why
and what is the importance of this statement?
34References
- STING Statistical Information Grid approach to
spatial data mining. Wei Wang et al. - Optimization with Randomized Search Heuristics
The (A)NFL Theorem, Realistic Scenarios, and
Dicult Functions. Stefan Droste et al. - Efficient and Effective clustering Method for
spatial data mining. R. Ng et al. - BIRCH An efficient data clustering method for
very large databases. T Zhang et al. - Tutorial on Spatial data types
http//www.informatik.fernuni-hagen.de/import/pi4/
schneider/abstracts/TutorialSDT.html - An efficient Approach to Clustering in Large
Multimedia Databases with Noise. A Hinneburg et
al. - Comparison of clustering algorithms
http//www.cs.ualberta.ca/joerg/papers/KI-Journal
.pdf