STING - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

STING

Description:

Spatial data may be thought of as features located on or referenced to the ... land use classifications, property ownership parcels, drinking water intakes, ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 35
Provided by: kan76
Category:
Tags: sting

less

Transcript and Presenter's Notes

Title: STING


1
STING
  • Kanav Kahol
  • CSE-591-Spring 2002
  • kanav_at_asu.edu
  • www.public.asu.edu/kkahol

2
What or who is STING?
  • A singer who was the lead singer of the band
    Police and then took up solo career and won many
    grammys.
  • The bite of a scorpion.
  • A Statistical Information Grid Approach to
    Spatial Data Mining.
  • All of the above.

3
What is Spatial Data?
  • Many definitions according to specific areas
  • According to GIS
  • Spatial data may be thought of as features
    located on or referenced to the Earth's surface,
    such as roads, streams, political boundaries,
    schools, land use classifications, property
    ownership parcels, drinking water intakes,
    pollution discharge sites - in short, anything
    that can be mapped.
  • Geographic features are stored as a series of
    coordinate values. Each point along a road or
    other feature is defined by positional coordinate
    value, such as longitude and latitude.
  • The GIS stores and manages the data not as a map
    but as a series of layers or, as they are
    sometimes called, themes

When viewed in a GIS, these layersvisually
appear as one graphic, but areactually still
independent of each other.This allows changes to
specific themes, without affecting the others.
Discussion Question 1 So can you define spatial
Data Generically????
4
What are Spatial Databases?
  • Spatial database systems aim at storing,
    retrieving, manipulating, querying, and
  • analyzing geometric data.
  • Special data types are necessary to model
    geometry and to suitably represent
  • geometric data in database systems. These data
    types are usually called spatial
  • data types, such as point, line, and region but
    also include more complex types like
  • partitions and graphs (networks).
  • Data Type understanding is a prerequisite for an
    effective construction of important
  • components of a spatial database system (like
    spatial index structures, optimizers
  • for spatial data, spatial query languages,
    storage management, and graphical user
  • interfaces) and for a cooperation with
    extensible DBMS providing spatial type
  • extension packages (like spatial data blades and
    cartridges).
  • Excellent tutorial on spatial data and data types
    available at
  • http//www.informatik.fernuni-hagen.de/import/pi4
    /schneider/abstracts/TutorialSDT.html

5
Spatial Data Resources
Pennsylvania Spatial Data Access
http//www.pasda.psu.edu/ The Missouri Spatial
Data Information Service http//msdis.missouri.ed
u/ National Spatial Data Infrastructure
http//www.fgdc.gov/nsdi/nsdi.html Michigan
Department of Natural Resources Online
www.dnr.state.mi.us/spatialdatalibrary/
Georgia Spatial Data Infrastructure Home Page
www.gis.state.ga.us/ Free GIS Data - GIS Data
Depot www.gisdatadepot.com
6
Spatial Data Mining
  • Discovery of interesting characteristics and
    patterns that may implicitly exist in spatial
    databases.
  • Huge amount of data specialized in nature.
  • Clustering and region oriented queries are common
    problems in this domain.
  • We deal with high dimensional data generally.
  • Applications GIS, Medical Imaging etc.

7
Problems????????
  • Huge Amount of Data Specialized in Nature
  • Complexity
  • Defining of geometric patterns and region
  • oriented queries
  • Conceptual nature of problem!
  • Spatial Data Accessing

8
STING-An Introduction
  • STING is a grid based method to efficiently
    process many
  • common region oriented queries on a set of points
  • What defines region? You tell me! Essentially it
    is a set of points
  • satisfying some criterion
  • It is a hierarchical Method. The idea is to
    capture statistical
  • information associated with spatial cells in
    such a manner that
  • the whole classes of queries can be answered
    without referring
  • to the individual objects.
  • Complexity is hence even less than O(n) infact
    what do you
  • think it will be???
  • Link to Paper http//citeseer.nj.nec.com/wang97st
    ing.html

9
Related Work
Great comparison of Clustering algorithms http//w
ww.cs.ualberta.ca/joerg/papers/KI-Journal.pdf
10
Generalization Based Approaches
  • Two types Spatial Data Dominant and Non- Spatial
    Data Dominant
  • Both of these require that a generalization
    hierarchy is given explicitly by experts or is
    somehow generated automatically.
  • Quality of mined data depends on the structure of
    the hierarchy.
  • Computational Complexity O(nlogn)
  • So the onus shifted to developing algorithms
    which discover characteristics directly from
    data. This was the motivation to move to
    clustering algorithms

11
Clustering Based Approaches
  • BIRCH Already covered Remember it?? Complexity??
  • The problem with BIRCH is that it does not work
    well with clusters which are not spherical.
  • DBSCAN Already covered Remember it??
    Complexity??
  • The Global Parameter Eps determination in DBSCAN
    requires human participation
  • When the point set to be clustered is the
    response set of objects with some qualifications,
    then determination of Eps must be done each time
    and cost is hence higher.

12
Clustering Based Approaches
  • CLARANS Clustering Large Applications based upon
    RANdomized Search.
  • Although claims have been made on it being linear
    it is essentially quadratic.
  • The computational Complexity is at least ?(KN2)
    where N is the number of data point and K is the
    number of clusters.
  • Quality of results can not be guaranteed when N
    is large as we use Randomized Search
  • Optimization with Randomized Search
    Heuristics The (A)NFL Theorem, Realistic
    Scenarios, and Dicult Functions

13
Related Work
  • All the approaches described in previous slides
    are all query dependent approaches
  • The structure of queries influence the structure
    of the algorithm and cannot be generalized to all
    queries.
  • As they scan all the data points the complexity
    will at least be O(N)

14
STING THE OVERVIEW
  • Spatial Area is divided into rectangular cells
  • Different levels of cells corresponding to
    different resolution and these cells have a
    hierarchical structure.
  • Each cell at a higher level is partitioned into
    number of cells of the next lower level
  • Statistical information of each cell is
    calculated and stored beforehand and is used to
    answer queries

15
GRID CELL HIERARCHY
Each Cell at (i-1)th level has 4 children at ith
level (can be changed) The size of leaf cell is
dependent on the density of objects. Generally it
should be from several dozens to thousands
16
GRID CELL HIERARCHY
  • For each cell we have attribute-dependent and
    attribute-independent parameters
  • The attribute independent parameter is number of
    objects in a cell-n
  • For attribute dependent parameters it is assumed
    that for each object its attributes have
    numerical values.
  • For each Numerical attribute we have the
    following five parameters

17
GRID CELL HIERARCHY
  • m- mean of all values in this cell
  • s- standard deviation of all values in this cell
  • min-the minimum value of the attribute in this
    cell
  • max-the minimum value of the attribute in this
    cell
  • distribution-the type of distribution this cell
    follows. (This is of enumeration type)

18
Parameter Generation
  • The determination of dist parameter is as follows
  • First the dist is set to distribution followed by
    most point
  • An estimate is made on number of conflicting
    points confl according to following
  • Rules
  • 1) if disti is not equal to dist, mmi and ssi
    then confl is increased
  • by amount ni.
  • 2) if disti is not equal to dist, mmi or ssi
    but not both then confl is
  • set to n.
  • 3) if distidist and mmi and ssi then confl is
    not changed
  • 4) if disti dist, mmi or ssi but not both
    then confl is set to n.
  • Finally if confl/n is greater than a threshold
    (say 0.05) then dist is set to none or
  • Original dist is retained

19
Parameter Generation
The parameters of the current cell
are N220 m20.27 s2.37 min3.8 max40 distNORMA
L
This is so because there are 220 data points out
of which 10 are not NORMAL So confl/n10/2200.045
lt0.05 hence it is still NORMAL. The parameters
are calculated only once so overall compilation
time is O(N) But querying requires much less
time as we only scan the number of grid cells K
i.e. O(K)
20
Query Types
  • If hierarchical structure cannot answer a query
    then can go to underlying database
  • SQL like Language used to describe queries
  • Two types of common queries found one is to find
    region specifying certain constraints and other
    take in a region and return some attribute of the
    region

21
Query Type Examples
22
Algorithm
  • Top down querying. Examine cells at a higher
    level determine if the cell is relevant to query
    at some confidence level. This likelihood can be
    defined as the proportion of objects in this cell
    that satisfy the query conditions. After
    obtaining the confidence interval, we label this
    cell to be relevant or not relevant at the
    specified confidence level.
  • After doing so for the present layer process is
    repeated for the children cells of the RELEVANT
    cells in the present layer only!!!
  • Procedure continues till the bottom most layer
  • Find region formed by relevant cells and return
    them
  • If not satisfactory retrieve those data that fall
    into the relevant cells from database and do
    some further processing.

23
Algorithm
  • After all cells are labeled as relevant or not
    relevant, we can easily find all regions that
    satisfy the density specified by Breadth First
    Search.
  • For a relevant cell, we examine cells within a
    certain distance d from the center of the current
    cell to see if the average density within this
    small area is greater than density specified.
  • If yes the cells are put into a queue
  • Step 2 and 3 are repeated for all the cells in
    the queue except cells previously examined are
    omitted.
  • When the queue is empty we get one region.

24
Algorithm
  • The distance d max (l, v(f/c?)
  • l, c, f are the side length of bottom layer cell,
    the specified density and small constant number
    set by STING (does not vary from query to
    another)
  • L is usually the dominant term so we generally
    have to examine the neighborhood term. If only
    granularity is very small do we need examine very
    cell at that distance rather than just the
    neighborhood.

25
Example
Given Data Houses one of the attribute is
price QueryFind those regions with area at
least A where the number of houses per unit area
is at least c and at least b of the houses have
price between a and b with (1 - a) confidence
where a lt b. Here, a could be -æ and b could be
æ. This query can be written as
We begin from the top level working our way down.
Assume the dist type is NORMAL First we calculate
the proportion of houses whose price lies between
a,b The probability that price lies between a
and b is

m and s are mean and standard deviation of all
prices.
26
Example
  • Now as we assume prices to be independent of m
    and s the number of houses with price range a,
    b has a binomial distribution with parameters n
    and p where n is number of houses. Now we
    consider the following cases according to n, np
    and n(1-p)
  • nlt30 binomial distribution used to determine
    confidence interval of the
  • number of houses whose prices fall into a,
    b, and divide it by n to get the
  • confidence interval for the proportion.
  • b) When n gt 30, n p ³ 5, and n(1 - p ) ³ 5, the
    proportion that the price falls
  • in a, b has a normal distribution
    Then 100(1 - alpha)
  • confidence interval of the proportion is
  • c) When ngt30 but nplt5 , the Poissons distribution
    with parameters
  • is used for approximation.
  • d) When ngt30 but n(1-p)lt5, we can calculate the
    proportion of houses (X)
  • whose price is not in a,b using Poissons
    distribution with n(1-p) and
  • 1-X is the proportion of houses whose prices
    is in a,b.

27
Example
  • Once we have the confidence interval or the
    estimated range p1, p2, we can label this cell
    as relevant or not relevant.
  • Let S be area of cells at bottom layer. If
    p1xnltSxcx , we can label as not relevant
    otherwise as relevant

28
Analysis of STING
  • Step one takes constant time
  • Step 2 and 3 total time is proportional to the
    total number of cells in the hierarchy.
  • Total number of cells is 1.33K, where K is number
    of cells at bottom layer.
  • In all cases it is found or claimed to be O(K)
  • Discussion Question what is the complexity if we
    need to go to step 7 in the algorithm??

29
Quality
  • STING under the following sufficient condition
    guarantee that if a region satisfies the
    specification of the query then it is returned.
  • Let F be a region. The width of F is defined as
    the side length of the maximum square that can
    fit in F.

30
Limiting Behavior of STING
  • The regions returned by Sting are an
    approximation of the result by DBSCAN. As the
    granularity approaches zero the regions returned
    by STING approaches result of DBSCAN.
  • SO worst case complexity is O(nlogn)!!!!!

31
Performance measure
Case A Normal Distribution Query in e.g.
answered in 0.2 sec Structure generation 9.8
second
Case A None Query in e.g. answered in 0.22
sec Structure generation 9.7 second
32
Performance measure
  • Used a benchmark called SEQUOLA 2000 to compare
    STING, DBSCAN, CLARANS
  • All the previous algorithms have three phases in
    query answering
  • Find Query Response
  • Build auxiliary structure
  • Do clustering
  • STING does all of this in one step so is
    inherently better.

33
Discussion Question
  • STING is trivially parallelizable. Comment why
    and what is the importance of this statement?

34
References
  • STING Statistical Information Grid approach to
    spatial data mining. Wei Wang et al.
  • Optimization with Randomized Search Heuristics
    The (A)NFL Theorem, Realistic Scenarios, and
    Dicult Functions. Stefan Droste et al.
  • Efficient and Effective clustering Method for
    spatial data mining. R. Ng et al.
  • BIRCH An efficient data clustering method for
    very large databases. T Zhang et al.
  • Tutorial on Spatial data types
    http//www.informatik.fernuni-hagen.de/import/pi4/
    schneider/abstracts/TutorialSDT.html
  • An efficient Approach to Clustering in Large
    Multimedia Databases with Noise. A Hinneburg et
    al.
  • Comparison of clustering algorithms
    http//www.cs.ualberta.ca/joerg/papers/KI-Journal
    .pdf
Write a Comment
User Comments (0)
About PowerShow.com