Spatial Temporal Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Spatial Temporal Data Mining

Description:

Spatial Temporal Data Mining Wei Wang Data Mining Lab, UCLA January 21, 1999 Outline Introduction Statistical Clustering User-defined Trigger Spatial Index Structure ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 38
Provided by: weiwang
Learn more at: https://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: Spatial Temporal Data Mining


1
Spatial Temporal Data Mining
  • Wei Wang
  • Data Mining Lab, UCLA
  • January 21, 1999

2
Outline
  • Introduction
  • Statistical Clustering
  • User-defined Trigger
  • Spatial Index Structure for High Dimensional
    Point Data
  • Temporal Spatial Pattern Detection
  • ongoing research

3
Introduction
  • Spatial data mining has been an active research
    area during recent years.
  • For some well know problem, e.g., clustering,
    many existing algorithms are not efficient
    enough.
  • There is still room for improvement.
  • There are a lot of interesting problems remaining
    uninvestigated.
  • We classify a subset of problems and try to solve
    them efficiently.

4
Outline
  • STING a statistical information grid approach to
    spatial data mining
  • STING an approach to active spatial data mining
  • PK-tree a spatial index structure for high
    dimensional point data
  • Temporal spatial pattern detection

5
STING
  • Spatial database is usually huge.
  • Efficiency of the data mining algorithm is
    crucial.
  • Example each person is an object
  • Query Find high income area within California
  • high income salary gt 50,000
  • area gt 4 square miles
  • Traditional Method
  • Step 1 Select out all person whose salary are
    high.
  • Step 2 Do clustering analysis on those persons
    selected out.
  • Step 3 Form the region that each cluster
    occupies.
  • Step 4 Return those regions larger than 4 square
    miles.
  • If high income is defined as 80 persons have
    salary gt 50,000
  • then the previous method can not even answer the
    query.
  • STING was proposed to solve such problem
    efficiently.

6
STING
  • Region Query Example
  • Select the maximal regions that have at least 100
    houses per unit area and at least 70 of the
    house prices are above 400K and with total area
    at least 100 units with 90 confidence.

SELECT REGION FROM house-map WHERE DENSITY IN
(100, ?) AND price RANGE (400000, ?) WITH
PERCENT (0.7, 1) AND AREA (100, ?) AND WITH
CONFIDENCE 0.9
7
STING
  • Objects are represented by points, each of which
    has associated spatial attributes, its location,
    and non-spatial (numerical) attributes.
  • Space is recursively divided into smaller
    rectangular cells until certain level is reached.
  • A hierarchical structure is employed.
  • The average number of objects in a leaf cell is
    in the range from several dozens to several
    thousands.
  • Preprocess data
  • capture the statistical information

8
STING
1st layer (root)
(i-1)th layer
ith layer
(i1)th layer (leaf layer)
9
STING
  • For each cell, we have
  • attribute-independent parameter
  • n number of objects
  • attribute-dependent parameters (for each
    numerical attribute)
  • mean mean value of the attribute
  • std standard deviation of the attribute value
  • min the minimum value of the attribute
  • max the maximum value of the attribute
  • distribution the type of distribution that best
    fits the attribute value (can be NONE)
  • Bottom-up generation when the data is loaded into
    the database.
  • Linear compilation time
  • Only has to be done once ? not for each query.

10
STING
  • Take advantage of the statistical information
    captured.
  • Only go through relevant cells at each level.
  • Root is relevant.
  • For each relevant cell, we exam its children at
    next level by statistical test and label them as
    relevant or not relevant.
  • Form regions from relevant leaf level cells.
  • Do not need to access full database.
  • It is very efficient.

11
STING
  • The computational complexity of STING is linearly
    proportional to the number of leaf cells.
  • We used the SEQUOIA 2000 benchmark as the data
    set to compare the performance of STING with
    other approaches.

12
STING
  • STING is a query-independent approach.
  • The statistical information exists independently
    of queries.
  • STING has a much smaller response time compared
    to other approaches
  • The computational complexity is linearly
    proportional to the number of leaves.
  • I/O cost is low.
  • STING can support different resolution of query
    result.
  • Regions returned by STING approach that returned
    by DBSCAN when the granularity approaches zero.
  • Parameters in the hierarchical structure can be
    maintained efficiently by incremental update.

13
Outline
  • STING a statistical information grid approach to
    spatial data mining
  • STING an approach to active spatial data mining
  • PK-tree a spatial index structure for high
    dimensional point data
  • Temporal spatial pattern detection

14
STING
  • Moreover, since objects evolve, interesting
    patterns may emerge or disappear over time.
  • Example
  • Trigger Do bandwidth reallocation when the
    average call length is greater than 10 minutes
    within the region where at least 10 cellular
    phones are in use per squared mile.
  • This can not be supported by traditional database
    triggers efficiently
  • due to the fact that the class membership of an
    object is not only determined by its non-spatial
    attributes but also by the attributes of objects
    in its neighborhood.
  • Naïve approach re-evaluate condition
    periodically.
  • Not efficient.

15
STING
  • STING was an extension of STING to support
    user-defined trigger.
  • In spatial databases, object insertion, deletion,
    and update are primitive events.
  • Observation Usually, only the cumulative effect
    of a set of primitive events may cause the
    trigger condition to be true.
  • We refer such set of primitive events to as a
    composite events.

16
STING
  • Condition-Action paradigm
  • In general, it is difficult or even impossible
    for user to specify all possible composite events
    that may cause the trigger condition to be true.
  • In general, evaluating a user-defined trigger T
    usually involves two aspects
  • Find a set of composite events E(s) that may
    cause the trigger condition CT to become true.
  • Each time some composite event in E(s) occurs,
    check the status (false or true) of CT (given
    that CT was false previously).

17
STING
  • Observation As a side effect of the occurrence
    of some composite event, the set of composite
    events E(s) that could cause CT to transition
    from false to true might also evolve over time.
  • Two set of composite events we need to consider
  • the set of composite events E(s) that can cause
    CT to become true
  • need to re-evaluate CT
  • the set of composite events F(s) that can cause a
    change to E(s)
  • need to update E(s)

18
STING
  • Observation In spatial databases, the effect of
    an event is usually local to its neighborhood.
  • STING decomposes the user-defined trigger into a
    set of sub-triggers associated with individual
    cells in the hierarchical structure.
  • These sub-triggers are used to monitor composite
    events in E(s) and F(s) and change accordingly
    when E(s) and F(s) evolves.

Level 4
Level 3
19
STING
  • Updates are suspended at some level in the
    hierarchy until such time that the cumulative
    effect of these updates might cause the trigger
    condition to become satisfied.

Level 2
Level 1
Level 1
Level 2
Level 3
20
STING
  • Example Trigger bandwidth reallocation when the
    total area occupied by those regions in
    California where at least 10 cellular phones are
    in use per squared mile and the average length of
    phone calls is at least 15 minutes with total
    area at least 50 squared miles increases by at
    least 10 squared miles.

DEFINE TRIGGER example ON cellular-phone WHEN
SELECT SIZE(REGION) INCREASE RANGE (10,
?) WHERE DENSITY IN RANGE (10, ?) AND
AVERAGE(length) IN RANGE (15, ?) AND AREA IN
RANGE (50, ?) LOCATION California DO
bandwidth-reallocation
21
STING
  • Observation Trigger condition CT is a
    conjunction of predicates P1 ? P2 ? ? Pn and
    can not be true if one predicate is false.
  • They can be evaluated in a certain order the ith
    predicate is tested when all previous i -1
    predicates are true.
  • The evaluation order should be chosen in such a
    way that the total cost is minimum.
  • STING evaluates CT in the order location,
    density condition, attribute condition, each of
    which is evaluated in a different phase.
  • Location only needs to be evaluated once and the
    cost can be regarded as constant in the trigger
    evaluation process.
  • If the location is fixed, unnecessary
    sub-triggers set on cells outside the location
    can be avoided and hence save the evaluation cost
    of other predicates.
  • Sub-triggers set during an earlier phase will
    exist longer than those set in a later phase.
  • It is better to first evaluate the predicate that
    takes less time to handle.
  • cost(density) lt cost(attribute)

22
STING
  • Average CPU cycles for handling each type of
    sub-trigger

23
STING
24
Outline
  • STING a statistical information grid approach to
    spatial data mining
  • STING an approach to active spatial data mining
  • PK-tree a spatial index structure for high
    dimensional point data
  • Temporal spatial pattern detection

25
PK-tree
  • As both the number of objects and the number of
    attributes are very large, it is essential to
    organize the set of objects by some dynamic
    indexing structure.
  • Point index methods

26
PK-tree
Spatial decomposition Space is recursively
divided until a level LD such that each cell
contains at most one point.
27
PK-tree
16 intermediate nodes, height 3
28
PK-tree
5 intermediate nodes, height 2
29
PK-tree
  • PK-tree employs a concept of K-instantiable cell
    to eliminate unnecessary nodes.
  • Point cell a non-empty cell at level LD
  • A cell C is K-instantiable iff
  • C is a point cell, or
  • there does not exist (K-1) or less K-instantiable
    sub-cells to cover all non-empty space in C
  • Only K-instantiable cells serve as nodes in the
    PK-tree (expect the root).
  • The parent-child relationship follows naturally
    from the cell-subcell relationship.

30
PK-tree
  • Properties
  • Bounds on nodes outdegree
  • allows allocating one node to a page
  • Bounded storage space
  • Existence and Uniqueness
  • enables us to analyze the behavior of a PK-tree
    easier.
  • Expected height
  • log(N) under some general condition
  • guarantees efficiency of retrieval and update.
  • No overlapping among sibling nodes
  • efficient retrieval
  • Empirical studies shown that the PK-tree
    outperforms SR-tree and X-tree by a wide margin.

31
PK-tree
Height of generated trees on 100,000 points
Size of index in MB
32
PK-tree
KNN query on clustered data distribution
33
PK-tree
  • Real data set NASA Sky Telescope Data
  • 200,000 two-dimensional points (they are the
    coordinates of crater locations on the surface of
    Mars)

34
Outline
  • STING a statistical information grid approach to
    spatial data mining
  • STING an approach to active spatial data mining
  • PK-tree a spatial index structure for high
    dimensional point data
  • Temporal spatial pattern detection

35
Temporal Spatial Pattern Detection
  • When the number of attributes is large and/or the
    value of attributes evolve frequently, the
    complexity of patterns and the number of
    potential patterns increase.
  • It is not desirable or even feasible to ask the
    user specify interesting patterns.
  • E.g., the user wants to know any possible
    patterns involving certain attributes such as
    salary, rent, cellular phone usage, etc.
  • Existing association rule algorithm can not be
    applied.
  • Continuous attribute domain
  • Temporal evolution
  • Prior knowledge about relationships among
    attributes and objects

36
Temporal Spatial Pattern Detection
  • Object ? represented by point
  • primitive attributes
  • spatial attributes, i.e., coordinates of its
    position
  • non-spatial attributes, e.g., name, weight,
    height, salary, rent
  • derived attributes ? derived from primitive
    attribute(s)
  • environment attributes, e.g., distance to a
    hospital, average income in the neighborhood area
  • Consider a sequence of snapshots S1, S2, , Sn
  • Temporal Spatial Pattern
  • describes a possible relationship among evolution
    of attributes
  • E.g., if the user want to know patterns involving
    salary and distance to big city, then one
    interesting pattern would be people receiving a
    raise tends to move further away from the big
    city from 1987 to 1993..

37
Temporal Spatial Pattern Detection
  • More complicated patterns
  • Patterns on clustering evolution
  • Patterns of high order
  • Patterns whose cause and consequence do not
    happen together
  • There is a delay for the consequence to show up.
  • Patterns involving relationships among objects
  • e.g., people who live far away from any doctor
    tend to move to a place closer to some doctor.
  • Environment variables evolve independently over
    time.
Write a Comment
User Comments (0)
About PowerShow.com