What we have covered? - PowerPoint PPT Presentation

About This Presentation

Title:

What we have covered?

Description:

Objects have spatial extent with location and boundary. DB typically uses geometric approximations constructed ... Must support inserts and deletes gracefully. ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 57

Provided by: jeff466

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: What we have covered?

1
What we have covered?

Indexing and Hashing
Data warehouse and OLAP
Data Mining
Information Retrieval and Web Mining
XML and XQuery
Spatial Databases
Transaction Management

2
Lecture 6 Spatial Data Management
3
Types of Spatial Data

Point Data
Points in a multidimensional space
E.g., Raster data such as satellite imagery,
where each pixel stores a measured value
E.g., Feature vectors extracted from text
Region Data
Objects have spatial extent with location and
boundary
DB typically uses geometric approximations
constructed using line segments, polygons, etc.,
called vector data.

4
Applications of Spatial Data

Geographic Information Systems (GIS)
E.g., ESRIs ArcInfo OpenGIS Consortium
Geospatial information
All classes of spatial queries and data are
common
Computer-Aided Design/Manufacturing
Store spatial objects such as surface of airplane
fuselage
Range queries and spatial join queries are common
Multimedia Databases
Images, video, text, etc. stored and retrieved by
content
First converted to feature vector form high
dimensionality
Nearest-neighbor queries are the most common

5
Types of Spatial Queries

Spatial Range Queries
Find all cities within 50 miles of Madison
Query has associated region (location, boundary)
Answer includes overlapping or contained data
regions
Nearest-Neighbor Queries
Find the 10 cities nearest to Madison
Results must be ordered by proximity
Spatial Join Queries
Find all cities near a lake
Expensive, join condition involves regions and
proximity

6
Spatial Indexing

Point Access Methods (PAMs) vs Spatial Access
Methods (SAMs)
PAM index only point data
Hierarchical (tree-based) structures
Multidimensional Hashing
Space filling curve
SAM index both points and regions
Transformations
Overlapping regions
Clipping methods (non-overlapping)
Data partitioning vs Space partitioning

7
Single-Dimensional Indexes

B trees are fundamentally single-dimensional
indexes.
When we create a composite search key B tree,
e.g., an index on ltage, salgt,
we effectively linearize the 2-dimensional
space since we sort entries first by age and then
by sal.

80
70
60
Consider entries lt11, 80gt, lt12, 10gt lt12, 20gt,
lt13, 75gt
50
SAL
40
B tree order
30
20
10
11 12 13
AGE
8
Multidimensional Indexes

A multidimensional index clusters entries so as
to exploit nearness in multidimensional space.
Keeping track of entries and maintaining a
balanced index structure presents a challenge!

Consider entries lt11, 80gt, lt12, 10gt lt12, 20gt,
lt13, 75gt
9
Motivation for Multidimensional Indexes

Spatial queries (GIS, CAD).
Find all hotels within a radius of 5 miles from
the conference venue.
Find the city with population 500,000 or more
that is nearest to Kalamazoo, MI.
Find all cities that lie on the Nile in Egypt.
Find all parts that touch the fuselage (in a
plane design).
Similarity queries (content-based retrieval).
Given a face, find the five most similar faces.
Multidimensional range queries.
50 lt age lt 55 AND 80K lt sal lt 90K

10
Whats the difficulty?

An index based on spatial location needed.
One-dimensional indexes dont support
multidimensional searching efficiently. (Why?)
Hash indexes only support point queries want to
support range queries as well.
Must support inserts and deletes gracefully.
Ideally, want to support non-point data as well
(e.g., lines, shapes).

11
PAMs

Point Access Methods
Hierarchical methods kd-tree based
Space Filling Curves Z-ordering
Multidimensional Hashing Grid File
Exponential growth of the directory

12
The problem

Given a point set and a rectangular query, find
the points enclosed in the query
We allow insertions/deletions on line

Query
13
Tree-based PAMs

Most of tb-PAMs are based on kd-tree
kd-tree is a main memory binary tree for indexing
k-dimensional points
Needs to be adapted for the disk model
Levels rotate among the dimensions, partitioning
the space based on a value for that dimension
kd-tree is not necessarily balanced

14
kd-tree

At each level we use a different dimension

x5
xgt5
xlt5
C
y6
B
y3
x6
E
A
D
15
Kd-tree properties

Height of the tree O(log2 n)
Search time for exact match O(log2 n)
Search time for range query O(n1/2 k)

16
kd-tree example
X5
X7
X3
y6
y5
Y6
x8
x7
x3
y2
Y2
X5
X8
17
External memory kd-trees

Similar to B-tree, tree nodes split many ways
instead of two ways
insertion becomes quite complex and expensive.
No storage utilization guarantee since when a
higher level node splits, the split has to be
propagated all the way to leaf level resulting in
many empty blocks.
Pack many interior nodes (forming a subtree) into
a block.
it may not be feasible to group nodes at lower
level into a block productively.
Many interesting papers on how to optimally pack
nodes into blocks recently published.

18
PAMs

Point Access Methods
Hierarchical methods kd-tree based
Space Filling Curves Z-ordering
Multidimensional Hashing Grid File
Exponential growth of the directory

19
Single-Dimensional Indexes

B trees are fundamentally single-dimensional
indexes.
When we create a composite search key B tree,
e.g., an index on ltage, salgt,
we effectively linearize the 2-dimensional
space since we sort entries first by age and then
by sal.

80
70
60
Consider entries lt11, 80gt, lt12, 10gt lt12, 20gt,
lt13, 75gt
50
SAL
40
B tree order
30
20
10
11 12 13
AGE
20
Z-Curve

What is a Z-curve?
A space filling curve
Generated from interleaving bits
x, y coordinate
See Fig. 4.6
Alternative generation method
see Fig. 4.5
Connecting points by z-order
see Fig. 4.4
looks like Ns or Zs
Implementing file operations

Fig 4.6
Fig 4.4
21
Example of Z-values

Figure 4.7
Left part shows a map with spatial object A, B,
C
Right part and Left bottom part Z-values within
A, B and C
Note C gets z-values of 2 and 8, which are not
close
Exercise Compute z-values for B.

Fig 4.7
22
Hilbert Curve

A space filling curve
Example Fig. 4.5
More complex to generate
due to rotations
Illustration on next slide!
Implementing file operations

Fig 4.5
23
Calculating Hilbert Values (Optional Topic)
Fig 4.8
24
PAMs

Point Access Methods
Hierarchical methods kd-tree based
Space Filling Curves Z-ordering
Multidimensional Hashing Grid File
Exponential growth of the directory

25
Grid File

Hashing methods for multidimensional points
(extension of Extensible hashing)
Idea Use a grid to partition the space? each
cell is associated with one page
Two disk access principle (exact match)

26
Grid File

Start with one bucket for the whole space.
Select dividers along each dimension. Partition
space into cells
Dividers cut all the way.
Each cell corresponds to 1 disk page.
Many cells can point to the same page.
Cell directory potentially exponential in the
number of dimensions

27
Grid File Implementation

Dynamic structure using a grid directory
Grid array a 2 dimensional array with pointers
to buckets (this array can be large, disk
resident) G(0,, nx-1, 0, , ny-1)
Linear scales Two 1 dimensional arrays that used
to access the grid array (main memory) X(0, ,
nx-1), Y(0, , ny-1)

28
Example
Buckets/Disk Blocks
Grid Directory
Linear scale Y
Linear scale X
29
Grid File Search

Exact Match Search at most 2 I/Os assuming
linear scales fit in memory.
First use liner scales to determine the index
into the cell directory
access the cell directory to retrieve the bucket
address (may cause 1 I/O if cell directory does
not fit in memory)
access the appropriate bucket (1 I/O)
Range Queries
use linear scales to determine the index into the
cell directory.
Access the cell directory to retrieve the bucket
addresses of buckets to visit.
Access the buckets.

30
Grid File Insertions

Determine the bucket into which insertion must
occur.
If space in bucket, insert.
Else, split bucket
how to choose a good dimension to split?
If bucket split causes a cell directory to split
do so and adjust linear scales.
insertion of these new entries potentially
requires a complete reorganization of the cell
directory--- expensive!!!

31
Grid File Deletions

Deletions may decrease the space utilization.
Merge buckets
We need to decide which cells to merge and a
merging threshold
Buddy system and neighbor system
A bucket can merge with only one buddy in each
dimension
Merge adjacent regions if the result is a
rectangle

32
Grid File Example
(N6)
1
2
3
4
5
6
33
Grid File Example
(N6)
8
10
9
11
12
34
Grid File Example
(N6)
14
15
35
Grid File Example
(N6)
36
Grid File Example
(N6)
37
The R-Tree

The R-tree is a tree-structured index that
remains balanced on inserts and deletes.
Each key stored in a leaf entry is intuitively a
box, or collection of intervals, with one
interval per dimension.
Example in 2-D

38
R-Tree Properties

Leaf entry lt n-dimensional box, rid gt
key value being a box.
Box is the tightest bounding box for a data
object.
Non-leaf entry lt n-dim box, ptr to child node gt
Box covers all boxes in child node (in fact,
subtree).
All leaves at same distance from root.
Nodes can be kept 50 full (except root).
Can choose a parameter m that is lt 50, and
ensure that every node is at least m full.

39
Example of an R-Tree
Leaf entry
Index entry
R1
R4
Spatial object approximated by bounding box R8
R11
R3
R5
R13
R9
R8
R14
R10
R12
R7
R18
R17
R6
R16
R19
R15
R2
40
Example R-Tree (Contd.)
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
R17
R18
R19
R16
41
Search for Objects Overlapping Box Q
Start at root. 1. If current node is non-leaf,
for each entry ltE, ptrgt, if box E overlaps
Q, search subtree identified by ptr. 2. If
current node is leaf, for each entry ltE,
ridgt, if E overlaps Q, rid identifies an
object that might overlap Q.
Note May have to search several subtrees at
each node! (In contrast, a B-tree equality search
goes to just one leaf.)
42
Improving Search Using Constraints

It is convenient to store boxes in the R-tree as
approximations of arbitrary regions, because
boxes can be represented compactly.
But why not use convex polygons to approximate
query regions more accurately?
Will reduce overlap with nodes in tree, and
reduce the number of nodes fetched by avoiding
some branches altogether.
Cost of overlap test is higher than bounding box
intersection, but it is a main-memory cost, and
can actually be done quite efficiently.
Generally a win.

43
Insert Entry ltB, ptrgt

Start at root and go down to best-fit leaf L.
Go to child whose box needs least enlargement to
cover B resolve ties by going to smallest area
child.
If best-fit leaf L has space, insert entry and
stop. Otherwise, split L into L1 and L2.
Adjust entry for L in its parent so that the box
now covers (only) L1.
Add an entry (in the parent node of L) for L2.
(This could cause the parent node to recursively
split.)

44
Splitting a Node During Insertion

The entries in node L plus the newly inserted
entry must be distributed between L1 and L2.
Goal is to reduce likelihood of both L1 and L2
being searched on subsequent queries.
Idea Redistribute so as to minimize area of L1
plus area of L2.

GOOD SPLIT!
BAD!
45
Spatial Data Warehousing

Spatial data warehouse Integrated,
subject-oriented, time-variant, and nonvolatile
spatial data repository for data analysis and
decision making
Spatial data integration a big issue
Structure-specific formats (raster- vs.
vector-based, OO vs. relational models, different
storage and indexing, etc.)
Vendor-specific formats (ESRI, MapInfo,
Integraph, etc.)
Spatial data cube multidimensional spatial
database
Both dimensions and measures may contain spatial
components

46
Dimensions and Measures in Spatial Data Warehouse

Measures
numerical
distributive (e.g. count, sum)
algebraic (e.g. average)
holistic (e.g. median, rank)
spatial
collection of spatial pointers (e.g. pointers to
all regions with 25-30 degrees in July)

Dimension modeling
nonspatial
e.g. temperature 25-30 degrees generalizes to
hot
spatial-to-nonspatial
e.g. region B.C. generalizes to description
western provinces
spatial-to-spatial
e.g. region Burnaby generalizes to region
Lower Mainland

47
Example BC weather pattern analysis

Input
A map with about 3,000 weather probes scattered
in B.C.
Daily data for temperature, precipitation, wind
velocity, etc.
Concept hierarchies for all attributes
Output
A map that reveals patterns merged (similar)
regions
Goals
Interactive analysis (drill-down, slice, dice,
pivot, roll-up)
Fast response time
Minimizing storage space used
Challenge
A merged region may contain hundreds of
primitive regions (polygons)

48
Star Schema of the BC Weather Warehouse

Spatial data warehouse
Dimensions
region_name
time
temperature
precipitation
Measurements
region_map
area
count

Fact table
Dimension table
49
Spatial Merge

Precomputing all too much storage space
On-line merge very expensive

50
Methods for Computation of Spatial Data Cube

On-line aggregation collect and store pointers
to spatial objects in a spatial data cube
expensive and slow, need efficient aggregation
techniques
Precompute and store all the possible
combinations
huge space overhead
Precompute and store rough approximations in a
spatial data cube
accuracy trade-off
Selective computation only materialize those
which will be accessed frequently
a reasonable choice

51
Spatial Association Analysis

Spatial association rule A ? B s, c
A and B are sets of spatial or nonspatial
predicates
Topological relations intersects, overlaps,
disjoint, etc.
Spatial orientations left_of, west_of, under,
etc.
Distance information close_to, within_distance,
etc.
s is the support and c is the confidence of the
rule
Examples
is_a(x, large_town) intersect(x, highway)
adjacent_to(x, water)
7, 85
is_a(x, large_town) adjacent_to(x,
georgia_strait) close_to(x, u.s.a.)
1, 78

52
Progressive Refinement Mining of Spatial
Association Rules

Hierarchy of spatial relationship
g_close_to near_by, touch, intersect, contain,
etc.
First search for rough relationship and then
refine it
Two-step mining of spatial association
Step 1 Rough spatial computation (as a filter)
Using MBR or R-tree for rough estimation
Step2 Detailed spatial algorithm (as refinement)
Apply only to those objects which have passed
the rough spatial association test (no less than
min_support)

53
Spatial Classification and Spatial Trend Analysis

Spatial classification
Analyze spatial objects to derive classification
schemes, such as decision trees in relevance to
certain spatial properties (district, highway,
river, etc.)
Example Classify regions in a province into rich
vs. poor according to the average family income
Spatial trend analysis
Detect changes and trends along a spatial
dimension
Study the trend of nonspatial or spatial data
changing with space
Example Observe the trend of changes of the
climate or vegetation with the increasing
distance from an ocean

54
LSD-tree

Local Split Decision tree
Use kd-tree to partition the space. Each
partition contains up to B points. The kd-tree is
stored in main-memory.
If the kd-tree (directory) is large, we store a
sub-tree on disk
Goal the structure must remain balanced
external balancing property

55
Example LSD-tree
56
LSD-tree main points