Title: What we have covered?
1What we have covered?
- Indexing and Hashing
- Data warehouse and OLAP
- Data Mining
- Information Retrieval and Web Mining
- XML and XQuery
- Spatial Databases
- Transaction Management
2Lecture 6 Spatial Data Management
3Types of Spatial Data
- Point Data
- Points in a multidimensional space
- E.g., Raster data such as satellite imagery,
where each pixel stores a measured value - E.g., Feature vectors extracted from text
- Region Data
- Objects have spatial extent with location and
boundary - DB typically uses geometric approximations
constructed using line segments, polygons, etc.,
called vector data.
4Applications of Spatial Data
- Geographic Information Systems (GIS)
- E.g., ESRIs ArcInfo OpenGIS Consortium
- Geospatial information
- All classes of spatial queries and data are
common - Computer-Aided Design/Manufacturing
- Store spatial objects such as surface of airplane
fuselage - Range queries and spatial join queries are common
- Multimedia Databases
- Images, video, text, etc. stored and retrieved by
content - First converted to feature vector form high
dimensionality - Nearest-neighbor queries are the most common
5Types of Spatial Queries
- Spatial Range Queries
- Find all cities within 50 miles of Madison
- Query has associated region (location, boundary)
- Answer includes overlapping or contained data
regions - Nearest-Neighbor Queries
- Find the 10 cities nearest to Madison
- Results must be ordered by proximity
- Spatial Join Queries
- Find all cities near a lake
- Expensive, join condition involves regions and
proximity
6Spatial Indexing
- Point Access Methods (PAMs) vs Spatial Access
Methods (SAMs) - PAM index only point data
- Hierarchical (tree-based) structures
- Multidimensional Hashing
- Space filling curve
- SAM index both points and regions
- Transformations
- Overlapping regions
- Clipping methods (non-overlapping)
- Data partitioning vs Space partitioning
7Single-Dimensional Indexes
- B trees are fundamentally single-dimensional
indexes. - When we create a composite search key B tree,
e.g., an index on ltage, salgt, - we effectively linearize the 2-dimensional
space since we sort entries first by age and then
by sal.
80
70
60
Consider entries lt11, 80gt, lt12, 10gt lt12, 20gt,
lt13, 75gt
50
SAL
40
B tree order
30
20
10
11 12 13
AGE
8Multidimensional Indexes
- A multidimensional index clusters entries so as
to exploit nearness in multidimensional space. - Keeping track of entries and maintaining a
balanced index structure presents a challenge!
Consider entries lt11, 80gt, lt12, 10gt lt12, 20gt,
lt13, 75gt
9Motivation for Multidimensional Indexes
- Spatial queries (GIS, CAD).
- Find all hotels within a radius of 5 miles from
the conference venue. - Find the city with population 500,000 or more
that is nearest to Kalamazoo, MI. - Find all cities that lie on the Nile in Egypt.
- Find all parts that touch the fuselage (in a
plane design). - Similarity queries (content-based retrieval).
- Given a face, find the five most similar faces.
- Multidimensional range queries.
- 50 lt age lt 55 AND 80K lt sal lt 90K
10Whats the difficulty?
- An index based on spatial location needed.
- One-dimensional indexes dont support
multidimensional searching efficiently. (Why?) - Hash indexes only support point queries want to
support range queries as well. - Must support inserts and deletes gracefully.
- Ideally, want to support non-point data as well
(e.g., lines, shapes).
11PAMs
- Point Access Methods
- Hierarchical methods kd-tree based
- Space Filling Curves Z-ordering
- Multidimensional Hashing Grid File
- Exponential growth of the directory
12The problem
- Given a point set and a rectangular query, find
the points enclosed in the query - We allow insertions/deletions on line
Query
13Tree-based PAMs
- Most of tb-PAMs are based on kd-tree
- kd-tree is a main memory binary tree for indexing
k-dimensional points - Needs to be adapted for the disk model
- Levels rotate among the dimensions, partitioning
the space based on a value for that dimension - kd-tree is not necessarily balanced
14kd-tree
- At each level we use a different dimension
x5
xgt5
xlt5
C
y6
B
y3
x6
E
A
D
15Kd-tree properties
- Height of the tree O(log2 n)
- Search time for exact match O(log2 n)
- Search time for range query O(n1/2 k)
16kd-tree example
X5
X7
X3
y6
y5
Y6
x8
x7
x3
y2
Y2
X5
X8
17External memory kd-trees
- Similar to B-tree, tree nodes split many ways
instead of two ways - insertion becomes quite complex and expensive.
- No storage utilization guarantee since when a
higher level node splits, the split has to be
propagated all the way to leaf level resulting in
many empty blocks. - Pack many interior nodes (forming a subtree) into
a block. - it may not be feasible to group nodes at lower
level into a block productively. - Many interesting papers on how to optimally pack
nodes into blocks recently published.
18PAMs
- Point Access Methods
- Hierarchical methods kd-tree based
- Space Filling Curves Z-ordering
- Multidimensional Hashing Grid File
- Exponential growth of the directory
19Single-Dimensional Indexes
- B trees are fundamentally single-dimensional
indexes. - When we create a composite search key B tree,
e.g., an index on ltage, salgt, - we effectively linearize the 2-dimensional
space since we sort entries first by age and then
by sal.
80
70
60
Consider entries lt11, 80gt, lt12, 10gt lt12, 20gt,
lt13, 75gt
50
SAL
40
B tree order
30
20
10
11 12 13
AGE
20Z-Curve
- What is a Z-curve?
- A space filling curve
- Generated from interleaving bits
- x, y coordinate
- See Fig. 4.6
- Alternative generation method
- see Fig. 4.5
- Connecting points by z-order
- see Fig. 4.4
- looks like Ns or Zs
- Implementing file operations
Fig 4.6
Fig 4.4
21Example of Z-values
- Figure 4.7
- Left part shows a map with spatial object A, B,
C - Right part and Left bottom part Z-values within
A, B and C - Note C gets z-values of 2 and 8, which are not
close - Exercise Compute z-values for B.
Fig 4.7
22Hilbert Curve
- A space filling curve
- Example Fig. 4.5
- More complex to generate
- due to rotations
- Illustration on next slide!
- Implementing file operations
Fig 4.5
23Calculating Hilbert Values (Optional Topic)
Fig 4.8
24PAMs
- Point Access Methods
- Hierarchical methods kd-tree based
- Space Filling Curves Z-ordering
- Multidimensional Hashing Grid File
- Exponential growth of the directory
25Grid File
- Hashing methods for multidimensional points
(extension of Extensible hashing) - Idea Use a grid to partition the space? each
cell is associated with one page - Two disk access principle (exact match)
26Grid File
- Start with one bucket for the whole space.
- Select dividers along each dimension. Partition
space into cells - Dividers cut all the way.
- Each cell corresponds to 1 disk page.
- Many cells can point to the same page.
- Cell directory potentially exponential in the
number of dimensions
27Grid File Implementation
- Dynamic structure using a grid directory
- Grid array a 2 dimensional array with pointers
to buckets (this array can be large, disk
resident) G(0,, nx-1, 0, , ny-1) - Linear scales Two 1 dimensional arrays that used
to access the grid array (main memory) X(0, ,
nx-1), Y(0, , ny-1)
28Example
Buckets/Disk Blocks
Grid Directory
Linear scale Y
Linear scale X
29Grid File Search
- Exact Match Search at most 2 I/Os assuming
linear scales fit in memory. - First use liner scales to determine the index
into the cell directory - access the cell directory to retrieve the bucket
address (may cause 1 I/O if cell directory does
not fit in memory) - access the appropriate bucket (1 I/O)
- Range Queries
- use linear scales to determine the index into the
cell directory. - Access the cell directory to retrieve the bucket
addresses of buckets to visit. - Access the buckets.
30Grid File Insertions
- Determine the bucket into which insertion must
occur. - If space in bucket, insert.
- Else, split bucket
- how to choose a good dimension to split?
- If bucket split causes a cell directory to split
do so and adjust linear scales. - insertion of these new entries potentially
requires a complete reorganization of the cell
directory--- expensive!!!
31Grid File Deletions
- Deletions may decrease the space utilization.
Merge buckets - We need to decide which cells to merge and a
merging threshold - Buddy system and neighbor system
- A bucket can merge with only one buddy in each
dimension - Merge adjacent regions if the result is a
rectangle
32Grid File Example
(N6)
1
2
3
4
5
6
33Grid File Example
(N6)
8
10
9
11
12
34Grid File Example
(N6)
14
15
35Grid File Example
(N6)
36Grid File Example
(N6)
37The R-Tree
- The R-tree is a tree-structured index that
remains balanced on inserts and deletes. - Each key stored in a leaf entry is intuitively a
box, or collection of intervals, with one
interval per dimension. - Example in 2-D
38R-Tree Properties
- Leaf entry lt n-dimensional box, rid gt
- key value being a box.
- Box is the tightest bounding box for a data
object. - Non-leaf entry lt n-dim box, ptr to child node gt
- Box covers all boxes in child node (in fact,
subtree). - All leaves at same distance from root.
- Nodes can be kept 50 full (except root).
- Can choose a parameter m that is lt 50, and
ensure that every node is at least m full.
39Example of an R-Tree
Leaf entry
Index entry
R1
R4
Spatial object approximated by bounding box R8
R11
R3
R5
R13
R9
R8
R14
R10
R12
R7
R18
R17
R6
R16
R19
R15
R2
40Example R-Tree (Contd.)
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
R17
R18
R19
R16
41Search for Objects Overlapping Box Q
Start at root. 1. If current node is non-leaf,
for each entry ltE, ptrgt, if box E overlaps
Q, search subtree identified by ptr. 2. If
current node is leaf, for each entry ltE,
ridgt, if E overlaps Q, rid identifies an
object that might overlap Q.
Note May have to search several subtrees at
each node! (In contrast, a B-tree equality search
goes to just one leaf.)
42Improving Search Using Constraints
- It is convenient to store boxes in the R-tree as
approximations of arbitrary regions, because
boxes can be represented compactly. - But why not use convex polygons to approximate
query regions more accurately? - Will reduce overlap with nodes in tree, and
reduce the number of nodes fetched by avoiding
some branches altogether. - Cost of overlap test is higher than bounding box
intersection, but it is a main-memory cost, and
can actually be done quite efficiently.
Generally a win.
43Insert Entry ltB, ptrgt
- Start at root and go down to best-fit leaf L.
- Go to child whose box needs least enlargement to
cover B resolve ties by going to smallest area
child. - If best-fit leaf L has space, insert entry and
stop. Otherwise, split L into L1 and L2. - Adjust entry for L in its parent so that the box
now covers (only) L1. - Add an entry (in the parent node of L) for L2.
(This could cause the parent node to recursively
split.)
44Splitting a Node During Insertion
- The entries in node L plus the newly inserted
entry must be distributed between L1 and L2. - Goal is to reduce likelihood of both L1 and L2
being searched on subsequent queries. - Idea Redistribute so as to minimize area of L1
plus area of L2.
GOOD SPLIT!
BAD!
45Spatial Data Warehousing
- Spatial data warehouse Integrated,
subject-oriented, time-variant, and nonvolatile
spatial data repository for data analysis and
decision making - Spatial data integration a big issue
- Structure-specific formats (raster- vs.
vector-based, OO vs. relational models, different
storage and indexing, etc.) - Vendor-specific formats (ESRI, MapInfo,
Integraph, etc.) - Spatial data cube multidimensional spatial
database - Both dimensions and measures may contain spatial
components
46Dimensions and Measures in Spatial Data Warehouse
- Measures
- numerical
- distributive (e.g. count, sum)
- algebraic (e.g. average)
- holistic (e.g. median, rank)
- spatial
- collection of spatial pointers (e.g. pointers to
all regions with 25-30 degrees in July)
- Dimension modeling
- nonspatial
- e.g. temperature 25-30 degrees generalizes to
hot - spatial-to-nonspatial
- e.g. region B.C. generalizes to description
western provinces - spatial-to-spatial
- e.g. region Burnaby generalizes to region
Lower Mainland
47Example BC weather pattern analysis
- Input
- A map with about 3,000 weather probes scattered
in B.C. - Daily data for temperature, precipitation, wind
velocity, etc. - Concept hierarchies for all attributes
- Output
- A map that reveals patterns merged (similar)
regions - Goals
- Interactive analysis (drill-down, slice, dice,
pivot, roll-up) - Fast response time
- Minimizing storage space used
- Challenge
- A merged region may contain hundreds of
primitive regions (polygons)
48Star Schema of the BC Weather Warehouse
- Spatial data warehouse
- Dimensions
- region_name
- time
- temperature
- precipitation
- Measurements
- region_map
- area
- count
Fact table
Dimension table
49Spatial Merge
- Precomputing all too much storage space
- On-line merge very expensive
50Methods for Computation of Spatial Data Cube
- On-line aggregation collect and store pointers
to spatial objects in a spatial data cube - expensive and slow, need efficient aggregation
techniques - Precompute and store all the possible
combinations - huge space overhead
- Precompute and store rough approximations in a
spatial data cube - accuracy trade-off
- Selective computation only materialize those
which will be accessed frequently - a reasonable choice
51Spatial Association Analysis
- Spatial association rule A ? B s, c
- A and B are sets of spatial or nonspatial
predicates - Topological relations intersects, overlaps,
disjoint, etc. - Spatial orientations left_of, west_of, under,
etc. - Distance information close_to, within_distance,
etc. - s is the support and c is the confidence of the
rule - Examples
- is_a(x, large_town) intersect(x, highway)
adjacent_to(x, water) - 7, 85
- is_a(x, large_town) adjacent_to(x,
georgia_strait) close_to(x, u.s.a.)
1, 78
52Progressive Refinement Mining of Spatial
Association Rules
- Hierarchy of spatial relationship
- g_close_to near_by, touch, intersect, contain,
etc. - First search for rough relationship and then
refine it - Two-step mining of spatial association
- Step 1 Rough spatial computation (as a filter)
- Using MBR or R-tree for rough estimation
- Step2 Detailed spatial algorithm (as refinement)
- Apply only to those objects which have passed
the rough spatial association test (no less than
min_support)
53Spatial Classification and Spatial Trend Analysis
- Spatial classification
- Analyze spatial objects to derive classification
schemes, such as decision trees in relevance to
certain spatial properties (district, highway,
river, etc.) - Example Classify regions in a province into rich
vs. poor according to the average family income - Spatial trend analysis
- Detect changes and trends along a spatial
dimension - Study the trend of nonspatial or spatial data
changing with space - Example Observe the trend of changes of the
climate or vegetation with the increasing
distance from an ocean
54LSD-tree
- Local Split Decision tree
- Use kd-tree to partition the space. Each
partition contains up to B points. The kd-tree is
stored in main-memory. - If the kd-tree (directory) is large, we store a
sub-tree on disk - Goal the structure must remain balanced
external balancing property
55Example LSD-tree
56LSD-tree main points
- Split strategies
- Data dependent
- Distribution dependent
- Paging algorithm
- Two types of splits bucket splits and internal
node splits
57Handling Regions with Z-curve
Fig 4.9