What we have covered? - PowerPoint PPT Presentation

About This Presentation
Title:

What we have covered?

Description:

Objects have spatial extent with location and boundary. DB typically uses geometric approximations constructed ... Must support inserts and deletes gracefully. ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 57
Provided by: jeff466
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: What we have covered?


1
What we have covered?
  • Indexing and Hashing
  • Data warehouse and OLAP
  • Data Mining
  • Information Retrieval and Web Mining
  • XML and XQuery
  • Spatial Databases
  • Transaction Management

2
Lecture 6 Spatial Data Management
3
Types of Spatial Data
  • Point Data
  • Points in a multidimensional space
  • E.g., Raster data such as satellite imagery,
    where each pixel stores a measured value
  • E.g., Feature vectors extracted from text
  • Region Data
  • Objects have spatial extent with location and
    boundary
  • DB typically uses geometric approximations
    constructed using line segments, polygons, etc.,
    called vector data.

4
Applications of Spatial Data
  • Geographic Information Systems (GIS)
  • E.g., ESRIs ArcInfo OpenGIS Consortium
  • Geospatial information
  • All classes of spatial queries and data are
    common
  • Computer-Aided Design/Manufacturing
  • Store spatial objects such as surface of airplane
    fuselage
  • Range queries and spatial join queries are common
  • Multimedia Databases
  • Images, video, text, etc. stored and retrieved by
    content
  • First converted to feature vector form high
    dimensionality
  • Nearest-neighbor queries are the most common

5
Types of Spatial Queries
  • Spatial Range Queries
  • Find all cities within 50 miles of Madison
  • Query has associated region (location, boundary)
  • Answer includes overlapping or contained data
    regions
  • Nearest-Neighbor Queries
  • Find the 10 cities nearest to Madison
  • Results must be ordered by proximity
  • Spatial Join Queries
  • Find all cities near a lake
  • Expensive, join condition involves regions and
    proximity

6
Spatial Indexing
  • Point Access Methods (PAMs) vs Spatial Access
    Methods (SAMs)
  • PAM index only point data
  • Hierarchical (tree-based) structures
  • Multidimensional Hashing
  • Space filling curve
  • SAM index both points and regions
  • Transformations
  • Overlapping regions
  • Clipping methods (non-overlapping)
  • Data partitioning vs Space partitioning

7
Single-Dimensional Indexes
  • B trees are fundamentally single-dimensional
    indexes.
  • When we create a composite search key B tree,
    e.g., an index on ltage, salgt,
  • we effectively linearize the 2-dimensional
    space since we sort entries first by age and then
    by sal.

80
70
60
Consider entries lt11, 80gt, lt12, 10gt lt12, 20gt,
lt13, 75gt
50
SAL
40
B tree order
30
20
10
11 12 13
AGE
8
Multidimensional Indexes
  • A multidimensional index clusters entries so as
    to exploit nearness in multidimensional space.
  • Keeping track of entries and maintaining a
    balanced index structure presents a challenge!

Consider entries lt11, 80gt, lt12, 10gt lt12, 20gt,
lt13, 75gt
9
Motivation for Multidimensional Indexes
  • Spatial queries (GIS, CAD).
  • Find all hotels within a radius of 5 miles from
    the conference venue.
  • Find the city with population 500,000 or more
    that is nearest to Kalamazoo, MI.
  • Find all cities that lie on the Nile in Egypt.
  • Find all parts that touch the fuselage (in a
    plane design).
  • Similarity queries (content-based retrieval).
  • Given a face, find the five most similar faces.
  • Multidimensional range queries.
  • 50 lt age lt 55 AND 80K lt sal lt 90K

10
Whats the difficulty?
  • An index based on spatial location needed.
  • One-dimensional indexes dont support
    multidimensional searching efficiently. (Why?)
  • Hash indexes only support point queries want to
    support range queries as well.
  • Must support inserts and deletes gracefully.
  • Ideally, want to support non-point data as well
    (e.g., lines, shapes).

11
PAMs
  • Point Access Methods
  • Hierarchical methods kd-tree based
  • Space Filling Curves Z-ordering
  • Multidimensional Hashing Grid File
  • Exponential growth of the directory

12
The problem
  • Given a point set and a rectangular query, find
    the points enclosed in the query
  • We allow insertions/deletions on line

Query
13
Tree-based PAMs
  • Most of tb-PAMs are based on kd-tree
  • kd-tree is a main memory binary tree for indexing
    k-dimensional points
  • Needs to be adapted for the disk model
  • Levels rotate among the dimensions, partitioning
    the space based on a value for that dimension
  • kd-tree is not necessarily balanced

14
kd-tree
  • At each level we use a different dimension

x5
xgt5
xlt5
C
y6
B
y3
x6
E
A
D
15
Kd-tree properties
  • Height of the tree O(log2 n)
  • Search time for exact match O(log2 n)
  • Search time for range query O(n1/2 k)

16
kd-tree example
X5
X7
X3
y6
y5
Y6
x8
x7
x3
y2
Y2
X5
X8
17
External memory kd-trees
  • Similar to B-tree, tree nodes split many ways
    instead of two ways
  • insertion becomes quite complex and expensive.
  • No storage utilization guarantee since when a
    higher level node splits, the split has to be
    propagated all the way to leaf level resulting in
    many empty blocks.
  • Pack many interior nodes (forming a subtree) into
    a block.
  • it may not be feasible to group nodes at lower
    level into a block productively.
  • Many interesting papers on how to optimally pack
    nodes into blocks recently published.

18
PAMs
  • Point Access Methods
  • Hierarchical methods kd-tree based
  • Space Filling Curves Z-ordering
  • Multidimensional Hashing Grid File
  • Exponential growth of the directory

19
Single-Dimensional Indexes
  • B trees are fundamentally single-dimensional
    indexes.
  • When we create a composite search key B tree,
    e.g., an index on ltage, salgt,
  • we effectively linearize the 2-dimensional
    space since we sort entries first by age and then
    by sal.

80
70
60
Consider entries lt11, 80gt, lt12, 10gt lt12, 20gt,
lt13, 75gt
50
SAL
40
B tree order
30
20
10
11 12 13
AGE
20
Z-Curve
  • What is a Z-curve?
  • A space filling curve
  • Generated from interleaving bits
  • x, y coordinate
  • See Fig. 4.6
  • Alternative generation method
  • see Fig. 4.5
  • Connecting points by z-order
  • see Fig. 4.4
  • looks like Ns or Zs
  • Implementing file operations

Fig 4.6
Fig 4.4
21
Example of Z-values
  • Figure 4.7
  • Left part shows a map with spatial object A, B,
    C
  • Right part and Left bottom part Z-values within
    A, B and C
  • Note C gets z-values of 2 and 8, which are not
    close
  • Exercise Compute z-values for B.

Fig 4.7
22
Hilbert Curve
  • A space filling curve
  • Example Fig. 4.5
  • More complex to generate
  • due to rotations
  • Illustration on next slide!
  • Implementing file operations

Fig 4.5
23
Calculating Hilbert Values (Optional Topic)
Fig 4.8
24
PAMs
  • Point Access Methods
  • Hierarchical methods kd-tree based
  • Space Filling Curves Z-ordering
  • Multidimensional Hashing Grid File
  • Exponential growth of the directory

25
Grid File
  • Hashing methods for multidimensional points
    (extension of Extensible hashing)
  • Idea Use a grid to partition the space? each
    cell is associated with one page
  • Two disk access principle (exact match)

26
Grid File
  • Start with one bucket for the whole space.
  • Select dividers along each dimension. Partition
    space into cells
  • Dividers cut all the way.
  • Each cell corresponds to 1 disk page.
  • Many cells can point to the same page.
  • Cell directory potentially exponential in the
    number of dimensions

27
Grid File Implementation
  • Dynamic structure using a grid directory
  • Grid array a 2 dimensional array with pointers
    to buckets (this array can be large, disk
    resident) G(0,, nx-1, 0, , ny-1)
  • Linear scales Two 1 dimensional arrays that used
    to access the grid array (main memory) X(0, ,
    nx-1), Y(0, , ny-1)

28
Example
Buckets/Disk Blocks
Grid Directory
Linear scale Y
Linear scale X
29
Grid File Search
  • Exact Match Search at most 2 I/Os assuming
    linear scales fit in memory.
  • First use liner scales to determine the index
    into the cell directory
  • access the cell directory to retrieve the bucket
    address (may cause 1 I/O if cell directory does
    not fit in memory)
  • access the appropriate bucket (1 I/O)
  • Range Queries
  • use linear scales to determine the index into the
    cell directory.
  • Access the cell directory to retrieve the bucket
    addresses of buckets to visit.
  • Access the buckets.

30
Grid File Insertions
  • Determine the bucket into which insertion must
    occur.
  • If space in bucket, insert.
  • Else, split bucket
  • how to choose a good dimension to split?
  • If bucket split causes a cell directory to split
    do so and adjust linear scales.
  • insertion of these new entries potentially
    requires a complete reorganization of the cell
    directory--- expensive!!!

31
Grid File Deletions
  • Deletions may decrease the space utilization.
    Merge buckets
  • We need to decide which cells to merge and a
    merging threshold
  • Buddy system and neighbor system
  • A bucket can merge with only one buddy in each
    dimension
  • Merge adjacent regions if the result is a
    rectangle

32
Grid File Example
(N6)
1
2
3
4
5
6
33
Grid File Example
(N6)
8
10
9
11
12
34
Grid File Example
(N6)
14
15
35
Grid File Example
(N6)
36
Grid File Example
(N6)
37
The R-Tree
  • The R-tree is a tree-structured index that
    remains balanced on inserts and deletes.
  • Each key stored in a leaf entry is intuitively a
    box, or collection of intervals, with one
    interval per dimension.
  • Example in 2-D

38
R-Tree Properties
  • Leaf entry lt n-dimensional box, rid gt
  • key value being a box.
  • Box is the tightest bounding box for a data
    object.
  • Non-leaf entry lt n-dim box, ptr to child node gt
  • Box covers all boxes in child node (in fact,
    subtree).
  • All leaves at same distance from root.
  • Nodes can be kept 50 full (except root).
  • Can choose a parameter m that is lt 50, and
    ensure that every node is at least m full.

39
Example of an R-Tree
Leaf entry
Index entry
R1
R4
Spatial object approximated by bounding box R8
R11
R3
R5
R13
R9
R8
R14
R10
R12
R7
R18
R17
R6
R16
R19
R15
R2
40
Example R-Tree (Contd.)
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
R17
R18
R19
R16
41
Search for Objects Overlapping Box Q
Start at root. 1. If current node is non-leaf,
for each entry ltE, ptrgt, if box E overlaps
Q, search subtree identified by ptr. 2. If
current node is leaf, for each entry ltE,
ridgt, if E overlaps Q, rid identifies an
object that might overlap Q.
Note May have to search several subtrees at
each node! (In contrast, a B-tree equality search
goes to just one leaf.)
42
Improving Search Using Constraints
  • It is convenient to store boxes in the R-tree as
    approximations of arbitrary regions, because
    boxes can be represented compactly.
  • But why not use convex polygons to approximate
    query regions more accurately?
  • Will reduce overlap with nodes in tree, and
    reduce the number of nodes fetched by avoiding
    some branches altogether.
  • Cost of overlap test is higher than bounding box
    intersection, but it is a main-memory cost, and
    can actually be done quite efficiently.
    Generally a win.

43
Insert Entry ltB, ptrgt
  • Start at root and go down to best-fit leaf L.
  • Go to child whose box needs least enlargement to
    cover B resolve ties by going to smallest area
    child.
  • If best-fit leaf L has space, insert entry and
    stop. Otherwise, split L into L1 and L2.
  • Adjust entry for L in its parent so that the box
    now covers (only) L1.
  • Add an entry (in the parent node of L) for L2.
    (This could cause the parent node to recursively
    split.)

44
Splitting a Node During Insertion
  • The entries in node L plus the newly inserted
    entry must be distributed between L1 and L2.
  • Goal is to reduce likelihood of both L1 and L2
    being searched on subsequent queries.
  • Idea Redistribute so as to minimize area of L1
    plus area of L2.

GOOD SPLIT!
BAD!
45
Spatial Data Warehousing
  • Spatial data warehouse Integrated,
    subject-oriented, time-variant, and nonvolatile
    spatial data repository for data analysis and
    decision making
  • Spatial data integration a big issue
  • Structure-specific formats (raster- vs.
    vector-based, OO vs. relational models, different
    storage and indexing, etc.)
  • Vendor-specific formats (ESRI, MapInfo,
    Integraph, etc.)
  • Spatial data cube multidimensional spatial
    database
  • Both dimensions and measures may contain spatial
    components

46
Dimensions and Measures in Spatial Data Warehouse
  • Measures
  • numerical
  • distributive (e.g. count, sum)
  • algebraic (e.g. average)
  • holistic (e.g. median, rank)
  • spatial
  • collection of spatial pointers (e.g. pointers to
    all regions with 25-30 degrees in July)
  • Dimension modeling
  • nonspatial
  • e.g. temperature 25-30 degrees generalizes to
    hot
  • spatial-to-nonspatial
  • e.g. region B.C. generalizes to description
    western provinces
  • spatial-to-spatial
  • e.g. region Burnaby generalizes to region
    Lower Mainland

47
Example BC weather pattern analysis
  • Input
  • A map with about 3,000 weather probes scattered
    in B.C.
  • Daily data for temperature, precipitation, wind
    velocity, etc.
  • Concept hierarchies for all attributes
  • Output
  • A map that reveals patterns merged (similar)
    regions
  • Goals
  • Interactive analysis (drill-down, slice, dice,
    pivot, roll-up)
  • Fast response time
  • Minimizing storage space used
  • Challenge
  • A merged region may contain hundreds of
    primitive regions (polygons)

48
Star Schema of the BC Weather Warehouse
  • Spatial data warehouse
  • Dimensions
  • region_name
  • time
  • temperature
  • precipitation
  • Measurements
  • region_map
  • area
  • count

Fact table
Dimension table
49
Spatial Merge
  • Precomputing all too much storage space
  • On-line merge very expensive

50
Methods for Computation of Spatial Data Cube
  • On-line aggregation collect and store pointers
    to spatial objects in a spatial data cube
  • expensive and slow, need efficient aggregation
    techniques
  • Precompute and store all the possible
    combinations
  • huge space overhead
  • Precompute and store rough approximations in a
    spatial data cube
  • accuracy trade-off
  • Selective computation only materialize those
    which will be accessed frequently
  • a reasonable choice

51
Spatial Association Analysis
  • Spatial association rule A ? B s, c
  • A and B are sets of spatial or nonspatial
    predicates
  • Topological relations intersects, overlaps,
    disjoint, etc.
  • Spatial orientations left_of, west_of, under,
    etc.
  • Distance information close_to, within_distance,
    etc.
  • s is the support and c is the confidence of the
    rule
  • Examples
  • is_a(x, large_town) intersect(x, highway)
    adjacent_to(x, water)
  • 7, 85
  • is_a(x, large_town) adjacent_to(x,
    georgia_strait) close_to(x, u.s.a.)
    1, 78

52
Progressive Refinement Mining of Spatial
Association Rules
  • Hierarchy of spatial relationship
  • g_close_to near_by, touch, intersect, contain,
    etc.
  • First search for rough relationship and then
    refine it
  • Two-step mining of spatial association
  • Step 1 Rough spatial computation (as a filter)
  • Using MBR or R-tree for rough estimation
  • Step2 Detailed spatial algorithm (as refinement)
  • Apply only to those objects which have passed
    the rough spatial association test (no less than
    min_support)

53
Spatial Classification and Spatial Trend Analysis
  • Spatial classification
  • Analyze spatial objects to derive classification
    schemes, such as decision trees in relevance to
    certain spatial properties (district, highway,
    river, etc.)
  • Example Classify regions in a province into rich
    vs. poor according to the average family income
  • Spatial trend analysis
  • Detect changes and trends along a spatial
    dimension
  • Study the trend of nonspatial or spatial data
    changing with space
  • Example Observe the trend of changes of the
    climate or vegetation with the increasing
    distance from an ocean

54
LSD-tree
  • Local Split Decision tree
  • Use kd-tree to partition the space. Each
    partition contains up to B points. The kd-tree is
    stored in main-memory.
  • If the kd-tree (directory) is large, we store a
    sub-tree on disk
  • Goal the structure must remain balanced
    external balancing property

55
Example LSD-tree
56
LSD-tree main points
  • Split strategies
  • Data dependent
  • Distribution dependent
  • Paging algorithm
  • Two types of splits bucket splits and internal
    node splits

57
Handling Regions with Z-curve
Fig 4.9
Write a Comment
User Comments (0)
About PowerShow.com