Title: The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree
1The Priority R-Tree A Practically Efficient and
Worst-Case Optimal R-Tree
- Lars Arge1, Mark de Berg2, Herman Haverkort3 and
Ke Yi1 - Department of Computer Science
- Duke University
- Department of Computer Science
- TU Eindhoven
- Institute of Information and Computing Sciences
- Utrecht University
2Problem Definition
- Input
- N rectangles in the plane
- Window query Q
- Output
- All rectangles intersecting Q
- Applications
- Spatial databases
- GIS
- CAD
- Computer vision
- Robotics
3R-Tree
- Definition Guttman84
- Advantages
- Little redundancy
- Multi-purpose
- Easy to update
Fanout ?(B) B disk block size
G
F
E
B
A
H
I
A
B
C
D
E
F
G
H
I
C
D
4How to Build an R-Tree
- Repeated insertions
- Guttman84
- R-tree Sellis et al. 87
- R-tree Beckmann et al. 90
- Bulkloading
- Hilbert R-Tree Kamel and Faloutos 94
- Top-down Greedy Split Garcia et al. 98
- Advantages
- Much faster than repeated insertions
- Better space utilization
- Usually produce R-trees with higher quality
5R-Tree Variant Hilbert R-Tree
Hilbert Curve
- To build a Hilbert R-Tree (cost O(N/B logM/BN)
I/Os) - Sort the rectangles by the Hilbert values of
their centers - Build a B-tree on top
- 4D Hilbert R-tree
6R-Tree Variant TGS R-Tree
(Top-down Greedy Split)
- To build a TGS R-tree
- Start from the root and buildthe tree top-down
- To build one node, use binary cutsuntil the
desired fan-out is reached - To make a binary cut, consider4 orderings of the
rectangles xmin, ymin, xmax, ymax - In each ordering, consider the B cutting
positions - Choose the one that minimizes the sum of the
areas of the two resulted bounding boxes - Typical bulk-load cost O(N/B log2N) I/Os
7Our Results
- None of existing R-tree variants has worst-case
query performance guarantee! - In the worst-case, a query can visit all nodes in
the tree even when the output size is zero - Priority R-Tree
- The first R-tree variant that answers a query by
visiting
nodes in the worst case - T Output size
- It is optimal!
- There exists a dataset such that for any R-tree,
there is an empty query that visits
nodes. Kanth and Singh 99, Agarwal et
al. 02
8Roadmap
- Pseudo-PR-Tree
- Has the desired
worst-case guarantee - Not a real R-tree
- Transform a pseudo-PR-Tree into a PR-tree
- A real R-tree
- Maintain the worst-case guarantee
- Experiments
- PR-tree
- Hilbert R-tree (2D and 4D)
- TGS-R-tree
9Building a Pseudo-PR-Tree
root
priority leaves
Step 1 take out B extreme rectangles from each
direction and put them into priority leaves
10Building a Pseudo-PR-Tree
Step 2 Divide by the xmin coordinates and build
subtrees recursively. Division is performed
using xmin, ymin, xmax, ymax in a round-robin
fashion, like a 4D kd-tree
root
Analysis sketch nodes with at least one
priority leafcompletely reported O(T/B) nodes
with no priority leaf completely reported
11Pseudo-PR-Tree to a Real R-tree
12Query Complexity Remains Unchanged
Next level
nodes visited on leaf level
13PR-Tree Bulkload Updates
- Bulkload
- O(N/Blog2N) I/Os?O(N/BlogM/BN) I/Os, using
grid method Agarwal et al. 01 - The same as Hilbert R-tree, but with a larger
constant - Updates
- Can use any previous heuristic to update in
O(logBN) I/Os - Without worst-case query guarantee
- Use logarithmic method
- Insert O(logBN 1/B logM/BN log2(N/M)) I/Os
- Delete O(logBN) I/Os
- Extending to d-dimensions
- Query bound O((N/B)1-1/d T/B), still optimal
- Bulkload update bounds remain the same
14Experiments
- Implemented with TPIE
- Priority R-tree
- Hilbert R-tree
- 4D Hilbert R-tree
- TGS R-tree
- Real-life data
- TIGER datasets
- 16 million rectangles
- Synthetic data
- Varying from normal to extreme data
- 10 million rectangles
15Experiments with Real-Life Data
- Query performance on the TIGER datasets
Shown I/Os spent in answering a query
T/B
16Experiments with Synthetic Data SIZE
Each side of a rectangle is uniformly
distributed in 0, max_side
Queries are squares with area 1
17Experiments with Synthetic Data ASPECT
Fix the area, vary aspect ratio
18Experiments with Synthetic Data SKEWED
Randomly place points, then do yyc on the
y-coordinates
19Experiments with Synthetic Data CLUSTER
20Conclusions
- In theory
- The PR-tree is the first R-tree variant that
answers a window query in
I/Os worst-case, which is optimal - In practice
- Roughly the same as previous best R-trees on
real-life and relatively nicely distributed data - Outperforms them significantly on more extreme
data - Future work
- How previous heuristics may affect the
performance of the PR-tree in the dynamic case
21Lower Bound Construction
- Each bounding box intersects at least
queries - N/B bounding boxes
- queries
- There exists a query that intersects at least
- bounding boxes
22Pseudo-PR-Tree Query Complexity
- Nodes v visited where all rectangles in at least
one of the priority leaves of vs parent are
reported O(T/B) - Let v be a node visited but none of the priority
leaves at its parent are reported completely,
consider vs parent u
2D
4D
Q
ymin ymax(Q)
xmax xmin(Q)
23Pseudo-PR-Tree Query Complexity
- The cell in the 4D kd-tree of u is intersected by
two different 3-dimensional hyper-planes - The intersection of each pair of such
3-dimensional hyper-planes is a 2-dimensional
hyper-plane - Lemma of cells in a d-dimensional kd-tree that
intersect an axis-parallel f-dimensional
hyper-plane is O((N/B)f/d) - So, such cells in a 4D kd-tree
- Total nodes visited
u
24Experiments with Real-Life Data
- Datasets TIGER/Line data
- Bulk-loading