Title: Finding Data Broadness Via Generalized Nearest Neighbors
1Finding Data Broadness Via Generalized Nearest
Neighbors
- Jayendra Venkateswaran1 Tamer Kahveci1
- Orhan Camoglu2
University of Florida-Gainesville1 University of
California-Santa Barbara2 www.cise.ufl.edu/jgven
kat EDBT 2006
2Data Broadness
Houses
2-NN Circle
Mexican Restaurants
h5
r1
h1
Other Restaurants
r5
h4
Mexican Restaurants In 2-NN of at least 3 Houses
r2
r4
r3
h3
h2
3Problem Definition
- Given two datasets R and S, Generalized Nearest
Neighbor (GNN) query is given by - GNN(R, S, S, k, t), S S and k, t gt0
- Query returns tuples (s, Rs) where s S , Rs
R has s in its k-NN and Rs t - R and S gt Available Memory.
4Observations
- GNN is has the following special cases
- K-Nearest Neighbor (K-NN) GNN(r,S,S,k,1)
- All Nearest Neighbor (ANN)
- GNN(R,S,S,1,1)
- Reverse Nearest Neighbor (RNN) GNN(R,S,s,1,1)
- Our goal is to solve a broader problem with
reasonable I/O and CPU costs using available
memory.
5Proposed Solution
- Index data using well known index structure such
as R-Tree and packing methods such as
Sort-Tile-Recursive method. - Predict solution space.
- Search Strategies based on the buffer
constraints. - Static Fetch All and Fetch One.
- Dynamic Fetch Dynamic.
- Optimizations Column Filter, Row Filter,
Adaptive Filter and Partitioning.
6R-Tree Example
R-Tree Structure
7Predicting the Solution Priority Table
s1 s2 s3 s4 s5 s6 s7 s8
s2
r1 r2 r3 r4 r5 r6 r7 r8
s6
s3
s8
s1
MINDIST
r1
s7
MAXDIST
s5
s4
Priority Table (PT)
GNN(R,S,s1,s3,s4,s5,s7,k,t)
First Row in PT
8Pruning PT Column Filter and Row Filter
GNN(R,S,s1,s3,s4,s5,s7,k,t)
s1 s2 s3 s4 s5 s6 s7 s8
r1 r2 r3 r4 r5 r6 r7 r8
If r4 lt t
9Fetch All (FA)
Find maximal cluster of pages that fit into
buffer starting from first row of R.
s1 s2 s3 s4 s5 s6 s7 s8
r1 r2 r3 r4 r5 r6 r7 r8
10Fetch All (Contd.)
- Reorder clusters to maximize overlap Improves
disk I/O time. - Greedy Approach Add pair of clusters having
maximum overlap.
Before Reordering
After Reordering
6
6
C1 r1,r2,s1,s3,s4,s7
C1 r1,r2,s1,s3,s4,s7
6
3
C2 r3,r4,s2,s5,s6,s8
C3 r5,s1,s2,s4,s7
4
4
C3 r5,s1,s2,s4,s7
C4 r6,r7,r8,s1,s2,s6
4
5
C4 r6,r7,r8,s1,s2,s6
C2 r3,r4,s2,s5,s6,s8
Disk Reads 17
Disk Reads 21
11Fetch One (FO)
- FA can read irrelevant pages.
- Solution Read one candidate per row.
s1 s2 s3 s4 s5 s6 s7 s8
r1 r2 r3 r4 r5 r6 r7 r8
12Fetch Dynamic (FD)
- Random seek costs in FO can be significant.
- Can read f, f gt1, candidates per row.
- f Median of number of candidates read for rows
processed.
s1 s2 s3 s4 s5 s6 s7 s8
r1 r2 r3 r4 r5 r6 r7 r8
13Process Buffer
Read clusters in order and process its pages.
s4
k-NN Box of r
s1
r
s2
dmax
k-NN Circle
s3
14Further ImprovementsAdaptive Filter
k-NN Box of r
- Costs with k-NN Box
- 3 Disk Reads for r,s1,s2.
- 8x4 32 Object Comparisons.
s1
r
k-NN Circle
- Costs with Adaptive k-NN Box
- 2 Disk Reads for r,s1.
- 4 comparisons for a pass on s1
- 8 object comparisons 12 Comparisons
s2
Adaptive k-NN Box of r
15Further Improvements Partitioning
s1
k-NN Circle
r
Partitioned k-NN Box
- Costs with Partitioned k-NN Boxes
- 2 Disk Reads for r,s1.
- 8 comparisons for two passes on s1
- 2 object comparisons 10 Comparisons
16Experimental Setup
- Datasets
- Image datasets (2-16 Dimensions).
- Protein Dataset (3 Dimensions).
- Experiments
- Evaluation of Optimizations
- Our Methods
- FA
- FO
- FD
- Existing Methods
- Sequential Scan (SS).
- R-Tree based method (RT).
- Mux-Join (kNN-join)
- GORDER (kNN-join)
- RkNN
17Evaluation of Optimizations
Image Dataset
Buffer 10 of database size
18Comparison of Our Methods
Image Dataset
19Comparison with Other Methods
Protein Data
20Comparison with Other Methods Scalability
Image Data
21Conclusion
- Considered the problem of Data Broadness.
- Generalized Nearest Neighbor (GNN) to express
data broadness. - Our method is faster than R-Tree, GORDER,
Mux-Index and RkNN methods.
22Thank You
23Comparison with Other Methods Scalability
Image Data
24Comparison with Other Methods
RkNN
GORDER
Image Data
25Comparison of Our Methods
26Comparison with Other Methods
Protein Data
27Predict Solution Space
- Construct Priority list for each MBR A in R
- Sort MBRs B in S based on the MAXDIST(A,B).
- , 1lt m lt S
(MBRs in S) - Find Bi S, where MINDIST(A,Bi) lt
MAXDIST(A,Bm). - Assign priorities to Bi in increasing order of
MINDIST(A,Bi)
28Further ImprovementsAdaptive Filter
- Comparing two MBR (r R and s S) takes
O(t2), where t is the number of objects in each
of r and s. - Adaptive Filter
- Obtain the k-NN box of r by expanding each
dimension by different amounts. - A single pass on s, eliminates unpromising points
from S. - Find points in adaptive k-NN box of r.
- Adaptive filter reduces the t2 comparisons.