Finding Data Broadness Via Generalized Nearest Neighbors - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Finding Data Broadness Via Generalized Nearest Neighbors

Description:

2-NN Circle. Mexican Restaurants. Other Restaurants. Mexican Restaurants ... GNN(R, S, S', k, t), S' S and k, t 0 ... k-NN Circle. s2. k-NN Box of r. Adaptive ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 29

Provided by: Gna5

Category:

more less

Transcript and Presenter's Notes

Title: Finding Data Broadness Via Generalized Nearest Neighbors

1
Finding Data Broadness Via Generalized Nearest
Neighbors

Jayendra Venkateswaran1 Tamer Kahveci1
Orhan Camoglu2

University of Florida-Gainesville1 University of
California-Santa Barbara2 www.cise.ufl.edu/jgven
kat EDBT 2006
2
Data Broadness
Houses
2-NN Circle
Mexican Restaurants
h5
r1
h1
Other Restaurants
r5
h4
Mexican Restaurants In 2-NN of at least 3 Houses
r2
r4
r3
h3
h2
3
Problem Definition

Given two datasets R and S, Generalized Nearest
Neighbor (GNN) query is given by
GNN(R, S, S, k, t), S S and k, t gt0
Query returns tuples (s, Rs) where s S , Rs
R has s in its k-NN and Rs t
R and S gt Available Memory.

4
Observations

GNN is has the following special cases
K-Nearest Neighbor (K-NN) GNN(r,S,S,k,1)
All Nearest Neighbor (ANN)
GNN(R,S,S,1,1)
Reverse Nearest Neighbor (RNN) GNN(R,S,s,1,1)
Our goal is to solve a broader problem with
reasonable I/O and CPU costs using available
memory.

5
Proposed Solution

Index data using well known index structure such
as R-Tree and packing methods such as
Sort-Tile-Recursive method.
Predict solution space.
Search Strategies based on the buffer
constraints.
Static Fetch All and Fetch One.
Dynamic Fetch Dynamic.
Optimizations Column Filter, Row Filter,
Adaptive Filter and Partitioning.

6
R-Tree Example
R-Tree Structure
7
Predicting the Solution Priority Table
s1 s2 s3 s4 s5 s6 s7 s8
s2
r1 r2 r3 r4 r5 r6 r7 r8
s6
s3
s8
s1
MINDIST
r1
s7
MAXDIST
s5
s4
Priority Table (PT)
GNN(R,S,s1,s3,s4,s5,s7,k,t)
First Row in PT
8
Pruning PT Column Filter and Row Filter
GNN(R,S,s1,s3,s4,s5,s7,k,t)
s1 s2 s3 s4 s5 s6 s7 s8
r1 r2 r3 r4 r5 r6 r7 r8
If r4 lt t
9
Fetch All (FA)
Find maximal cluster of pages that fit into
buffer starting from first row of R.
s1 s2 s3 s4 s5 s6 s7 s8

Buffer Size, b 6

r1 r2 r3 r4 r5 r6 r7 r8
10
Fetch All (Contd.)

Reorder clusters to maximize overlap Improves
disk I/O time.
Greedy Approach Add pair of clusters having
maximum overlap.

Before Reordering
After Reordering
6
6
C1 r1,r2,s1,s3,s4,s7
C1 r1,r2,s1,s3,s4,s7
6
3
C2 r3,r4,s2,s5,s6,s8
C3 r5,s1,s2,s4,s7
4
4
C3 r5,s1,s2,s4,s7
C4 r6,r7,r8,s1,s2,s6
4
5
C4 r6,r7,r8,s1,s2,s6
C2 r3,r4,s2,s5,s6,s8
Disk Reads 17
Disk Reads 21
11
Fetch One (FO)

FA can read irrelevant pages.
Solution Read one candidate per row.

s1 s2 s3 s4 s5 s6 s7 s8

Buffer Size, b 6

r1 r2 r3 r4 r5 r6 r7 r8
12
Fetch Dynamic (FD)

Random seek costs in FO can be significant.
Can read f, f gt1, candidates per row.
f Median of number of candidates read for rows
processed.

s1 s2 s3 s4 s5 s6 s7 s8

Buffer Size, b 6

r1 r2 r3 r4 r5 r6 r7 r8
13
Process Buffer
Read clusters in order and process its pages.
s4
k-NN Box of r
s1
r
s2
dmax
k-NN Circle
s3
14
Further ImprovementsAdaptive Filter
k-NN Box of r

Costs with k-NN Box
3 Disk Reads for r,s1,s2.
8x4 32 Object Comparisons.

s1
r
k-NN Circle

Costs with Adaptive k-NN Box
2 Disk Reads for r,s1.
4 comparisons for a pass on s1
8 object comparisons 12 Comparisons

s2
Adaptive k-NN Box of r
15
Further Improvements Partitioning
s1
k-NN Circle
r
Partitioned k-NN Box

Costs with Partitioned k-NN Boxes
2 Disk Reads for r,s1.
8 comparisons for two passes on s1
2 object comparisons 10 Comparisons

16
Experimental Setup

Datasets
Image datasets (2-16 Dimensions).
Protein Dataset (3 Dimensions).
Experiments
Evaluation of Optimizations
Our Methods
FA
FO
FD
Existing Methods
Sequential Scan (SS).
R-Tree based method (RT).
Mux-Join (kNN-join)
GORDER (kNN-join)
RkNN

17
Evaluation of Optimizations
Image Dataset
Buffer 10 of database size
18
Comparison of Our Methods
Image Dataset
19
Comparison with Other Methods
Protein Data
20
Comparison with Other Methods Scalability
Image Data
21
Conclusion

Considered the problem of Data Broadness.
Generalized Nearest Neighbor (GNN) to express
data broadness.
Our method is faster than R-Tree, GORDER,
Mux-Index and RkNN methods.

22
Thank You

Questions ?

23
Comparison with Other Methods Scalability
Image Data
24
Comparison with Other Methods
RkNN
GORDER
Image Data
25
Comparison of Our Methods
26
Comparison with Other Methods
Protein Data
27
Predict Solution Space

Construct Priority list for each MBR A in R
Sort MBRs B in S based on the MAXDIST(A,B).
, 1lt m lt S
(MBRs in S)
Find Bi S, where MINDIST(A,Bi) lt
MAXDIST(A,Bm).
Assign priorities to Bi in increasing order of
MINDIST(A,Bi)

28
Further ImprovementsAdaptive Filter

Comparing two MBR (r R and s S) takes
O(t2), where t is the number of objects in each
of r and s.
Adaptive Filter
Obtain the k-NN box of r by expanding each
dimension by different amounts.
A single pass on s, eliminates unpromising points
from S.
Find points in adaptive k-NN box of r.
Adaptive filter reduces the t2 comparisons.