Finding Data Broadness Via Generalized Nearest Neighbors - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Finding Data Broadness Via Generalized Nearest Neighbors

Description:

2-NN Circle. Mexican Restaurants. Other Restaurants. Mexican Restaurants ... GNN(R, S, S', k, t), S' S and k, t 0 ... k-NN Circle. s2. k-NN Box of r. Adaptive ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 29
Provided by: Gna5
Category:

less

Transcript and Presenter's Notes

Title: Finding Data Broadness Via Generalized Nearest Neighbors


1
Finding Data Broadness Via Generalized Nearest
Neighbors
  • Jayendra Venkateswaran1 Tamer Kahveci1
  • Orhan Camoglu2

University of Florida-Gainesville1 University of
California-Santa Barbara2 www.cise.ufl.edu/jgven
kat EDBT 2006
2
Data Broadness
Houses
2-NN Circle
Mexican Restaurants
h5
r1
h1
Other Restaurants
r5
h4
Mexican Restaurants In 2-NN of at least 3 Houses
r2
r4
r3
h3
h2
3
Problem Definition
  • Given two datasets R and S, Generalized Nearest
    Neighbor (GNN) query is given by
  • GNN(R, S, S, k, t), S S and k, t gt0
  • Query returns tuples (s, Rs) where s S , Rs
    R has s in its k-NN and Rs t
  • R and S gt Available Memory.

4
Observations
  • GNN is has the following special cases
  • K-Nearest Neighbor (K-NN) GNN(r,S,S,k,1)
  • All Nearest Neighbor (ANN)
  • GNN(R,S,S,1,1)
  • Reverse Nearest Neighbor (RNN) GNN(R,S,s,1,1)
  • Our goal is to solve a broader problem with
    reasonable I/O and CPU costs using available
    memory.

5
Proposed Solution
  • Index data using well known index structure such
    as R-Tree and packing methods such as
    Sort-Tile-Recursive method.
  • Predict solution space.
  • Search Strategies based on the buffer
    constraints.
  • Static Fetch All and Fetch One.
  • Dynamic Fetch Dynamic.
  • Optimizations Column Filter, Row Filter,
    Adaptive Filter and Partitioning.

6
R-Tree Example
R-Tree Structure
7
Predicting the Solution Priority Table
s1 s2 s3 s4 s5 s6 s7 s8
s2
r1 r2 r3 r4 r5 r6 r7 r8
s6
s3
s8
s1
MINDIST
r1
s7
MAXDIST
s5
s4
Priority Table (PT)
GNN(R,S,s1,s3,s4,s5,s7,k,t)
First Row in PT
8
Pruning PT Column Filter and Row Filter
GNN(R,S,s1,s3,s4,s5,s7,k,t)
s1 s2 s3 s4 s5 s6 s7 s8
r1 r2 r3 r4 r5 r6 r7 r8
If r4 lt t
9
Fetch All (FA)
Find maximal cluster of pages that fit into
buffer starting from first row of R.
s1 s2 s3 s4 s5 s6 s7 s8
  • Buffer Size, b 6

r1 r2 r3 r4 r5 r6 r7 r8
10
Fetch All (Contd.)
  • Reorder clusters to maximize overlap Improves
    disk I/O time.
  • Greedy Approach Add pair of clusters having
    maximum overlap.

Before Reordering
After Reordering
6
6
C1 r1,r2,s1,s3,s4,s7
C1 r1,r2,s1,s3,s4,s7
6
3
C2 r3,r4,s2,s5,s6,s8
C3 r5,s1,s2,s4,s7
4
4
C3 r5,s1,s2,s4,s7
C4 r6,r7,r8,s1,s2,s6
4
5
C4 r6,r7,r8,s1,s2,s6
C2 r3,r4,s2,s5,s6,s8
Disk Reads 17
Disk Reads 21
11
Fetch One (FO)
  • FA can read irrelevant pages.
  • Solution Read one candidate per row.

s1 s2 s3 s4 s5 s6 s7 s8
  • Buffer Size, b 6

r1 r2 r3 r4 r5 r6 r7 r8
12
Fetch Dynamic (FD)
  • Random seek costs in FO can be significant.
  • Can read f, f gt1, candidates per row.
  • f Median of number of candidates read for rows
    processed.

s1 s2 s3 s4 s5 s6 s7 s8
  • Buffer Size, b 6

r1 r2 r3 r4 r5 r6 r7 r8
13
Process Buffer
Read clusters in order and process its pages.
s4
k-NN Box of r
s1
r
s2
dmax
k-NN Circle
s3
14
Further ImprovementsAdaptive Filter
k-NN Box of r
  • Costs with k-NN Box
  • 3 Disk Reads for r,s1,s2.
  • 8x4 32 Object Comparisons.

s1
r
k-NN Circle
  • Costs with Adaptive k-NN Box
  • 2 Disk Reads for r,s1.
  • 4 comparisons for a pass on s1
  • 8 object comparisons 12 Comparisons

s2
Adaptive k-NN Box of r
15
Further Improvements Partitioning
s1
k-NN Circle
r
Partitioned k-NN Box
  • Costs with Partitioned k-NN Boxes
  • 2 Disk Reads for r,s1.
  • 8 comparisons for two passes on s1
  • 2 object comparisons 10 Comparisons

16
Experimental Setup
  • Datasets
  • Image datasets (2-16 Dimensions).
  • Protein Dataset (3 Dimensions).
  • Experiments
  • Evaluation of Optimizations
  • Our Methods
  • FA
  • FO
  • FD
  • Existing Methods
  • Sequential Scan (SS).
  • R-Tree based method (RT).
  • Mux-Join (kNN-join)
  • GORDER (kNN-join)
  • RkNN

17
Evaluation of Optimizations
Image Dataset
Buffer 10 of database size
18
Comparison of Our Methods
Image Dataset
19
Comparison with Other Methods
Protein Data
20
Comparison with Other Methods Scalability
Image Data
21
Conclusion
  • Considered the problem of Data Broadness.
  • Generalized Nearest Neighbor (GNN) to express
    data broadness.
  • Our method is faster than R-Tree, GORDER,
    Mux-Index and RkNN methods.

22
Thank You
  • Questions ?

23
Comparison with Other Methods Scalability
Image Data
24
Comparison with Other Methods
RkNN
GORDER
Image Data
25
Comparison of Our Methods
26
Comparison with Other Methods
Protein Data
27
Predict Solution Space
  • Construct Priority list for each MBR A in R
  • Sort MBRs B in S based on the MAXDIST(A,B).
  • , 1lt m lt S
    (MBRs in S)
  • Find Bi S, where MINDIST(A,Bi) lt
    MAXDIST(A,Bm).
  • Assign priorities to Bi in increasing order of
    MINDIST(A,Bi)

28
Further ImprovementsAdaptive Filter
  • Comparing two MBR (r R and s S) takes
    O(t2), where t is the number of objects in each
    of r and s.
  • Adaptive Filter
  • Obtain the k-NN box of r by expanding each
    dimension by different amounts.
  • A single pass on s, eliminates unpromising points
    from S.
  • Find points in adaptive k-NN box of r.
  • Adaptive filter reduces the t2 comparisons.
Write a Comment
User Comments (0)
About PowerShow.com