Joining Massive High-Dimensional Datasets - PowerPoint PPT Presentation

About This Presentation

Title:

Joining Massive High-Dimensional Datasets

Description:

... Random Seek Cost: Cost Clustering. Dataset 2. Dataset 1. The ... Cluster order = (C4,C2,C1,C3,C5) 5 2 3 3 2=15 page reads. Scenario 1 = 19. C1. C2. C3 ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 38

Provided by: tubayavu

Learn more at: https://www.cise.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Joining Massive High-Dimensional Datasets

1
Joining Massive High-Dimensional Datasets

Tamer Kahveci
Christian A. Lang
Ambuj K. Singh
Department of Computer Science
University of California at Santa Barbara
http//www.cs.ucsb.edu/tamer

2
Motivation Sample Queries

Join is fundamental database primitive
Spatial Join Find all hotels in California that
are within three miles of a recreation area.
Sequence Join Find all pairs of companies from
New York Exchange and Tokyo Exchange that have
similar closing prices for one month

3
Motivation

We assume limited buffer space.
Joining two datasets is expensive
I/O cost
CPU cost
O(mn)

4
The Naive Solution NLJ
Dataset 1 (m pages)
Buffer B 4
minm,nmn/(B-1) page reads
mn page comparisons
Dataset 2 (n pages)
We do not need to compare all page pairs!
5
Outline

Reducing search space Prediction Matrix
Minimizing I/O cost by clustering
Square Cluster
Cost Cluster
Maximizing buffer reuse
Experimental results

6
PM-NLJ

Predict the candidate page pairs using plane
sweep method on an index structure.

Dataset 1
Dataset 2
7
Prediction of Join
8
Prediction of Join
9
PM-NLJ

Predict the candidate page pairs using plane
sweep method on an index structure.

Dataset 1

The final estimate is called Prediction Matrix
(PM).
Restrict NLJ to marked entries of PM.
We call this method PM-NLJ.

Dataset 2
10
PM-NLJ

The number of marked entries e.
Performance improvement rate mn/e.

Dataset 1
Dataset 2
Is there a better read schedule?
11
Outline

Reducing search space Prediction Matrix
Minimizing I/O cost by clustering
Square Cluster
Cost Cluster
Maximizing buffer reuse
Experimental results

12
Minimizing Number of I/OSquare Clustering

PM-NLJ reads minm,ne 9 pages.

Let B6.

Dataset 1

mn 6 page reads suffices.

Savings
e-maxm,n.
Maximize e
Minimize maxm,n
mn B
mnB/2.

Dataset 2
13
Minimizing Number of I/OSquare Clustering
Dataset 1
O(e) space time complexity
Can we reduce total I/O cost by reducing the
amount of random seeks?
Dataset 2
14
Minimizing Random Seek Cost Cost Clustering
Dataset 1

The location of
the pages is
important as
well as their
number!

Dataset 2
15
Minimizing Random Seek Cost Cost Clustering
Dataset 1

O(e) space
complexity
O(e3/2) time
complexity

Dataset 2
16
Outline

Reducing search space Prediction Matrix
Minimizing I/O cost by clustering
Square Cluster
Cost Cluster
Maximizing buffer reuse
Experimental results

17
Maximizing Cache Reuse
Dataset 1
B 5 pages
C1
C3

Scenario 1
Cluster order
(C4,C1,C3,C5,C2)
5432519 page reads.

C2
Dataset 2
C4
C5
18
Maximizing Cache Reuse
Dataset 1
Scenario 1 19
C1

Scenario 2
Cluster order
(C4,C2,C1,C3,C5)
5233215 page reads.

C3
C2
Dataset 2
C4
What is the best schedule?
C5
19
Sharing Graph (SG)
Dataset 1
C1
C3
C2
Dataset 2
C4
C5
20
Finding Best Schedule

Each schedule is a path on SG.
Cache reuse sum of weights of the edges of the
corresponding path on SG.
Equivalent to TSP.
NP-Complete.
Use greedy heuristic to find optimal path.

C2
2
C1
1
1
1
C3
3
C5
C4
21
Outline

Reducing search space Prediction Matrix
Minimizing I/O cost by clustering
Square Cluster
Cost Cluster
Maximizing buffer reuse
Experimental results

22
Experimental Setup Datasets

Low dimensional data
2-D road intersections of Long Beach (LBeach)
Montgomery County (MGcounty).
53K 39K vectors
High dimensional data
60-D feature vectors for satellite image database
(landsat).
275K vectors
Sequence data
Human chromosome 18 (HChr18) mouse chromosome
18 (MChr18)
4.2 M 2.3 M nucleotides

23
Experimental Setup Compared Techniques

NLJ
Epsilon Grid Order (EGO) BBKK01
BFRJ HJR97
PM-NLJ
Random-SC
SC
CC

24
Experimental Setup

Three optimizations tested
OPT 1 reducing space by using the PM.
OPT 2 clustering.
OPT 3 cluster scheduling.

25
Itemized Cost Analysis
Join on MGCounty LBeach
26
Total Cost Analysis of Various Optimizations
Self-join on HChr18
Buffer Size (num pages)
27
Comparison of SC CC
28
Total Cost Analysis
Join on landsat data
Buffer Size (num pages)
29
Scalability Analysis
Join on landsat data
Database Size (num vectors per database)
30
Discussion

We proposed three optimizations for join
operator.
Prediction matrix
Clustering
Buffer recycling
SC is 2 to 86 times faster than competing
techniques for spatial databases, and 13 to 133
times faster than competing techniques for
sequence databases
SC is very close to the optimal technique (CC).

31
Future Directions

The solution can be generalized to multi-way
joins.
Similar optimizations can be applied to NN
queries.
Can be applied to biological data.

32
Related Work

Join without index
Arge et al 1998
Blasgen et al 1977
Bohm et al 2001
Chan et al 1997
Graefe 1994
Koudas et al 1997
Koudas et al 2000
Orenstein 1986
Patel et al 1996
Shim et al 2002
Xiao et al 2001

Join with index
Bercken et al 2000
Bohm et al 2001
Brinkhoff et al 1993
Gurret et al 2000
Hjaltson et al 1998
Huang et al 1997
Lo et al 1994
Lo et al 1996

THANK YOU
33
Using Sharing Graph to Determine Cache Reuse
Scenario 1
Scenario 2
C2
C2
2
2
C1
C1
1
1
1
1
1
1
0
C3
C3
3
0
0
3
C5
C5
C4
C4
Reuse 11 2
Reuse 321 6
34
Spatial Join Example
Recreation areas
Hotels
35
Spatial Join Example
36
The Naive Solution NLJ
Dataset 1 (m pages)
Buffer B 4
minm,nmn/(B-1) page reads
mn page comparisons
Dataset 2 (n pages)
We do not need to compare all page pairs!
37
Reading Pages in a Better Order
1 seek 4 page transfers
3 seeks 3 page transfers

Write a Comment

User Comments (0)