Title: Joining Massive High-Dimensional Datasets
1Joining Massive High-Dimensional Datasets
- Tamer Kahveci
- Christian A. Lang
- Ambuj K. Singh
- Department of Computer Science
- University of California at Santa Barbara
- http//www.cs.ucsb.edu/tamer
2Motivation Sample Queries
- Join is fundamental database primitive
- Spatial Join Find all hotels in California that
are within three miles of a recreation area. - Sequence Join Find all pairs of companies from
New York Exchange and Tokyo Exchange that have
similar closing prices for one month
3Motivation
- We assume limited buffer space.
- Joining two datasets is expensive
- I/O cost
- CPU cost
- O(mn)
4The Naive Solution NLJ
Dataset 1 (m pages)
Buffer B 4
minm,nmn/(B-1) page reads
mn page comparisons
Dataset 2 (n pages)
We do not need to compare all page pairs!
5Outline
- Reducing search space Prediction Matrix
- Minimizing I/O cost by clustering
- Square Cluster
- Cost Cluster
- Maximizing buffer reuse
- Experimental results
6PM-NLJ
- Predict the candidate page pairs using plane
sweep method on an index structure.
Dataset 1
Dataset 2
7Prediction of Join
8Prediction of Join
9PM-NLJ
- Predict the candidate page pairs using plane
sweep method on an index structure.
Dataset 1
- The final estimate is called Prediction Matrix
(PM). - Restrict NLJ to marked entries of PM.
- We call this method PM-NLJ.
Dataset 2
10PM-NLJ
- The number of marked entries e.
- Performance improvement rate mn/e.
Dataset 1
Dataset 2
Is there a better read schedule?
11Outline
- Reducing search space Prediction Matrix
- Minimizing I/O cost by clustering
- Square Cluster
- Cost Cluster
- Maximizing buffer reuse
- Experimental results
12Minimizing Number of I/OSquare Clustering
- PM-NLJ reads minm,ne 9 pages.
Dataset 1
- mn 6 page reads suffices.
- Savings
- e-maxm,n.
- Maximize e
- Minimize maxm,n
- mn B
- mnB/2.
Dataset 2
13Minimizing Number of I/OSquare Clustering
Dataset 1
O(e) space time complexity
Can we reduce total I/O cost by reducing the
amount of random seeks?
Dataset 2
14Minimizing Random Seek Cost Cost Clustering
Dataset 1
- The location of
- the pages is
- important as
- well as their
- number!
Dataset 2
15Minimizing Random Seek Cost Cost Clustering
Dataset 1
- O(e) space
- complexity
- O(e3/2) time
- complexity
Dataset 2
16Outline
- Reducing search space Prediction Matrix
- Minimizing I/O cost by clustering
- Square Cluster
- Cost Cluster
- Maximizing buffer reuse
- Experimental results
17Maximizing Cache Reuse
Dataset 1
B 5 pages
C1
C3
- Scenario 1
- Cluster order
- (C4,C1,C3,C5,C2)
- 5432519 page reads.
C2
Dataset 2
C4
C5
18Maximizing Cache Reuse
Dataset 1
Scenario 1 19
C1
- Scenario 2
- Cluster order
- (C4,C2,C1,C3,C5)
- 5233215 page reads.
C3
C2
Dataset 2
C4
What is the best schedule?
C5
19Sharing Graph (SG)
Dataset 1
C1
C3
C2
Dataset 2
C4
C5
20Finding Best Schedule
- Each schedule is a path on SG.
- Cache reuse sum of weights of the edges of the
corresponding path on SG. - Equivalent to TSP.
- NP-Complete.
- Use greedy heuristic to find optimal path.
C2
2
C1
1
1
1
C3
3
C5
C4
21Outline
- Reducing search space Prediction Matrix
- Minimizing I/O cost by clustering
- Square Cluster
- Cost Cluster
- Maximizing buffer reuse
- Experimental results
22Experimental Setup Datasets
- Low dimensional data
- 2-D road intersections of Long Beach (LBeach)
Montgomery County (MGcounty). - 53K 39K vectors
- High dimensional data
- 60-D feature vectors for satellite image database
(landsat). - 275K vectors
- Sequence data
- Human chromosome 18 (HChr18) mouse chromosome
18 (MChr18) - 4.2 M 2.3 M nucleotides
23Experimental Setup Compared Techniques
- NLJ
- Epsilon Grid Order (EGO) BBKK01
- BFRJ HJR97
- PM-NLJ
- Random-SC
- SC
- CC
24Experimental Setup
- Three optimizations tested
- OPT 1 reducing space by using the PM.
- OPT 2 clustering.
- OPT 3 cluster scheduling.
25Itemized Cost Analysis
Join on MGCounty LBeach
26Total Cost Analysis of Various Optimizations
Self-join on HChr18
Buffer Size (num pages)
27Comparison of SC CC
28Total Cost Analysis
Join on landsat data
Buffer Size (num pages)
29Scalability Analysis
Join on landsat data
Database Size (num vectors per database)
30Discussion
- We proposed three optimizations for join
operator. - Prediction matrix
- Clustering
- Buffer recycling
- SC is 2 to 86 times faster than competing
techniques for spatial databases, and 13 to 133
times faster than competing techniques for
sequence databases - SC is very close to the optimal technique (CC).
31Future Directions
- The solution can be generalized to multi-way
joins. - Similar optimizations can be applied to NN
queries. - Can be applied to biological data.
32Related Work
- Join without index
- Arge et al 1998
- Blasgen et al 1977
- Bohm et al 2001
- Chan et al 1997
- Graefe 1994
- Koudas et al 1997
- Koudas et al 2000
- Orenstein 1986
- Patel et al 1996
- Shim et al 2002
- Xiao et al 2001
- Join with index
- Bercken et al 2000
- Bohm et al 2001
- Brinkhoff et al 1993
- Gurret et al 2000
- Hjaltson et al 1998
- Huang et al 1997
- Lo et al 1994
- Lo et al 1996
THANK YOU
33Using Sharing Graph to Determine Cache Reuse
Scenario 1
Scenario 2
C2
C2
2
2
C1
C1
1
1
1
1
1
1
0
C3
C3
3
0
0
3
C5
C5
C4
C4
Reuse 11 2
Reuse 321 6
34Spatial Join Example
Recreation areas
Hotels
35Spatial Join Example
36The Naive Solution NLJ
Dataset 1 (m pages)
Buffer B 4
minm,nmn/(B-1) page reads
mn page comparisons
Dataset 2 (n pages)
We do not need to compare all page pairs!
37Reading Pages in a Better Order
1 seek 4 page transfers
3 seeks 3 page transfers