Improving Hash Join Performance Through Prefetching - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Improving Hash Join Performance Through Prefetching

Description:

Build hash table on smaller (build) relation. Probe hash table using larger (probe) ... pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9) _at_Carnegie Mellon. Databases ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 25

Provided by: toddc9

Category:

more less

Transcript and Presenter's Notes

Title: Improving Hash Join Performance Through Prefetching

1
Improving Hash Join Performance Through
Prefetching
Shimin Chen
Anastassia Ailamaki
Todd C. Mowry

Phillip B. Gibbons
2
Hash Join

Simple hash join
Build hash table on smaller (build) relation
Probe hash table using larger (probe) relation
Random access patterns inherent in hashing
Excessive random I/Os
If build relation and hash table cannot fit in
memory

3
I/O Partitioning

Avoid excessive random disk accesses
Join pairs of build and probe partitions
separately
Sequential I/O patterns for relations and
partitions
Hash join is CPU-bound with reasonable I/O
bandwidth

Build
Probe
4
Hash Join Cache Performance

Partition divides a 1GB relation into 800
partitions
Join 50MB build partition 100MB probe
partition
Detailed simulations based on Compaq ES40 system
Most of execution time is wasted on data cache
misses
82 for partition, 73 for join
Because of random access patterns in memory

5
Employing Partitioning for Cache?

Cache partitioning generating cache-sized
partitions
Effective in main-memory databases Shatdal et
al., 94, Boncz et al., 99, Manegold et
al.,00
Two limitations when used in commercial databases
Usually need additional in-memory partitioning
pass
Cache is much smaller than main memory
50 worse than our techniques
2) Sensitive to cache sharing by multiple
activities

6
Our Approach Cache Prefetching

Modern processors support
Multiple cache misses to be serviced
simultaneously
Prefetch assembly instructions for exploiting the
parallelism
Overlap cache miss latency with computation
Successfully applied to
Array-based programs Mowry et al., 92
Pointer-based programs Luk Mowry, 96
Database B-trees Chen et al., 01

7
Challenges for Cache Prefetching

Difficult to obtain memory addresses early
Randomness of hashing prohibits address
prediction
Data dependencies within the processing of a
tuple
Naïve approach does not work
Complexity of hash join code
Ambiguous pointer references
Multiple code paths
Cannot apply compiler prefetching techniques

8
Our Solution

Dependencies are rare across subsequent tuples
Exploit inter-tuple parallelism
Overlap cache misses of one tuple with
computation and cache misses of other tuples
We propose two prefetching techniques
Group prefetching
Software-pipelined prefetching

9
Outline

Overview
Our Proposed Techniques
Simplified Probing Algorithm
Naïve Prefetching
Group Prefetching
Software-Pipelined Prefetching
Dealing with Complexities
Experimental Results
Conclusions

10
Simplified Probing Algorithm

foreach probe tuple
(0)compute bucket number
(1)visit header
(2)visit cell array
(3)visit matching build tuple

11
Naïve Prefetching

foreach probe tuple
(0)compute bucket number
prefetch header
(1)visit header
prefetch cell array
(2)visit cell array
prefetch matching build tuple
(3)visit matching build tuple

Data dependencies make it difficult to obtain
addresses early
12
Group Prefetching
a group

foreach group of probe tuples
foreach tuple in group
(0)compute bucket number
prefetch header
foreach tuple in group
(1)visit header
prefetch cell array
foreach tuple in group
(2)visit cell array
prefetch build tuple
foreach tuple in group
(3)visit matching build tuple

13
Software-Pipelined Prefetching

Prologue
for j0 to N-4 do
tuple j3
(0)compute bucket number
prefetch header
tuple j2
(1)visit header
prefetch cell array
tuple j1
(2)visit cell array
prefetch build tuple
tuple j
(3)visit matching build tuple
Epilogue

14
Dealing with Multiple Code Paths

Multiple code paths
There could be 0 or many matches
Hash buckets could be empty or full
Keep state information for tuples being processed

Record state

Test state to decide
Do nothing, if state B
Execute D, if state C
Execute G, if state F

Previous compiler techniques cannot handle this

15
Dealing with Read-write Conflicts

In hash table building

Use busy flag in bucket header to detect
conflicts
Postpone hashing 2nd tuple until finish
processing 1st
Compiler cannot perform this transformation

16
More Details In Paper

General group prefetching algorithm
General software-pipelined prefetching algorithm
Analytical models
Discussion of important parameters
group size, prefetching distance
Implementation details

17
Outline

Overview
Our Proposed Techniques
Experimental Results
Setup
Performance of Our Techniques
Comparison with Cache Partitioning
Conclusions

18
Experimental Setup

Relation schema 4-byte join attribute fixed
length payload
No selection, no projection
50MB memory available for the join phase
Detailed cycle-by-cycle simulations
1GHz superscalar processor
Memory hierarchy is based on Compaq ES40

19
Joining a Pair of Build and Probe Partitions

A 50MB build partition joins a 100MB probe
partition
12 matching
Number of tuples decreases as tuple size increases

Our techniques achieve 2.1-2.9X speedups over
original hash join

20
Varying Memory Latency

A 50MB build partition joins a 100MB probe
partition
12 matching
100 B tuples

150 cycles default parameter
1000 cycles memory latency in future
Our techniques achieve 9X speedups over baseline
at 1000 cycles
Absolute performances of our techniques are very
close

21
Comparison with Cache Partitioning

A 200MB build relation joins a 400MB probe
relation
12 matching
Partitioning join

Cache partitioning generating cache sized
partitions Shatdal et al., 94, Boncz et al.,
99, Manegold et al.,00
Additional in-memory partition step after I/O
partitioning
At least 50 worse than our prefetching schemes

22
Robustness Impact of Cache Interference

Cache partitioning relies on exclusive use of
cache
Periodically flush cache worst case interference
Self normalized to execution time when no flush
Cache partitioning degrades 8-38
Our prefetching schemes are very robust

23
Conclusions

Exploited inter-tuple parallelism
Proposed group prefetching and software-pipelined
prefetching
Prior prefetching techniques cannot handle code
complexity
Our techniques achieve dramatically better
performance
2.1-2.9X speedups for join phase
1.4-2.6X speedups for partition phase
9X speedups at 1000 cycle memory latency in
future
Absolute performances are close to that at 150
cycles
Robust against cache interference
Unlike cache partitioning
Our prefetching techniques are effective for hash
joins

24
Thank you !
25
Back Up Slides
26
Is Hash Join CPU-bound ?

550MHz CPUs, 512MB RAM, Seagate Cheetah X15 36LP
SCSI disks (max transfer rate 68MByte/sec),
Linux 2.4.18
100B tuples, 4B keys, 12 matching
Striping unit 256KB
10 measurements, std lt 10 mean or std lt 1s

Quad-processor Pentium III, four disks
A 1.5 GB build relation, a 3GB probe relation
Main thread GRACE hash join
Background I/O thread per disk I/O prefetching
and writing
Hash join is CPU-bound with reasonable I/O
bandwidth
Still large room for CPU performance improvement

27
Hiding Latency within A Group

Hide cache miss latency across multiple tuples
within a group
Group size can be increased to hide most cache
miss latency for hash joins
Generic algorithm and analytical model (please
see paper)
There are gaps between groups

28
Prefetching Distance
D1

Prefetching distance (D) the number of
iterations between two subsequent code stages for
a single tuple
Increase prefetching distance to hide all cache
miss latency
Generic algorithm and analytical model (please
see paper)

29
Group Pref Multiple Code Paths
A

We keep state information for tuples in a group
One of the states decides which code path to take

30
Prefetching Distance D

Prologue
for j0 to N-3D-1 do
tuple j3D
compute hash bucket number
prefetch the target bucket header
tuple j2D
visit the hash bucket header
prefetch the hash cell array
tuple jD
visit the hash cell array
prefetch the matching build tuple
tuple j
visit the matching build tuple to
compare keys and produce output tuple
Epilogue

31
Experiment Setup

We have implemented our own hash join engine
Relations are stored in files with slotted page
structure
A simple XOR and shift based hash function is
used
GRACE hash join (baseline), Simple prefetching,
Group prefetching, Software-pipelined
prefetching
Experiment Design
Same schema for build and probe relations
4-byte key fixed length payload
No selection and projection
50MB memory available for the joins

32
Simulation Platform

Detailed cycle-by-cycle simulations
Out-of-order processor pipeline
Integer multiply and divide latency are based on
Pentium4
Memory hierarchy is based on Compaq ES40
memory system parameters in near future
better prefetching support
Supports TLB prefetching

33
Simulation Parameters
34
Simulator vs. Real Machine

Better prefetching support
Never ignore prefetching
Our prefetches are not hints!
TLB prefetching
When a prefetch incurs a TLB miss, perform TLB
loading
More miss handlers
32 for data

35
Varying Group Size and Prefetching Distance

Varying params for the 20B case in the previous
figure
Hash table probing

Too small latencies are not fully hidden
Too large many prefetched cache lines are
replaced by other memory references
Similar performance even when latency increases
to 1000 cycles!

36
Breakdowns of Cache Misses to Understand the
Tuning Curves
37
Cache Performance Breakdowns

Our schemes indeed hide most of the data cache
miss latencies
Overheads lead to larger portions of busy times

38
Partition Phase Performance

When the number of partitions is small, use
simple prefetching
When the number of partitions is large, use group
or software-pipelined prefetching
Combined 1.9-2.6X speedups over the baseline

39
Cache Partitioning Schemes

Two ways to employ cache partitioning in GRACE
join
Direct cache generate the cache partitions
in the I/O partition
phase
Two-step cache generate the cache partitions in
the join phase as a
preprocessing step
Requires to generate larger number of smaller I/O
partitions
Bounded by available memory
Bounded by requirements of underlying storage
managers
So cache partitioning may not be used when
joining relations are very large
Not robust with multiple activities going on
Require exclusive use of (part of) cache
Performance penalty due to cache conflicts

40
Comparison with Cache Partitioning

Direct cache suffers from larger number of
partitions generated in the I/O partition phase
Two-step cache suffers from the additional
partition step
Our schemes are the best (slightly better than
direct cache)

41
Robustness Impact of Cache Interference
100 corresponds to the join phase execution
time when there is no cache flush

Performance degradation when the cache is
periodically flushed
The worst cache interference
Direct cache and 2-step cache degrade 15-67 and
8-38
Our prefetching schemes are very robust

42
Group Pref vs. Software-pipelined Pref

Hiding latency
Software-pipelined pref is always able to hide
all latencies (according to our analytical model)
Book-keeping overhead
Software-pipelined pref has more overhead
Code complexity
Group prefetching is easier to implement
Natural group boundary provides a place to do
necessary processing left (e.g. for read-write
conflicts)
A natural place to send outputs to the parent
operator if pipelined operator is needed

43
Challenges in Applying Prefetching