Title: Improving Hash Join Performance Through Prefetching
1Improving Hash Join Performance Through
Prefetching
Shimin Chen
Anastassia Ailamaki
Todd C. Mowry
Phillip B. Gibbons
2Hash Join
- Simple hash join
- Build hash table on smaller (build) relation
- Probe hash table using larger (probe) relation
- Random access patterns inherent in hashing
- Excessive random I/Os
- If build relation and hash table cannot fit in
memory
3I/O Partitioning
- Avoid excessive random disk accesses
- Join pairs of build and probe partitions
separately - Sequential I/O patterns for relations and
partitions - Hash join is CPU-bound with reasonable I/O
bandwidth
Build
Probe
4Hash Join Cache Performance
- Partition divides a 1GB relation into 800
partitions - Join 50MB build partition 100MB probe
partition - Detailed simulations based on Compaq ES40 system
- Most of execution time is wasted on data cache
misses - 82 for partition, 73 for join
- Because of random access patterns in memory
5Employing Partitioning for Cache?
- Cache partitioning generating cache-sized
partitions - Effective in main-memory databases Shatdal et
al., 94, Boncz et al., 99, Manegold et
al.,00 - Two limitations when used in commercial databases
- Usually need additional in-memory partitioning
pass - Cache is much smaller than main memory
- 50 worse than our techniques
- 2) Sensitive to cache sharing by multiple
activities
6Our Approach Cache Prefetching
- Modern processors support
- Multiple cache misses to be serviced
simultaneously - Prefetch assembly instructions for exploiting the
parallelism - Overlap cache miss latency with computation
- Successfully applied to
- Array-based programs Mowry et al., 92
- Pointer-based programs Luk Mowry, 96
- Database B-trees Chen et al., 01
7Challenges for Cache Prefetching
- Difficult to obtain memory addresses early
- Randomness of hashing prohibits address
prediction - Data dependencies within the processing of a
tuple - Naïve approach does not work
- Complexity of hash join code
- Ambiguous pointer references
- Multiple code paths
- Cannot apply compiler prefetching techniques
8Our Solution
- Dependencies are rare across subsequent tuples
- Exploit inter-tuple parallelism
- Overlap cache misses of one tuple with
computation and cache misses of other tuples - We propose two prefetching techniques
- Group prefetching
- Software-pipelined prefetching
9Outline
- Overview
- Our Proposed Techniques
- Simplified Probing Algorithm
- Naïve Prefetching
- Group Prefetching
- Software-Pipelined Prefetching
- Dealing with Complexities
- Experimental Results
- Conclusions
10Simplified Probing Algorithm
- foreach probe tuple
-
- (0)compute bucket number
- (1)visit header
- (2)visit cell array
- (3)visit matching build tuple
11Naïve Prefetching
- foreach probe tuple
-
- (0)compute bucket number
- prefetch header
- (1)visit header
- prefetch cell array
- (2)visit cell array
- prefetch matching build tuple
- (3)visit matching build tuple
Data dependencies make it difficult to obtain
addresses early
12Group Prefetching
a group
- foreach group of probe tuples
-
- foreach tuple in group
- (0)compute bucket number
- prefetch header
-
- foreach tuple in group
- (1)visit header
- prefetch cell array
-
- foreach tuple in group
- (2)visit cell array
- prefetch build tuple
-
- foreach tuple in group
- (3)visit matching build tuple
-
13Software-Pipelined Prefetching
- Prologue
- for j0 to N-4 do
-
- tuple j3
- (0)compute bucket number
- prefetch header
- tuple j2
- (1)visit header
- prefetch cell array
- tuple j1
- (2)visit cell array
- prefetch build tuple
- tuple j
- (3)visit matching build tuple
-
- Epilogue
14Dealing with Multiple Code Paths
- Multiple code paths
- There could be 0 or many matches
- Hash buckets could be empty or full
- Keep state information for tuples being processed
Record state
- Test state to decide
- Do nothing, if state B
- Execute D, if state C
- Execute G, if state F
- Previous compiler techniques cannot handle this
15Dealing with Read-write Conflicts
- Use busy flag in bucket header to detect
conflicts - Postpone hashing 2nd tuple until finish
processing 1st - Compiler cannot perform this transformation
16More Details In Paper
- General group prefetching algorithm
- General software-pipelined prefetching algorithm
- Analytical models
- Discussion of important parameters
- group size, prefetching distance
- Implementation details
17Outline
- Overview
- Our Proposed Techniques
- Experimental Results
- Setup
- Performance of Our Techniques
- Comparison with Cache Partitioning
- Conclusions
18Experimental Setup
- Relation schema 4-byte join attribute fixed
length payload - No selection, no projection
- 50MB memory available for the join phase
- Detailed cycle-by-cycle simulations
- 1GHz superscalar processor
- Memory hierarchy is based on Compaq ES40
19Joining a Pair of Build and Probe Partitions
- A 50MB build partition joins a 100MB probe
partition - 12 matching
- Number of tuples decreases as tuple size increases
- Our techniques achieve 2.1-2.9X speedups over
original hash join
20Varying Memory Latency
- A 50MB build partition joins a 100MB probe
partition - 12 matching
- 100 B tuples
- 150 cycles default parameter
- 1000 cycles memory latency in future
- Our techniques achieve 9X speedups over baseline
at 1000 cycles - Absolute performances of our techniques are very
close
21Comparison with Cache Partitioning
- A 200MB build relation joins a 400MB probe
relation - 12 matching
- Partitioning join
- Cache partitioning generating cache sized
partitions Shatdal et al., 94, Boncz et al.,
99, Manegold et al.,00 - Additional in-memory partition step after I/O
partitioning - At least 50 worse than our prefetching schemes
22Robustness Impact of Cache Interference
- Cache partitioning relies on exclusive use of
cache - Periodically flush cache worst case interference
- Self normalized to execution time when no flush
- Cache partitioning degrades 8-38
- Our prefetching schemes are very robust
23Conclusions
- Exploited inter-tuple parallelism
- Proposed group prefetching and software-pipelined
prefetching - Prior prefetching techniques cannot handle code
complexity - Our techniques achieve dramatically better
performance - 2.1-2.9X speedups for join phase
- 1.4-2.6X speedups for partition phase
- 9X speedups at 1000 cycle memory latency in
future - Absolute performances are close to that at 150
cycles - Robust against cache interference
- Unlike cache partitioning
- Our prefetching techniques are effective for hash
joins
24Thank you !
25Back Up Slides
26Is Hash Join CPU-bound ?
- 550MHz CPUs, 512MB RAM, Seagate Cheetah X15 36LP
SCSI disks (max transfer rate 68MByte/sec),
Linux 2.4.18 - 100B tuples, 4B keys, 12 matching
- Striping unit 256KB
- 10 measurements, std lt 10 mean or std lt 1s
- Quad-processor Pentium III, four disks
- A 1.5 GB build relation, a 3GB probe relation
- Main thread GRACE hash join
- Background I/O thread per disk I/O prefetching
and writing - Hash join is CPU-bound with reasonable I/O
bandwidth - Still large room for CPU performance improvement
27Hiding Latency within A Group
- Hide cache miss latency across multiple tuples
within a group - Group size can be increased to hide most cache
miss latency for hash joins - Generic algorithm and analytical model (please
see paper) - There are gaps between groups
28Prefetching Distance
D1
- Prefetching distance (D) the number of
iterations between two subsequent code stages for
a single tuple - Increase prefetching distance to hide all cache
miss latency - Generic algorithm and analytical model (please
see paper)
29Group Pref Multiple Code Paths
A
- We keep state information for tuples in a group
- One of the states decides which code path to take
30Prefetching Distance D
- Prologue
- for j0 to N-3D-1 do
-
- tuple j3D
- compute hash bucket number
- prefetch the target bucket header
- tuple j2D
- visit the hash bucket header
- prefetch the hash cell array
- tuple jD
- visit the hash cell array
- prefetch the matching build tuple
- tuple j
- visit the matching build tuple to
- compare keys and produce output tuple
-
- Epilogue
31Experiment Setup
- We have implemented our own hash join engine
- Relations are stored in files with slotted page
structure - A simple XOR and shift based hash function is
used - GRACE hash join (baseline), Simple prefetching,
Group prefetching, Software-pipelined
prefetching - Experiment Design
- Same schema for build and probe relations
- 4-byte key fixed length payload
- No selection and projection
- 50MB memory available for the joins
32Simulation Platform
- Detailed cycle-by-cycle simulations
- Out-of-order processor pipeline
- Integer multiply and divide latency are based on
Pentium4 - Memory hierarchy is based on Compaq ES40
- memory system parameters in near future
- better prefetching support
- Supports TLB prefetching
33Simulation Parameters
34Simulator vs. Real Machine
- Better prefetching support
- Never ignore prefetching
- Our prefetches are not hints!
- TLB prefetching
- When a prefetch incurs a TLB miss, perform TLB
loading - More miss handlers
- 32 for data
35Varying Group Size and Prefetching Distance
- Varying params for the 20B case in the previous
figure - Hash table probing
- Too small latencies are not fully hidden
- Too large many prefetched cache lines are
replaced by other memory references - Similar performance even when latency increases
to 1000 cycles!
36Breakdowns of Cache Misses to Understand the
Tuning Curves
37Cache Performance Breakdowns
- Our schemes indeed hide most of the data cache
miss latencies - Overheads lead to larger portions of busy times
38Partition Phase Performance
- When the number of partitions is small, use
simple prefetching - When the number of partitions is large, use group
or software-pipelined prefetching - Combined 1.9-2.6X speedups over the baseline
39Cache Partitioning Schemes
- Two ways to employ cache partitioning in GRACE
join - Direct cache generate the cache partitions
in the I/O partition
phase - Two-step cache generate the cache partitions in
the join phase as a
preprocessing step - Requires to generate larger number of smaller I/O
partitions - Bounded by available memory
- Bounded by requirements of underlying storage
managers - So cache partitioning may not be used when
joining relations are very large - Not robust with multiple activities going on
- Require exclusive use of (part of) cache
- Performance penalty due to cache conflicts
40Comparison with Cache Partitioning
- Direct cache suffers from larger number of
partitions generated in the I/O partition phase - Two-step cache suffers from the additional
partition step - Our schemes are the best (slightly better than
direct cache)
41Robustness Impact of Cache Interference
100 corresponds to the join phase execution
time when there is no cache flush
- Performance degradation when the cache is
periodically flushed - The worst cache interference
- Direct cache and 2-step cache degrade 15-67 and
8-38 - Our prefetching schemes are very robust
42Group Pref vs. Software-pipelined Pref
- Hiding latency
- Software-pipelined pref is always able to hide
all latencies (according to our analytical model) - Book-keeping overhead
- Software-pipelined pref has more overhead
- Code complexity
- Group prefetching is easier to implement
- Natural group boundary provides a place to do
necessary processing left (e.g. for read-write
conflicts) - A natural place to send outputs to the parent
operator if pipelined operator is needed
43Challenges in Applying Prefetching
- Try to hide the latency within the processing of
a single tuple - Example hash table probing
- Does not work
- Dependencies essentially form a critical path
- Randomness makes prediction almost impossible
Hash Cell Array
BuildPartition
HashBucketHeaders
44Naïve Prefetching
- foreach probe tuple
-
- compute bucket number
- prefetch header
-
- visit header
- prefetch cell array
-
- visit cell array
- prefetch matching build tuple
-
- visit matching build tuple
1
2
3
Data dependencies make it difficult to obtain
addresses early
45Group Prefetching
- foreach group of probe tuples
-
- foreach tuple in group
- compute bucket number
- prefetch header
-
- foreach tuple in group
- visit header
- prefetch cell array
-
- foreach tuple in group
- visit cell array
- prefetch matching build tuple
-
- foreach tuple in group
- visit matching build tuple
-
46Software-Pipelined Prefetching
- Prologue
- for j0 to N-4 do
-
- tuple j3
- compute bucket number
- prefetch header
- tuple j2
- visit header
- prefetch cell array
- tuple j1
- visit cell array
- prefetch matching build tuple
- tuple j
- visit matching build tuple
-
- Epilogue