Improving Hash Join Performance Through Prefetching - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Improving Hash Join Performance Through Prefetching

Description:

Build hash table on smaller (build) relation. Probe hash table using larger (probe) ... pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9) _at_Carnegie Mellon. Databases ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 25
Provided by: toddc9
Category:

less

Transcript and Presenter's Notes

Title: Improving Hash Join Performance Through Prefetching


1
Improving Hash Join Performance Through
Prefetching
Shimin Chen
Anastassia Ailamaki
Todd C. Mowry

Phillip B. Gibbons
2
Hash Join
  • Simple hash join
  • Build hash table on smaller (build) relation
  • Probe hash table using larger (probe) relation
  • Random access patterns inherent in hashing
  • Excessive random I/Os
  • If build relation and hash table cannot fit in
    memory

3
I/O Partitioning
  • Avoid excessive random disk accesses
  • Join pairs of build and probe partitions
    separately
  • Sequential I/O patterns for relations and
    partitions
  • Hash join is CPU-bound with reasonable I/O
    bandwidth

Build
Probe
4
Hash Join Cache Performance
  • Partition divides a 1GB relation into 800
    partitions
  • Join 50MB build partition 100MB probe
    partition
  • Detailed simulations based on Compaq ES40 system
  • Most of execution time is wasted on data cache
    misses
  • 82 for partition, 73 for join
  • Because of random access patterns in memory

5
Employing Partitioning for Cache?
  • Cache partitioning generating cache-sized
    partitions
  • Effective in main-memory databases Shatdal et
    al., 94, Boncz et al., 99, Manegold et
    al.,00
  • Two limitations when used in commercial databases
  • Usually need additional in-memory partitioning
    pass
  • Cache is much smaller than main memory
  • 50 worse than our techniques
  • 2) Sensitive to cache sharing by multiple
    activities

6
Our Approach Cache Prefetching
  • Modern processors support
  • Multiple cache misses to be serviced
    simultaneously
  • Prefetch assembly instructions for exploiting the
    parallelism
  • Overlap cache miss latency with computation
  • Successfully applied to
  • Array-based programs Mowry et al., 92
  • Pointer-based programs Luk Mowry, 96
  • Database B-trees Chen et al., 01

7
Challenges for Cache Prefetching
  • Difficult to obtain memory addresses early
  • Randomness of hashing prohibits address
    prediction
  • Data dependencies within the processing of a
    tuple
  • Naïve approach does not work
  • Complexity of hash join code
  • Ambiguous pointer references
  • Multiple code paths
  • Cannot apply compiler prefetching techniques

8
Our Solution
  • Dependencies are rare across subsequent tuples
  • Exploit inter-tuple parallelism
  • Overlap cache misses of one tuple with
    computation and cache misses of other tuples
  • We propose two prefetching techniques
  • Group prefetching
  • Software-pipelined prefetching

9
Outline
  • Overview
  • Our Proposed Techniques
  • Simplified Probing Algorithm
  • Naïve Prefetching
  • Group Prefetching
  • Software-Pipelined Prefetching
  • Dealing with Complexities
  • Experimental Results
  • Conclusions

10
Simplified Probing Algorithm
  • foreach probe tuple
  • (0)compute bucket number
  • (1)visit header
  • (2)visit cell array
  • (3)visit matching build tuple

11
Naïve Prefetching
  • foreach probe tuple
  • (0)compute bucket number
  • prefetch header
  • (1)visit header
  • prefetch cell array
  • (2)visit cell array
  • prefetch matching build tuple
  • (3)visit matching build tuple

Data dependencies make it difficult to obtain
addresses early
12
Group Prefetching
a group
  • foreach group of probe tuples
  • foreach tuple in group
  • (0)compute bucket number
  • prefetch header
  • foreach tuple in group
  • (1)visit header
  • prefetch cell array
  • foreach tuple in group
  • (2)visit cell array
  • prefetch build tuple
  • foreach tuple in group
  • (3)visit matching build tuple

13
Software-Pipelined Prefetching
  • Prologue
  • for j0 to N-4 do
  • tuple j3
  • (0)compute bucket number
  • prefetch header
  • tuple j2
  • (1)visit header
  • prefetch cell array
  • tuple j1
  • (2)visit cell array
  • prefetch build tuple
  • tuple j
  • (3)visit matching build tuple
  • Epilogue

14
Dealing with Multiple Code Paths
  • Multiple code paths
  • There could be 0 or many matches
  • Hash buckets could be empty or full
  • Keep state information for tuples being processed

Record state
  • Test state to decide
  • Do nothing, if state B
  • Execute D, if state C
  • Execute G, if state F
  • Previous compiler techniques cannot handle this

15
Dealing with Read-write Conflicts
  • In hash table building
  • Use busy flag in bucket header to detect
    conflicts
  • Postpone hashing 2nd tuple until finish
    processing 1st
  • Compiler cannot perform this transformation

16
More Details In Paper
  • General group prefetching algorithm
  • General software-pipelined prefetching algorithm
  • Analytical models
  • Discussion of important parameters
  • group size, prefetching distance
  • Implementation details

17
Outline
  • Overview
  • Our Proposed Techniques
  • Experimental Results
  • Setup
  • Performance of Our Techniques
  • Comparison with Cache Partitioning
  • Conclusions

18
Experimental Setup
  • Relation schema 4-byte join attribute fixed
    length payload
  • No selection, no projection
  • 50MB memory available for the join phase
  • Detailed cycle-by-cycle simulations
  • 1GHz superscalar processor
  • Memory hierarchy is based on Compaq ES40

19
Joining a Pair of Build and Probe Partitions
  • A 50MB build partition joins a 100MB probe
    partition
  • 12 matching
  • Number of tuples decreases as tuple size increases
  • Our techniques achieve 2.1-2.9X speedups over
    original hash join

20
Varying Memory Latency
  • A 50MB build partition joins a 100MB probe
    partition
  • 12 matching
  • 100 B tuples
  • 150 cycles default parameter
  • 1000 cycles memory latency in future
  • Our techniques achieve 9X speedups over baseline
    at 1000 cycles
  • Absolute performances of our techniques are very
    close

21
Comparison with Cache Partitioning
  • A 200MB build relation joins a 400MB probe
    relation
  • 12 matching
  • Partitioning join
  • Cache partitioning generating cache sized
    partitions Shatdal et al., 94, Boncz et al.,
    99, Manegold et al.,00
  • Additional in-memory partition step after I/O
    partitioning
  • At least 50 worse than our prefetching schemes

22
Robustness Impact of Cache Interference
  • Cache partitioning relies on exclusive use of
    cache
  • Periodically flush cache worst case interference
  • Self normalized to execution time when no flush
  • Cache partitioning degrades 8-38
  • Our prefetching schemes are very robust

23
Conclusions
  • Exploited inter-tuple parallelism
  • Proposed group prefetching and software-pipelined
    prefetching
  • Prior prefetching techniques cannot handle code
    complexity
  • Our techniques achieve dramatically better
    performance
  • 2.1-2.9X speedups for join phase
  • 1.4-2.6X speedups for partition phase
  • 9X speedups at 1000 cycle memory latency in
    future
  • Absolute performances are close to that at 150
    cycles
  • Robust against cache interference
  • Unlike cache partitioning
  • Our prefetching techniques are effective for hash
    joins

24
Thank you !
25
Back Up Slides
26
Is Hash Join CPU-bound ?
  • 550MHz CPUs, 512MB RAM, Seagate Cheetah X15 36LP
    SCSI disks (max transfer rate 68MByte/sec),
    Linux 2.4.18
  • 100B tuples, 4B keys, 12 matching
  • Striping unit 256KB
  • 10 measurements, std lt 10 mean or std lt 1s
  • Quad-processor Pentium III, four disks
  • A 1.5 GB build relation, a 3GB probe relation
  • Main thread GRACE hash join
  • Background I/O thread per disk I/O prefetching
    and writing
  • Hash join is CPU-bound with reasonable I/O
    bandwidth
  • Still large room for CPU performance improvement

27
Hiding Latency within A Group
  • Hide cache miss latency across multiple tuples
    within a group
  • Group size can be increased to hide most cache
    miss latency for hash joins
  • Generic algorithm and analytical model (please
    see paper)
  • There are gaps between groups

28
Prefetching Distance
D1
  • Prefetching distance (D) the number of
    iterations between two subsequent code stages for
    a single tuple
  • Increase prefetching distance to hide all cache
    miss latency
  • Generic algorithm and analytical model (please
    see paper)

29
Group Pref Multiple Code Paths
A
  • We keep state information for tuples in a group
  • One of the states decides which code path to take

30
Prefetching Distance D
  • Prologue
  • for j0 to N-3D-1 do
  • tuple j3D
  • compute hash bucket number
  • prefetch the target bucket header
  • tuple j2D
  • visit the hash bucket header
  • prefetch the hash cell array
  • tuple jD
  • visit the hash cell array
  • prefetch the matching build tuple
  • tuple j
  • visit the matching build tuple to
  • compare keys and produce output tuple
  • Epilogue

31
Experiment Setup
  • We have implemented our own hash join engine
  • Relations are stored in files with slotted page
    structure
  • A simple XOR and shift based hash function is
    used
  • GRACE hash join (baseline), Simple prefetching,
    Group prefetching, Software-pipelined
    prefetching
  • Experiment Design
  • Same schema for build and probe relations
  • 4-byte key fixed length payload
  • No selection and projection
  • 50MB memory available for the joins

32
Simulation Platform
  • Detailed cycle-by-cycle simulations
  • Out-of-order processor pipeline
  • Integer multiply and divide latency are based on
    Pentium4
  • Memory hierarchy is based on Compaq ES40
  • memory system parameters in near future
  • better prefetching support
  • Supports TLB prefetching

33
Simulation Parameters
34
Simulator vs. Real Machine
  • Better prefetching support
  • Never ignore prefetching
  • Our prefetches are not hints!
  • TLB prefetching
  • When a prefetch incurs a TLB miss, perform TLB
    loading
  • More miss handlers
  • 32 for data

35
Varying Group Size and Prefetching Distance
  • Varying params for the 20B case in the previous
    figure
  • Hash table probing
  • Too small latencies are not fully hidden
  • Too large many prefetched cache lines are
    replaced by other memory references
  • Similar performance even when latency increases
    to 1000 cycles!

36
Breakdowns of Cache Misses to Understand the
Tuning Curves
37
Cache Performance Breakdowns
  • Our schemes indeed hide most of the data cache
    miss latencies
  • Overheads lead to larger portions of busy times

38
Partition Phase Performance
  • When the number of partitions is small, use
    simple prefetching
  • When the number of partitions is large, use group
    or software-pipelined prefetching
  • Combined 1.9-2.6X speedups over the baseline

39
Cache Partitioning Schemes
  • Two ways to employ cache partitioning in GRACE
    join
  • Direct cache generate the cache partitions
    in the I/O partition
    phase
  • Two-step cache generate the cache partitions in
    the join phase as a
    preprocessing step
  • Requires to generate larger number of smaller I/O
    partitions
  • Bounded by available memory
  • Bounded by requirements of underlying storage
    managers
  • So cache partitioning may not be used when
    joining relations are very large
  • Not robust with multiple activities going on
  • Require exclusive use of (part of) cache
  • Performance penalty due to cache conflicts

40
Comparison with Cache Partitioning
  • Direct cache suffers from larger number of
    partitions generated in the I/O partition phase
  • Two-step cache suffers from the additional
    partition step
  • Our schemes are the best (slightly better than
    direct cache)

41
Robustness Impact of Cache Interference
100 corresponds to the join phase execution
time when there is no cache flush
  • Performance degradation when the cache is
    periodically flushed
  • The worst cache interference
  • Direct cache and 2-step cache degrade 15-67 and
    8-38
  • Our prefetching schemes are very robust

42
Group Pref vs. Software-pipelined Pref
  • Hiding latency
  • Software-pipelined pref is always able to hide
    all latencies (according to our analytical model)
  • Book-keeping overhead
  • Software-pipelined pref has more overhead
  • Code complexity
  • Group prefetching is easier to implement
  • Natural group boundary provides a place to do
    necessary processing left (e.g. for read-write
    conflicts)
  • A natural place to send outputs to the parent
    operator if pipelined operator is needed

43
Challenges in Applying Prefetching
  • Try to hide the latency within the processing of
    a single tuple
  • Example hash table probing
  • Does not work
  • Dependencies essentially form a critical path
  • Randomness makes prediction almost impossible

Hash Cell Array
BuildPartition
HashBucketHeaders
44
Naïve Prefetching
  • foreach probe tuple
  • compute bucket number
  • prefetch header
  • visit header
  • prefetch cell array
  • visit cell array
  • prefetch matching build tuple
  • visit matching build tuple

1
2
3
Data dependencies make it difficult to obtain
addresses early
45
Group Prefetching
  • foreach group of probe tuples
  • foreach tuple in group
  • compute bucket number
  • prefetch header
  • foreach tuple in group
  • visit header
  • prefetch cell array
  • foreach tuple in group
  • visit cell array
  • prefetch matching build tuple
  • foreach tuple in group
  • visit matching build tuple

46
Software-Pipelined Prefetching
  • Prologue
  • for j0 to N-4 do
  • tuple j3
  • compute bucket number
  • prefetch header
  • tuple j2
  • visit header
  • prefetch cell array
  • tuple j1
  • visit cell array
  • prefetch matching build tuple
  • tuple j
  • visit matching build tuple
  • Epilogue
Write a Comment
User Comments (0)
About PowerShow.com