Title: Improving Database Performance on Simultaneous Multithreading Processors
1Improving Database Performance on Simultaneous
Multithreading Processors
- Jingren Zhou
- Microsoft Research
- jrzhou_at_microsoft.com
John Cieslewicz Columbia University johnc_at_cs.colum
bia.edu
Kenneth A. Ross Columbia University kar_at_cs.columbi
a.edu
Mihir Shah Columbia University ms2604_at_columbia.edu
2Simultaneous Multithreading (SMT)
- Available on modern CPUs
- Hyperthreading on Pentium 4 and Xeon.
- IBM POWER5
- Sun UltraSparc IV
- Challenge Design software to efficiently utilize
SMT. - This talk Database software
Intel Pentium 4 with Hyperthreading
3Superscalar Processor (no SMT)
...
...
Time
Instruction Stream
...
...
CPI 3/4
Superscalar pipeline (up to 2 instructions/cycle)
- Improved instruction level parallelism
4SMT Processor
...
...
...
...
Time
Instruction Streams
...
...
CPI 5/8 ?
- Improved thread level parallelism
- More opportunities to keep the processor busy
- But sometimes SMT does not work so well ?
5Stalls
...
...
Instruction Stream 1
Time
...
...
Instruction Stream 2
...
...
Stall
CPI 3/4 ?. Progress despite stalled thread.
Stalls due to cache misses (200-300 cycles for L2
cache), branch mispredictions (20-30 cycles), etc.
6Memory Consistency
...
...
Instruction Stream 1
Time
...
...
Instruction Stream 2
Detect conflicting access to common cache line
...
...
flush pipeline sync cache with RAM
MOMC Event on Pentium 4. (300-350 cycles)
7SMT Processor
- Exposes multiple logical CPUs (one per
instruction stream) - One physical CPU (5 extra silicon to duplicate
thread state information) - Better than single threading
- Increased thread-level parallelism
- Improved processor utilization when one thread
blocks - Not as good as two physical CPUs
- CPU resources are shared, not replicated
8SMT Challenges
- Resource Competition
- Shared Execution Units
- Shared Cache
- Thread Coordination
- Locking, etc. has high overhead
- False Sharing
- MOMC Events
9Approaches to using SMT
- Ignore it, and write single threaded code.
- Naïve parallelism
- Pretend the logical CPUs are physical CPUs
- SMT-aware parallelism
- Parallel threads designed to avoid SMT-related
interference - Use one thread for the algorithm, and another to
manage resources - E.g., to avoid stalls for cache misses
10Naïve Parallelism
- Treat SMT processor as if it is multi-core
- Databases already designed to utilize multiple
processors - no code modification - Uses shared processor resources inefficiently
- Cache Pollution / Interference
- Competition for execution units
11SMT-Aware Parallelism
- Exploit intra-operator parallelism
- Divide input and use a separate thread to process
each part - E.g., one thread for even tuples, one for odd
tuples. - Explicit partitioning step not required.
- Sharing input involves multiple readers
- No MOMC events, because two reads dont conflict
12SMT-Aware Parallelism (cont.)
- Sharing output is challenging
- Thread coordination for output
- read/write and write/write conflicts on common
cache lines (MOMC Events) - Solution Partition the output
- Each thread writes to separate memory buffer to
avoid memory conflicts - Need an extra merge step in the consumer of the
output stream - Difficult to maintain input order in the output
13Managing Resources for SMT
- Cache misses are a well-known performance
bottleneck for modern database systems - Mainly L2 data cache misses, but also L1
instruction cache misses Ailamaki et al 98. - Goal Use a helper thread to avoid cache misses
in the main thread - load future memory references into the cache
- explicit load, not a prefetch
14Data Dependency
- Memory references that depend upon a previous
memory access exhibit a data dependency - E.g., Lookup hash table
Tuple
15Data Dependency (cont.)
- Data dependencies make instruction level
parallelism harder - Modern architectures provide prefetch
instructions. - Request that data be brought into the cache
- Non-blocking
- Pitfalls
- Prefetch instructions are frequently dropped
- Difficult to tune
- Too much prefetching can pollute the cache
16Staging Computation
- Preload A.
- (other work)
- Process A.
- Preload B.
- (other work)
- Process B.
- Preload C.
- (other work)
- Process C.
- Preload Tuple.
- (other work)
- Process Tuple.
Hash Buckets
Overflow Cells
Tuple
(Assumes each element is a cache line.)
17Staging Computation (cont.)
- By overlapping memory latency with other work,
some cache miss latency can be hidden. - Many probes in flight at the same time.
- Algorithms need to be rewritten.
- E.g. Chen, et al. 2004, Harizopoulos, et al.
2004.
18Work-Ahead Set Main Thread
- Writes memory address computation state to the
work-ahead set - Retrieves a previous address state
- Hope that helper thread can preload data before
retrieval by the main thread - Correct whether or not helper thread succeeds at
preloading data - helper thread is read-only
19Work-ahead Set Data Structure
state
address
Main Thread
20Work-ahead Set Data Structure
state
address
Main Thread
21Work-Ahead Set Helper Thread
- Reads memory addresses from the work-ahead set,
and loads their contents - Data becomes cache resident
- Tries to preload data before main thread cycles
around - If successful, main thread experiences cache hits
22Work-ahead Set Data Structure
state
address
G
1
H
2
temp sloti
Helper Thread
I
2
J
2
E
1
F
1
23Iterate Backwards!
state
address
G
1
i i-1 mod size
i
H
2
I
2
J
2
Helper Thread
E
1
F
1
Why? See Paper.
24Helper Thread Speed
- If helper thread faster than main thread
- More computation than memory latency
- Helper thread should not preload twice (wasted
CPU cycles) - See paper for how to stop redundant loads
- If helper thread is slower
- No special tuning necessary
- Main thread will absorb some cache misses
25Work-Ahead Set Size
- Too Large Cache Pollution
- Preloaded data evicts other preloaded data before
it can be used - Too Small Thread Contention
- Many MOMC events because work-ahead set spans few
cache lines - Just Right Experimentally determined
- But use the smallest size within the acceptable
range (performance plateaus), so that cache space
is available for other purposes (for us, 128
entries) - Data structure itself much smaller than L2 cache
26Experimental Workload
- Two Operators
- Probe phase of Hash Join
- CSB Tree Index Join
- Operators run in isolation and in parallel
- Intel VTune used to measure hardware events
27Experimental Outline
- Hash join
- Index lookup
- Mixed Hash join and index lookup
28Hash JoinComparative Performance
29Hash JoinL2 Cache Misses Per Tuple
30CSB Tree Index JoinComparative Performance
31CSB Tree Index JoinL2 Cache Misses Per Tuple
32Parallel Operator Performance
33Parallel Operator Performance
34Conclusion