Improving Database Performance on Simultaneous Multithreading Processors - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Improving Database Performance on Simultaneous Multithreading Processors

Description:

Parallel threads designed to avoid SMT-related interference ... Divide input and use a separate thread to process each part ... Thread coordination for output ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 35
Provided by: johncie
Category:

less

Transcript and Presenter's Notes

Title: Improving Database Performance on Simultaneous Multithreading Processors


1
Improving Database Performance on Simultaneous
Multithreading Processors
  • Jingren Zhou
  • Microsoft Research
  • jrzhou_at_microsoft.com

John Cieslewicz Columbia University johnc_at_cs.colum
bia.edu
Kenneth A. Ross Columbia University kar_at_cs.columbi
a.edu
Mihir Shah Columbia University ms2604_at_columbia.edu
2
Simultaneous Multithreading (SMT)
  • Available on modern CPUs
  • Hyperthreading on Pentium 4 and Xeon.
  • IBM POWER5
  • Sun UltraSparc IV
  • Challenge Design software to efficiently utilize
    SMT.
  • This talk Database software

Intel Pentium 4 with Hyperthreading
3
Superscalar Processor (no SMT)
...
...
Time
Instruction Stream
...
...
CPI 3/4
Superscalar pipeline (up to 2 instructions/cycle)
  • Improved instruction level parallelism

4
SMT Processor
...
...
...
...
Time
Instruction Streams
...
...
CPI 5/8 ?
  • Improved thread level parallelism
  • More opportunities to keep the processor busy
  • But sometimes SMT does not work so well ?

5
Stalls
...
...
Instruction Stream 1
Time
...
...
Instruction Stream 2
...
...
Stall
CPI 3/4 ?. Progress despite stalled thread.
Stalls due to cache misses (200-300 cycles for L2
cache), branch mispredictions (20-30 cycles), etc.
6
Memory Consistency
...
...
Instruction Stream 1
Time
...
...
Instruction Stream 2
Detect conflicting access to common cache line
...
...
flush pipeline sync cache with RAM
MOMC Event on Pentium 4. (300-350 cycles)
7
SMT Processor
  • Exposes multiple logical CPUs (one per
    instruction stream)
  • One physical CPU (5 extra silicon to duplicate
    thread state information)
  • Better than single threading
  • Increased thread-level parallelism
  • Improved processor utilization when one thread
    blocks
  • Not as good as two physical CPUs
  • CPU resources are shared, not replicated

8
SMT Challenges
  • Resource Competition
  • Shared Execution Units
  • Shared Cache
  • Thread Coordination
  • Locking, etc. has high overhead
  • False Sharing
  • MOMC Events

9
Approaches to using SMT
  • Ignore it, and write single threaded code.
  • Naïve parallelism
  • Pretend the logical CPUs are physical CPUs
  • SMT-aware parallelism
  • Parallel threads designed to avoid SMT-related
    interference
  • Use one thread for the algorithm, and another to
    manage resources
  • E.g., to avoid stalls for cache misses

10
Naïve Parallelism
  • Treat SMT processor as if it is multi-core
  • Databases already designed to utilize multiple
    processors - no code modification
  • Uses shared processor resources inefficiently
  • Cache Pollution / Interference
  • Competition for execution units

11
SMT-Aware Parallelism
  • Exploit intra-operator parallelism
  • Divide input and use a separate thread to process
    each part
  • E.g., one thread for even tuples, one for odd
    tuples.
  • Explicit partitioning step not required.
  • Sharing input involves multiple readers
  • No MOMC events, because two reads dont conflict

12
SMT-Aware Parallelism (cont.)
  • Sharing output is challenging
  • Thread coordination for output
  • read/write and write/write conflicts on common
    cache lines (MOMC Events)
  • Solution Partition the output
  • Each thread writes to separate memory buffer to
    avoid memory conflicts
  • Need an extra merge step in the consumer of the
    output stream
  • Difficult to maintain input order in the output

13
Managing Resources for SMT
  • Cache misses are a well-known performance
    bottleneck for modern database systems
  • Mainly L2 data cache misses, but also L1
    instruction cache misses Ailamaki et al 98.
  • Goal Use a helper thread to avoid cache misses
    in the main thread
  • load future memory references into the cache
  • explicit load, not a prefetch

14
Data Dependency
  • Memory references that depend upon a previous
    memory access exhibit a data dependency
  • E.g., Lookup hash table

Tuple
15
Data Dependency (cont.)
  • Data dependencies make instruction level
    parallelism harder
  • Modern architectures provide prefetch
    instructions.
  • Request that data be brought into the cache
  • Non-blocking
  • Pitfalls
  • Prefetch instructions are frequently dropped
  • Difficult to tune
  • Too much prefetching can pollute the cache

16
Staging Computation
  • Preload A.
  • (other work)
  • Process A.
  • Preload B.
  • (other work)
  • Process B.
  • Preload C.
  • (other work)
  • Process C.
  • Preload Tuple.
  • (other work)
  • Process Tuple.

Hash Buckets
Overflow Cells
Tuple
(Assumes each element is a cache line.)
17
Staging Computation (cont.)
  • By overlapping memory latency with other work,
    some cache miss latency can be hidden.
  • Many probes in flight at the same time.
  • Algorithms need to be rewritten.
  • E.g. Chen, et al. 2004, Harizopoulos, et al.
    2004.

18
Work-Ahead Set Main Thread
  • Writes memory address computation state to the
    work-ahead set
  • Retrieves a previous address state
  • Hope that helper thread can preload data before
    retrieval by the main thread
  • Correct whether or not helper thread succeeds at
    preloading data
  • helper thread is read-only

19
Work-ahead Set Data Structure
state
address
Main Thread
20
Work-ahead Set Data Structure
state
address
Main Thread
21
Work-Ahead Set Helper Thread
  • Reads memory addresses from the work-ahead set,
    and loads their contents
  • Data becomes cache resident
  • Tries to preload data before main thread cycles
    around
  • If successful, main thread experiences cache hits

22
Work-ahead Set Data Structure
state
address
G
1
H
2
temp sloti
Helper Thread
I
2
J
2
E
1
F
1
23
Iterate Backwards!
state
address
G
1
i i-1 mod size
i
H
2
I
2
J
2
Helper Thread
E
1
F
1
Why? See Paper.
24
Helper Thread Speed
  • If helper thread faster than main thread
  • More computation than memory latency
  • Helper thread should not preload twice (wasted
    CPU cycles)
  • See paper for how to stop redundant loads
  • If helper thread is slower
  • No special tuning necessary
  • Main thread will absorb some cache misses

25
Work-Ahead Set Size
  • Too Large Cache Pollution
  • Preloaded data evicts other preloaded data before
    it can be used
  • Too Small Thread Contention
  • Many MOMC events because work-ahead set spans few
    cache lines
  • Just Right Experimentally determined
  • But use the smallest size within the acceptable
    range (performance plateaus), so that cache space
    is available for other purposes (for us, 128
    entries)
  • Data structure itself much smaller than L2 cache

26
Experimental Workload
  • Two Operators
  • Probe phase of Hash Join
  • CSB Tree Index Join
  • Operators run in isolation and in parallel
  • Intel VTune used to measure hardware events

27
Experimental Outline
  • Hash join
  • Index lookup
  • Mixed Hash join and index lookup

28
Hash JoinComparative Performance
29
Hash JoinL2 Cache Misses Per Tuple
30
CSB Tree Index JoinComparative Performance
31
CSB Tree Index JoinL2 Cache Misses Per Tuple
32
Parallel Operator Performance
33
Parallel Operator Performance
34
Conclusion
Write a Comment
User Comments (0)
About PowerShow.com