Title: Dynamic Region Selection for Thread Level Speculation
1Dynamic Region Selection for Thread Level
Speculation
- Presented by
- Jeff Da Silva
- Stanley Fung
- Martin Labrecque
Feb 6, 2004
- Builds on research done by
- Chris Colohan from CMU
- Greg Steffan
2Multithreading on a Chip is here TODAY!
Threads of Execution
Supercomputers
3Improving Performance with a Chip Multiprocessor
With a bunch of independent applications
Applications
Execution Time
Processor
Caches
?improves throughput (total work per second)
4Improving Performance with a Chip Multiprocessor
With a single application
?
Exec. Time
?need parallel threads to reduce execution time
5Thread-Level Speculation the Basic Idea
?
6Support for TLS What Do We Need?
- Break programs into speculative threads
- to maximize thread-level parallelism
- Track data dependences
- to determine whether speculation was safe
- Recover from failed speculation
- to ensure correct execution
?three key elements of every TLS system
7Support for TLS What Do We Need?
- Lots of research has been done on TLS hardware
- Tracking data dependence
- Recover from violation
- We focus on how to select regions to run in
parallel - A region is any segment of code that you want to
speculatively parallelize - For this work, region loop, iterations
speculative threads
8Why is static region selection hard?
- Extensive profiling information
- Regions can be nested
- for ( i 1 to N )
lt 2x faster in parallel - .
- for ( j 1 to N ) lt
3x faster in parallel - .
- for ( k 1 to N ) lt 4x
faster in parallel - .
-
Which loop should we parallelize? -
-
- Dynamic behaviour
?Dynamic Region Selection is a potential solution
9Dynamic Region Selection
- Compiler transforms all candidate regions into
parallel and sequential versions - Through dynamic profiling, we decide which
regions are to be run in parallel - Key Questions
- Is there any dynamic behaviour between region
instances? - What is a good algorithm for selecting regions?
- Are there performance trade-offs for doing
dynamic profiling? - Is there any dynamic behaviour within region
instances? (not the focus of this research)
10Outline
- The role of the TLS compiler
- Characterizing dynamic behaviour
- Dynamic Region Selection (DRS) algorithms
- Results
- Conclusions
- Open questions and future work
11Current Compilation for TLS
- LoopA
- LoopB
- EndB
- LoopC
- LoopD
- EndD
- EndC
- EndA
- LoopE
- LoopF
- EndF
- EndE
- LoopG
- LoopH
- EndH
- EndH
12DRS Compilation
LoopA LoopB EndB LoopC LoopD EndD EndC End
A LoopE LoopF EndF EndE LoopG LoopH EndH EndH
LoopA LoopB EndB LoopC LoopD EndD EndC End
A LoopE LoopF EndF EndE LoopG LoopH EndH EndH
13DRS Compilation
14DRS Compilation
15DRS Compilation
16DRS Compilation
17DRS Compilation
?DRS Compilation by Colohan
18Characterizing TLS Region Behaviour
19Characterizing TLS Region Behaviour
20DRS Algorithms
- Sample Twice
- Continuous Monitoring
- Continuous Resample
- Path Sensitive Sampling
21Sample Twice Algorithm
- Effective if behaviour is constant.
- When a region is encountered
- 1st Time Run sequential version and record
execution time t1 - 2nd Time Run parallel version (if possible) and
record execution time tp - Subsequent instances
- if tp lt t1 then run parallel version
- else run sequential version
- Note that by using execution time as a metric, it
is assumed that the amount of work done from
instance to instance remains relatively constant.
Using throughput (IPC) as a metric eliminates the
need for this assumption but adds additional
complexity.
22Sample Twice Example
23Continuous Monitoring
- Effective if behaviour is continuously degrading.
- Extension to sample twice method. Continuously
monitor all regions and reevaluate your decision
if speedup changes. - Not doing much more besides monitoring
continuously -gt the overhead is free. - When a region is encountered
- 1st Time Run sequential version and record
execution time t1 - 2nd Time Run parallel version (if possible) and
record execution time tp - Subsequent instances
- if tp lt t1 then run parallel version and update
tp - else run sequential version and update t1
24Continuous Monitoring Example
25Continuous Resample
- Effective if behaviour is continuously changing.
- Continuously resample by flushing values t1 and
tp periodically. - Adds new overhead.
- This algorithm has not yet been explored.
26Path Sensitive Sampling
- If the behaviour is periodic, a means of
filtering is required. - One intuitive solution is to sample when the
invocation path or region nesting path changes.
27Path Sensitive Sampling
- Sample when region nesting path changes
- Makes the assumption that state stays the same if
the invocation path does not change
void foo() while(cond)
moo() void bar() while(cond)
moo() void moo() while(cond)
moo()
28Results Static analysis
Average number of per-path instances for all
regions
29Interesting Region in IJPEG
Number of speculative threads per region instance
Program execution ?
30Interesting Region in Perl
Number of instructions per region instance
Program execution ?
31Experimental Framework
- SPEC benchmarks
- TLS compiler
- MIPS architecture
- TLS profiler and simulator
32Outline
- The role of the TLS compiler
- Characterizing dynamic behaviour
- Dynamic Region Selection (DRS) algorithms
- Results
- Conclusions
- Open questions and future work
33- ?Is there any dynamic behavior between region
instances?
34Results Dynamic behavior
?Regions with high coverage have low instruction
variance between instances
35Results Dynamic behavior
?Regions with high coverage have low violation
variance between instances
36Results Dynamic behavior
?Regions with high coverage have low speculative
thread count variance between instances
37- ?What is a good algorithm for selecting regions?
38slower
faster
?Continuous monitoring 1 better on average than
sample twice ?About 10 worse than static
optimal selection
39- ?How often did we agree with the optimal
selection?
40?Sample twice agrees 57 of the time, on
average ?Continuous monitoring agrees 43 of the
time, on average ?Levels of agreement are close ?
no dynamic behavior?
41?Agreeing with static optimal gives better
performance? ?Another sign of no dynamic
behaviour?
42? Sample twice often leaves regions
undecided ?Overall, undecided regions represent
low coverage
43Outline
- The role of the TLS compiler
- Characterizing dynamic behaviour
- Dynamic Region Selection (DRS) algorithms
- Results
- Conclusions
- Open questions and future work
44Conclusions
- This is an unexplored research topic (as far as
we know) - ? Is there any dynamic behavior between region
instances? - We have good indications that there isnt tons of
it - ?What is the best algorithm for selecting
regions? - Continuous sampling does 1 better than sample
twice - Within 10 of the static optimal without any
sampling done! - ?Any performance trade-offs for doing dynamic
profiling? - The code size is increased by at most 30
- The runtime performance overhead is believed to
be negligible - ? Is there any dynamic behavior within a region
instance? - We dont know yet
45Open Questions
- The dynamic optimal is the theoretical optimal
- How close are we from the dynamic optimal?
- How close is the static optimal to the dynamic
optimal? - How do the other proposed algorithms perform?
- What should be implemented in hardware/software?
46 47AUXILIARY SLIDES
48Results Potential Study
Execution time versus invocation (IJPEG)
49Results Potential Study
Execution time versus invocation (CRAFTY)
50Results Potential Study
Execution time versus invocation (LI)
51Results Potential Study
Execution time versus invocation (PERL)
52Results Static analysis
53Results Dynamic behavior
54Results Dynamic behavior
55Results Dynamic behavior