Dynamic Region Selection for Thread Level Speculation - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamic Region Selection for Thread Level Speculation

Description:

Dynamic Region Selection for Thread Level Speculation. Presented by: Jeff Da Silva ... Multithreading on a Chip is here TODAY! Supercomputers. Threads of ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 56
Provided by: jason218
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Region Selection for Thread Level Speculation


1
Dynamic Region Selection for Thread Level
Speculation
  • Presented by
  • Jeff Da Silva
  • Stanley Fung
  • Martin Labrecque

Feb 6, 2004
  • Builds on research done by
  • Chris Colohan from CMU
  • Greg Steffan

2
Multithreading on a Chip is here TODAY!
Threads of Execution
Supercomputers
3
Improving Performance with a Chip Multiprocessor
With a bunch of independent applications
Applications
Execution Time
Processor
Caches
?improves throughput (total work per second)
4
Improving Performance with a Chip Multiprocessor
With a single application
?
Exec. Time
?need parallel threads to reduce execution time
5
Thread-Level Speculation the Basic Idea
?
6
Support for TLS What Do We Need?
  • Break programs into speculative threads
  • to maximize thread-level parallelism
  • Track data dependences
  • to determine whether speculation was safe
  • Recover from failed speculation
  • to ensure correct execution

?three key elements of every TLS system
7
Support for TLS What Do We Need?
  • Lots of research has been done on TLS hardware
  • Tracking data dependence
  • Recover from violation
  • We focus on how to select regions to run in
    parallel
  • A region is any segment of code that you want to
    speculatively parallelize
  • For this work, region loop, iterations
    speculative threads

8
Why is static region selection hard?
  • Extensive profiling information
  • Regions can be nested
  • for ( i 1 to N )
    lt 2x faster in parallel
  • .
  • for ( j 1 to N ) lt
    3x faster in parallel
  • .
  • for ( k 1 to N ) lt 4x
    faster in parallel
  • .

  • Which loop should we parallelize?
  • Dynamic behaviour

?Dynamic Region Selection is a potential solution
9
Dynamic Region Selection
  • Compiler transforms all candidate regions into
    parallel and sequential versions
  • Through dynamic profiling, we decide which
    regions are to be run in parallel
  • Key Questions
  • Is there any dynamic behaviour between region
    instances?
  • What is a good algorithm for selecting regions?
  • Are there performance trade-offs for doing
    dynamic profiling?
  • Is there any dynamic behaviour within region
    instances? (not the focus of this research)

10
Outline
  • The role of the TLS compiler
  • Characterizing dynamic behaviour
  • Dynamic Region Selection (DRS) algorithms
  • Results
  • Conclusions
  • Open questions and future work

11
Current Compilation for TLS
  • LoopA
  • LoopB
  • EndB
  • LoopC
  • LoopD
  • EndD
  • EndC
  • EndA
  • LoopE
  • LoopF
  • EndF
  • EndE
  • LoopG
  • LoopH
  • EndH
  • EndH

12
DRS Compilation
LoopA LoopB EndB LoopC LoopD EndD EndC End
A LoopE LoopF EndF EndE LoopG LoopH EndH EndH
LoopA LoopB EndB LoopC LoopD EndD EndC End
A LoopE LoopF EndF EndE LoopG LoopH EndH EndH
13
DRS Compilation
14
DRS Compilation
15
DRS Compilation
16
DRS Compilation
17
DRS Compilation
?DRS Compilation by Colohan
18
Characterizing TLS Region Behaviour
19
Characterizing TLS Region Behaviour
20
DRS Algorithms
  • Sample Twice
  • Continuous Monitoring
  • Continuous Resample
  • Path Sensitive Sampling

21
Sample Twice Algorithm
  • Effective if behaviour is constant.
  • When a region is encountered
  • 1st Time Run sequential version and record
    execution time t1
  • 2nd Time Run parallel version (if possible) and
    record execution time tp
  • Subsequent instances
  • if tp lt t1 then run parallel version
  • else run sequential version
  • Note that by using execution time as a metric, it
    is assumed that the amount of work done from
    instance to instance remains relatively constant.
    Using throughput (IPC) as a metric eliminates the
    need for this assumption but adds additional
    complexity.

22
Sample Twice Example
23
Continuous Monitoring
  • Effective if behaviour is continuously degrading.
  • Extension to sample twice method. Continuously
    monitor all regions and reevaluate your decision
    if speedup changes.
  • Not doing much more besides monitoring
    continuously -gt the overhead is free.
  • When a region is encountered
  • 1st Time Run sequential version and record
    execution time t1
  • 2nd Time Run parallel version (if possible) and
    record execution time tp
  • Subsequent instances
  • if tp lt t1 then run parallel version and update
    tp
  • else run sequential version and update t1

24
Continuous Monitoring Example
25
Continuous Resample
  • Effective if behaviour is continuously changing.
  • Continuously resample by flushing values t1 and
    tp periodically.
  • Adds new overhead.
  • This algorithm has not yet been explored.

26
Path Sensitive Sampling
  • If the behaviour is periodic, a means of
    filtering is required.
  • One intuitive solution is to sample when the
    invocation path or region nesting path changes.

27
Path Sensitive Sampling
  • Sample when region nesting path changes
  • Makes the assumption that state stays the same if
    the invocation path does not change

void foo() while(cond)
moo() void bar() while(cond)
moo() void moo() while(cond)
moo()
28
Results Static analysis
Average number of per-path instances for all
regions
29
Interesting Region in IJPEG
Number of speculative threads per region instance
Program execution ?
30
Interesting Region in Perl
Number of instructions per region instance
Program execution ?
31
Experimental Framework
  • SPEC benchmarks
  • TLS compiler
  • MIPS architecture
  • TLS profiler and simulator

32
Outline
  • The role of the TLS compiler
  • Characterizing dynamic behaviour
  • Dynamic Region Selection (DRS) algorithms
  • Results
  • Conclusions
  • Open questions and future work

33
  • ?Is there any dynamic behavior between region
    instances?

34
Results Dynamic behavior
?Regions with high coverage have low instruction
variance between instances
35
Results Dynamic behavior
?Regions with high coverage have low violation
variance between instances
36
Results Dynamic behavior
?Regions with high coverage have low speculative
thread count variance between instances
37
  • ?What is a good algorithm for selecting regions?

38
slower
faster
?Continuous monitoring 1 better on average than
sample twice ?About 10 worse than static
optimal selection
39
  • ?How often did we agree with the optimal
    selection?

40
?Sample twice agrees 57 of the time, on
average ?Continuous monitoring agrees 43 of the
time, on average ?Levels of agreement are close ?
no dynamic behavior?
41
?Agreeing with static optimal gives better
performance? ?Another sign of no dynamic
behaviour?
42
? Sample twice often leaves regions
undecided ?Overall, undecided regions represent
low coverage
43
Outline
  • The role of the TLS compiler
  • Characterizing dynamic behaviour
  • Dynamic Region Selection (DRS) algorithms
  • Results
  • Conclusions
  • Open questions and future work

44
Conclusions
  • This is an unexplored research topic (as far as
    we know)
  • ? Is there any dynamic behavior between region
    instances?
  • We have good indications that there isnt tons of
    it
  • ?What is the best algorithm for selecting
    regions?
  • Continuous sampling does 1 better than sample
    twice
  • Within 10 of the static optimal without any
    sampling done!
  • ?Any performance trade-offs for doing dynamic
    profiling?
  • The code size is increased by at most 30
  • The runtime performance overhead is believed to
    be negligible
  • ? Is there any dynamic behavior within a region
    instance?
  • We dont know yet

45
Open Questions
  • The dynamic optimal is the theoretical optimal
  • How close are we from the dynamic optimal?
  • How close is the static optimal to the dynamic
    optimal?
  • How do the other proposed algorithms perform?
  • What should be implemented in hardware/software?

46
  • Questions?

47
AUXILIARY SLIDES
48
Results Potential Study
Execution time versus invocation (IJPEG)
49
Results Potential Study
Execution time versus invocation (CRAFTY)
50
Results Potential Study
Execution time versus invocation (LI)
51
Results Potential Study
Execution time versus invocation (PERL)
52
Results Static analysis
53
Results Dynamic behavior
54
Results Dynamic behavior
55
Results Dynamic behavior
Write a Comment
User Comments (0)
About PowerShow.com