Title: Saisanthosh Balakrishnan Guri Sohi
1Program DemultiplexingData-flow based
Speculative Parallelization
- Saisanthosh Balakrishnan Guri Sohi
- University of Wisconsin-Madison
2Speculative Parallelization
- Construct threads from sequential program
- Loops, methods,
- Execute threads speculatively
- Hardware support to enforce program order
- Application domain
- Irregularly parallel
- Importance now
- Single-core performance incremental
3Speculative Parallelization Execution
- Execution model
- Fork threads in program order for execution
- Commit tasks in that order
Control-flow Speculative Parallelization
- Limitation
- Reaching distant parallelism
4Outline
- Program Demultiplexing Overview
- Program Demultiplexing Execution Model
- Hardware Support
- Evaluation
5Program Demultiplexing Framework
- Trigger
- Begins execution of Handler
- Handler
- Setup execution, parameters
- Demultiplexed execution
- Speculative
- Stored in Execution Buffer
- At call site
- Search EB for execution
- Dependence violations
- Invalidate executions
M()
PD Execution
6Program Demultiplexing Highlights
- Method granularity
- Well defined
- Parameters
- Stack for local communication
- Trigger forks execution
- Means for reaching distant method
- Different from call site
- Independent speculative executions
- No control dependence with other executions
- Triggers lead to unordered execution
- Not according to program order
7Outline
- Program Demultiplexing Overview
- Program Demultiplexing Execution Model
- Hardware Support
- Evaluation
8Example 175.vpr, update_bb ()
.. x_from block b_from.x y_from block
b_from.y find_to (x_from, y_from, block
b_from.type, rlim, x_to, y_to) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
Call Site 2
Call Site 1
9Handlers
- Provides parameters to execution
- Achieves separation of call site and execution
- Handler code
- Slice of dependent instructions from call site
- Many variants possible
update_bb (inet, bb_coord_new bb_index,
bb_edge_new bb_index, x_from,
y_from, x_to, y_to)
10Handlers Example
.. x_from block b_from.x y_from block
b_from.y find_to ( x_from, y_from, block
b_from.type, rlim, x_to, y_to ) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
11Triggers
- Fork demultiplexed execution
- Usually when method and handler are ready
- i.e. when data dependencies satisfied
- Begins execution of the handler
12Identifying Triggers
- Generate memory profile
- Identify trigger point
- Collect for many executions
- Good coverage
- Represent trigger points
- Use instruction attributes
- PCs, Memory write address
Sequential Exec.
Program state for HM
Handler
M()
13Triggers Example
.. x_from block b_from.x y_from block
b_from.y find_to ( x_from, y_from, block
b_from.type, rlim, x_to, y_to ) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
T1
T2
Minimum of 400 cycles
90 cycles per execution
14 Handlers Example (2)
H2
H1
.. x_from block b_from.x y_from block
b_from.y find_to ( x_from, y_from, block
b_from.type, rlim, x_to, y_to ) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
T1
T2
15Outline
- Program Demultiplexing Overview
- Program Demultiplexing Execution Model
- Hardware Support
- Evaluation
16Hardware Support Outline
- Support for triggers
- Demultiplexed execution
- Maintaining executions
- Storage
- Invalidation
- Committing
17Support for Triggers
- Triggers are registered with hardware
- ISA extensions
- Similar to debug watchpoints
- Evaluation of triggers
- Only by Committed instructions
- PC, address
- Fast lookup with filters
18Demultiplexed Execution
- Hardware Typical MP system
- Private cache for speculative data
- Extend cache line with access bit
- Misses serviced by Main processor
- No communication with other executions
- On completion
- Collect read set (R)
- Accessed lines
- Collect write set (W)
- Dirty lines
- Invalidate write set in cache
C
C
P0
P3
P1
P2
C
C
19Execution buffer pool
- Holds speculative executions
- Execution entry contains
- Read and write set
- Parameters and return value
- Alternatives
- Use cache
- May be more efficient
- Similar to other proposals
- Not the focus in this paper
20Invalidating Executions
- For a committed store address
- Search Read and Write sets
- Invalidate matching executions
21Using Executions
- For a given call site
- Search method name, parameters
- Get write and read set
- Commit
- If accessed by program
- Use
- If accessed by another method
- Nested methods
22Outline
- Program Demultiplexing Overview
- Program Demultiplexing Execution Model
- Hardware Support
- Evaluation
23Reaching distant parallelism
Call site
Fo rk
A B
A
Call Site
M()
B
24Performance evaluation
- Performance benefits limited by
- Methods in program
- Handler implementation
25Summary of other results (Refer paper)
- Method sizes
- 10s to 1000s of instructions. Lower 100s usually
- Demultiplexed execution overheads
- Common case 1.1x to 2.0x
- Trigger points
- 1 to 3. Outliers exist macro usage
- Handler length
- 10 to 50 instructions average
- Cache lines
- Read 20s, Written 10s
- Demultiplexed execution
- Held average of 100s of cycles
26Conclusions
- Method granularity
- Exploit modularity in program
- Trigger and handler to allow earliest execution
- Data-flow based
- Unordered execution
- Reach distant parallelism
- Orthogonal to other speculative parallelization
- Use to further speedup demultiplexed execution
27Backup
28Average trigger points in call site
- Small set of trigger points for a given call site
- Defines reachability from trigger to the call site
29Evaluation
- Full-system execution-based simulator
- Intel x86 ISA and Virtutech Simics
- 4-wide out-of-order processors
- 64K Level 1 caches (2 cycle), 1 MB Level 2 (12
cycle) - MSI coherence
- Software toolchain
- Modified gcc-compiler and lancet tool
- Debugging information, CFG, program dependence
graph - Simulator based memory profile
- Generates triggers and handlers
- No mis-speculations occur
30Reaching distant parallelism
A Cycles between Fork and Call Site
A
M()
31Execution Buffer Entries
900 590 70 520 413 244 160 308
Avg. Cycles Held
- Storage requirements
- Max case 284 KB
- Minimize entries by better scheduling
32Read and write set
Cache lines written
Cache lines read
33Demultiplexed execution overheads
Execution Time Overhead
- Overheads due to
- Handler
- Cache misses due to demultiplexed execution
- Common case
- between 1.1 to 2.0x
- Small methods ? High overheads
34Length of handlers
14 10 9 100 16 4 40 4
Handler Instruction Count Overhead
35Method sizes
36Methods
- crafty gap gzip mcf parser
twolf vortex vpr - 24 16 9 8 12
10 11 11 - 59 27 9 84
26 106 20
Methods Call Sites
Exec. time ()
85 90 51 30 55 92
88 99
- Runtime includes frequently called methods
37Loop-level Parallelization
- Unit Loop iterations
- Live-ins from
- P-slice
- Similar to handler
- Fork instruction
- Restricted
- Same basic block level, method
- Program order dependent
- Ordered forking
Mitosis
fork loop
endl
38Method-level parallelization
- Unit Method continuations
- Program after the method returns
- Orthogonal to PD
Method-level
call
M() ret
39Reaching distant parallelism
M1()
B A
B
M2()
A
crafty gap gzip mcf pars twolf
vortex vpr
B A
60 72 30 80 70 40 63 47
gt 1 ()
40Reaching distant parallelism
B Call Time to Earliest execution time (1
outstanding)
M1()
C
B
C / B R1 CNo params/C R2
M2()
A
41Issues with Stack
- Stack pointer is position dependent
- Handler has to insert parameters at right
position - Same stack addresses denote different variables
- Affects triggers
- Different stack pointers in program and execution
- Stack may be discarded
- To commit requires relocation of stack results
- Example parameters passed by reference
42Benchmarks
- SPECint2000 benchmarks
- C programs
- Did not evaluate gcc, perl, bzip2, and eon
- No intention of creating concurrency
- No specific/ clean Programming style
- Many methods perform several tasks
- May have less opportunities
43Hardware System
- Intel x86 simulation
- Virtutech Simics based full-system, Bochs decoder
- 4-processors at 3 GHz
- Simple memory system
- Micro-architecture model
- 4-wide out of order without cracking into
micro-ops - Branch predictors
- 32K L1 (2-cycle), 1 MB L2 (12-cycle)
- MSI, 15-cycle communication cache to cache
- Infinite Execution buffer pool
44Software
- Modified gcc-compiler tool chain and lancet tool
- Extract from compiled binary
- Debugging information
- CFG, Program Dependence Graph
- Software
- Dynamic information from simulator
- Generates handler, trigger for call site as
encountered - Control-flow in handler not included ongoing
work - Perfect control transfer from trigger to method
- Handler doesnt execute if a branch leads to not
calling the method
45Generating Handlers
- Cannot easily identify and demarcate code
- Heuristic to demarcate
- Terminate when load address is from heap
- Handler has
- Loads and stores to stack
- No stores to heap
- Limitation
- Heuristic. Doesnt always work
46Generating Handlers
- 1 Specify parameters to method
- Pushed into stack by program
- Introduces dependency
- Prevents separation
- 2 Computing parameters
- Program performs it near call site
- Need to identify the code
- Deal with
- Use of stack
- Control-flow
- Inter-method dependence
1 G F (N) 2 if () 3 X G 2 4 else 5
X G 2 6 M (X)
47Control-flow in Handlers
- Depends on call sites CF
- Handler for D
- Call site in C () BB 3
- Include Loop
- BB 4 to BB 1
- Include Branch
- Branch in BB 1
- Inclusion depends on trigger
- Multiple iterations, diff. triggers
- Ongoing work
CFG (C), Call Graph
D
48Other dependencies in Handlers
- C calls D, A or B calls C
- Dependence (X) extends
- May need multiple handlers
- If multiple call sites
Call Graph
A(X)
C (X)
D(X)
B(X)
49Buffering Handler Writes
- General case
- Writes in handler to be buffered
- Provided to execution
- Discarded after execution
- Current implementation
- Only stack writes
P1
P2
P3
C
C
C
EB
50Methods for Speculative Execution
- Well encapsulated
- Defined by parameters and return value
- Stack for local computation
- Heap for global state
- Often performs specific tasks
- Access limited global state
- Limits side-effects