Saisanthosh Balakrishnan Guri Sohi - PowerPoint PPT Presentation

About This Presentation

Title:

Saisanthosh Balakrishnan Guri Sohi

Description:

crafty gap gzip mcf parser twolf vortex vpr. 24 16 9 8 12 10 11 11. 59 27 9 84 26 106 20 ... crafty gap gzip mcf pars twolf vortex vpr. 60 72 30 80 70 40 63 47 ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 51

Provided by: SaiBalak4

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Saisanthosh Balakrishnan Guri Sohi

1
Program DemultiplexingData-flow based
Speculative Parallelization

Saisanthosh Balakrishnan Guri Sohi
University of Wisconsin-Madison

2
Speculative Parallelization

Construct threads from sequential program
Loops, methods,
Execute threads speculatively
Hardware support to enforce program order
Application domain
Irregularly parallel
Importance now
Single-core performance incremental

3
Speculative Parallelization Execution

Execution model
Fork threads in program order for execution
Commit tasks in that order

Control-flow Speculative Parallelization

Limitation
Reaching distant parallelism

4
Outline

Program Demultiplexing Overview
Program Demultiplexing Execution Model
Hardware Support
Evaluation

5
Program Demultiplexing Framework

Trigger
Begins execution of Handler
Handler
Setup execution, parameters
Demultiplexed execution
Speculative
Stored in Execution Buffer
At call site
Search EB for execution
Dependence violations
Invalidate executions

M()
PD Execution
6
Program Demultiplexing Highlights

Method granularity
Well defined
Parameters
Stack for local communication
Trigger forks execution
Means for reaching distant method
Different from call site
Independent speculative executions
No control dependence with other executions
Triggers lead to unordered execution
Not according to program order

7
Outline

Program Demultiplexing Overview
Program Demultiplexing Execution Model
Hardware Support
Evaluation

8
Example 175.vpr, update_bb ()
.. x_from block b_from.x y_from block
b_from.y find_to (x_from, y_from, block
b_from.type, rlim, x_to, y_to) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
Call Site 2
Call Site 1
9
Handlers

Provides parameters to execution
Achieves separation of call site and execution
Handler code
Slice of dependent instructions from call site
Many variants possible

update_bb (inet, bb_coord_new bb_index,
bb_edge_new bb_index, x_from,
y_from, x_to, y_to)
10
Handlers Example
.. x_from block b_from.x y_from block
b_from.y find_to ( x_from, y_from, block
b_from.type, rlim, x_to, y_to ) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
11
Triggers

Fork demultiplexed execution
Usually when method and handler are ready
i.e. when data dependencies satisfied
Begins execution of the handler

12
Identifying Triggers

Generate memory profile
Identify trigger point
Collect for many executions
Good coverage
Represent trigger points
Use instruction attributes
PCs, Memory write address

Sequential Exec.
Program state for HM
Handler
M()
13
Triggers Example
.. x_from block b_from.x y_from block
b_from.y find_to ( x_from, y_from, block
b_from.type, rlim, x_to, y_to ) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
T1
T2
Minimum of 400 cycles
90 cycles per execution
14
Handlers Example (2)
H2
H1
.. x_from block b_from.x y_from block
b_from.y find_to ( x_from, y_from, block
b_from.type, rlim, x_to, y_to ) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
T1
T2
15
Outline

Program Demultiplexing Overview
Program Demultiplexing Execution Model
Hardware Support
Evaluation

16
Hardware Support Outline

Support for triggers
Demultiplexed execution
Maintaining executions
Storage
Invalidation
Committing

17
Support for Triggers

Triggers are registered with hardware
ISA extensions
Similar to debug watchpoints
Evaluation of triggers
Only by Committed instructions
PC, address
Fast lookup with filters

18
Demultiplexed Execution

Hardware Typical MP system
Private cache for speculative data
Extend cache line with access bit
Misses serviced by Main processor
No communication with other executions
On completion
Collect read set (R)
Accessed lines
Collect write set (W)
Dirty lines
Invalidate write set in cache

C
C
P0
P3
P1
P2
C
C
19
Execution buffer pool

Holds speculative executions
Execution entry contains
Read and write set
Parameters and return value
Alternatives
Use cache
May be more efficient
Similar to other proposals
Not the focus in this paper

20
Invalidating Executions

For a committed store address
Search Read and Write sets
Invalidate matching executions

21
Using Executions

For a given call site
Search method name, parameters
Get write and read set
Commit
If accessed by program
Use
If accessed by another method
Nested methods

22
Outline

Program Demultiplexing Overview
Program Demultiplexing Execution Model
Hardware Support
Evaluation

23
Reaching distant parallelism
Call site
Fo rk
A B
A
Call Site
M()
B
24
Performance evaluation

Performance benefits limited by
Methods in program
Handler implementation

25
Summary of other results (Refer paper)

Method sizes
10s to 1000s of instructions. Lower 100s usually
Demultiplexed execution overheads
Common case 1.1x to 2.0x
Trigger points
1 to 3. Outliers exist macro usage
Handler length
10 to 50 instructions average
Cache lines
Read 20s, Written 10s
Demultiplexed execution
Held average of 100s of cycles

26
Conclusions

Method granularity
Exploit modularity in program
Trigger and handler to allow earliest execution
Data-flow based
Unordered execution
Reach distant parallelism
Orthogonal to other speculative parallelization
Use to further speedup demultiplexed execution

27
Backup
28
Average trigger points in call site

Small set of trigger points for a given call site
Defines reachability from trigger to the call site

29
Evaluation

Full-system execution-based simulator
Intel x86 ISA and Virtutech Simics
4-wide out-of-order processors
64K Level 1 caches (2 cycle), 1 MB Level 2 (12
cycle)
MSI coherence
Software toolchain
Modified gcc-compiler and lancet tool
Debugging information, CFG, program dependence
graph
Simulator based memory profile
Generates triggers and handlers
No mis-speculations occur

30
Reaching distant parallelism
A Cycles between Fork and Call Site
A
M()
31
Execution Buffer Entries
900 590 70 520 413 244 160 308
Avg. Cycles Held

Storage requirements
Max case 284 KB
Minimize entries by better scheduling

32
Read and write set
Cache lines written
Cache lines read
33
Demultiplexed execution overheads
Execution Time Overhead

Overheads due to
Handler
Cache misses due to demultiplexed execution
Common case
between 1.1 to 2.0x
Small methods ? High overheads

34
Length of handlers
14 10 9 100 16 4 40 4
Handler Instruction Count Overhead
35
Method sizes
36
Methods

crafty gap gzip mcf parser
twolf vortex vpr
24 16 9 8 12
10 11 11
59 27 9 84
26 106 20

Methods Call Sites
Exec. time ()
85 90 51 30 55 92
88 99

Runtime includes frequently called methods

37
Loop-level Parallelization

Unit Loop iterations
Live-ins from
P-slice
Similar to handler
Fork instruction
Restricted
Same basic block level, method
Program order dependent
Ordered forking

Mitosis
fork loop

endl
38
Method-level parallelization

Unit Method continuations
Program after the method returns
Orthogonal to PD

Method-level
call
M() ret

39
Reaching distant parallelism
M1()
B A
B
M2()
A
crafty gap gzip mcf pars twolf
vortex vpr
B A
60 72 30 80 70 40 63 47
gt 1 ()
40
Reaching distant parallelism
B Call Time to Earliest execution time (1
outstanding)
M1()
C
B
C / B R1 CNo params/C R2
M2()
A
41
Issues with Stack