Saisanthosh Balakrishnan Guri Sohi - PowerPoint PPT Presentation

About This Presentation
Title:

Saisanthosh Balakrishnan Guri Sohi

Description:

crafty gap gzip mcf parser twolf vortex vpr. 24 16 9 8 12 10 11 11. 59 27 9 84 26 106 20 ... crafty gap gzip mcf pars twolf vortex vpr. 60 72 30 80 70 40 63 47 ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 51
Provided by: SaiBalak4
Category:

less

Transcript and Presenter's Notes

Title: Saisanthosh Balakrishnan Guri Sohi


1
Program DemultiplexingData-flow based
Speculative Parallelization
  • Saisanthosh Balakrishnan Guri Sohi
  • University of Wisconsin-Madison

2
Speculative Parallelization
  • Construct threads from sequential program
  • Loops, methods,
  • Execute threads speculatively
  • Hardware support to enforce program order
  • Application domain
  • Irregularly parallel
  • Importance now
  • Single-core performance incremental

3
Speculative Parallelization Execution
  • Execution model
  • Fork threads in program order for execution
  • Commit tasks in that order

Control-flow Speculative Parallelization
  • Limitation
  • Reaching distant parallelism

4
Outline
  • Program Demultiplexing Overview
  • Program Demultiplexing Execution Model
  • Hardware Support
  • Evaluation

5
Program Demultiplexing Framework
  • Trigger
  • Begins execution of Handler
  • Handler
  • Setup execution, parameters
  • Demultiplexed execution
  • Speculative
  • Stored in Execution Buffer
  • At call site
  • Search EB for execution
  • Dependence violations
  • Invalidate executions

M()
PD Execution
6
Program Demultiplexing Highlights
  • Method granularity
  • Well defined
  • Parameters
  • Stack for local communication
  • Trigger forks execution
  • Means for reaching distant method
  • Different from call site
  • Independent speculative executions
  • No control dependence with other executions
  • Triggers lead to unordered execution
  • Not according to program order

7
Outline
  • Program Demultiplexing Overview
  • Program Demultiplexing Execution Model
  • Hardware Support
  • Evaluation

8
Example 175.vpr, update_bb ()
.. x_from block b_from.x y_from block
b_from.y find_to (x_from, y_from, block
b_from.type, rlim, x_to, y_to) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
Call Site 2
Call Site 1
9
Handlers
  • Provides parameters to execution
  • Achieves separation of call site and execution
  • Handler code
  • Slice of dependent instructions from call site
  • Many variants possible

update_bb (inet, bb_coord_new bb_index,
bb_edge_new bb_index, x_from,
y_from, x_to, y_to)
10
Handlers Example
.. x_from block b_from.x y_from block
b_from.y find_to ( x_from, y_from, block
b_from.type, rlim, x_to, y_to ) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
11
Triggers
  • Fork demultiplexed execution
  • Usually when method and handler are ready
  • i.e. when data dependencies satisfied
  • Begins execution of the handler

12
Identifying Triggers
  • Generate memory profile
  • Identify trigger point
  • Collect for many executions
  • Good coverage
  • Represent trigger points
  • Use instruction attributes
  • PCs, Memory write address

Sequential Exec.
Program state for HM
Handler
M()
13
Triggers Example
.. x_from block b_from.x y_from block
b_from.y find_to ( x_from, y_from, block
b_from.type, rlim, x_to, y_to ) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
T1
T2
Minimum of 400 cycles
90 cycles per execution
14
Handlers Example (2)
H2
H1
.. x_from block b_from.x y_from block
b_from.y find_to ( x_from, y_from, block
b_from.type, rlim, x_to, y_to ) .. .. for (
k 0 k lt num_nets_affected k ) inet
nets_to_update k if (net_block_moved k
FROM_AND_TO) continue .. if ( net
inet.num_pins lt SMALL_NET )
get_non_updateable_bb (inet, bb_coord_new
bb_index) else if ( net_block_moved
k FROM ) update_bb ( inet,
bb_coord_new bb_index,
bb_edge_new bb_index, x_from, y_from, x_to,
y_to ) else update_bb ( inet,
bb_coord_new bb_index, bb_edge_new
bb_index, x_to, y_to, x_from, y_from )
.. .. bb_index
T1
T2
15
Outline
  • Program Demultiplexing Overview
  • Program Demultiplexing Execution Model
  • Hardware Support
  • Evaluation

16
Hardware Support Outline
  • Support for triggers
  • Demultiplexed execution
  • Maintaining executions
  • Storage
  • Invalidation
  • Committing

17
Support for Triggers
  • Triggers are registered with hardware
  • ISA extensions
  • Similar to debug watchpoints
  • Evaluation of triggers
  • Only by Committed instructions
  • PC, address
  • Fast lookup with filters

18
Demultiplexed Execution
  • Hardware Typical MP system
  • Private cache for speculative data
  • Extend cache line with access bit
  • Misses serviced by Main processor
  • No communication with other executions
  • On completion
  • Collect read set (R)
  • Accessed lines
  • Collect write set (W)
  • Dirty lines
  • Invalidate write set in cache

C
C
P0
P3
P1
P2
C
C
19
Execution buffer pool
  • Holds speculative executions
  • Execution entry contains
  • Read and write set
  • Parameters and return value
  • Alternatives
  • Use cache
  • May be more efficient
  • Similar to other proposals
  • Not the focus in this paper

20
Invalidating Executions
  • For a committed store address
  • Search Read and Write sets
  • Invalidate matching executions

21
Using Executions
  • For a given call site
  • Search method name, parameters
  • Get write and read set
  • Commit
  • If accessed by program
  • Use
  • If accessed by another method
  • Nested methods

22
Outline
  • Program Demultiplexing Overview
  • Program Demultiplexing Execution Model
  • Hardware Support
  • Evaluation

23
Reaching distant parallelism
Call site
Fo rk
A B
A
Call Site
M()
B
24
Performance evaluation
  • Performance benefits limited by
  • Methods in program
  • Handler implementation

25
Summary of other results (Refer paper)
  • Method sizes
  • 10s to 1000s of instructions. Lower 100s usually
  • Demultiplexed execution overheads
  • Common case 1.1x to 2.0x
  • Trigger points
  • 1 to 3. Outliers exist macro usage
  • Handler length
  • 10 to 50 instructions average
  • Cache lines
  • Read 20s, Written 10s
  • Demultiplexed execution
  • Held average of 100s of cycles

26
Conclusions
  • Method granularity
  • Exploit modularity in program
  • Trigger and handler to allow earliest execution
  • Data-flow based
  • Unordered execution
  • Reach distant parallelism
  • Orthogonal to other speculative parallelization
  • Use to further speedup demultiplexed execution

27
Backup
28
Average trigger points in call site
  • Small set of trigger points for a given call site
  • Defines reachability from trigger to the call site

29
Evaluation
  • Full-system execution-based simulator
  • Intel x86 ISA and Virtutech Simics
  • 4-wide out-of-order processors
  • 64K Level 1 caches (2 cycle), 1 MB Level 2 (12
    cycle)
  • MSI coherence
  • Software toolchain
  • Modified gcc-compiler and lancet tool
  • Debugging information, CFG, program dependence
    graph
  • Simulator based memory profile
  • Generates triggers and handlers
  • No mis-speculations occur

30
Reaching distant parallelism
A Cycles between Fork and Call Site
A
M()
31
Execution Buffer Entries
900 590 70 520 413 244 160 308
Avg. Cycles Held
  • Storage requirements
  • Max case 284 KB
  • Minimize entries by better scheduling

32
Read and write set
Cache lines written
Cache lines read
33
Demultiplexed execution overheads
Execution Time Overhead
  • Overheads due to
  • Handler
  • Cache misses due to demultiplexed execution
  • Common case
  • between 1.1 to 2.0x
  • Small methods ? High overheads

34
Length of handlers
14 10 9 100 16 4 40 4
Handler Instruction Count Overhead
35
Method sizes
36
Methods
  • crafty gap gzip mcf parser
    twolf vortex vpr
  • 24 16 9 8 12
    10 11 11
  • 59 27 9 84
    26 106 20

Methods Call Sites
Exec. time ()
85 90 51 30 55 92
88 99
  • Runtime includes frequently called methods

37
Loop-level Parallelization
  • Unit Loop iterations
  • Live-ins from
  • P-slice
  • Similar to handler
  • Fork instruction
  • Restricted
  • Same basic block level, method
  • Program order dependent
  • Ordered forking

Mitosis
fork loop

endl
38
Method-level parallelization
  • Unit Method continuations
  • Program after the method returns
  • Orthogonal to PD

Method-level
call
M() ret

39
Reaching distant parallelism
M1()
B A
B
M2()
A
crafty gap gzip mcf pars twolf
vortex vpr
B A
60 72 30 80 70 40 63 47
gt 1 ()
40
Reaching distant parallelism
B Call Time to Earliest execution time (1
outstanding)
M1()
C
B
C / B R1 CNo params/C R2
M2()
A
41
Issues with Stack
  • Stack pointer is position dependent
  • Handler has to insert parameters at right
    position
  • Same stack addresses denote different variables
  • Affects triggers
  • Different stack pointers in program and execution
  • Stack may be discarded
  • To commit requires relocation of stack results
  • Example parameters passed by reference

42
Benchmarks
  • SPECint2000 benchmarks
  • C programs
  • Did not evaluate gcc, perl, bzip2, and eon
  • No intention of creating concurrency
  • No specific/ clean Programming style
  • Many methods perform several tasks
  • May have less opportunities

43
Hardware System
  • Intel x86 simulation
  • Virtutech Simics based full-system, Bochs decoder
  • 4-processors at 3 GHz
  • Simple memory system
  • Micro-architecture model
  • 4-wide out of order without cracking into
    micro-ops
  • Branch predictors
  • 32K L1 (2-cycle), 1 MB L2 (12-cycle)
  • MSI, 15-cycle communication cache to cache
  • Infinite Execution buffer pool

44
Software
  • Modified gcc-compiler tool chain and lancet tool
  • Extract from compiled binary
  • Debugging information
  • CFG, Program Dependence Graph
  • Software
  • Dynamic information from simulator
  • Generates handler, trigger for call site as
    encountered
  • Control-flow in handler not included ongoing
    work
  • Perfect control transfer from trigger to method
  • Handler doesnt execute if a branch leads to not
    calling the method

45
Generating Handlers
  • Cannot easily identify and demarcate code
  • Heuristic to demarcate
  • Terminate when load address is from heap
  • Handler has
  • Loads and stores to stack
  • No stores to heap
  • Limitation
  • Heuristic. Doesnt always work

46
Generating Handlers
  • 1 Specify parameters to method
  • Pushed into stack by program
  • Introduces dependency
  • Prevents separation
  • 2 Computing parameters
  • Program performs it near call site
  • Need to identify the code
  • Deal with
  • Use of stack
  • Control-flow
  • Inter-method dependence

1 G F (N) 2 if () 3 X G 2 4 else 5
X G 2 6 M (X)
47
Control-flow in Handlers
  • Depends on call sites CF
  • Handler for D
  • Call site in C () BB 3
  • Include Loop
  • BB 4 to BB 1
  • Include Branch
  • Branch in BB 1
  • Inclusion depends on trigger
  • Multiple iterations, diff. triggers
  • Ongoing work

CFG (C), Call Graph
D
48
Other dependencies in Handlers
  • C calls D, A or B calls C
  • Dependence (X) extends
  • May need multiple handlers
  • If multiple call sites

Call Graph
A(X)
C (X)
D(X)
B(X)
49
Buffering Handler Writes
  • General case
  • Writes in handler to be buffered
  • Provided to execution
  • Discarded after execution
  • Current implementation
  • Only stack writes

P1
P2
P3
C
C
C
EB
50
Methods for Speculative Execution
  • Well encapsulated
  • Defined by parameters and return value
  • Stack for local computation
  • Heap for global state
  • Often performs specific tasks
  • Access limited global state
  • Limits side-effects
Write a Comment
User Comments (0)
About PowerShow.com