Title: EECS 583 Lecture 6 Hyperblocks, Control CPR
1EECS 583 Lecture 6Hyperblocks, Control CPR
- University of Michigan
- January 27, 2003
2Homeworks
- HW 1 due today _at_1159pm
- No 4th testcase I didnt get around to it
- scp tar file to lloth.eecs.umich.edu
- Please dont email it to me
- user eecs583
- password is same
- Create tar file, uniquename.tgz
- put in /y/eecs583/hw1
- scp mahlke.tgz eecs583_at_lloth.eecs.umich.edu/y/eec
s583/hw1/. - HW 2 is available Due in 2 wks
3Class Problem from Last Time
if (a gt 0) r t s if (b gt 0 c gt
0) u v 1 else if (d gt 0)
x y 1 else z z 1
- Draw the CFG
- Compute CD
- If-convert the code
4Region Formation If-conversion
10
- Control flow representation
- branches
- predicated operations
- If-conversion not all all or nothing deal
- Often bad to apply in blanket mode
- Selectively apply
- Regions
- Extend a superblock to contain if-converted code
- Convert off-trace transitions to on-trace
- A hyperblock is born
- Superblock is a special case HB where all
guarding predicates are True
BB1
20
80
BB2
BB3
80
20
BB4
BB4
8
20
72
BB5
28
BB6
BB6
7.2
25.2
64.8
2.8
5When to Apply If-conversion
- Positives
- Remove branch
- No disruption to sequential fetch
- No prediction or mispredict
- No use of branch resource
- Increase potential for operation overlap
- Enable more aggressive compiler xforms
- Software pipelining
- Height reduction
- Negatives
- Max or Sum function applied when overlap
- Resource usage
- Dependence height
- Hazard presence
- Executing useless operations
10
BB1
80
90
20
BB2
BB3
80
20
BB4
10
BB5
90
10
BB6
10
6Negative 1 Resource Usage
Case 1 Each BB requires 3 resources Assume
processor has 2 resources No IC 13 .63
.43 13 9 9 / 2 4.5 5 cycles IC 1(3
3 3 3) 12 12 / 2 6 cycles
Resource usage is additive for all BBs that are
if-converted
100
BB1
BB1
60
40
BB2 if p1
BB2
BB3
Case 2 Each BB requires 3 resources Assume
processor has 6 resources No IC 13 .63
.43 13 9 9 / 6 1.5 2 cycles IC
1(3333) 12 12 / 6 2 cycles
BB3 if p2
60
40
BB4
BB4
100
7Negative 2 Dependence Height
Case 1 height(bb1) 1, height(bb2)
3 Height(bb3) 9, height(bb4) 2 No IC 11
.63 .49 12 8.4 IC 11 1MAX(3,9)
13 13
Dependence height is max of for all BBs that are
if-converted (dep height schedule length with
infinite resources)
100
BB1
BB1
Case 2 height(bb1) 1, height(bb2)
3 Height(bb3) 3, height(bb4) 2 No IC 11
.63 .43 12 6 IC 11 1MAX(3,3)
12 6
60
40
BB2 if p1
BB2
BB3
BB3 if p2
60
40
BB4
BB4
100
8Negative 3 Hazard Presence
Case 1 Hazard in BB3 No IC SB out of BB1, 2,
4, operations In BB4 free to overlap with those
in BB1 and BB2 IC operations in BB4 cannot
overlap With those in BB1 (BB2 ok)
Hazard operation that forces the compiler to be
conservative, so limited reordering or
optimization, e.g., subroutine call, pointer
store,
100
BB1
BB1
60
40
BB2 if p1
BB2
BB3
BB3 if p2
60
40
BB4
BB4
100
9When To If-convert
- Resources
- Small resource usage ideal for less important
paths - Dependence height
- Matched heights are ideal
- Close to same heights is ok
- Remember everything is relative for resources
and dependence height ! - Hazards
- Avoid hazards unless on most important path
- Estimate of benefit
- Branches/Mispredicts removed
- Fudge factor
100
BB1
BB1
60
40
BB2 if p1
BB2
BB3
BB3 if p2
60
40
BB4
BB4
100
10The Hyperblock
- Hyperblock - Collection of basic blocks in which
control flow may only enter at the first BB. All
internal control flow is eliminated via
if-conversion - Likely control flow paths
- Acyclic (outer backedge ok)
- Multiple intersecting traces with no side
entrances - Side exits still exist
- Hyperblock formation
- 1. Block selection
- 2. Tail duplication
- 3. If-conversion
10
BB1
80
90
20
BB2
BB3
80
20
BB4
10
BB5
90
10
BB6
10
11Block Selection
- Block selection
- Select subset of BBs for inclusion in HB
- Difficult problem
- Weighted cost/benefit function
- Height overhead
- Resource overhead
- Hazard overhead
- Branch elimination benefit
- Weighted by frequency
10
BB1
80
90
20
BB2
BB3
80
20
BB4
10
BB5
90
10
BB6
10
12Block Selection
- Create a trace ?main path
- Use a heuristic function to select other blocks
that are compatible with the main path - Consider each BB by itself for simplicity
- Compute priority for other BBs
- Normalize against main path.
- BSVi (K x (weight_bbi / size_bbi) x
(size_main_path / weight_main_path) x bb_chari) - weight execution frequency
- size number of operations
- bb_char characteristic value of each BB
- Max value 1, Hazardous instructions reduce
this to 0.5, 0.25, ... - K constant to represent processor issue rate
- Include BB when BSVi gt Threshold
13Example - Step 1 - Block Selection
main path 1,2,4,6 num_ops 5 8 3 2
18 weight 80 Calculate the BSVs for BB3,
BB5 assuming no hazards, K 4 BSV3 4 x (20 /
2) x (18 / 80) 9 BSV5 4 x (10 / 5) x (18 /
80) 1.8 If Threshold 2.0, select BB3 along
with main path
10
BB1 - 5
80
90
20
BB2 - 8
BB3 2
80
20
BB4 - 3
10
BB5 - 5
90
10
BB6 - 2
10
14Example - Step 2 - Tail Duplication
Tail duplication same as with Superblock formation
10
10
BB1
BB1
80
20
80
20
BB2
BB3
BB2
BB3
80
20
80
20
BB4
BB4
10
10
BB5
90
BB5
90
10
10
BB6
BB6
BB6
90
81
9
10
9
1
15Example - Step 3 If-conversion
If-convert intra-HB branches only!!
10
10
BB1
80
20
BB1 p1,p2 CMPP
BB2
BB3
80
20
BB2 if p1
BB4
BB3 if p2
10
BB4
BB5
90
BB6
BB5
10
10
BB6
81
BB6
9
81
BB6
9
9
1
1
9
16Hyperblock Performance Evaluation (1)
- O BB code
- IP Structural if-conversion
- All innermost loops, acyclic SEME regions
- PP Selective if-conversion
17Class Problem
Form the HB for this subgraph Assume K 4, BSV
Threshold 2
100
BB1- 3
20
80
BB2 - 8
BB3 - 2
80
20
BB4 - 2
45
55
BB5 - 3
BB6 - 2
10
35
55
BB7 -1
BB8 -2
35
10
BB9 -1
18Block Selection Try 2
- Problems with BSV formula
- Ignore dependence height
- Blocks considered independently (control flow
ignored) - Enumerate all paths of execution through region
of interest - Consider a path execution from entry to some
exit - Give priority to path as a whole
- Path priority
- dep_ratioi 1.0 (dep_heighti / max dep_height)
- op_ratioi 1.0 (num_opsi / max num_ops)
- priorityi (probabilityi x hazardi) x
(dep_ratioi op_ratioi K) - Hazard multiplier was 0.25 for paths containing
subroutine call or unresolvable memory store - K base contribution for a path (0.1 used)
19Block Selection Try 2 (continued)
- Path selection
- Rank paths from highest to lowest priority
- Include paths until either
- Estimated available resources full
- Priority drops too low
- Exclude any paths with excessive resource util or
dep height - Use union of selected paths to form Hyperblock
- Causes some lower priority paths to be included
20Block Selection - Try 2 - Example
Enumerate all paths, rank by priority
1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3.
A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5.
A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D 8.
A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N 10.
A-C-D-E-G-J-M-N 11. A-C-D-E-G-J-L-M-N 12.
A-C-D-E-G-I-M-N 13. A-C-D-E-G-J-L-N 14. A-C-D
15. A-B-D-E-F-G-I-M-N 16. A-B-D-E-F-G-J-M-N 17.
A-B-D-E-F-G-J-L-M-N 18. A-B-D-E-F-G-J-L-N 19.
A-B-C-E-F-G-I-M-N 20. A-B-C-E-F-G-J-M-N 21.
A-B-C-E-F-G-J-L-M-N 22. A-B-C-E-F-G-J-L-N
21Block Selection Try 2 Example continued
22Hyperblock Performance Using Paths
4 - issue
8 - issue
23Control CPR A Branch Height Reduction
Optimization for EPIC ArchitecturesPLDI - 1999
- Mike Schlansker
- Scott Mahlke
- Hewlett-Packard Laboratories
- Richard Johnson
- Transmeta Corporation
24Introduction and Problem Statement
- Dependences limit performance
- Data
- Control
- Long dependence chains
- Sequential code
- Problem worse for next generation processors
- High degree hardware parallelism
- Low degree of program parallelism
- Resources idle most of the time
- Height reduction optimizations
- Traditional compilers focus on reducing operation
count - Future compilers need on increasing program
parallelism
25Height Reduction Optimization
- Goals
- Break dependences
- Reduce latency of edges
- Reorganize computation
- Common approach
- Tradeoff redundant work for reduced height
- Inverse of CSE
- Data height reduction
- Use of the associative property
- Induction variable back substitution
- Control height reduction
- Control dependences
- Reduce height through branch network
- Focus of our work
26Our Approach to Control Height Reduction
- Goals
- Reduce dependence height through a network of
branches - Reduce number of executed branches
- Applicable to a large fraction of the program
- Fit into our existing compiler infrastructure
- Difficulty
- Reducing height while
- Not increasing operation count
- Irredundant Consecutive Branch Method (ICBM)
- Use branch profile information
- Optimize likely the important control flow paths
- Possibly penalize less important paths
27Definitions
- Superblock
- single-entry linear sequence of operations
containing 1 or more branches - Our basic compilation unit
- Non-speculative operations
- Exit branch
- branch to allow early transfer out of the
superblock - compare condition (ai lt bi)
- On-trace
- preferred execution path (E4)
- identified by profiling
- Off-trace
- non-preferred paths (E1, E2, E3)
- taking an exit branch
28ICBM for a Simple RISC Processor - Step 1
Input superblock
Insert bypass branch
29ICBM for a Simple RISC Processor - Step 2
Superblock with bypass branch
Move code down through bypass branch
30ICBM for a Simple RISC Processor - Step 3
Code after downward motion
Simplify resultant code
31ICBM for a Simple RISC Processor - Step 4
Sequential boolean
Height reduced
Code after simplification
expression
expression
32Is the ICBM Transformation Always Correct?
- Answer is no
- Problem with downward motion
- S1 ops to compute c0, c1, c2
- S2 ops dependent on branches
- S1 ops must remain on-trace
- S2 ops must move downward
- No dependences permitted between S1 and S2
- Separability violation
- Experiments - 6 branches failed
- Memory dependences
33Blocking
- Transforming an entire superblock
- May not be possible
- May not be profitable
- Solution - CPR blocks
- Block into smaller subregions
- Linear sequences of basic blocks
- Apply CPR to each subregion
- Grow CPR block incrementally
- Terminate CPR block when
- Correctness violation
- Performance heuristic
34ICBM for an EPIC Processor (HPL-PlayDoh)
- Predicated execution
- Boolean guard for all operations
- a b c if p
- Increases complexity of ICBM
- Generalize the schema
- Analyze and transform complex predicated code
- Suitability pattern match
- Proof of correct code generation
- Increases efficiency of ICBM
- Wired-AND/wired-OR compares
- Accumulate disjunction of conditions into a
predicate - See PlayDoh technical report
- Compare network reduced to 1 level
35Experiment Evaluation
- ICBM implemented in Elcor research compiler
- More information available at www.trimaran.org
- Comparison
- Baseline - optimized superblock code produced by
Impact - Height-reduced - baseline code with ICBM
transformation - Benchmarks - SPECINT95, SPECINT92, Unix utilities
(24 total) - Processor models - PlayDoh instruction set
- sequential - single issue RISC
- narrow - (2,1,1,1) (I,F,M,B)
- medium - (4,2,2,1)
- wide - (8,4,4,2)
- infinite - (75,25,25,25)
- Cache stalls and branch mispredictions not
measured
36Taste of the Results
37Performance Insights
- When is ICBM most effective?
- Sequences of biased branches (includes unrolled
loops) - long sequences are good
- but do not have to be overly long because its
better to block anyways - Control dependence limited
- Branch resource limited
- Best benchmark - cmp
- long trace (162 ops), 25 branches, all branches
heavily biased - When is ICBM less effective?
- Few branches
- Dominated by unbiased branches
- Control dependences are not limiting -
data/memory are limiting factors - Important branches that we cannot treat (table
jumps) - Worst benchmark - 099.go
- unbiased branches, data dependences
38Summary and Final Thoughts
- ICBM is an effective strategy for control height
reduction - Relatively simple
- No height versus redundancy tradeoff
- Use profile information
- Reduce dependence height and operation count on
important paths - Penalize less important paths
- Strong performance gains across range of
processors - 13 for a sequential processor
- 18 for a medium VLIW (4,2,2,1)
- 33 for a wide VLIW (8,4,4,2)
- Importance of height reduction optimizations in
future compilers - Parallelism limit studies are only valid on a
fixed code base - Compiler can manufacture ILP
- Current research only scratches the surface of
height reduction