Title: Diverge-Merge Processor (DMP)
1Diverge-Merge Processor (DMP)
Hyesoon Kim José A. Joao Onur Mutlu Yale N.
Patt HPS Research Group
Microsoft Research University of Texas at Austin
2Outline
- Predicated Execution
- Diverge-Merge Processor (DMP)
- Implementation of DMP
- Experimental Evaluation
- Conclusion
3Predicated Execution
(predicated code)
A
p1 (cond) (!p1) mov b, 1 (p1) mov
b, 0
B
C
- Convert control flow dependence to data dependence
4Benefit of Predicated Execution
- Predicated Execution can be high performance and
energy-efficient.
Predicated Execution
A
Fetch Decode Rename Schedule RegisterRead
Execute
B
C
nop
Branch Prediction
D
Fetch Decode Rename Schedule RegisterRead
Execute
A
E
D
B
F
E
Pipeline flush!!
F
5Limitations/Problems of Predication
- ISA Predicate registers and predicated
instructions - Dynamic-Hammock PredicationKlauser98 can solve
this problem but it is only applicable to simple
hammocks. - Adaptivity Static predication is not adaptive to
run-time branch behavior. - Branch behavior changes based on input set,
phase, control-flow path. - Wish BranchesKim05
- Complex CFG A large subset of control-flow
graphs is not converted to predicated code. - Function calls, loops, many instructions inside a
region, - and complex CFGs
- HyperblockMahlke92 cannot adapt to
frequently-executed paths dynamically.
6Outline
- Predicated Execution
- Diverge-Merge Processor (DMP)
- Implementation of DMP
- Experimental Evaluation
- Conclusion
7Diverge-Merge Processor (DMP)
- DMP can dynamically predicate complex branches
(in addition to simple hammocks). - The compiler identifies
- Diverge branches
- Control-flow merge (CFM) points
- The microarchitecture decides when and what to
predicate dynamically.
8Dynamic Predication
Low-confidence
A
(mov R1, 1) PR10 1
B
(mov R1, 0) PR11 0
C
select-µops (f-nodes in SSA)
PR12 (cond) ? PR11 PR10
H
H
JOIN add R5, R1, 1
Klauser et al.PACT98 Dynamic-hammock
predication
9Diverge-Merge Processor
A
A
Diverge Branch
B
C
B
D
C
E
E
F
G
Insert select-µops
H
CFM point
H
Frequently executed path Not frequently executed
path
10Diverge-Merge Processor
A
C
B
D
E
F
G
H
Frequently executed path Not frequently executed
path
diverge-branch executed block CFM
point
11Control-Flow Graphs
DMP
Dynamic Hammock
SW pred
Wish br.
Dual-path
12Dual-path Execution vs. DMP
Dual-path
DMP
A
path 1
path 2
path 1
path 2
C
B
C
B
B
C
CFM
CFM
D
D
D
D
E
E
E
E
F
F
F
F
13Control-Flow Graphs
DMP
Dynamic-hammock
SW pred
Wish br.
Dual-path
sometimes
sometimes
14Distribution of Mispredicted Branches
- 66 of mispredicted branches can be dynamically
predicated in DMP.
15Distribution of Mispredicted Branches
- 66 of mispredicted branches can be dynamically
predicated in DMP.
16Outline
- Predicated Execution
- Diverge-Merge Processor (DMP)
- Implementation of DMP
- Experimental Evaluation
- Conclusion
17Fetch Mechanism
A
A
Diverge Branch
Low Confidence
C
B
B
D
Round-robin fetch
C
E
E
F
G
CFM point
H
H
predicted path
18Dynamic Predication
Arch. Phy. M
R1
R2 PR12
R3 PR13
A
PR11
PR41
1
PR21
B
RAT1
Arch. Phy. M
R1
R2 PR12
R3 PR13
C
PR31
1
PR11
E
select-µop pr41 p1? pr21 pr31
RAT2
H
Forks RAT, RAS, and GHR
19DMP Support
- ISA Support
- Mark diverge branches/CFM points.
- Compiler Support CGO07
- The compiler identifies diverge branches and the
corresponding CFM points. - Hardware Support
- Confidence estimator
- Fetch mechanisms
- Load/store processing
- Instruction retirement
- Dynamic predication
20Hardware Complexity Analysis
SW pred.
Dualpath
Wish br.
Multi path
Dyn.ham.
DMP
?
?
? ?
?
?
Front-End
?
?
?
?
Confidence Estimator
?
? ?
?
?
Rename Support
?
?
?
?
Predicate Registers
?
?
?
?
Select-Uop Gen.
?
?
?
?
?
?
ST-LD Forwarding
?
?
?
?
?
Check Flush/no Flush
21Outline
- Predicated Execution
- Diverge-Merge Processor (DMP)
- Implementation of DMP
- Experimental Evaluation
- Conclusion
22Simulation Methodology
- 12 SPEC 2000 INT, 5 SPEC 95 INT
- Different input sets for profiling and evaluation
- Alpha ISA execution driven simulator
- Baseline processor configuration
- 64KB perceptron predictor/O-GEHL (paper)
- Minimum 30-cycle branch misprediction penalty
- 8-wide, 512-entry instruction window
- 2 KB 12-bit history enhanced JRS confidence
estimator - Less aggressive processor (paper)
- Power model using Wattch
23Different CFG types
24Performance Improvement
25Energy Consumption
26Outline
- Predicated Execution
- Diverge-Merge Processor (DMP)
- Implementation of DMP
- Experimental Evaluation
- Conclusion
27Conclusion
- DMP introduces the concept of frequently-hammocks
and it dynamically predicates complex CFGs. - DMP can overcome the three major limitations of
software predication ISA support, adaptivity,
complex CFG. - DMP reduces branch mispredictions energy
efficiently - 19 performance improvement, 9 less energy
- DMP divides the work between the compiler and the
microarchitecture - The compiler analyzes the control-flow graphs.
- The microarchitecture decides when and what to
predicate dynamically.
28Thank You!!
29Questions?
30Handling Mispredictions
A
A
A
Diverge Br.
C
B
B
B
(0)
Misprediction!
D
C
C
(1)
Flush
E
E
E
F
add pr44 ? pr34, -1(!p1)
(1)
G
select-µop pr41 p1? pr21 pr31
D
D
add pr34 ? pr31, pr13
CFM point
H
H
H
add pr24 ? pr41, pr13
predicted path
31Loop Branches
- Exit Condition
- The loop branch is predicted to exit the loop.
- Benefit
- Reduced pipeline flushes when the predicated
loop is iterated more times than it should be. - Instructions in the extra iterations of the loop
become NOPs. Instructions after loop-exit can
still be executed. - Negative Effects
- Increased execution delay of loop-carried
dependencies - The overhead of select-µops
32Loop Branches
- Predicate each loop iteration separately
select-uop pr22 p1 ? pr21 pr11 select-uop
pr23 p1? pr20 pr10
select-uop pr32 p2 ? pr31 pr22 select-uop
pr33 p2 ? pr30 pr23
Loop br. is predicted to exit the loop
33Enhanced Mechanisms
- Multiple CFM points
- The hardware chooses one CFM point for each
instance of dynamic predication. - Exit Optimizations
- Counter Policy What if one path does not reach
the CFM point? - Number of fetched instructions gt Threshold
- Yield Policy What if another low confidence
diverge branch is encountered in dynamic
predication mode? - Later low confidence branch is more likely
mispredicted.
A
B
C
G
D
F
E
H
34Detailed DMP Support
- 32 Predicate register ids
- Fetch mechanism
- High performance I-Cache
- Fetch two cache lines
- Predict 3 branches
- Fetch stops at the first taken branch
35Diverge and Merge?
36Useful Dynamic Predication Mode
37Perfect Branch Prediction
38Maximum Power
39Branch Predictor Effects
40Confidence Estimator Effects
41Results in Less Aggressive Processors
42DMP vs. Perfect Conditional BP
43Enhanced DMP Mechanisms
44DMP vs. Other Mechanisms
45Comparisons with Predication/Wish Branches
non-predicated
46Reduction in Pipeline Flushes
- Average overhead
- Dynamic-hammock 4 instructions/entry
- Dual-path 150 instructions/entry
- Multipath 200 instructions/entry
- DMP 20 instructions/entry
47Handling Nested Diverge Branches
- Basic DMP
- Ignore other low confidence div. branches
- Enhanced DMP
- Exit dynamic predication mode and re-enter from
the younger low confidence branch on predicted
path (Yield policy)
A
Diverge Br.
C
B
D
E
F
G
CFM point
H
48Compiler Support CGO07
- Compiler analyzes the control flow and the
profile data - Step1 Identify diverge branch candidates and CFM
points. - Step2 Select diverge branches based on
- (1) the number of instructions between a branch
and the CFM point - (2) the probability of merging at the CFM point
- Heuristics or a cost-benefit model
- Step3 Mark the selected branches/CFM points.
49Future Research
- Hardware Support
- Better confidence estimators
- Efficient hardware mechanism to detect diverge
branches and CFM points - Increase hardware complexity but eliminate the
need for ISA/compiler support - Compiler Support
- Better compiler algorithms CGO07
50Power Measurement Configurations
- 100 nm Technology
- Baseline processor
- 4GHZ
- Less aggressive processor
- 1.5GHz
- CC3 clock-gating model in Wattch unused units
dissipate only 10 of their maximum power - DMP one more RAT/RAS/GHR, select-uop generation
module, additional fields in BTB, predicate
registers, CFM registers, load-store forwarding,
instruction retirement
51Fetched wrong-path instructions per entry into
dynamic-predication/dual-path mode
52Fetched/Executed Instructions
53ISA Support
- Example of Diverge Br and CFM markers
OPCODE
TARGET
00 normal branch 10 diverge forward branch 11
diverge loop branch
CFM rel address
CFM CFM rel address PC
54Entering Dynamic Predication Mode
- Entry condition
- When a diverge branch has low confidence.
- The Front-end
- Stores the address of the CFM point to the CFM
register. - Forks the RAS, GHR, and RAT.
- Allocates a predicate register.
- Fetch Mechanisms
- Round-robin fetch from two paths
- The processor follows the branch predictor until
it reaches the corresponding CFM point.
55Exiting Dynamic Predication Mode
- Exit condition
- Both paths of a diverge branch have reached the
corresponding CFM point. - A diverge branch is resolved.
- Select-µop mechanism
- Similar to f-node in SSA
- Merges register values from two paths.
56Multipath Execution
A
path 1
path 2
path 3
path 4
C
B
B
C
Low-confidence
E
G
E
D
G
F
D
F
H
H
H
H
H
I
I
I
I
I
Instructions after the control-flow merge point
are fetched multiple times. Waste of resources
and energy.
57Modeling Software Predication
- Mark using a binary instrumentation tool
- All simple and nested hammocks can be predicated.
- All instruction between a branch and the
control-flow merge point are fetched. - All nested branches are predicated.