Diverge-Merge Processor (DMP) - PowerPoint PPT Presentation

About This Presentation
Title:

Diverge-Merge Processor (DMP)

Description:

jmp JOIN. TARGET: mov R1, 0. A. B. C. p1 = (cond) branch p1, TARGET ... JOIN: add R5, R1, 1. Diverge-Merge Processor. C. B. E. D. F. G. Frequently executed path ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 58
Provided by: hyes
Learn more at: http://users.ece.cmu.edu
Category:
Tags: dmp | diverge | join | merge | processor

less

Transcript and Presenter's Notes

Title: Diverge-Merge Processor (DMP)


1
Diverge-Merge Processor (DMP)
Hyesoon Kim José A. Joao Onur Mutlu Yale N.
Patt HPS Research Group
Microsoft Research University of Texas at Austin

2
Outline
  • Predicated Execution
  • Diverge-Merge Processor (DMP)
  • Implementation of DMP
  • Experimental Evaluation
  • Conclusion

3
Predicated Execution
(predicated code)
A
p1 (cond) (!p1) mov b, 1 (p1) mov
b, 0
B
C
  • Convert control flow dependence to data dependence

4
Benefit of Predicated Execution
  • Predicated Execution can be high performance and
    energy-efficient.

Predicated Execution
A
Fetch Decode Rename Schedule RegisterRead
Execute
B
C
nop
Branch Prediction
D
Fetch Decode Rename Schedule RegisterRead
Execute
A
E
D
B
F
E
Pipeline flush!!
F
5
Limitations/Problems of Predication
  • ISA Predicate registers and predicated
    instructions
  • Dynamic-Hammock PredicationKlauser98 can solve
    this problem but it is only applicable to simple
    hammocks.
  • Adaptivity Static predication is not adaptive to
    run-time branch behavior.
  • Branch behavior changes based on input set,
    phase, control-flow path.
  • Wish BranchesKim05
  • Complex CFG A large subset of control-flow
    graphs is not converted to predicated code.
  • Function calls, loops, many instructions inside a
    region,
  • and complex CFGs
  • HyperblockMahlke92 cannot adapt to
    frequently-executed paths dynamically.

6
Outline
  • Predicated Execution
  • Diverge-Merge Processor (DMP)
  • Implementation of DMP
  • Experimental Evaluation
  • Conclusion

7
Diverge-Merge Processor (DMP)
  • DMP can dynamically predicate complex branches
    (in addition to simple hammocks).
  • The compiler identifies
  • Diverge branches
  • Control-flow merge (CFM) points
  • The microarchitecture decides when and what to
    predicate dynamically.

8
Dynamic Predication
Low-confidence
A
(mov R1, 1) PR10 1
B
(mov R1, 0) PR11 0
C
select-µops (f-nodes in SSA)
PR12 (cond) ? PR11 PR10
H
H
JOIN add R5, R1, 1
Klauser et al.PACT98 Dynamic-hammock
predication
9
Diverge-Merge Processor
A
A
Diverge Branch
B
C
B
D
C
E
E
F
G
Insert select-µops
H
CFM point
H
Frequently executed path Not frequently executed
path
10
Diverge-Merge Processor
A
C
B
D
E
F
G
H
Frequently executed path Not frequently executed
path
diverge-branch executed block CFM
point
11
Control-Flow Graphs
DMP
Dynamic Hammock
SW pred
Wish br.
Dual-path
12
Dual-path Execution vs. DMP
Dual-path
DMP
A
path 1
path 2
path 1
path 2
  • Low-confidence

C
B
C
B
B
C
CFM
CFM
D
D
D
D
E
E
E
E
F
F
F
F
13
Control-Flow Graphs
DMP
Dynamic-hammock
SW pred
Wish br.
Dual-path
sometimes
sometimes
14
Distribution of Mispredicted Branches
  • 66 of mispredicted branches can be dynamically
    predicated in DMP.

15
Distribution of Mispredicted Branches
  • 66 of mispredicted branches can be dynamically
    predicated in DMP.

16
Outline
  • Predicated Execution
  • Diverge-Merge Processor (DMP)
  • Implementation of DMP
  • Experimental Evaluation
  • Conclusion

17
Fetch Mechanism
A
A
Diverge Branch
Low Confidence
C
B
B
D
Round-robin fetch
C
E
E
F
G
CFM point
H
H
predicted path
18
Dynamic Predication
Arch. Phy. M
R1
R2 PR12
R3 PR13
A
PR11
PR41
1
PR21
B
RAT1
Arch. Phy. M
R1
R2 PR12
R3 PR13
C
PR31
1
PR11
E
select-µop pr41 p1? pr21 pr31
RAT2
H
Forks RAT, RAS, and GHR
19
DMP Support
  • ISA Support
  • Mark diverge branches/CFM points.
  • Compiler Support CGO07
  • The compiler identifies diverge branches and the
    corresponding CFM points.
  • Hardware Support
  • Confidence estimator
  • Fetch mechanisms
  • Load/store processing
  • Instruction retirement
  • Dynamic predication

20
Hardware Complexity Analysis
SW pred.
Dualpath
Wish br.
Multi path
Dyn.ham.
DMP
?
?
? ?
?
?
Front-End
?
?
?
?
Confidence Estimator
?
? ?
?
?
Rename Support
?
?
?
?
Predicate Registers
?
?
?
?
Select-Uop Gen.
?
?
?
?
?
?
ST-LD Forwarding
?
?
?
?
?
Check Flush/no Flush
21
Outline
  • Predicated Execution
  • Diverge-Merge Processor (DMP)
  • Implementation of DMP
  • Experimental Evaluation
  • Conclusion

22
Simulation Methodology
  • 12 SPEC 2000 INT, 5 SPEC 95 INT
  • Different input sets for profiling and evaluation
  • Alpha ISA execution driven simulator
  • Baseline processor configuration
  • 64KB perceptron predictor/O-GEHL (paper)
  • Minimum 30-cycle branch misprediction penalty
  • 8-wide, 512-entry instruction window
  • 2 KB 12-bit history enhanced JRS confidence
    estimator
  • Less aggressive processor (paper)
  • Power model using Wattch

23
Different CFG types
24
Performance Improvement
25
Energy Consumption
26
Outline
  • Predicated Execution
  • Diverge-Merge Processor (DMP)
  • Implementation of DMP
  • Experimental Evaluation
  • Conclusion

27
Conclusion
  • DMP introduces the concept of frequently-hammocks
    and it dynamically predicates complex CFGs.
  • DMP can overcome the three major limitations of
    software predication ISA support, adaptivity,
    complex CFG.
  • DMP reduces branch mispredictions energy
    efficiently
  • 19 performance improvement, 9 less energy
  • DMP divides the work between the compiler and the
    microarchitecture
  • The compiler analyzes the control-flow graphs.
  • The microarchitecture decides when and what to
    predicate dynamically.

28
Thank You!!
29
Questions?
30
Handling Mispredictions
A
A
A
Diverge Br.
C
B
B
B
(0)
Misprediction!
D
C
C
(1)
Flush
E
E
E
F
add pr44 ? pr34, -1(!p1)
(1)
G
select-µop pr41 p1? pr21 pr31
D
D
add pr34 ? pr31, pr13
CFM point
H
H
H
add pr24 ? pr41, pr13
predicted path
31
Loop Branches
  • Exit Condition
  • The loop branch is predicted to exit the loop.
  • Benefit
  • Reduced pipeline flushes when the predicated
    loop is iterated more times than it should be.
  • Instructions in the extra iterations of the loop
    become NOPs. Instructions after loop-exit can
    still be executed.
  • Negative Effects
  • Increased execution delay of loop-carried
    dependencies
  • The overhead of select-µops

32
Loop Branches
  • Predicate each loop iteration separately

select-uop pr22 p1 ? pr21 pr11 select-uop
pr23 p1? pr20 pr10
select-uop pr32 p2 ? pr31 pr22 select-uop
pr33 p2 ? pr30 pr23
Loop br. is predicted to exit the loop
33
Enhanced Mechanisms
  • Multiple CFM points
  • The hardware chooses one CFM point for each
    instance of dynamic predication.
  • Exit Optimizations
  • Counter Policy What if one path does not reach
    the CFM point?
  • Number of fetched instructions gt Threshold
  • Yield Policy What if another low confidence
    diverge branch is encountered in dynamic
    predication mode?
  • Later low confidence branch is more likely
    mispredicted.

A
B
C
G
D
F
E
H
34
Detailed DMP Support
  • 32 Predicate register ids
  • Fetch mechanism
  • High performance I-Cache
  • Fetch two cache lines
  • Predict 3 branches
  • Fetch stops at the first taken branch

35
Diverge and Merge?
36
Useful Dynamic Predication Mode
37
Perfect Branch Prediction
38
Maximum Power
39
Branch Predictor Effects
40
Confidence Estimator Effects
41
Results in Less Aggressive Processors
42
DMP vs. Perfect Conditional BP
43
Enhanced DMP Mechanisms
44
DMP vs. Other Mechanisms
45
Comparisons with Predication/Wish Branches
non-predicated
46
Reduction in Pipeline Flushes
  • Average overhead
  • Dynamic-hammock 4 instructions/entry
  • Dual-path 150 instructions/entry
  • Multipath 200 instructions/entry
  • DMP 20 instructions/entry

47
Handling Nested Diverge Branches
  • Basic DMP
  • Ignore other low confidence div. branches
  • Enhanced DMP
  • Exit dynamic predication mode and re-enter from
    the younger low confidence branch on predicted
    path (Yield policy)

A
Diverge Br.
C
B
D
E
F
G
CFM point
H
48
Compiler Support CGO07
  • Compiler analyzes the control flow and the
    profile data
  • Step1 Identify diverge branch candidates and CFM
    points.
  • Step2 Select diverge branches based on
  • (1) the number of instructions between a branch
    and the CFM point
  • (2) the probability of merging at the CFM point
  • Heuristics or a cost-benefit model
  • Step3 Mark the selected branches/CFM points.

49
Future Research
  • Hardware Support
  • Better confidence estimators
  • Efficient hardware mechanism to detect diverge
    branches and CFM points
  • Increase hardware complexity but eliminate the
    need for ISA/compiler support
  • Compiler Support
  • Better compiler algorithms CGO07

50
Power Measurement Configurations
  • 100 nm Technology
  • Baseline processor
  • 4GHZ
  • Less aggressive processor
  • 1.5GHz
  • CC3 clock-gating model in Wattch unused units
    dissipate only 10 of their maximum power
  • DMP one more RAT/RAS/GHR, select-uop generation
    module, additional fields in BTB, predicate
    registers, CFM registers, load-store forwarding,
    instruction retirement

51
Fetched wrong-path instructions per entry into
dynamic-predication/dual-path mode
52
Fetched/Executed Instructions
53
ISA Support
  • Example of Diverge Br and CFM markers

OPCODE
TARGET
00 normal branch 10 diverge forward branch 11
diverge loop branch
CFM rel address
CFM CFM rel address PC
54
Entering Dynamic Predication Mode
  • Entry condition
  • When a diverge branch has low confidence.
  • The Front-end
  • Stores the address of the CFM point to the CFM
    register.
  • Forks the RAS, GHR, and RAT.
  • Allocates a predicate register.
  • Fetch Mechanisms
  • Round-robin fetch from two paths
  • The processor follows the branch predictor until
    it reaches the corresponding CFM point.

55
Exiting Dynamic Predication Mode
  • Exit condition
  • Both paths of a diverge branch have reached the
    corresponding CFM point.
  • A diverge branch is resolved.
  • Select-µop mechanism
  • Similar to f-node in SSA
  • Merges register values from two paths.

56
Multipath Execution
A
path 1
path 2
  • Low-confidence

path 3
path 4
C
B
B
C
Low-confidence
E
G
E
D
G
F
D
F
H
H
H
H
H
I
I
I
I
I
Instructions after the control-flow merge point
are fetched multiple times. Waste of resources
and energy.
57
Modeling Software Predication
  • Mark using a binary instrumentation tool
  • All simple and nested hammocks can be predicated.
  • All instruction between a branch and the
    control-flow merge point are fetched.
  • All nested branches are predicated.
Write a Comment
User Comments (0)
About PowerShow.com