Title: PEEP: Exploiting Predictability of Memory Dependences in SMT Processors
1PEEP Exploiting Predictability of Memory
Dependences in SMT Processors
- Samantika Subramaniam, Milos Prvulovic, Gabriel
H. Loh
2Simplified view of SMT execution
Front-end
Reservation Stations
Icache
Execution Units
Store per thread state Enough work from all
threads put together High throughput
3Something bad happens
Producer insn stalls
Front-end
Icache
Reservation Stations
Execution Units
Low ILP thread eventually uses up the CPU
resources Other independent high ILP threads
forced to stall Defeats purpose of SMT
Tackle the problem at the source
FETCH UNIT
4Previously proposed solution
ICOUNT (Instruction Count) Tullsen et al. ISCA
1996 Count the number of instructions in the
pipeline per thread Fetch Policy Less priority
to thread with more instructions
Clogged resources
OOPS!
Front-end
Icache
Reservation Stations
Execution Units
REACTIVE EXCLUSION !
5So can we do better?
Oracle
Front-end
Icache
Reservation Stations
Execution Units
PROACTIVE EXCLUSION !
6Proactive Exclusion Strategies (PE)
- Load Misses Moursy et al. ISCA 2003
- predicted load miss ?GATE
- MLP Eyerman et al. HPCA 2007
- all available MLP exposed ? GATE
7A Brief Overview of Memory Dependences
LSQ
Memory Dependence Predictor
PRED
ADDR
INST
0xF023
ST 1
PC
1
0xF380
LD 1
ST 2
0xF793
0xF060
?
0xF060
LD 2
Predictability of Memory Dependences Predictor
can indicate future stalls
8Proactive Exclusion using Memory Dependences
T0
T0
LD
ST
LD
T1
T1
ST
LD
ST
T2
T2
Learn ST-LD relationships
ST A LD A
ST ? LD A
T3
T3
9Starvation Problem with Proactive Exclusion
Stall resolves
Insn enters RS
T0
T1
Reservation Stations
Exclusion (any strategy) could cause temporary
STARVATION
T2
T3
Especially bad for short duration stalls!!!
10Short Duration Stall
ST A LD A ADD SUB
ST ? LD A ADD SUB
ADD
Original
ST A LD A ADD SUB
ST ? LD A ADD SUB
ADD
Original PE
Memory Dependence Predictor
11Can we avoid starvation?
With PE based on memory dependences we can
Memory Dependence Predictor
INST
ADDR
?
0xF060
20 cycles
12Delay Predictor Details
Memory Dependence Predictor
- Conservative
- Maximum observed delay
- Aggressive
- Last observed delay
- Adaptive
- Average of last observed n delays
-
DELAY
PRED
PC
1
20
13How does this help us?
ST A LD A ADD SUB
ADD
Addr resolves
Original
ST A LD A ADD SUB
ST ? LD A ADD SUB
ADD
Addr resolves
Memory Dependence Predictor
Original PE
Choose an appropriate delay threshold
14 Performance Impact of Delay Information
Phase 1
After 20 cycles
MDP
AST 1 BLD 1
P
D
ST ?
ST xF060
B
0
1
0
20
ST 1
. . .
AST 21 BLD 21
LD1
LD xF060
Reservation Stations
P prediction D delay
Execution Units
15Phase 2
Delay Threshold Front End Depth 5
MDP
AST 1 BLD 1
D
P
B
1
20
. . .
AST 21 BLD 21
Front-end
P prediction D delay
16PE without delay information
Phase 3
Front End Depth 5
Reservation Stations
Front-end
Stall resolves
Restart fetch
Insn enter RS
5
20
25 cycles
Instructions enter RS after stall resolves
17PE with delay information
Phase 3
Delay Threshold Front End Depth 5
ReservationStations
Front-end
Stall resolves
Restart fetch
Insn enter RS
5
15
20 cycles
Instructions enter RS right in time as stall
resolves
18What does this give us?
PEEP
- Proactive Exclusion
- When a memory dependence stall is predicted
- Avoid starvation
- Ignore short stalls
- Give the thread a head start
- Restart fetch of gated thread few cycles before
stall resolves
Early Parole!!!
PROACTIVE EXCLUSION AND EARLY PAROLE
19PEEP In Our Context
Memory Dependence and Delay Predictor
20 cycles
Front-end
Icache
Reservation Stations
Execution Units
Predicted delay FE pipeline depth 15
cycles
20Simulation Parameters
- Aggressive four-way SMT processor
- MDP modeled on Load Wait Table
- SPEC2000, MediaBench and others
- 32 four-thread application mixes evaluated
- Application Classification
- S sensitive to memory dependences
- N non-sensitive to memory dependences
- L low-ILP
- M medium-ILP
- H high-ILP
21Proactive Exclusion Strategies
S Sensitive N Non-sensitive L low-ILP M
medium-ILP H high-ILP
- PE using memory dependencies shows 13 speedup
- Maximum benefit with both sensitive (S) and
non-sensitive (N) threads - All sensitive threads all PE strategies perform
comparably
22PEEP
17
- PEEP using delay prediction outperforms MLP and
PE mdep - All sensitive threads PEEP does better since it
can predict stall durations accurately
- PEEP with an oracle-based MDP shows performance
speedup of 19
232-threaded Workloads
12
- Less threads ? less opportunities to fetch from
non-stalled threads - 12 performance speedup over 25 application mixes
shows there is potential benefit even in a
2-way SMT
Intel Simulator shows 8 performance speedup over
150 application mixes
24Relationship with OOO Load Scheduling
Hypothesis Performance benefit purely due to a
more efficient fetch policy based on a highly
predictable attribute Experiment PEEP on a
processor without OOO memory scheduling
Prediction is used only for
controlling fetch policy
Result Avg. Speedup over ICOUNT17 (same as
PEEP!) Conclusion Memory Dependencies are a
very good indicator of future stalls Even a
machine without load reordering benefits from
predicting these stalls
25Why does it work so well?
LMP
PEEP
LD 1
LD 1
ST 1
ST 1
LD 2
LD 2
LD 3
LD 3
Reservation Stations
Reservation Stations
LD 4
LD 4
26LMP
PEEP
MLP
LD 1
LD 1
LD 1
ST 1
ST 1
ST 1
LD 2
LD 2
LD 2
ADD
ADD
ADD
Reservation Stations
Reservation Stations
Reservation Stations
SUB
SUB
SUB
Can expose more ILP
27Key Points
- Need a mechanism for efficient resource
management in SMT - Improve the fetch unit
- Memory Dependences and Associated Latencies are
predictable - Proactively Exclude bad threads but give them
Early Parole to avoid temporary starvation - Performance improvements on both 4-way and 2-way
SMT machines
28Thank You www.cc.gatech.edu/samantik
LD
LD
LD
LD
LD
LD
LD
LD
When will I get paroled?
29B1Sensitivity Analysis
30 Predictor Size
Delay Threshold
31B2PEEP
17.3
- Memory Dependences are a very good indicator of
future stalls - Performance shows that PEEP works because it
leverages knowledge of future stalls to improve
instruction fetch
32B3Fairness
19
- Speedup is computed for harmonic mean of weighted
IPCs - Since all PE strategies run on top of ICOUNT,
they inherit its fairness - SDS (standard deviation of speedup) for PEEP
0.17 and for ICOUNT 0.11
33B4 OOO memory scheduling on SMT machine
34B5 Accuracy of MDP
35B6 Delays associated with PEEP
36B7 Delay Predictors
Conservative Maximum observed delay Aggressive
Last observed delay Adaptive Average of last n
observed delays
37B8Simulator Configuration
384-threaded mixes
392-threaded mixes