Title: Statistical Simulation of Superscalar Architectures using Commercial Workloads
1Statistical Simulation of Superscalar
Architectures using Commercial Workloads
- Lieven Eeckhout and Koen De Bosschere
- Dept. of Electronics and Information Systems
(ELIS) - Ghent University, Belgium
- CAECW01, January 21, 2001
2Outline
- Introduction
- Statistical Simulation
- Statistical profiling
- Synthetic trace generation
- Methodology
- Evaluation
- Conclusion
3Introduction
- Architectural simulation
- trace-driven or execution-driven
- accurate
- long simulation times
- long traces to be stored
- Need for fast simulation techniques
- take part of a full trace
- analytical modeling
- trace sampling
- statistical simulation
4Goal
- Previous work used SPEC benchmarks to evaluate
statistical simulation - In this talk we use both commercial and
scientific workloads - SPECint, SPECfp, system traces, multimedia, X
graphics, database
5Statistical Simulation
- Three steps
- extract statistical profile from a program
execution - generate synthetic trace from it
- simulate on a trace-driven simulator
- Two major advantages
- statistical profile is more compact than full
trace - fast simulation due to statistical nature
- design space exploration in limited time
6Statistical Simulation
real trace (e.g. SPEC benchmark)
branch profiling
cache profiling
instruction profiling
branch statistics
cache statistics
instruction statistics
7Statistical Profiling
- Microarchitecture-independent statistics
- instruction statistics
- Microarchitecture-dependent statistics
- branch statistics
- cache statistics
- Result statistical simulation only to explore
design options of processor core (cache and
branch predictor are fixed)
8Statistical ProfilingInstruction Statistics
- Instruction mix (13 classes)
- Number of register operands
- Age of register operands
- probability that register operand was produced ?
instructions before it in the trace (only RAW) - Memory dependencies
- probability that load is memory-dependent on the
?-th store before it in the trace (only RAW)
9Statistical ProfilingBranch Statistics
- Six branch types
- conditional branch, unconditional branch, call
with offset, indirect jump, indirect call, return - Distinction
- branch prediction accuracy refill pipeline on
branch misprediction - branch target prediction accuracy single-cycle
bubble in pipeline on correct branch prediction
but target misprediction
10Statistical ProfilingCache Statistics
- D-cache statistics
- L1 D-cache miss rate
- L2 D-cache miss rate
- I-cache statistics
- L1 I-cache miss rate
- L2 I-cache miss rate
11Synthetic Trace Generation
- Instruction-by-instruction
- through random number generation
- Determine
- instruction type
- number of operands
- age of register operands
- memory dependency
- branch behavior
- D-cache behavior
- I-cache behavior
I-cache miss
D-cache miss
mispredicted
12Methodology microarchitecture
- Out-of-order processor
- 8 and 16 issue
- windows of 64 and 128 instructions
- McFarling branch predictor
- small cache configuration
- 8KB DM L1 I-cache, 8KB DM L1 D-cache, 64KB 2WSA
unified L2 cache - large cache configuration
- 32KB DM L1 I-cache, 64KB 2WSA L1 D-cache, 512KB
4WSA unified L2 cache - Access time
- L1 I-cache (1 cycle), L1 D-cache (2 cycles), L2
cache (10 cycles), main memory (80 cycles)
13Methodology benchmarks
- 8 SPECint95 benchmarks
- 5 SPECfp95 benchmarks (hydro2d, su2cor, swim,
tomcatv, wave5) - 8 IBS system traces (mpeg, jpeg, gs, verilog,
gcc, sdet, nroff, groff) - 4 MediaBench applications (g721, gs, gsm, mpeg2)
- 4 X graphics benchmarks (DooM, POVRay, Xanim,
Quake) - 2 TPC-D queries running on Postgres 6.3
- 200 million instructions / trace
14Evaluation
- IPC prediction error
- IPC real trace - IPC synthetic trace
- IPC real trace
- IPC real trace IPC when running real trace on
trace-driven simulator - IPC synthetic trace IPC when running synthetic
trace generated from the statistical profile of
the real trace - Simulation speed sIPC/xIPC less than 1 after
simulating 1 million instructions
15IPC prediction error (1)
high D-cache miss rate
157
135
40
30
20
10
IPC prediction error
0
-10
-20
-30
li
go
gs
gs
perl
jpeg
sdet
gcc
ijpeg
nroff
groff
swim
verilog
gsm_e
mpeg2
xanim
mpeg
tpc-d.2
vortex
wave5
su2cor
xdoom
xquake
hydro2d
g721_e
xpovray
tomcatv
tpc-d.17
real_gcc
m88ksim
compress
SPECint95
SPECfp95
IBS
MediaBench
X graphics
TPC-D
16-issue, 128-entry window, small cache
configuration
16IPC prediction error (2)
30
20
10
IPC prediction error
0
-10
-20
-30
li
go
gs
gs
gcc
jpeg
ijpeg
sdet
perl
nroff
groff
swim
verilog
mpeg
gsm_e
mpeg2
xanim
vortex
tpc-d.2
wave5
xquake
su2cor
xdoom
g721_e
xpovray
tomcatv
tpc-d.17
real_gcc
hydro2d
m88ksim
compress
SPECint95
SPECfp95
IBS
MediaBench
X graphics
TPC-D
16-issue, 128-entry window, large cache
configuration
17IPC prediction error vs. static instruction count
160
w 64 i 8 'small' cache
140
w 128 i 16 'small' cache
120
w 64 i 8 'large' cache
nroff jpeg (IBS) verilog sdet
100
w 128 i 16 'large' cache
80
mpeg (IBS) groff
gcc
DooM Quake
gs (IBS)
IPC prediction error
60
40
20
0
gcc (IBS)
vortex go
TPC-D
-20
-40
0
20000
40000
60000
80000
100000
120000
140000
160000
static instruction count (number of instructions
executed at least once)
18Conclusion (1)
- Higher IPC prediction errors for applications
with smaller static instruction count - MediaBench applications
- SPECfp95 benchmarks
- 2 X graphics benchmarks (POVRay and Xanim)
- 5 SPECint95 benchmarks
19Conclusion (2)
- Smaller IPC prediction errors for applications
with larger instruction footprint - IBS system traces
- TPC-D traces
- 2 X graphics benchmarks (DooM and Quake)
- 3 SPECint95 benchmarks (go, gcc, vortex)
- IPC prediction error between -1 and 25
20Conclusion (3)
- Statistical simulation is a useful fast
simulation technique for commercial workloads - due to higher variability in instructions
- since commercial workloads have larger
instruction footprint - which makes a statistical technique more powerful