Title: PINTOS: An Execution Phase Based Optimization and Simulation Tool
1 PINTOS An Execution Phase Based
Optimization and Simulation Tool)
- Wei Hsu, Jinpyo Kim, Sreekumar Kodak
- Computer Science Department
- University of Minnesota
- October 9, 2004
- PIN Tutorial at ASPLOS04
2Outline
- What is Pintos?
- What can Pintos do?
- Phase detection for optimization and simulation
- Optimization (instruction prefetching)
- Fast Simulation
- Summary
3What is Pintos?
- PINTOS is a PIN based Tool for Optimization and
Simulation - A research framework supports adaptive object
code optimization - Supports deep analysis of run-time program
behavior for object code optimization (e.g.
instruction, data prefetching) - Integrates HPM performance monitoring (Pfmon)
with dynamic instrumentation (PIN). - Also supports fast performance simulation
- Identifies program phases (with coarse and fine
granularity) - Generates simulation strings that capture
representative program behaviors
4Pintos Framework
PIN-based Analysis
Filtered Opt Targets
pfmon
profile analysis
Optimization
control flow
program
profile
Opt targets
Cache Sim
PIN-based Phase Detection
pfmon
profile analysis
Simulation Strings
Simulation
Phase Info
program
profile
phase targets
Simulation String Gen
5Our Background
- ADORE dynamic optimization system
Code Cache
Phase Detection
Main Thread
Dynamic Optimization Thread
Trace Selection
Optimization
Deployment
Kernel / Pfmon
Hardware Performance Monitoring Unit
6ADORE Performance Speedup of ORC2.1 O2 Compiled
SPEC2000 Benchmarks
7ADORE Performance at Different Sampling Rates
8Future Enhancements to ADORE
- I-cache prefetching
- Help thread based optimizations
- Value prediction based optimizations
- Dynamically undo aggressive optimizations (e.g.
control/data speculations, indirect array
prefetches) - Software Branch Predictions
9What can Pintos do for us?
- Pintos uses pfmon to identify high-level
performance problems (e.g. I-cache miss) and
locate target code (phases) for optimization - Pintos then uses PIN-based analysis tool to focus
on target code (phases) to conduct deep analysis - Pintos provides a framework to support deep
analysis of program behavior so that we may
experience with new object code optimization
techniques and feed them to ADORE. - Simulation strings can be generated by Pintos and
used for more efficient micro-architecture
simulations
10Phase based Optimization and Simulation
- Phase is a sequence of code that consistently
exhibits certain performance behaviors in Pintos,
for example - Gzip shows consistent and repeated data cache
miss patterns - Crafty exhibits consistent I-cache misses
- A repeating phase can serve as an unit for
dynamic and adaptive optimization, or for fast
performance simulations. - Optimization unit can be basic block, trace,
procedure and region (loops and loop nests
including complex control transfers) - Simulation unit can be an extended code sequence
-
11Phase Detection
- One phase detection method doesnt fit all needs.
- Dynamic data cache prefetching requires coarse
grain phases (e.g. loops) while dynamic I-cache
prefetching requires fine-grain phases (e.g.
frequent calling paths). - A phase tuple is used to determine the current
point of execution in PIN instrumentation - Phase tuple (phase ID , ip addr, of retired
insts)
12Pintos for Optimization (I-Prefetch)
- Many applications still suffer from significant
I-cache misses (e.g. data base apps, some SPEC
CPU2000 benchmarks, etc)
- Complex control flows
- cause high miss rate from
- streaming prefetches
- Predictable call sequence
- results in relatively low miss
- rate
13I-Cache Miss Analysis (pfmon)
- Miss address based info
- Crafty (2125/4760000)
- 25 30 (1.41) Each top miss PC was
caused by 10-40 - 50 91 (4.28) different paths.
- 75 228 (10.73)
- 90 442 (20.80)
- Path based info
- Crafty (8016/4760000) Each top path leading to
I-cache - 25 28 (0.34) miss has 1-2 possible
prefetch targets - 50 126 (1.57)
- 75 436 (5.43) Data show we can reduce
points of - 90 1118 (13.94) interest for inst
prefetching
14Exploring prospective points of instruction
prefetching (PIN)
- Pintos generates prospective paths leading to
frequent I-cache misses by analyzing pfmon
profile - PIN instrumentation routine constructs control
flow graph and simulates instruction cache along
execution - It inserts I-cache prefetching instructions for
the prospective paths based on control flow edge
weight and estimated cache replacement
Paths frequently causing I-cache misses
B1
B2
B6
B3
B4
B5
B7
Instruction Cache Simulator
B8
Control flow graph
15Exploring prospective points of instruction
prefetching (PIN)
- Key observation
- Most I-cache misses happen in the following cache
lines after the entry or the return of a function
call. - L1I cache misses are mostly capacity misses. We
need to estimate how prefetch affect incoming
instruction stream. - Key idea
- Run ahead by exploring CFG and I-cache simulator
- Evaluate prospective paths given by Pintos
Paths frequently causing I-cache misses
B1
B2
B6
B3
B4
B5
B7
Instruction Cache Simulator
B8
Control flow graph
16Pintos for Fast Simulation
- Execution driven micro-architectral simulation is
commonly used for evaluating new
micro-architecture features and respective code
optimizations. - Simulation time is often too long for a complete
simulation. New methods for fast simulations such
as Simpoint and Smarts have been proposed. - PASS (Phase Aware Stratified Sampling) is a
different way to generate representative and
customized traces for targeted simulations
17Fast Simulation Techniques
- Truncated Execution
- Run Z, FastFoward-W-R
- Sampling
- SMARTS
- SIMPOINT
- Stratified Sampling
- Reduced Input Sets
- MinneSPEC
18Problems of Previous Works
- Truncated Execution gives very inaccurate results
- Reduced Input sets do not always behave the same
as reference inputs so the performance estimation
based on reduced input sets may be misleading.
19Mechanism of SMARTS
- W Warm up time (Fixed to 2000 instructions for
SPEC 2000) - U Detailed Simulation (Fixed to 1000
instructions for SPEC2000) - (K-1)U
- Function Simulation with Functional Warming
(The tool gives the value of K for which the IPC
will be within 3 of the actual value with
99.7 confidence interval)
20Issues in Previous Work
- SMARTS
- Value of U and W fixed for SPEC 2000 suite. Have
to identify them for every new benchmark suite
(Very time consuming) - Over sampling in steady phases. Does not
effectively exploit the existence of phases in
programs - SIMPOINT
- The user chooses the length of simulation point
(100 million, 10 million, 1 million) - Provides Simulation Points based on Clustering of
Basic Block profiles which is generated using
sim-fast or ATOM
21Phase Aware Stratified Sampling (PASS)
- Deploy a hierarchical method to detect coarse and
fine grain program phases - (1) Tracking calling stack (stable bottom
coarse grain phase) ? inter-procedure - (2) Detecting loops within the procedure ?
intra-procedure - (3)Tracking data access pattern such as stride
within loops (fine grain phases) - Select stratified samples from each phase until
getting high statistical confidence
22IPC vs SimPoint (cc1-166, 1 million insts)
23IPC vs Phase Classification on PASS(cc1-166, 1
million insts)
24IPC vs SimPoint (cc1-166, 250 million insts)
25IPC vs SimPoint (gzip-source, 1 million insts)
26IPC vs Phase Classification on PASS(gzip-source,
1 million insts)
27IPC vs SimPoint (gzip-source, 250 million insts)
28IPC vs SimPoint (mcf-ref, 1 million insts)
29IPC vs Phase Classification on PASS (mcf-ref)
30IPC vs SimPoint (mcf-ref, 250 million insts)
31IPC vs Phase Classification on PASS(gap-ref, 1
million insts)
32IPC vs SimPoint (gap-ref, 250 million insts)
33Summary
- We show the combination of HPM sampling (Pfmon)
and dynamic instrumentation (Pin) in our research
framework (Pintos) for adaptive object code
optimization and micro-architectural simulation. - PASS (Phase Aware Stratified Sampling) may lead
to a more efficient way in simulating the
interaction between compiler optimizations and
new micro-architectural features.