PINTOS: An Execution Phase Based Optimization and Simulation Tool - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

PINTOS: An Execution Phase Based Optimization and Simulation Tool

Description:

PINTOS: An Execution Phase Based Optimization and Simulation Tool) ... 255.vortex. 7.2. 42.8. 27.6. 253.perlbmk. 0.0. 40.8. 16.9. 252.eon. 0.2. 44.5. 28.7. 186.crafty ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 34
Provided by: pint2
Category:

less

Transcript and Presenter's Notes

Title: PINTOS: An Execution Phase Based Optimization and Simulation Tool


1
PINTOS An Execution Phase Based
Optimization and Simulation Tool)
  • Wei Hsu, Jinpyo Kim, Sreekumar Kodak
  • Computer Science Department
  • University of Minnesota
  • October 9, 2004
  • PIN Tutorial at ASPLOS04

2
Outline
  • What is Pintos?
  • What can Pintos do?
  • Phase detection for optimization and simulation
  • Optimization (instruction prefetching)
  • Fast Simulation
  • Summary

3
What is Pintos?
  • PINTOS is a PIN based Tool for Optimization and
    Simulation
  • A research framework supports adaptive object
    code optimization
  • Supports deep analysis of run-time program
    behavior for object code optimization (e.g.
    instruction, data prefetching)
  • Integrates HPM performance monitoring (Pfmon)
    with dynamic instrumentation (PIN).
  • Also supports fast performance simulation
  • Identifies program phases (with coarse and fine
    granularity)
  • Generates simulation strings that capture
    representative program behaviors

4
Pintos Framework
PIN-based Analysis
Filtered Opt Targets
pfmon
profile analysis
Optimization
control flow
program
profile
Opt targets
Cache Sim
PIN-based Phase Detection
pfmon
profile analysis
Simulation Strings
Simulation
Phase Info
program
profile
phase targets
Simulation String Gen
5
Our Background
  • ADORE dynamic optimization system

Code Cache
Phase Detection
Main Thread
Dynamic Optimization Thread
Trace Selection
Optimization
Deployment
Kernel / Pfmon
Hardware Performance Monitoring Unit
6
ADORE Performance Speedup of ORC2.1 O2 Compiled
SPEC2000 Benchmarks
7
ADORE Performance at Different Sampling Rates
8
Future Enhancements to ADORE
  • I-cache prefetching
  • Help thread based optimizations
  • Value prediction based optimizations
  • Dynamically undo aggressive optimizations (e.g.
    control/data speculations, indirect array
    prefetches)
  • Software Branch Predictions

9
What can Pintos do for us?
  • Pintos uses pfmon to identify high-level
    performance problems (e.g. I-cache miss) and
    locate target code (phases) for optimization
  • Pintos then uses PIN-based analysis tool to focus
    on target code (phases) to conduct deep analysis
  • Pintos provides a framework to support deep
    analysis of program behavior so that we may
    experience with new object code optimization
    techniques and feed them to ADORE.
  • Simulation strings can be generated by Pintos and
    used for more efficient micro-architecture
    simulations

10
Phase based Optimization and Simulation
  • Phase is a sequence of code that consistently
    exhibits certain performance behaviors in Pintos,
    for example
  • Gzip shows consistent and repeated data cache
    miss patterns
  • Crafty exhibits consistent I-cache misses
  • A repeating phase can serve as an unit for
    dynamic and adaptive optimization, or for fast
    performance simulations.
  • Optimization unit can be basic block, trace,
    procedure and region (loops and loop nests
    including complex control transfers)
  • Simulation unit can be an extended code sequence

11
Phase Detection
  • One phase detection method doesnt fit all needs.
  • Dynamic data cache prefetching requires coarse
    grain phases (e.g. loops) while dynamic I-cache
    prefetching requires fine-grain phases (e.g.
    frequent calling paths).
  • A phase tuple is used to determine the current
    point of execution in PIN instrumentation
  • Phase tuple (phase ID , ip addr, of retired
    insts)

12
Pintos for Optimization (I-Prefetch)
  • Many applications still suffer from significant
    I-cache misses (e.g. data base apps, some SPEC
    CPU2000 benchmarks, etc)
  • Complex control flows
  • cause high miss rate from
  • streaming prefetches
  • Predictable call sequence
  • results in relatively low miss
  • rate

13
I-Cache Miss Analysis (pfmon)
  • Miss address based info
  • Crafty (2125/4760000)
  • 25 30 (1.41) Each top miss PC was
    caused by 10-40
  • 50 91 (4.28) different paths.
  • 75 228 (10.73)
  • 90 442 (20.80)
  • Path based info
  • Crafty (8016/4760000) Each top path leading to
    I-cache
  • 25 28 (0.34) miss has 1-2 possible
    prefetch targets
  • 50 126 (1.57)
  • 75 436 (5.43) Data show we can reduce
    points of
  • 90 1118 (13.94) interest for inst
    prefetching

14
Exploring prospective points of instruction
prefetching (PIN)
  • Pintos generates prospective paths leading to
    frequent I-cache misses by analyzing pfmon
    profile
  • PIN instrumentation routine constructs control
    flow graph and simulates instruction cache along
    execution
  • It inserts I-cache prefetching instructions for
    the prospective paths based on control flow edge
    weight and estimated cache replacement

Paths frequently causing I-cache misses
B1
B2
B6
B3
B4
B5
B7
Instruction Cache Simulator
B8
Control flow graph
15
Exploring prospective points of instruction
prefetching (PIN)
  • Key observation
  • Most I-cache misses happen in the following cache
    lines after the entry or the return of a function
    call.
  • L1I cache misses are mostly capacity misses. We
    need to estimate how prefetch affect incoming
    instruction stream.
  • Key idea
  • Run ahead by exploring CFG and I-cache simulator
  • Evaluate prospective paths given by Pintos

Paths frequently causing I-cache misses
B1
B2
B6
B3
B4
B5
B7
Instruction Cache Simulator
B8
Control flow graph
16
Pintos for Fast Simulation
  • Execution driven micro-architectral simulation is
    commonly used for evaluating new
    micro-architecture features and respective code
    optimizations.
  • Simulation time is often too long for a complete
    simulation. New methods for fast simulations such
    as Simpoint and Smarts have been proposed.
  • PASS (Phase Aware Stratified Sampling) is a
    different way to generate representative and
    customized traces for targeted simulations

17
Fast Simulation Techniques
  • Truncated Execution
  • Run Z, FastFoward-W-R
  • Sampling
  • SMARTS
  • SIMPOINT
  • Stratified Sampling
  • Reduced Input Sets
  • MinneSPEC

18
Problems of Previous Works
  • Truncated Execution gives very inaccurate results
  • Reduced Input sets do not always behave the same
    as reference inputs so the performance estimation
    based on reduced input sets may be misleading.

19
Mechanism of SMARTS
  • W Warm up time (Fixed to 2000 instructions for
    SPEC 2000)
  • U Detailed Simulation (Fixed to 1000
    instructions for SPEC2000)
  • (K-1)U
  • Function Simulation with Functional Warming
    (The tool gives the value of K for which the IPC
    will be within 3 of the actual value with
    99.7 confidence interval)

20
Issues in Previous Work
  • SMARTS
  • Value of U and W fixed for SPEC 2000 suite. Have
    to identify them for every new benchmark suite
    (Very time consuming)
  • Over sampling in steady phases. Does not
    effectively exploit the existence of phases in
    programs
  • SIMPOINT
  • The user chooses the length of simulation point
    (100 million, 10 million, 1 million)
  • Provides Simulation Points based on Clustering of
    Basic Block profiles which is generated using
    sim-fast or ATOM

21
Phase Aware Stratified Sampling (PASS)
  • Deploy a hierarchical method to detect coarse and
    fine grain program phases
  • (1) Tracking calling stack (stable bottom
    coarse grain phase) ? inter-procedure
  • (2) Detecting loops within the procedure ?
    intra-procedure
  • (3)Tracking data access pattern such as stride
    within loops (fine grain phases)
  • Select stratified samples from each phase until
    getting high statistical confidence

22
IPC vs SimPoint (cc1-166, 1 million insts)
23
IPC vs Phase Classification on PASS(cc1-166, 1
million insts)
24
IPC vs SimPoint (cc1-166, 250 million insts)
25
IPC vs SimPoint (gzip-source, 1 million insts)
26
IPC vs Phase Classification on PASS(gzip-source,
1 million insts)
27
IPC vs SimPoint (gzip-source, 250 million insts)
28
IPC vs SimPoint (mcf-ref, 1 million insts)
29
IPC vs Phase Classification on PASS (mcf-ref)
30
IPC vs SimPoint (mcf-ref, 250 million insts)
31
IPC vs Phase Classification on PASS(gap-ref, 1
million insts)
32
IPC vs SimPoint (gap-ref, 250 million insts)
33
Summary
  • We show the combination of HPM sampling (Pfmon)
    and dynamic instrumentation (Pin) in our research
    framework (Pintos) for adaptive object code
    optimization and micro-architectural simulation.
  • PASS (Phase Aware Stratified Sampling) may lead
    to a more efficient way in simulating the
    interaction between compiler optimizations and
    new micro-architectural features.
Write a Comment
User Comments (0)
About PowerShow.com