If%20the%20CPU%20is%20so%20fast,%20why%20are%20the%20programs%20running%20so%20slowly? - PowerPoint PPT Presentation

About This Presentation

Title:

If%20the%20CPU%20is%20so%20fast,%20why%20are%20the%20programs%20running%20so%20slowly?

Description:

Set of culprits causing stalls (phase 2) ... Dynamic culprits isolated by process of ... Stall culprit analysis allows for more extensive optimizations ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 44

Provided by: jonathanaa

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: If%20the%20CPU%20is%20so%20fast,%20why%20are%20the%20programs%20running%20so%20slowly?

1
If the CPU is so fast, why are the programs
running so slowly?

CS 614 Lecture Fall 2007 Thursday September
20, 2007
By Jonathan Winter

2
Introduction

Both papers discuss online profiling and
optimization.
Main Goals
Gather data about the users actual experience
with the system and software
Improve application behavior without user
involvement
Identify performance bottlenecks in the real
world
Direct program optimization to alleviate these
slowdowns
Challenges
Continuously running profiler must have low
overhead
Difficult to extracting detailed information at
runtime
Lack of application specific information in
online setting

3
Outline

Application Performance Basics
Studying Performance
Online Profiling
Program Optimization
Related Work and Background
The Digital Continuous Profiling Infrastructure
(DCPI)
The Morph System
Comparison
Comments and Critique
Conclusions

4
Application Performance Basics

CPU Time Instruction Count x CPI x Clock
Cycle Time
Instruction Count - number of instruction in
program
Reduced through compilation techniques or ISA
changes
CPI Cycles Per Instruction
Improved through micro-architectural changes
System level factors such as I/O and memory
accesses
Clock Cycle Time
Frequency dependent on micro-architecture
Circuit design and electron device technology
driven
CPI is primary focus of online profiling and
optimization

5
Architectural View of Performance

Key tasks get instructions, get data, and
provide resources
Improve performance by
Avoiding control, data, and structural hazards
Control branch prediction, prefetching,
instruction caches, trace caches
Data prefetching, data caches, load value
prediction, load-store forwarding
Structural more resources, result value
forwarding

Increased parallelism
instruction, thread, and
memory level
Reducing cycle time
pipelining, shorten stage
length

6
Analyzing Performance When?

Analysis can be done a different stages of
development
Trade off between ability to adapt and accuracy
Trade off between application specific vs.
runtime knowledge

7
Analyzing Performance How?

A number of mechanisms can be used.
Static program analysis
Simulation - full system or CPU cycle accurate
Binary instrumentation
Performance counters
Operating system involvement
Major factors are
Accuracy vs. Speed vs. Coverage
Overhead and behavior perturbation
Ease of implementation

8
Online Profiling

Requires hardware and software support
Processor must monitor and track hardware events
Performance counters has become dominant method
Operating system or application must access
counters
Use special purpose registers/memory space
Typically microprocessor vendors provide special
libraries
Challenges
Poor portability across hardware platforms and OS
Continuous profiling requires low overhead
Gathering, moving, and processing data can have
high cost
Source code and application information not
available
Makes analyzing performance bottlenecks
difficult.
Transparent to system users

9
Performance Optimization

Range of options
Compiler level
Binary rewriting
Binary instrumentation
Online optimization
Hardware techniques
Benefits of Online Optimization
Customize program to specific hardware, OS, and
system
Adaptive to user usage pattern and dynamic
variation
Optimize for common case
Does not require user or application developer
involvement

10
Related Work

DCPI and Morph claim to be the first online
low-overhead profiling and optimizing tools
Most prior tools were not online and had high
overhead.
Eg. Pixie, jprof, gprof, ATOM, MTOOL, SimOS,
quartz
Relied on intrusive techniques
recompilation, binary instrumentation, simulation
Required significant user intervention
Some used performance counters but lacked detail
Eg. VTune sampler, iprobe, and Speedshop
Memory demands prevented use for continuous
profiling
Some used statistical sample Eg. Prof and
Speedshop

11
Profiling Systems Summary
12
Hardware Performance Counters

Most common counters track basic information
cycle count, instructions executed, and program
counter
More detailed counters track occurrence of 3
hazards
Eg. Branch mispredictions, cache misses, ALU
contention
DEC Alpha 21164 has numerous hazard counters
Can also track information about instruction
types
Pipeline stalls, instructions issued,
multiprocessor events
Major problem with counters microarchitecture
specific
2 research efforts provide cross-platform support
Performance Counter Library (PCL)
Performance Application Programming Interface
(PAPI)

13
Digital Continuous Profiling Infrastructure

Objectives
Achieve lower overhead than previous system
Deliver a very high sampling rate
Provide more detailed and accurate cycle level
analysis
Three key tools included
dcpiprof identify distribution of cycles among
procedures
dcpicalc instruction execution details and
stall causes
dcpistats analyze variation in profile data
Key contributions
Novel data structures for gathering counter
information
Innovative analysis of counters to determine
cause of stalls

14
Procedure-Level Bottlenecks

Identify dominant procedures to focus on for
optimization
Obtain low level details, such as instruction
cache miss rates

15
Instruction-Level Bottlenecks

Static analysis can identify structural hazards.
This provides best-case
DCPI identifies all possible stall causes
(conservatively)
Different executions of code may suffer from
different stalls

16
Analysis of Variance Across Executions

Variance analysis is useful to characterize
system effects
Important to evaluate applicability of
optimizations

17
DCPI System Overview
Load map info
Buffered samples
Analysis tools system-, load-file-, procedure-,
and instruction-level
User space
daemon
Overflow buffer
Hash table
Kernel device driver
Profiles
Load files
Per-cpu data
cpu 1

cpu n
Optional source code
...
cpu 1
Hardware
...
18
DCPI Hardware Support

Program counters generate interrupts on overflow
Interrupts passes PID, program counter, and event
type
DCPI monitors CYCLES and IMISS events by default
Intelligent analysis obtains all desired
execution details
Other events can be monitored must be
multiplexed
Sampling period is configurable (between 4K and
64K)
Period is randomized to minimize systemic
correlations
Six cycle latency between event overflow and PC
Does not affect sampling accuracy for CYCLES and
IMISS
Blind spots exist during execution of PALcode and
highest level interrupts

19
DCPI Kernel Device Driver

DCPI has high interrupt rate, 5200 per second at
333MHz
Fast interrupt handler is critical.
Taking 1000 cycles would consume 1.5 of CPU
Tagged TLB avoids most TLB flushes
Need to reduce cache misses to memory (100
cycles)
Transfer of data from kernel to user space is
bottleneck
Smart data structures reduce overhead
Hash table reduces accessed cache lines
Entry data (PID, PC, and event) packed into 16
bytes
Counter events are aggregated in driver memory
Overflow buffers handles evictions and data
transfer

20
DCPI User-Mode Daemon

Upon full overflow buffer, data is moved to user
space
PID and PC are identify program and EVENT data is
merged with accumulated profile information
Program image data obtained from
Modified loader
Recognizer routines invoked by kernel exec
Mach-based system calls
User space data merged with disk database
periodically
Disk usage minimized by compact format
Small fraction of program image is actually
executed

21
DCPI Uniprocessor Workloads
22
DCPI Multiprocessor Workloads
23
DCPI Workload Slowdowns
24
DCPI Time Overhead Breakdown

Interrupt handler setup and teardown took
additional 214 cycles

25
DCPI Space Overhead Breakdown

Device driver has two 8K entry overflow buffers
and a 16K entry hash table, totaling 512KB of
kernel memory.

26
DCPI Analyzing Profile Data

CYCLES profile data indicates approximate time
each instruction spent at the head of the issue
queue
High values could indicate
Instruction executed frequently
Instruction spent much time stalling
Objective to determine
Execution frequency and CPI (phase 1)
Set of culprits causing stalls (phase 2)

27
Phase 1 Estimating Frequency and CPI

Frequency and CPI must be determined only from
sample counts and static procedure control flow
analysis
Sample Count Frequency x CPI
Procedure
Build control flow graph from basic block
analysis
Group basic blocks and edges into equivalence
classes
Statically determine minimum time at head of
queue
Assume lowest sample counts indicate minimum CPI
Propagate frequency estimates around CFG
Derive confidence estimates using heuristics

28
Evaluation of Phase 1 Analysis
Instruction Frequency
Edge Frequency

Evaluation used base SPECfp and peak
SPECint workloads
dcpix, a profiling tool is used, to gather
execution counts
73 of instructions within 5 of count, 58 of
edges within 10

29
Phase 2 Identifying Stall Culprits

Analysis uses only binary executable and sample
counts
Static stalls determined by accurate processor
modeling
Dynamic culprits isolated by process of
elimination
Technique specific to each stall cause
Less than 10 of stalls remain unexplained
Ex. Instruction cache misses
Rule out miss when in same cache line as
instruction before
Determine when this occurs by basic block
analysis
Accuracy can be determined by comparing against
event sampling of stall causes

30
Evaluation of Phase 2 Analysis
31
The Morph System

Objectives
Provide user and machine specific optimization
capability
Optimizations should not require source code
Profiling and optimization process should be
transparent
Key Components
Morph Monitor online gathering of counter
information
Morph Manager process and prepare data for
optimization
Morph Editor conducts optimizations on
intermediate form
Contributions
Develops full system with code layout
optimizations as case study

32
Morph System Overview

Two other components
Morph Back-end provides executable with
intermediate form annotations to support online
optimization
PostMorph can infer annotations from static and
dynamic analysis to improve legacy applications

33
The Morph Monitor

Program activity gauged by low-cost statistical
sampling
Modified clock interrupt routine collects samples
Interrupt rate of 1024 Hz producing 8 byte
samples
Claim that synchronization with clock is not
deterimental
Monitor requires 256KB of kernel memory
Transfer of data to Morph Manager occurs every 30
seconds
Small modifications to OS required
exec() and mmap() changed to provide address
space data
exit() modified to log process termination events
Context switch information must also be logged

34
The Morph Manager

Manager must compile sample data from multiple
sample sets and execution modules
During program updates, sample data must be
ignored
Program counter samples must be interpreted
Intermediate representation contains CFG
information
PC samples are scaled for basic block size
Aggregate basic block execution profile is
created
Morph does not compensate for CPI
Authors argue that time-based approach is not
detrimental
Profiles from multiple inputs must be combined
Morph combines information weighted by execution
length

35
The Morph Editor

Implemented as a composition of SUIF compiler
passes
Intermediate representation is modified low-level
SUIF
Three code layout optimizations performed
Branch alignment
Fluff removal
Procedure layout
Optimizations require basic block execution
counts and CFG edge frequencies (calculated by
Morph Editor)
Profile information used to optimize for common
case
Optimization reduce control hazards such as
branch mispredictions, misfetches, and improve
cache locality

36
Morph Workload Descriptions and Inputs
I am not clear on the necessity or desirability
of of the two stage experiment with test and
train workload inputs for this study
37
Morph Overhead in Online Monitor

Non-determinism of bin-hopping policy for
virtual to physical page mapping caused problems
DU is the baseline Digital Unix using page
coloring for mapping
Larger benchmarks have higher overhead due to
cache conflicts
Strawman tests conducted to quantify the
relationship between working set and profiling
overhead
Monitor adds 72 instructions to clock interrupt

38
Morph Overhead in Offline Manager

At 1024 Hz, 8KB of data is generated by Monitor
Adding logged events, Manager must copy 110KB to
disk / 10 sec
Profiles made 640KB per minute
Manager can process 60 MB per minute (up to 900
MB per day)
Data typically much less
Long term storage augments intermediate
representation and is very compact

39
Morph Optimization Results

Profiled samples are capture from train input
sets.
Execution time improvement is measure on test
input sets
Results compared to conventional optimization
techniques utilizing complete profile information
instead of sampling

40
DCPI and Morph Comparison - Similarities

Both target DEC Alpha processors
Same available hardware and OS support (Digital
Unix)
First two works proposing low overhead online
profiling
Both employ statistical sampling of processor
activity
Program counter samples provide bulk of insight
Common infrastructure design and division of
labor
Light-weight kernel process for counter
collection
Acts like device driver for performance counters
Slower user-mode daemon for processing data
Comparable performance
1-3 for DCPI (5x faster sampling) and 0.3 for
Morph

41
DCPI and Morph Comparison - Differences

Significant focus of Morph on optimization side
Optimization tool tightly integrated
DCPI leaves optimization task to others
Authors goals was to develop a tool for broad
use
Morph developed more for proof-of-concept
Develops more integrated profiling and
optimization suite
DCPI has heavier instruction-level analysis focus
Stall culprit analysis allows for more extensive
optimizations
Morphs profile data limits optimization to code
layout
DCPI provides multiprocessor support
Morph targets single user workstations

42
Comments and Critique

Proposed methodology lacks portability
Profiling infrastructure tied to DEC Alpha and
Digital Unix
Common infrastructure (PCL PAPI) seem more
promising
Ability to infer stall causes from PC counts
limited to in-order processors
Out-of-order execution poses serious problem
Papers focus on processor core and memory
hierarchy
Interconnect performance and I/O critical in
multi-core
Would have liked to see more detail on
optimization side
How is the profile and optimization cycle
automated?

43
Conclusions

Systems research must be reconciliated with
performance profiling
Low-level architectural events are responsible
for significant performance losses
Critical to consider low-level impact of
OS/system design
OS level changes could affect pipeline stalls
Perceived gains or losses could be accidental
side-effect
Are high level performance measurements of
virtualization or µKernel overhead meaningful?
Performance results must be taken with grain of
salt
Lots of salt, of many different origins