Statistical Profiling: Hardware, OS, and Analysis Tools - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Statistical Profiling: Hardware, OS, and Analysis Tools

Description:

DIGITAL Continuous Profiling Infrastructure (DCPI) Project Members ... Compiler improvements: 20% in several Spec benchmarks. Profiling Tutorial. 2-12. 10/4/98 ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 35

Provided by: markvandev

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Profiling: Hardware, OS, and Analysis Tools

1
Statistical Profiling Hardware, OS, and
Analysis Tools
2
Joint Work

DIGITAL Continuous Profiling Infrastructure
(DCPI) Project Members at
Systems Research Center
Lance Berc, Sanjay Ghemawat, Monika Henzinger,
Shun-Tak Leung, Dick Sites (now at Adobe),
Mitch Lichtenberg, Mark Vandevoorde, Carl
Waldspurger,
Bill Weihl
Western Research Lab
Jennifer Anderson, Jeffrey Dean
Other Collaborators
Cambridge Research Lab
Jamey Hicks
Alpha Engineering
George Chrysos, Scot Hildebrandt, Rick Kessler,
Ed McLellan, Gerard Vernes, Jonathan White

3
Outline

Statistical sampling
What is it?
Why use it?
Data collection
Hardware issues
OS issues
Data analysis
In-order processors
Out-of-order processors

4
Statistical Profiling

Based on periodic sampling
Hardware generates periodic interrupts
OS handles the interrupts and stores data
Program Counter (PC) and any extra info
Analysis Tools convert data
for users
for compilers
Examples
DCPI, Morph, SGI Speedshop, Unixs prof(), VTune

5
Sampling vs. Instrumentation

Much lower overhead than instrumentation
DCPI program 1-3 slower
Pixie program 2-3 times slower
Applicable to large workloads
100,000 TPS on Alpha
AltaVista
Easier to apply to whole systems (kernel, device
drivers, shared libraries, ...)
Instrumenting kernels is very tricky
No source code needed

6
Information from Profiles

DCPI estimates
Where CPU cycles went, broken down by
image, procedure, instruction
How often code was executed
basic blocks and CFG edges
Where peak performance was lost and why

7
Example Getting the Big Picture
Total samples for event type cycles 6095201
cycles cum load file
2257103 37.03 37.03 /usr/shlib/X11/lib_dec_
ffb_ev5.so 1658462 27.21 64.24 /vmunix
928318 15.23 79.47 /usr/shlib/X11/libmi.so
650299 10.67 90.14 /usr/shlib/X11/libos.
so cycles cum procedure
load file 2064143
33.87 33.87 ffb8ZeroPolyArc
/usr/shlib/X11/lib_dec_ffb_ev5.so 517464
8.49 42.35 ReadRequestFromClient
/usr/shlib/X11/libos.so 305072 5.01
47.36 miCreateETandAET
/usr/shlib/X11/libmi.so 271158 4.45
51.81 miZeroArcSetup
/usr/shlib/X11/libmi.so 245450 4.03
55.84 bcopy
/vmunix 209835 3.44 59.28 Dispatch
/usr/shlib/X11/libdix.so
186413 3.06 62.34 ffb8FillPolygon
/usr/shlib/X11/lib_dec_ffb_ev5.so
170723 2.80 65.14 in_checksum
/vmunix 161326 2.65 67.78
miInsertEdgeInET /usr/shlib/X11/libm
i.so 133768 2.19 69.98
miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so
8
Example Using the Microscope
Where peak performance is lost and why
9
Example Summarizing Stalls
I-cache (not ITB) 0.0 to 0.3
ITB/I-cache miss 0.0 to 0.0 D-cache
miss 27.9 to 27.9 DTB miss
9.2 to 18.3 Write buffer 0.0 to
6.3 Synchronization 0.0 to 0.0 Branch
mispredict 0.0 to 2.6 IMUL busy
0.0 to 0.0 FDIV busy 0.0 to
0.0 Other 0.0 to
0.0 Unexplained stall 2.3 to 2.3
Unexplained gain -4.3 to -4.3 -----------------
--------------------------------------------
Subtotal dynamic
44.1
Slotting 1.8 Ra
dependency 2.0 Rb dependency
1.0 Rc dependency 0.0 FU
dependency 0.0 ----------------------------
---------------------------------
Subtotal static
4.8 ---------------------------------------------
---------------- Total stall
48.9
Execution 51.2 Net
sampling error
-0.1 --------------------------------------------
----------------- Total tallied
100.0 (35171, 93.1 of
all samples)
10
Example Sorting Stalls
cum cycles cnt cpi blame PC
fileline 10.0 10.0 109885 4998 22.0 dcache
957c comp.c484 9.9 19.8 108776 5513 19.7
dcache 9530 comp.c477 7.8 27.6 85668 3836
22.3 dcache 959c comp.c488
11
Instruction-level Information Matters

DCPI anecdotes
TPC-D 10 speedup
Duplicate filtering for AltaVista part of 19X
Compress program 22
Compiler improvements 20 in several Spec
benchmarks

12
Outline

Statistical sampling
What is it?
Why use it?
Data collection
Hardware issues
OS issues
Data analysis
In-order processors
Out-of-order processors

13
Typical Hardware Support

Timers
Clock interrupt after N units of time
Performance Counters
Interrupt after N
cycles, issues, loads, L1 Dcache misses, branch
mispredicts, uops retired, ...
Alpha 21064, 21164 Ppro, PII
Easy to measure total cycles, issues, CPI, etc.
Only extra information is restart PC

14
Problem Inaccurate Attribution

Experiment
count data loads
loop single load hundreds of nops
In-Order Processor
Alpha 21164
skew
large peak
Out-of-Order Processor
Intel Pentium Pro
skew
smear

load
15
Ramification of Misattribution

No skew or smear
Instruction-level analysis is easy!
Skew is a constant number of cycles
Instruction-level analysis is possible
Adjust sampling period by amount of skew
Infer execution counts, CPI, stalls, and stall
explanations from cycles samples and program
Smear
Instruction-level analysis seems hopeless
Examples PII, StrongARM

16
Desired Hardware Support

Sample fetched instructions
Save PC of sampled instruction
E.g., interrupt handler reads Internal Processor
Register
Makes skew and smear irrelevant
Gather more information

17
ProfileMe Instruction-Centric Profiling
Fetch counter
overflow?
fetch
map
issue
exec
retire
random selection
ProfileMe tag!
interrupt!
arithunits
branchpredict
dcache
icache
done?
tagged?
pc
addr
retired?
miss?
stage latencies
history
mp?
miss?
capture!
internal processor registers
18
Instruction-Level Statistics

PC Retire Status ? execution frequency
PC Cache Miss Flag ? cache miss rates
PC Branch Mispredict ? mispredict rates
PC Event Flag ? event rates
PC Branch Direction ? edge frequencies
PC Branch History ? path execution rates
PC Latency ? instruction stalls
100-cycle dcache miss vs. dcache miss

19
Kernel Device Driver

Challenge 1 of 64K is only 655 cycles/sample
Aggregate samples in hash table
(PID, PC, event) ? count
Minimize cache misses
100 cycles to memory
Pack data structures into cache lines
Eliminate expensive synchronization operations
Interprocessor interrupts for synchronization
with daemon
Replicate main data structures on each processor

20
Moving Samples to Disk

User-Space Daemon
Extracts raw samples from driver
Associates samples with compiled code
Updates disk-based profiles for compiled code
Mapping ltPID, PCgt samples to compiled code
Dynamic loader hook for dynamically loaded code
Exec hook for statically linked code
Other hooks for initializing mapping at daemon
start-up
Profiles
text header compact binary samples

21
Performance of Data Collection (DCPI)

Time
1-3 total overhead for most workloads
Often less than variation from run to run
Space
512 KB kernel memory per processor
2-10 MB resident for daemon
10 MB disk after one month of profiling on
heavily used timeshared 4-processor machine
Non-intrusive enough to be run for many hours on
production systems, e.g.

22
Outline

Statistical sampling
What is it?
Why use it?
Data collection
Hardware issues
OS issues
Data analysis
In-order processors
Out-of-order processors

23
Data Analysis

Cycle samples are proportional to total time at
head of issue queue (at least on in-order Alphas)
Frequency indicates frequent paths
CPI indicates stalls

24
Estimating Frequency from Samples

Problem
given cycle samples, compute frequency and CPI
Approach
Let F Frequency / Sampling Period
E(Cycle Samples) F X CPI
So F E(Cycle Samples) / CPI

25
Estimating Frequency (cont.)

F E(Cycle Samples) / CPI
Idea
If no dynamic stall, then know CPI, so can
estimate F
So assume some instructions have no dynamic
stalls
Consider a group of instructions with the same
frequency (e.g., basic block)
Identify instructions w/o dynamic stalls then
average their sample counts for better accuracy
Key insight
Instructions without stalls have smaller sample
counts

26
Estimating Frequency (Example)

Compute MinCPI from Code
Compute Samples/MinCPI
Select Data to Average

Does badly when
Few issue points
All issue points stall

27
Frequency Estimate Accuracy

Compare frequency estimates for blocks to
measured values obtained with pixie-like tool
Edge frequencies a bit less accurate

28
Explaining Stalls

Static stalls
Schedule instructions in each basic block
optimistically using a detailed pipeline model
for the processor
Dynamic stalls
Start with all possible explanations
I-cache miss, D-cache miss, DTB miss, branch
mispredict, ...
Rule out unlikely explanations
List the remaining possibilities

29
Ruling Out D-cache Misses

Is the previous occurrence of an operand register
the destination of a load instruction?
Search backward across basic block boundaries
Prune by block and edge execution frequencies

30
Out-of-Order Processors

In-Order processors
Periodic interrupt lands on current
instruction, e.g., next instruction to issue
Peak performance no wasted issue slots
Any stall implies loss in performance
Out-of-Order Processors
Many instructions in-flight no current
instruction
Some stalls masked by concurrent execution
Instructions issue around stalled instruction