Statistical Profiling: Hardware, OS, and Analysis Tools - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Statistical Profiling: Hardware, OS, and Analysis Tools

Description:

DIGITAL Continuous Profiling Infrastructure (DCPI) Project Members ... Compiler improvements: 20% in several Spec benchmarks. Profiling Tutorial. 2-12. 10/4/98 ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 35
Provided by: markvandev
Category:

less

Transcript and Presenter's Notes

Title: Statistical Profiling: Hardware, OS, and Analysis Tools


1
Statistical Profiling Hardware, OS, and
Analysis Tools
2
Joint Work
  • DIGITAL Continuous Profiling Infrastructure
    (DCPI) Project Members at
  • Systems Research Center
  • Lance Berc, Sanjay Ghemawat, Monika Henzinger,
  • Shun-Tak Leung, Dick Sites (now at Adobe),
  • Mitch Lichtenberg, Mark Vandevoorde, Carl
    Waldspurger,
  • Bill Weihl
  • Western Research Lab
  • Jennifer Anderson, Jeffrey Dean
  • Other Collaborators
  • Cambridge Research Lab
  • Jamey Hicks
  • Alpha Engineering
  • George Chrysos, Scot Hildebrandt, Rick Kessler,
  • Ed McLellan, Gerard Vernes, Jonathan White

3
Outline
  • Statistical sampling
  • What is it?
  • Why use it?
  • Data collection
  • Hardware issues
  • OS issues
  • Data analysis
  • In-order processors
  • Out-of-order processors

4
Statistical Profiling
  • Based on periodic sampling
  • Hardware generates periodic interrupts
  • OS handles the interrupts and stores data
  • Program Counter (PC) and any extra info
  • Analysis Tools convert data
  • for users
  • for compilers
  • Examples
  • DCPI, Morph, SGI Speedshop, Unixs prof(), VTune

5
Sampling vs. Instrumentation
  • Much lower overhead than instrumentation
  • DCPI program 1-3 slower
  • Pixie program 2-3 times slower
  • Applicable to large workloads
  • 100,000 TPS on Alpha
  • AltaVista
  • Easier to apply to whole systems (kernel, device
    drivers, shared libraries, ...)
  • Instrumenting kernels is very tricky
  • No source code needed

6
Information from Profiles
  • DCPI estimates
  • Where CPU cycles went, broken down by
  • image, procedure, instruction
  • How often code was executed
  • basic blocks and CFG edges
  • Where peak performance was lost and why

7
Example Getting the Big Picture
Total samples for event type cycles 6095201
cycles cum load file
2257103 37.03 37.03 /usr/shlib/X11/lib_dec_
ffb_ev5.so 1658462 27.21 64.24 /vmunix
928318 15.23 79.47 /usr/shlib/X11/libmi.so
650299 10.67 90.14 /usr/shlib/X11/libos.
so cycles cum procedure
load file 2064143
33.87 33.87 ffb8ZeroPolyArc
/usr/shlib/X11/lib_dec_ffb_ev5.so 517464
8.49 42.35 ReadRequestFromClient
/usr/shlib/X11/libos.so 305072 5.01
47.36 miCreateETandAET
/usr/shlib/X11/libmi.so 271158 4.45
51.81 miZeroArcSetup
/usr/shlib/X11/libmi.so 245450 4.03
55.84 bcopy
/vmunix 209835 3.44 59.28 Dispatch
/usr/shlib/X11/libdix.so
186413 3.06 62.34 ffb8FillPolygon
/usr/shlib/X11/lib_dec_ffb_ev5.so
170723 2.80 65.14 in_checksum
/vmunix 161326 2.65 67.78
miInsertEdgeInET /usr/shlib/X11/libm
i.so 133768 2.19 69.98
miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so
8
Example Using the Microscope
Where peak performance is lost and why
9
Example Summarizing Stalls
I-cache (not ITB) 0.0 to 0.3
ITB/I-cache miss 0.0 to 0.0 D-cache
miss 27.9 to 27.9 DTB miss
9.2 to 18.3 Write buffer 0.0 to
6.3 Synchronization 0.0 to 0.0 Branch
mispredict 0.0 to 2.6 IMUL busy
0.0 to 0.0 FDIV busy 0.0 to
0.0 Other 0.0 to
0.0 Unexplained stall 2.3 to 2.3
Unexplained gain -4.3 to -4.3 -----------------
--------------------------------------------
Subtotal dynamic
44.1
Slotting 1.8 Ra
dependency 2.0 Rb dependency
1.0 Rc dependency 0.0 FU
dependency 0.0 ----------------------------
---------------------------------
Subtotal static
4.8 ---------------------------------------------
---------------- Total stall
48.9
Execution 51.2 Net
sampling error
-0.1 --------------------------------------------
----------------- Total tallied
100.0 (35171, 93.1 of
all samples)
10
Example Sorting Stalls
cum cycles cnt cpi blame PC
fileline 10.0 10.0 109885 4998 22.0 dcache
957c comp.c484 9.9 19.8 108776 5513 19.7
dcache 9530 comp.c477 7.8 27.6 85668 3836
22.3 dcache 959c comp.c488
11
Instruction-level Information Matters
  • DCPI anecdotes
  • TPC-D 10 speedup
  • Duplicate filtering for AltaVista part of 19X
  • Compress program 22
  • Compiler improvements 20 in several Spec
    benchmarks

12
Outline
  • Statistical sampling
  • What is it?
  • Why use it?
  • Data collection
  • Hardware issues
  • OS issues
  • Data analysis
  • In-order processors
  • Out-of-order processors

13
Typical Hardware Support
  • Timers
  • Clock interrupt after N units of time
  • Performance Counters
  • Interrupt after N
  • cycles, issues, loads, L1 Dcache misses, branch
    mispredicts, uops retired, ...
  • Alpha 21064, 21164 Ppro, PII
  • Easy to measure total cycles, issues, CPI, etc.
  • Only extra information is restart PC

14
Problem Inaccurate Attribution
  • Experiment
  • count data loads
  • loop single load hundreds of nops
  • In-Order Processor
  • Alpha 21164
  • skew
  • large peak
  • Out-of-Order Processor
  • Intel Pentium Pro
  • skew
  • smear

load
15
Ramification of Misattribution
  • No skew or smear
  • Instruction-level analysis is easy!
  • Skew is a constant number of cycles
  • Instruction-level analysis is possible
  • Adjust sampling period by amount of skew
  • Infer execution counts, CPI, stalls, and stall
    explanations from cycles samples and program
  • Smear
  • Instruction-level analysis seems hopeless
  • Examples PII, StrongARM

16
Desired Hardware Support
  • Sample fetched instructions
  • Save PC of sampled instruction
  • E.g., interrupt handler reads Internal Processor
    Register
  • Makes skew and smear irrelevant
  • Gather more information

17
ProfileMe Instruction-Centric Profiling
Fetch counter
overflow?
fetch
map
issue
exec
retire
random selection
ProfileMe tag!
interrupt!
arithunits
branchpredict
dcache
icache
done?
tagged?
pc
addr
retired?
miss?
stage latencies
history
mp?
miss?
capture!
internal processor registers
18
Instruction-Level Statistics
  • PC Retire Status ? execution frequency
  • PC Cache Miss Flag ? cache miss rates
  • PC Branch Mispredict ? mispredict rates
  • PC Event Flag ? event rates
  • PC Branch Direction ? edge frequencies
  • PC Branch History ? path execution rates
  • PC Latency ? instruction stalls
  • 100-cycle dcache miss vs. dcache miss

19
Kernel Device Driver
  • Challenge 1 of 64K is only 655 cycles/sample
  • Aggregate samples in hash table
  • (PID, PC, event) ? count
  • Minimize cache misses
  • 100 cycles to memory
  • Pack data structures into cache lines
  • Eliminate expensive synchronization operations
  • Interprocessor interrupts for synchronization
    with daemon
  • Replicate main data structures on each processor

20
Moving Samples to Disk
  • User-Space Daemon
  • Extracts raw samples from driver
  • Associates samples with compiled code
  • Updates disk-based profiles for compiled code
  • Mapping ltPID, PCgt samples to compiled code
  • Dynamic loader hook for dynamically loaded code
  • Exec hook for statically linked code
  • Other hooks for initializing mapping at daemon
    start-up
  • Profiles
  • text header compact binary samples

21
Performance of Data Collection (DCPI)
  • Time
  • 1-3 total overhead for most workloads
  • Often less than variation from run to run
  • Space
  • 512 KB kernel memory per processor
  • 2-10 MB resident for daemon
  • 10 MB disk after one month of profiling on
    heavily used timeshared 4-processor machine
  • Non-intrusive enough to be run for many hours on
    production systems, e.g.

22
Outline
  • Statistical sampling
  • What is it?
  • Why use it?
  • Data collection
  • Hardware issues
  • OS issues
  • Data analysis
  • In-order processors
  • Out-of-order processors

23
Data Analysis
  • Cycle samples are proportional to total time at
    head of issue queue (at least on in-order Alphas)
  • Frequency indicates frequent paths
  • CPI indicates stalls

24
Estimating Frequency from Samples
  • Problem
  • given cycle samples, compute frequency and CPI
  • Approach
  • Let F Frequency / Sampling Period
  • E(Cycle Samples) F X CPI
  • So F E(Cycle Samples) / CPI

25
Estimating Frequency (cont.)
  • F E(Cycle Samples) / CPI
  • Idea
  • If no dynamic stall, then know CPI, so can
    estimate F
  • So assume some instructions have no dynamic
    stalls
  • Consider a group of instructions with the same
    frequency (e.g., basic block)
  • Identify instructions w/o dynamic stalls then
    average their sample counts for better accuracy
  • Key insight
  • Instructions without stalls have smaller sample
    counts

26
Estimating Frequency (Example)
  • Compute MinCPI from Code
  • Compute Samples/MinCPI
  • Select Data to Average
  • Does badly when
  • Few issue points
  • All issue points stall

27
Frequency Estimate Accuracy
  • Compare frequency estimates for blocks to
    measured values obtained with pixie-like tool
  • Edge frequencies a bit less accurate

28
Explaining Stalls
  • Static stalls
  • Schedule instructions in each basic block
    optimistically using a detailed pipeline model
    for the processor
  • Dynamic stalls
  • Start with all possible explanations
  • I-cache miss, D-cache miss, DTB miss, branch
    mispredict, ...
  • Rule out unlikely explanations
  • List the remaining possibilities

29
Ruling Out D-cache Misses
  • Is the previous occurrence of an operand register
    the destination of a load instruction?
  • Search backward across basic block boundaries
  • Prune by block and edge execution frequencies

30
Out-of-Order Processors
  • In-Order processors
  • Periodic interrupt lands on current
    instruction, e.g., next instruction to issue
  • Peak performance no wasted issue slots
  • Any stall implies loss in performance
  • Out-of-Order Processors
  • Many instructions in-flight no current
    instruction
  • Some stalls masked by concurrent execution
  • Instructions issue around stalled instruction
  • Example does this stall matter?
  • load r1,
  • add ,r1, average latency 15.0 cycles
  • other instructions

31
Issue Need to Measure Concurrency
  • Interesting concurrency metrics
  • Retired instructions per cycle
  • Issue slots wasted while an instruction is in
    flight
  • Pipeline stage utilization
  • How to measure concurrency?
  • Special-purpose hardware
  • Some metrics difficult to measuree.g. need
    retire/abort status
  • Sample potentially-concurrent instructions
  • Aggregate info from pairs of samples
  • Statistically estimate metrics

32
Paired Sampling
  • Sample two instructions
  • May be in-flight simultaneously
  • Replicate ProfileMe hardware, add intra-pair
    distance
  • Nested sampling
  • Sample window around first profiled instruction
  • Randomly select second profiled instruction
  • Statistically estimate frequency for F(first,
    second)

33
Explaining Lost Performance
  • An open question
  • Some in-order analysis applicable
  • E.g., D-cache miss branch mispredict analysis
  • Pipe stage latencies from counters would help a
    lot

34
Summary Conclusion
  • Statistical profiling can be
  • Inexpensive
  • Effective
  • Instruction-level analysis matters
  • Performance counters
  • Implementation details make a big difference
  • Out-of-order processors require better counters
Write a Comment
User Comments (0)
About PowerShow.com