Title: Profiling, Instrumentation, and Profile Based Optimization
1Profiling, Instrumentation, and Profile Based
Optimization
- Robert Cohn Robert.Cohn_at_compaq.com
- Mark T. Vandevoorde
2Introduction
- Understanding the dynamic interaction between
programs and processors - What do programs do?
- How do processors perform?
- How can we make it faster?
3What to do?
- Build tools!
- Profiling
- Instrumentation
- Profile based optimization
4The Big Picture
Instrumentation
Sampling
Profiling
Profile Based Optimization
Analysis
Modeling
5Instrumentation
- User level view
- Executable editing
6Code Instrumentation
Trojan Horse
- Application appears unchanged
- Data collected as a side effect of execution
7Instrumentation Example
if (b gt c) bb0 t 1 else
bb1 b 3
Instrumentation
8Instrumentation Uses
- Profiles
- Model new hardware
- What will this new branch predictor do?
- What is the miss rate of this new cache?
- Optimization opportunities
- find unnecessary loads and stores
- find divides by 1
9What Tool Does Instrumentation?
- Compiler
- Compiler inserts extra operations
- Requires recompile, access to source code
- Executable editor
- Post-link tool inserts instrumentation code
- No rebuild, source code not required
- More difficult to relate back to source
10Instrumentation Tools for Alpha
- All executable based
- General instrumentation
- Atom on Digital Unix
- Distributed with Digital Unix
- Ntatom on Windows NT
- New! Download from web
- Specialized tools based on above
- hiprof, pixie, 3rd, ...
11ATOM
- Tool for customized instrumentation
- User writes program that describes how to
instrument application - Instrumentation program applied to application,
generates instrumented application - Instrumented application is run
- Data is collected
12User Supplies
- Instrumentation routines user written program
that inserts instrumentation - calls to analysis routines
- Analysis routines do the instrumentation work at
runtime (e.g. count a basic block)
13Atom Programming Model
spice
libc.so
libm.so
main()
Compute()
_exit()
block2
block3
block1
block5
block4
ldq r1, 8(sp)
addq r1, 0x1, r2
stq r2, 8(sp)
bne r1, 0x1ffc40
14ATOM Instrumentation API Navigation
- Objects (binary, shared library)
- GetFirstObj, GetNextObj
- Procedures
- GetFirstProc, GetNextProc
- Basic blocks
- GetFirstBlock, GetNextBlock
- Instructions
- GetFirstInst, GetNextInst
15ATOM Instrumentation API Interrogation
- GetObjInfo, GetProcInfo, GetBlockInfo,
GetInstInfo - IsBranchTarget
- GetInstRegUsage
- InstPC
- InstLineNo
- ...
16ATOM Instrumentation API Definition
- AddCallProto
- tells atom the types of the arguments for calls
to analysis routines
17ATOM Instrumentation API Instrumentation
- AddCallProgram, AddCallObj, AddCallProc,
AddCallBlock, AddCallInst, ReplaceProcedure - Insert before or after
18Arguments to analysis routines
- Constants
- variables in instrumentation program, but
constant at instrumentation point - e.g. uninstrumented PC, function name
- VALUE computed at runtime
- effective address, branch taken predicate
- Register
- r3, arguments, return value
19Sample 1 Cache Simulator
- Write a tool that computes the miss rate of the
application running in a 64KB, direct mapped data
cache with 32 byte lines. - gt atom spice cache.inst.o cache.anal.o -o
spice.cache - gt spice.cache lt ref.in gt ref.out
- gt more cache.out
- 5,387,822,402 620,855,884 11.523
20Cache Tool Implementation
Application
Instrumentation
main clr t0 loop ldl
t2,0(a0) addl t0,4,t0 addl
t2,0x10,t2 stl t2,0(a0) bne t3,loop ret
VALUE
PrintResults()
21Cache Analysis File
- include ltstdio.hgt
- define CACHE_SIZE 65536
- define BLOCK_SHIFT 5
- long cacheCACHE_SIZE gtgt BLOCK_SHIFT,
refs,misses - Reference(long address)
- int index address (CACHE_SIZE-1) gtgt
BLOCK_SHIFT - long tag address gtgt BLOCK_SHIFT
- if (cacheindex ! tag) misses
cacheindex tag - refs
- Print()
- FILE file fopen("cache.out","w")
- fprintf(file,"ld ld .2f\n",refs, misses,
100.0 misses / refs) - fclose(file)
22Cache Instrumentation File
- include ltstdio.hgt
- include ltcmplrs/atom.inst.hgt
- unsigned Instrument(int argc, char argv, Obj
o) - Inst iBlock bProc p
- AddCallProto("Reference(VALUE)")
AddCallProto("Print()") - AddCallProgram(ProgramAfter,"Print")
- for (p GetFirstProc() p ! NULL p
GetNextProc(p)) - for (b GetFirstBlock(p) b ! NULL b
GetNextBlock(b)) - for (i GetFirstInst(b) i ! NULL i
GetNextInst(i)) - if (IsInstType(i, InstTypeLoad)
IsInstType(i,InstTypeStore)) - AddCallInst(i, InstBefore, "Reference",
EffAddrValue)
23Sample 2 Profiler
- Write a tool that outputs the address of each
basic block and the number of times it is
executed. - vssad-27gt atom a.out prof.inst.c prof.anal.c
- vssad-28gt a.out.atom
- Hello world
- vssad-29gt head prof.out
- 120001030 1
- 120001038 1
- 12000103c 1
- 120001058 33
- 120001064 1
24Profiler Tool Implementation
Application
Instrumentation
main clr t0 loop ldl
t2,0(a0) addl t0,4,t0 addl
t2,0x10,t2 stl t2,0(a0) bne t3,loop ret
Constant
PrintResults(addresses,3)
25Profiler prof.anal.c
- include ltstdio.hgt
- long counts
- void Init(int nblocks)
- counts (long )malloc(nblocks sizeof(long))
- memset(counts,0,nblocks sizeof(long))
- void Count(int index) countsindex
- void Print(long blocks,int nblocks)
- int i FILE file fopen("prof.out","w")
- for (i 0 i lt nblocks i)
- fprintf(file,"lx ld\n",blocksi,countsi)
- fclose(file)
-
26Profiler prof.inst.c
- include ltstdio.hgt
- include ltcmplrs/atom.inst.hgt
- void CallInitPrint()
- void Instrument(int argc, char argv,Obj o)
- Block bProc pint index0
- int nblocks GetObjInfo(o,ObjNumberBlocks)
- long addresses (long )malloc(nblocks
sizeof(long)) - CallInitPrint(addresses,nblocks)
- for (p GetFirstProc() p ! NULL p
GetNextProc(p)) - for (b GetFirstBlock(p) b ! NULL b
GetNextBlock(b)) - addressesindex InstPC(GetFirstInst(b))
- AddCallInst(GetFirstInst(b), InstBefore,
"Count",index) -
27Profiler prof.inst.c
- void CallInitPrint(long addresses, int nblocks)
-
- char buffer100
- AddCallProto("Count(int)")
- AddCallProto("Init(int)")
- AddCallProgram(ProgramBefore,"Init",nblocks)
- sprintf(buffer,"Print(const stable
intd,int)") - AddCallProto(buffer)
- AddCallProgram(ProgramAfter,"Print",addresses,nb
locks) -
28Executable editors
- Input executable, ouput executable
- Instrument, optimize, translate
- Executable image binary shared library
shared object dynamically linked library (DLL) - Executable editor, executable optimizer, binary
rewriter, binary translator, post link optimizer
29Executable Editing
- Insert/delete/reorder instructions and data
- Obstacle to modification
- Addresses are bound
- Registers are bound
30Obstacles
- if (a) a b
- beq r1,2
- ldl r1,0x1000
- Is a0 free?
- Adjust branch offsets
- Adjust literal addresses
31Phases
- 1. Decompose
- 2. Build IR
- 3. Insert instrumentation
- 4. Convert IR to executable
321. Decompose Executable
Executable
Header
Text (code)
Program code data
Data
Rdata
Exception Info
Meta data
Relocations
Debug
33Decompose
- Break executable into units
- unit minimum data that must be kept together
- code unit is instruction
- data unit is data section
- alternative unit is data item
342. Build Internal Representation
35Intermediate Representation
- Similar to compiler
- except unstructured, untyped data
- 1 to 1 mapping for IR and machine instructions
- Base representation should be compact
- fit in physical memory
- initial/final phases do multiple passes
- Representations built/thrown away for procedures
36Bound addresses
Code br 4 ldah r0,0x1234 lda r0,0x5678(r0)
Metadata Begin 0x12345678 End 0x12345680
37Adjusting addresses
- No translation
- Dynamic translation
- Static translation
38No translation
- Leave code and data at same address
beq r1,L2 ldl r1,0x1234 L2
beq r1,L2 br L1 L2 ... ... L1 lda
a0,0x1234 bsr Reference ldl r1,0x1234 br L2
39Dynamic translation
- Address computation is unchanged
- Image has map of old-gtnew address
- Code inserted to map old-gtnew address at runtime
for load/store/branch - Better
- Do PC relative branches statically
- Keep data section at original address
- Still indirect calls and jumps (not returns)
40Static translation
- Address computation is altered for new layout
- Find addresses
- Determine what they point to
- unit, offset
- Insert instrumentation
- Adjust literals or offsets to compute new address
of unit
41Other tools that change addresses
- Linker
- combine separately compiled objects
- adjust addresses based on assigned load address
- unit is section of object (data, text)
- Loader
- Load address ! link address for DLL
- unit is entire image
- Use relocations
42Relocations
No relocation required
Code br 4 ldl r1,10(gp) ldah r0,0x1234 lda
r0,0x5678(r0)
May require relocation
Relocation example address 0x200 type ldah
literal object 0x12345670 external
Requires relocation
43How to recognize addresses?
- Metadata
- example procedure begin, procedure end
- implicit in structure of data
- Absolute addresses
- example literal address in data section
- use relocations
- Relative addresses address offset
- example pc relative branch, offset for base
pointer - may not need adjustment,usually no relocation
44Relative Addresses
- Address computed as offset of another address
- Address and Address Offset point to same unit
ok, unit moved as a unit - Example
- a-gtfield1 ar4
- ldl r0,field1(a) ldl r0,16(ar)
45Relative Addresses
- Offset spans multiple units
- example
Jump table ad base i jmp ad base br
l1 br l2 br l3 br l4
PC relative branch br 4
Must be 1 unit
46Map address to unit and offset
- Reference -gt address
- in code interpret instructions
- br 4
- ldah r0,0x1234
- lda r0,0x5678(r0)
- in data data is address
- .data
- 0x12345678
47Map address to unit and offset
- (relocation,address) -gt (unit,offset)
- to code pointer to instruction
- to data data section and offset
- alternative data item and offset
- offset address - unit address
483. Insert Instrumentation
Instruction list
add
Data sections
load
Data
load
Sdata
beq
Ndata
MetaData
Exception Info
Relocations
49Adding instrumentation code
- Instrumentation requires free registers
- wrapper routine saves and restores registers
beq r1,2 save registers lda a0,0x1000 bsr
ra,wrapper restore registers ldl r1,0(r2)
Save registers on stack bsr ra,Reference Restore
registers return
Reference
- Local/global/interprocedural analysis finds free
registers
504. Convert IR to Executable
Executable
Header
Text
Program code data
Data
Rdata
Ndata
Exception Info
Meta data
Relocations
Debug
51Profile Based Optimization
52Profile based optimization
- Collect profile information
- example how often basic blocks are executed
- Use profile to guide optimization
- example inlining
53Profile based Optimization
- Available on Alpha, MIPS, PA, PPC, Sparc, x86
- Used in compilers and executable optimizers
- Spec, products, too.
54Speedup from code layout
55Register allocation and inlining
56User level view
- Compiler
- Compile
- Instrument
- Run scenario1
- Run scenario2
- Merge profiles
- Recompile
- Executable optimizer
- Instrument
- Run scenario1
- Run scenario2
- Merge profiles
- Optimize
57Optimizations sensitivity to training data
- Experience with varying training
- compiler, spreadsheet, CAD, Spec95
- Some training sets are better than others
- Can find one or a combination that gives best
results in all scenarios - Sometimes requires tuning of optimizations
58Types of optimizations
- Enhance conventional optimization with weights
based on profile - Transformations driven by profile info
- Examples
- Register allocation
- Code layout
- Inlining
59Register allocation
top cmpgt a,3,t0 brfalse t0,then addl
b,1,b br join then ldl t0,c
addl t0,1,t0 stl t0,c join subl
a,1,a brtrue a,top
While (a) if (a gt 3) b else c
a--
- a, b, and c live for entire loop
- Should b or c get the last register?
- Information block counts
60Code layout Reduce the number of taken branches
- Greedy algorithm, lay out common paths
sequentially - Information
- flow edge counts
60
40
60
40
45
55
55
45
61Inlining
RtnC
RtnA
1000
2
0
RtnB
RtnD
- Probably no advantage to inline RtnD into RtnA
- RtnB is almost always called from RtnA
- thus no cache penalty for inlining
- Information Call edge counts
62Information to drive optimization
- Basic
- basic block counts
- flow edge counts
- call edge counts
- More advanced
- path profiles
- cache misses
- branch mispredicts
63Computing basic block counts
- Instrumentation
- Use atom tool
- Use 64 bit integers
- Sampling
64Computing call edges
rtna move 1,a0 move 2,a1 ldl r0,20(t0) jsr r0
PC relative call Call edge count is same as
basic block count
- rtna
- move 1,a0
- move 2,a1
- bsr rtnb
Indirect call keep hash table of targets and
counts
65Computing flow edge counts from basic block counts
- Basic block count
- S incoming edges
- S outgoing edges
- Exceptions, longjmp/setjmp are implicit edges
- Tolerate inconsistencies
10
20
10
66Computing flow edge counts from basic block counts
- Some graphs have multiple solutions
- Guess!
- Instrument edges
- Instrument minimum number of blocks and edges
while (a) a--
bzero a,skip topsubl a,1,a bnzero a,top skip
10
10
1
9
9
19
1
11
20
20
1
9
10
10
Two solutions for same bb count
67Computing flow edge counts from basic block counts
- Spanning tree algorithm
- given flow graph, costs, finds lowest cost set of
instrumentation points - costs derived from static analysis or earlier
runs - Read Ball and Larus for details
68Instrumenting flow edges
- ATOM branch taken value can be passed to
analysis routine - branch not taken insert call to count after
conditional branch - taken branch, indirect jump insert new basic
block between branch and target
69Merging multiple profiles
- Multiple runs generate multiple profiles, how do
we combine them? - Add them together
- Should the profiles be weighted equally?
- User defined
- Scale so that sums are equal
70Using profiles
- Edge, block counts are in database
- For each procedure, compiler locates counts in
database and copies them to IR - Every flow edge, call edge, block labeled with
execution count - Optimizations that modify flow graph must update
profile information
71IR/Profiled program mismatch
- Does the flow graph of the program you profiled
match the flow graph in the compiler IR? - Optimization
- Code generation
- Usually ok if you disable optimization
- Not a problem for executable optimizers
72Persistence
- If the program is modified, can you use an old
profile? - Generating a profile can be difficult and time
consuming - Dont hold up build process generating a new
profile every time
73Usability
- Make it easy or no one will use it
- Limited changes to build process
- Limited opportunities for user to mess up
74Profile based optimization nirvana
- Profile any build
- tolerate IR/profiled program mismatches
- No instrumentation step
- Low cost profiling, lt 5
- No restructuring of makefile
- Big speedup!
DCPI
75Tools for profile based optimization
- Unix
- cc, f77
- om executable optimizer called from cc
- cord user specified procedure ordering
- NT
- scc calls Visual C
- spike executable optimizer
- link /order user specified procedure ordering
- wstune generates ordering