Profiling, Instrumentation, and Profile Based Optimization - PowerPoint PPT Presentation

1 / 75

About This Presentation

Title:

Profiling, Instrumentation, and Profile Based Optimization

Description:

Post-link tool inserts instrumentation code. No rebuild, source ... instrumentation ... Analysis routines: do the instrumentation work at runtime (e.g. ... – PowerPoint PPT presentation

Number of Views:194

Avg rating:3.0/5.0

Slides: 76

Provided by: rober525

Category:

more less

Transcript and Presenter's Notes

Title: Profiling, Instrumentation, and Profile Based Optimization

1
Profiling, Instrumentation, and Profile Based
Optimization

Robert Cohn Robert.Cohn_at_compaq.com
Mark T. Vandevoorde

2
Introduction

Understanding the dynamic interaction between
programs and processors
What do programs do?
How do processors perform?
How can we make it faster?

3
What to do?

Build tools!
Profiling
Instrumentation
Profile based optimization

4
The Big Picture
Instrumentation
Sampling
Profiling
Profile Based Optimization
Analysis
Modeling
5
Instrumentation

User level view
Executable editing

6
Code Instrumentation
Trojan Horse

Application appears unchanged
Data collected as a side effect of execution

7
Instrumentation Example

Add extra code

if (b gt c) bb0 t 1 else
bb1 b 3

if (b gt c)
t 1
else
b 3

Instrumentation
8
Instrumentation Uses

Profiles
Model new hardware
What will this new branch predictor do?
What is the miss rate of this new cache?
Optimization opportunities
find unnecessary loads and stores
find divides by 1

9
What Tool Does Instrumentation?

Compiler
Compiler inserts extra operations
Requires recompile, access to source code
Executable editor
Post-link tool inserts instrumentation code
No rebuild, source code not required
More difficult to relate back to source

10
Instrumentation Tools for Alpha

All executable based
General instrumentation
Atom on Digital Unix
Distributed with Digital Unix
Ntatom on Windows NT
New! Download from web
Specialized tools based on above
hiprof, pixie, 3rd, ...

11
ATOM

Tool for customized instrumentation
User writes program that describes how to
instrument application
Instrumentation program applied to application,
generates instrumented application
Instrumented application is run
Data is collected

12
User Supplies

Instrumentation routines user written program
that inserts instrumentation
calls to analysis routines
Analysis routines do the instrumentation work at
runtime (e.g. count a basic block)

13
Atom Programming Model
spice
libc.so
libm.so
main()
Compute()
_exit()
block2
block3
block1
block5
block4
ldq r1, 8(sp)
addq r1, 0x1, r2
stq r2, 8(sp)
bne r1, 0x1ffc40
14
ATOM Instrumentation API Navigation

Objects (binary, shared library)
GetFirstObj, GetNextObj
Procedures
GetFirstProc, GetNextProc
Basic blocks
GetFirstBlock, GetNextBlock
Instructions
GetFirstInst, GetNextInst

15
ATOM Instrumentation API Interrogation

GetObjInfo, GetProcInfo, GetBlockInfo,
GetInstInfo
IsBranchTarget
GetInstRegUsage
InstPC
InstLineNo
...

16
ATOM Instrumentation API Definition

AddCallProto
tells atom the types of the arguments for calls
to analysis routines

17
ATOM Instrumentation API Instrumentation

AddCallProgram, AddCallObj, AddCallProc,
AddCallBlock, AddCallInst, ReplaceProcedure
Insert before or after

18
Arguments to analysis routines

Constants
variables in instrumentation program, but
constant at instrumentation point
e.g. uninstrumented PC, function name
VALUE computed at runtime
effective address, branch taken predicate
Register
r3, arguments, return value

19
Sample 1 Cache Simulator

Write a tool that computes the miss rate of the
application running in a 64KB, direct mapped data
cache with 32 byte lines.
gt atom spice cache.inst.o cache.anal.o -o
spice.cache
gt spice.cache lt ref.in gt ref.out
gt more cache.out
5,387,822,402 620,855,884 11.523

20
Cache Tool Implementation
Application
Instrumentation
main clr t0 loop ldl
t2,0(a0) addl t0,4,t0 addl
t2,0x10,t2 stl t2,0(a0) bne t3,loop ret
VALUE
PrintResults()
21
Cache Analysis File

include ltstdio.hgt
define CACHE_SIZE 65536
define BLOCK_SHIFT 5
long cacheCACHE_SIZE gtgt BLOCK_SHIFT,
refs,misses
Reference(long address)
int index address (CACHE_SIZE-1) gtgt
BLOCK_SHIFT
long tag address gtgt BLOCK_SHIFT
if (cacheindex ! tag) misses
cacheindex tag
refs
Print()
FILE file fopen("cache.out","w")
fprintf(file,"ld ld .2f\n",refs, misses,
100.0 misses / refs)
fclose(file)

22
Cache Instrumentation File

include ltstdio.hgt
include ltcmplrs/atom.inst.hgt
unsigned Instrument(int argc, char argv, Obj
o)
Inst iBlock bProc p
AddCallProto("Reference(VALUE)")
AddCallProto("Print()")
AddCallProgram(ProgramAfter,"Print")
for (p GetFirstProc() p ! NULL p
GetNextProc(p))
for (b GetFirstBlock(p) b ! NULL b
GetNextBlock(b))
for (i GetFirstInst(b) i ! NULL i
GetNextInst(i))
if (IsInstType(i, InstTypeLoad)
IsInstType(i,InstTypeStore))
AddCallInst(i, InstBefore, "Reference",
EffAddrValue)

23
Sample 2 Profiler

Write a tool that outputs the address of each
basic block and the number of times it is
executed.
vssad-27gt atom a.out prof.inst.c prof.anal.c
vssad-28gt a.out.atom
Hello world
vssad-29gt head prof.out
120001030 1
120001038 1
12000103c 1
120001058 33
120001064 1

24
Profiler Tool Implementation
Application
Instrumentation
main clr t0 loop ldl
t2,0(a0) addl t0,4,t0 addl
t2,0x10,t2 stl t2,0(a0) bne t3,loop ret
Constant
PrintResults(addresses,3)
25
Profiler prof.anal.c

include ltstdio.hgt
long counts
void Init(int nblocks)
counts (long )malloc(nblocks sizeof(long))
memset(counts,0,nblocks sizeof(long))
void Count(int index) countsindex
void Print(long blocks,int nblocks)
int i FILE file fopen("prof.out","w")
for (i 0 i lt nblocks i)
fprintf(file,"lx ld\n",blocksi,countsi)
fclose(file)

26
Profiler prof.inst.c

include ltstdio.hgt
include ltcmplrs/atom.inst.hgt
void CallInitPrint()
void Instrument(int argc, char argv,Obj o)
Block bProc pint index0
int nblocks GetObjInfo(o,ObjNumberBlocks)
long addresses (long )malloc(nblocks
sizeof(long))
CallInitPrint(addresses,nblocks)
for (p GetFirstProc() p ! NULL p
GetNextProc(p))
for (b GetFirstBlock(p) b ! NULL b
GetNextBlock(b))
addressesindex InstPC(GetFirstInst(b))
AddCallInst(GetFirstInst(b), InstBefore,
"Count",index)

27
Profiler prof.inst.c

void CallInitPrint(long addresses, int nblocks)
char buffer100
AddCallProto("Count(int)")
AddCallProto("Init(int)")
AddCallProgram(ProgramBefore,"Init",nblocks)
sprintf(buffer,"Print(const stable
intd,int)")
AddCallProto(buffer)
AddCallProgram(ProgramAfter,"Print",addresses,nb
locks)

28
Executable editors

Input executable, ouput executable
Instrument, optimize, translate
Executable image binary shared library
shared object dynamically linked library (DLL)
Executable editor, executable optimizer, binary
rewriter, binary translator, post link optimizer

29
Executable Editing

Insert/delete/reorder instructions and data
Obstacle to modification
Addresses are bound
Registers are bound

30
Obstacles

if (a) a b
beq r1,2
ldl r1,0x1000

Is a0 free?
Adjust branch offsets
Adjust literal addresses

31
Phases

1. Decompose
2. Build IR
3. Insert instrumentation
4. Convert IR to executable

32
1. Decompose Executable
Executable
Header
Text (code)
Program code data
Data
Rdata
Exception Info
Meta data
Relocations
Debug
33
Decompose

Break executable into units
unit minimum data that must be kept together
code unit is instruction
data unit is data section
alternative unit is data item

34
2. Build Internal Representation
35
Intermediate Representation

Similar to compiler
except unstructured, untyped data
1 to 1 mapping for IR and machine instructions
Base representation should be compact
fit in physical memory
initial/final phases do multiple passes
Representations built/thrown away for procedures

36
Bound addresses

Data
1
2
0x12345678
3

Code br 4 ldah r0,0x1234 lda r0,0x5678(r0)
Metadata Begin 0x12345678 End 0x12345680
37
Adjusting addresses

No translation
Dynamic translation
Static translation

38
No translation

Leave code and data at same address

beq r1,L2 ldl r1,0x1234 L2
beq r1,L2 br L1 L2 ... ... L1 lda
a0,0x1234 bsr Reference ldl r1,0x1234 br L2
39
Dynamic translation

Address computation is unchanged
Image has map of old-gtnew address
Code inserted to map old-gtnew address at runtime
for load/store/branch
Better
Do PC relative branches statically
Keep data section at original address
Still indirect calls and jumps (not returns)

40
Static translation

Address computation is altered for new layout
Find addresses
Determine what they point to
unit, offset
Insert instrumentation
Adjust literals or offsets to compute new address
of unit

41
Other tools that change addresses

Linker
combine separately compiled objects
adjust addresses based on assigned load address
unit is section of object (data, text)
Loader
Load address ! link address for DLL
unit is entire image
Use relocations

42
Relocations
No relocation required

Data
1
2
0x12345678
3

Code br 4 ldl r1,10(gp) ldah r0,0x1234 lda
r0,0x5678(r0)
May require relocation
Relocation example address 0x200 type ldah
literal object 0x12345670 external
Requires relocation
43
How to recognize addresses?

Metadata
example procedure begin, procedure end
implicit in structure of data
Absolute addresses
example literal address in data section
use relocations
Relative addresses address offset
example pc relative branch, offset for base
pointer
may not need adjustment,usually no relocation

44
Relative Addresses

Address computed as offset of another address
Address and Address Offset point to same unit
ok, unit moved as a unit
Example
a-gtfield1 ar4
ldl r0,field1(a) ldl r0,16(ar)

45
Relative Addresses

Offset spans multiple units
example

Jump table ad base i jmp ad base br
l1 br l2 br l3 br l4
PC relative branch br 4
Must be 1 unit
46
Map address to unit and offset

Reference -gt address
in code interpret instructions
br 4
ldah r0,0x1234
lda r0,0x5678(r0)
in data data is address
.data
0x12345678

47
Map address to unit and offset

(relocation,address) -gt (unit,offset)
to code pointer to instruction
to data data section and offset
alternative data item and offset
offset address - unit address

48
3. Insert Instrumentation
Instruction list
add
Data sections
load
Data
load
Sdata
beq
Ndata
MetaData
Exception Info
Relocations
49
Adding instrumentation code

Instrumentation requires free registers
wrapper routine saves and restores registers

beq r1,2 save registers lda a0,0x1000 bsr
ra,wrapper restore registers ldl r1,0(r2)
Save registers on stack bsr ra,Reference Restore
registers return
Reference

Local/global/interprocedural analysis finds free
registers

50
4. Convert IR to Executable
Executable
Header
Text
Program code data
Data
Rdata
Ndata
Exception Info
Meta data
Relocations
Debug
51
Profile Based Optimization
52
Profile based optimization

Collect profile information
example how often basic blocks are executed
Use profile to guide optimization
example inlining

53
Profile based Optimization

Available on Alpha, MIPS, PA, PPC, Sparc, x86
Used in compilers and executable optimizers
Spec, products, too.

54
Speedup from code layout
55
Register allocation and inlining
56
User level view

Compiler
Compile
Instrument
Run scenario1
Run scenario2
Merge profiles
Recompile

Executable optimizer
Instrument
Run scenario1
Run scenario2
Merge profiles
Optimize

57
Optimizations sensitivity to training data

Experience with varying training
compiler, spreadsheet, CAD, Spec95
Some training sets are better than others
Can find one or a combination that gives best
results in all scenarios
Sometimes requires tuning of optimizations

58
Types of optimizations

Enhance conventional optimization with weights
based on profile
Transformations driven by profile info
Examples
Register allocation
Code layout
Inlining

59
Register allocation
top cmpgt a,3,t0 brfalse t0,then addl
b,1,b br join then ldl t0,c
addl t0,1,t0 stl t0,c join subl
a,1,a brtrue a,top
While (a) if (a gt 3) b else c
a--

a, b, and c live for entire loop
Should b or c get the last register?
Information block counts

60
Code layout Reduce the number of taken branches

Greedy algorithm, lay out common paths
sequentially
Information
flow edge counts

60
40
60
40
45
55
55
45
61
Inlining
RtnC
RtnA
1000
2
0
RtnB
RtnD

Probably no advantage to inline RtnD into RtnA
RtnB is almost always called from RtnA
thus no cache penalty for inlining
Information Call edge counts

62
Information to drive optimization

Basic
basic block counts
flow edge counts
call edge counts
More advanced
path profiles
cache misses
branch mispredicts

63
Computing basic block counts

Instrumentation
Use atom tool
Use 64 bit integers
Sampling

64
Computing call edges
rtna move 1,a0 move 2,a1 ldl r0,20(t0) jsr r0
PC relative call Call edge count is same as
basic block count

rtna
move 1,a0
move 2,a1
bsr rtnb

Indirect call keep hash table of targets and
counts
65
Computing flow edge counts from basic block counts

Basic block count
S incoming edges
S outgoing edges
Exceptions, longjmp/setjmp are implicit edges
Tolerate inconsistencies

10
20
10
66
Computing flow edge counts from basic block counts

Some graphs have multiple solutions
Guess!
Instrument edges
Instrument minimum number of blocks and edges

while (a) a--
bzero a,skip topsubl a,1,a bnzero a,top skip
10
10
1
9
9
19
1
11
20
20
1
9
10
10
Two solutions for same bb count
67
Computing flow edge counts from basic block counts

Spanning tree algorithm
given flow graph, costs, finds lowest cost set of
instrumentation points
costs derived from static analysis or earlier
runs
Read Ball and Larus for details

68
Instrumenting flow edges

ATOM branch taken value can be passed to
analysis routine
branch not taken insert call to count after
conditional branch
taken branch, indirect jump insert new basic
block between branch and target

69
Merging multiple profiles

Multiple runs generate multiple profiles, how do
we combine them?
Add them together
Should the profiles be weighted equally?
User defined
Scale so that sums are equal

70
Using profiles

Edge, block counts are in database
For each procedure, compiler locates counts in
database and copies them to IR
Every flow edge, call edge, block labeled with
execution count
Optimizations that modify flow graph must update
profile information

71
IR/Profiled program mismatch

Does the flow graph of the program you profiled
match the flow graph in the compiler IR?
Optimization
Code generation
Usually ok if you disable optimization
Not a problem for executable optimizers

72
Persistence