Profiling, Instrumentation, and Profile Based Optimization - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Profiling, Instrumentation, and Profile Based Optimization

Description:

Post-link tool inserts instrumentation code. No rebuild, source ... instrumentation ... Analysis routines: do the instrumentation work at runtime (e.g. ... – PowerPoint PPT presentation

Number of Views:194
Avg rating:3.0/5.0
Slides: 76
Provided by: rober525
Category:

less

Transcript and Presenter's Notes

Title: Profiling, Instrumentation, and Profile Based Optimization


1
Profiling, Instrumentation, and Profile Based
Optimization
  • Robert Cohn Robert.Cohn_at_compaq.com
  • Mark T. Vandevoorde

2
Introduction
  • Understanding the dynamic interaction between
    programs and processors
  • What do programs do?
  • How do processors perform?
  • How can we make it faster?

3
What to do?
  • Build tools!
  • Profiling
  • Instrumentation
  • Profile based optimization

4
The Big Picture
Instrumentation
Sampling
Profiling
Profile Based Optimization
Analysis
Modeling
5
Instrumentation
  • User level view
  • Executable editing

6
Code Instrumentation
Trojan Horse
  • Application appears unchanged
  • Data collected as a side effect of execution

7
Instrumentation Example
  • Add extra code

if (b gt c) bb0 t 1 else
bb1 b 3
  • if (b gt c)
  • t 1
  • else
  • b 3

Instrumentation
8
Instrumentation Uses
  • Profiles
  • Model new hardware
  • What will this new branch predictor do?
  • What is the miss rate of this new cache?
  • Optimization opportunities
  • find unnecessary loads and stores
  • find divides by 1

9
What Tool Does Instrumentation?
  • Compiler
  • Compiler inserts extra operations
  • Requires recompile, access to source code
  • Executable editor
  • Post-link tool inserts instrumentation code
  • No rebuild, source code not required
  • More difficult to relate back to source

10
Instrumentation Tools for Alpha
  • All executable based
  • General instrumentation
  • Atom on Digital Unix
  • Distributed with Digital Unix
  • Ntatom on Windows NT
  • New! Download from web
  • Specialized tools based on above
  • hiprof, pixie, 3rd, ...

11
ATOM
  • Tool for customized instrumentation
  • User writes program that describes how to
    instrument application
  • Instrumentation program applied to application,
    generates instrumented application
  • Instrumented application is run
  • Data is collected

12
User Supplies
  • Instrumentation routines user written program
    that inserts instrumentation
  • calls to analysis routines
  • Analysis routines do the instrumentation work at
    runtime (e.g. count a basic block)

13
Atom Programming Model
spice
libc.so
libm.so
main()
Compute()
_exit()
block2
block3
block1
block5
block4
ldq r1, 8(sp)
addq r1, 0x1, r2
stq r2, 8(sp)
bne r1, 0x1ffc40
14
ATOM Instrumentation API Navigation
  • Objects (binary, shared library)
  • GetFirstObj, GetNextObj
  • Procedures
  • GetFirstProc, GetNextProc
  • Basic blocks
  • GetFirstBlock, GetNextBlock
  • Instructions
  • GetFirstInst, GetNextInst

15
ATOM Instrumentation API Interrogation
  • GetObjInfo, GetProcInfo, GetBlockInfo,
    GetInstInfo
  • IsBranchTarget
  • GetInstRegUsage
  • InstPC
  • InstLineNo
  • ...

16
ATOM Instrumentation API Definition
  • AddCallProto
  • tells atom the types of the arguments for calls
    to analysis routines

17
ATOM Instrumentation API Instrumentation
  • AddCallProgram, AddCallObj, AddCallProc,
    AddCallBlock, AddCallInst, ReplaceProcedure
  • Insert before or after

18
Arguments to analysis routines
  • Constants
  • variables in instrumentation program, but
    constant at instrumentation point
  • e.g. uninstrumented PC, function name
  • VALUE computed at runtime
  • effective address, branch taken predicate
  • Register
  • r3, arguments, return value

19
Sample 1 Cache Simulator
  • Write a tool that computes the miss rate of the
    application running in a 64KB, direct mapped data
    cache with 32 byte lines.
  • gt atom spice cache.inst.o cache.anal.o -o
    spice.cache
  • gt spice.cache lt ref.in gt ref.out
  • gt more cache.out
  • 5,387,822,402 620,855,884 11.523

20
Cache Tool Implementation
Application
Instrumentation
main clr t0 loop ldl
t2,0(a0) addl t0,4,t0 addl
t2,0x10,t2 stl t2,0(a0) bne t3,loop ret
VALUE
PrintResults()
21
Cache Analysis File
  • include ltstdio.hgt
  • define CACHE_SIZE 65536
  • define BLOCK_SHIFT 5
  • long cacheCACHE_SIZE gtgt BLOCK_SHIFT,
    refs,misses
  • Reference(long address)
  • int index address (CACHE_SIZE-1) gtgt
    BLOCK_SHIFT
  • long tag address gtgt BLOCK_SHIFT
  • if (cacheindex ! tag) misses
    cacheindex tag
  • refs
  • Print()
  • FILE file fopen("cache.out","w")
  • fprintf(file,"ld ld .2f\n",refs, misses,
    100.0 misses / refs)
  • fclose(file)

22
Cache Instrumentation File
  • include ltstdio.hgt
  • include ltcmplrs/atom.inst.hgt
  • unsigned Instrument(int argc, char argv, Obj
    o)
  • Inst iBlock bProc p
  • AddCallProto("Reference(VALUE)")
    AddCallProto("Print()")
  • AddCallProgram(ProgramAfter,"Print")
  • for (p GetFirstProc() p ! NULL p
    GetNextProc(p))
  • for (b GetFirstBlock(p) b ! NULL b
    GetNextBlock(b))
  • for (i GetFirstInst(b) i ! NULL i
    GetNextInst(i))
  • if (IsInstType(i, InstTypeLoad)
    IsInstType(i,InstTypeStore))
  • AddCallInst(i, InstBefore, "Reference",
    EffAddrValue)

23
Sample 2 Profiler
  • Write a tool that outputs the address of each
    basic block and the number of times it is
    executed.
  • vssad-27gt atom a.out prof.inst.c prof.anal.c
  • vssad-28gt a.out.atom
  • Hello world
  • vssad-29gt head prof.out
  • 120001030 1
  • 120001038 1
  • 12000103c 1
  • 120001058 33
  • 120001064 1

24
Profiler Tool Implementation
Application
Instrumentation
main clr t0 loop ldl
t2,0(a0) addl t0,4,t0 addl
t2,0x10,t2 stl t2,0(a0) bne t3,loop ret
Constant
PrintResults(addresses,3)
25
Profiler prof.anal.c
  • include ltstdio.hgt
  • long counts
  • void Init(int nblocks)
  • counts (long )malloc(nblocks sizeof(long))
  • memset(counts,0,nblocks sizeof(long))
  • void Count(int index) countsindex
  • void Print(long blocks,int nblocks)
  • int i FILE file fopen("prof.out","w")
  • for (i 0 i lt nblocks i)
  • fprintf(file,"lx ld\n",blocksi,countsi)
  • fclose(file)

26
Profiler prof.inst.c
  • include ltstdio.hgt
  • include ltcmplrs/atom.inst.hgt
  • void CallInitPrint()
  • void Instrument(int argc, char argv,Obj o)
  • Block bProc pint index0
  • int nblocks GetObjInfo(o,ObjNumberBlocks)
  • long addresses (long )malloc(nblocks
    sizeof(long))
  • CallInitPrint(addresses,nblocks)
  • for (p GetFirstProc() p ! NULL p
    GetNextProc(p))
  • for (b GetFirstBlock(p) b ! NULL b
    GetNextBlock(b))
  • addressesindex InstPC(GetFirstInst(b))
  • AddCallInst(GetFirstInst(b), InstBefore,
    "Count",index)

27
Profiler prof.inst.c
  • void CallInitPrint(long addresses, int nblocks)
  • char buffer100
  • AddCallProto("Count(int)")
  • AddCallProto("Init(int)")
  • AddCallProgram(ProgramBefore,"Init",nblocks)
  • sprintf(buffer,"Print(const stable
    intd,int)")
  • AddCallProto(buffer)
  • AddCallProgram(ProgramAfter,"Print",addresses,nb
    locks)

28
Executable editors
  • Input executable, ouput executable
  • Instrument, optimize, translate
  • Executable image binary shared library
    shared object dynamically linked library (DLL)
  • Executable editor, executable optimizer, binary
    rewriter, binary translator, post link optimizer

29
Executable Editing
  • Insert/delete/reorder instructions and data
  • Obstacle to modification
  • Addresses are bound
  • Registers are bound

30
Obstacles
  • if (a) a b
  • beq r1,2
  • ldl r1,0x1000
  • Is a0 free?
  • Adjust branch offsets
  • Adjust literal addresses

31
Phases
  • 1. Decompose
  • 2. Build IR
  • 3. Insert instrumentation
  • 4. Convert IR to executable

32
1. Decompose Executable
Executable
Header
Text (code)
Program code data
Data
Rdata
Exception Info
Meta data
Relocations
Debug
33
Decompose
  • Break executable into units
  • unit minimum data that must be kept together
  • code unit is instruction
  • data unit is data section
  • alternative unit is data item

34
2. Build Internal Representation
35
Intermediate Representation
  • Similar to compiler
  • except unstructured, untyped data
  • 1 to 1 mapping for IR and machine instructions
  • Base representation should be compact
  • fit in physical memory
  • initial/final phases do multiple passes
  • Representations built/thrown away for procedures

36
Bound addresses
  • Data
  • 1
  • 2
  • 0x12345678
  • 3

Code br 4 ldah r0,0x1234 lda r0,0x5678(r0)
Metadata Begin 0x12345678 End 0x12345680
37
Adjusting addresses
  • No translation
  • Dynamic translation
  • Static translation

38
No translation
  • Leave code and data at same address

beq r1,L2 ldl r1,0x1234 L2
beq r1,L2 br L1 L2 ... ... L1 lda
a0,0x1234 bsr Reference ldl r1,0x1234 br L2
39
Dynamic translation
  • Address computation is unchanged
  • Image has map of old-gtnew address
  • Code inserted to map old-gtnew address at runtime
    for load/store/branch
  • Better
  • Do PC relative branches statically
  • Keep data section at original address
  • Still indirect calls and jumps (not returns)

40
Static translation
  • Address computation is altered for new layout
  • Find addresses
  • Determine what they point to
  • unit, offset
  • Insert instrumentation
  • Adjust literals or offsets to compute new address
    of unit

41
Other tools that change addresses
  • Linker
  • combine separately compiled objects
  • adjust addresses based on assigned load address
  • unit is section of object (data, text)
  • Loader
  • Load address ! link address for DLL
  • unit is entire image
  • Use relocations

42
Relocations
No relocation required
  • Data
  • 1
  • 2
  • 0x12345678
  • 3

Code br 4 ldl r1,10(gp) ldah r0,0x1234 lda
r0,0x5678(r0)
May require relocation
Relocation example address 0x200 type ldah
literal object 0x12345670 external
Requires relocation
43
How to recognize addresses?
  • Metadata
  • example procedure begin, procedure end
  • implicit in structure of data
  • Absolute addresses
  • example literal address in data section
  • use relocations
  • Relative addresses address offset
  • example pc relative branch, offset for base
    pointer
  • may not need adjustment,usually no relocation

44
Relative Addresses
  • Address computed as offset of another address
  • Address and Address Offset point to same unit
    ok, unit moved as a unit
  • Example
  • a-gtfield1 ar4
  • ldl r0,field1(a) ldl r0,16(ar)

45
Relative Addresses
  • Offset spans multiple units
  • example

Jump table ad base i jmp ad base br
l1 br l2 br l3 br l4
PC relative branch br 4
Must be 1 unit
46
Map address to unit and offset
  • Reference -gt address
  • in code interpret instructions
  • br 4
  • ldah r0,0x1234
  • lda r0,0x5678(r0)
  • in data data is address
  • .data
  • 0x12345678

47
Map address to unit and offset
  • (relocation,address) -gt (unit,offset)
  • to code pointer to instruction
  • to data data section and offset
  • alternative data item and offset
  • offset address - unit address

48
3. Insert Instrumentation
Instruction list
add
Data sections
load
Data
load
Sdata
beq
Ndata
MetaData
Exception Info
Relocations
49
Adding instrumentation code
  • Instrumentation requires free registers
  • wrapper routine saves and restores registers

beq r1,2 save registers lda a0,0x1000 bsr
ra,wrapper restore registers ldl r1,0(r2)
Save registers on stack bsr ra,Reference Restore
registers return
Reference
  • Local/global/interprocedural analysis finds free
    registers

50
4. Convert IR to Executable
Executable
Header
Text
Program code data
Data
Rdata
Ndata
Exception Info
Meta data
Relocations
Debug
51
Profile Based Optimization
52
Profile based optimization
  • Collect profile information
  • example how often basic blocks are executed
  • Use profile to guide optimization
  • example inlining

53
Profile based Optimization
  • Available on Alpha, MIPS, PA, PPC, Sparc, x86
  • Used in compilers and executable optimizers
  • Spec, products, too.

54
Speedup from code layout
55
Register allocation and inlining
56
User level view
  • Compiler
  • Compile
  • Instrument
  • Run scenario1
  • Run scenario2
  • Merge profiles
  • Recompile
  • Executable optimizer
  • Instrument
  • Run scenario1
  • Run scenario2
  • Merge profiles
  • Optimize

57
Optimizations sensitivity to training data
  • Experience with varying training
  • compiler, spreadsheet, CAD, Spec95
  • Some training sets are better than others
  • Can find one or a combination that gives best
    results in all scenarios
  • Sometimes requires tuning of optimizations

58
Types of optimizations
  • Enhance conventional optimization with weights
    based on profile
  • Transformations driven by profile info
  • Examples
  • Register allocation
  • Code layout
  • Inlining

59
Register allocation
top cmpgt a,3,t0 brfalse t0,then addl
b,1,b br join then ldl t0,c
addl t0,1,t0 stl t0,c join subl
a,1,a brtrue a,top
While (a) if (a gt 3) b else c
a--
  • a, b, and c live for entire loop
  • Should b or c get the last register?
  • Information block counts

60
Code layout Reduce the number of taken branches
  • Greedy algorithm, lay out common paths
    sequentially
  • Information
  • flow edge counts

60
40
60
40
45
55
55
45
61
Inlining
RtnC
RtnA
1000
2
0
RtnB
RtnD
  • Probably no advantage to inline RtnD into RtnA
  • RtnB is almost always called from RtnA
  • thus no cache penalty for inlining
  • Information Call edge counts

62
Information to drive optimization
  • Basic
  • basic block counts
  • flow edge counts
  • call edge counts
  • More advanced
  • path profiles
  • cache misses
  • branch mispredicts

63
Computing basic block counts
  • Instrumentation
  • Use atom tool
  • Use 64 bit integers
  • Sampling

64
Computing call edges
rtna move 1,a0 move 2,a1 ldl r0,20(t0) jsr r0
PC relative call Call edge count is same as
basic block count
  • rtna
  • move 1,a0
  • move 2,a1
  • bsr rtnb

Indirect call keep hash table of targets and
counts
65
Computing flow edge counts from basic block counts
  • Basic block count
  • S incoming edges
  • S outgoing edges
  • Exceptions, longjmp/setjmp are implicit edges
  • Tolerate inconsistencies

10
20
10
66
Computing flow edge counts from basic block counts
  • Some graphs have multiple solutions
  • Guess!
  • Instrument edges
  • Instrument minimum number of blocks and edges

while (a) a--
bzero a,skip topsubl a,1,a bnzero a,top skip
10
10
1
9
9
19
1
11
20
20
1
9
10
10
Two solutions for same bb count
67
Computing flow edge counts from basic block counts
  • Spanning tree algorithm
  • given flow graph, costs, finds lowest cost set of
    instrumentation points
  • costs derived from static analysis or earlier
    runs
  • Read Ball and Larus for details

68
Instrumenting flow edges
  • ATOM branch taken value can be passed to
    analysis routine
  • branch not taken insert call to count after
    conditional branch
  • taken branch, indirect jump insert new basic
    block between branch and target

69
Merging multiple profiles
  • Multiple runs generate multiple profiles, how do
    we combine them?
  • Add them together
  • Should the profiles be weighted equally?
  • User defined
  • Scale so that sums are equal

70
Using profiles
  • Edge, block counts are in database
  • For each procedure, compiler locates counts in
    database and copies them to IR
  • Every flow edge, call edge, block labeled with
    execution count
  • Optimizations that modify flow graph must update
    profile information

71
IR/Profiled program mismatch
  • Does the flow graph of the program you profiled
    match the flow graph in the compiler IR?
  • Optimization
  • Code generation
  • Usually ok if you disable optimization
  • Not a problem for executable optimizers

72
Persistence
  • If the program is modified, can you use an old
    profile?
  • Generating a profile can be difficult and time
    consuming
  • Dont hold up build process generating a new
    profile every time

73
Usability
  • Make it easy or no one will use it
  • Limited changes to build process
  • Limited opportunities for user to mess up

74
Profile based optimization nirvana
  • Profile any build
  • tolerate IR/profiled program mismatches
  • No instrumentation step
  • Low cost profiling, lt 5
  • No restructuring of makefile
  • Big speedup!

DCPI
75
Tools for profile based optimization
  • Unix
  • cc, f77
  • om executable optimizer called from cc
  • cord user specified procedure ordering
  • NT
  • scc calls Visual C
  • spike executable optimizer
  • link /order user specified procedure ordering
  • wstune generates ordering
Write a Comment
User Comments (0)
About PowerShow.com