Title: Detecting Performance Bottlenecks Using Binary Rewriting
1Detecting Performance Bottlenecks Using Binary
Rewriting
Jaydeep Marathe and Frank Mueller North Carolina
State University Department of Computer Science
2Why are Memory Performance Bottlenecks a Problem?
Processor
L1 Cache
L2 Cache
Main Memory
- Processor speeds growing much faster than
memory access speeds. - Application memory performance has
increasingly significant impact on overall
performance.
3 Locality Of Reference
Increase Locality ? Decrease Misses !
4How To Gauge Memory Performance ?
One Way ..
- Drawbacks
- Tradeoff between accuracy ? sampling
overhead - Fairly coarse statistics - overall hits, misses
etc.
Another Way ..
- Drawbacks
- High Execution Overhead due to logging.
- Complete Trace is huge ! hundreds of MBs in
size
Need Accurate Metrics with Minimum Time Space
Overheads !
5Detecting Bottlenecks Using Binary Rewriting
Online Compression
Instrument
Target Binary
Memory trace
Compressed trace
Trace file
Controller Process
Detailed Cache Statistics
Cache Simulator
- Binary Rewriting to instrument application
binary. - Insert online compression routines to compress
generated trace - Drive incremental cache simulator with
compressed trace. - Simulator generates detailed cache metrics for
user feedback.
6Advantages ..
Online Compression
Instrument
Target Binary
Memory trace
Compressed trace
Trace file
Controller Process
Detailed Cache Statistics
Cache Simulator
- Selective Instrumentation of parts of target
binary. - Partial Data Traces instead of complete
traces. - Online Compression reduces trace storage
requirements. - Statistics Correlated to Source Data
Structures.
7Instrumenting Target Binary
CFG
Machine code
Mutator (controller)
DynInst
Target Binary
- Extended a Portable Binary Manipulation
Framework (DynInst U. Maryland).
8Compressing Generated Trace
- Generated Trace potentially contains millions
of accesses ! - Solution Detect Regular Patterns in trace
for effective compression.
9An RSD Example
10Power Regular Section Descriptors (PRSDs)
- RSDs not powerful enough to compress address
stream efficiently. - Solution Nest RSD to create Power Regular
Section Descriptor (PRSD) -
- PRSD lt base_addr first address
generated by PRSD. - base_addr_shift stride of
base_addr b/w PRSD iterations - base_seq starting
position of this pattern in trace. - base_seq_shift interleave
distance b/w PRSD iterations. - length PRSD
length - child PRSD/RSD nested PRSD/
RSD gt -
11Incremental Cache Simulation
Address Trace
Compressed Trace
Cache Simulator
Report File
Scopes File
Scope Structure of Target
Detailed Cache Statistics
Base Addresses of Variables in Target
Variables File
- Incremental cache simulation (modified MHSim
Rice U.) - Correlate
- Trace Addresses lt---gt
Variable Names - Access point (LD / ST Instructions) lt---gt
Line Numbers in Source - Metrics per Access Point, also aggregated by
scope structure of target.
12Cache Metrics Per Access Point
Report File
What it tells us
Metric
Definition
Miss ratio
Coarse Indicator of performance
Temporal ratio
Relative degree of temporal locality
Relative degree of spatial locality
Spatial ratio
Cache Block Fraction Used,before eviction
(Access Efficiency)
Spatial Use
Conflicting Variables (useful !)
Evictor References
List of evictors
13Test Kernel Matrix Multiplication
Overall Performance
reads 750000 writes 250000
hits 738811 misses 261189
miss ratio 0.261 temporal hits
703930 spatial hits 34881 temporal ratio
0.95 spatial ratio 0.04721 spatial use
0.169
60 for(i0iltMAT_DIMi) 61 for(j0jltMAT_DIMj
) 62 for(k 0kltMAT_DIMk) 63
xij yik zkj xij MAT_DIM
800 total samples registered 1000000
Per Reference Information
Miss Temporal Spatial Line Name
Hits Ratio Ratio Use
Evictors 66 z_Read_1 0.00e00 1.0
no hits 0.171 Z,Y,X 66
y_Read_0 2.39e05 0.044 0.854
0.129 Z 66 x_Read_2 2.50e05
0.0006 1.00 0.5 Z 66
x_Write_3 2.50e05 0.0 1.00
no evicts
C Source code
- High miss ratio More than 25 of accesses
were misses. - Low spatial use References evicted before
cache block fully referenced. - z_Read_1 dominating ? 100 misses
- cause iteration space layout
- z_Read_1 sole evictor (evicts itself 95 ?
evictor table) - Evictions ? low spatial use for x , y and z
loads
- locality for z interchange j k loops.
- temporal reuse for y and x blocking (tiling)
14Optimized Matrix Multiply
81for(jj 0jjltMAT_DIMjj ts) 82
for(kk0kkltMAT_DIMkk ts) 83
for(i0iltMAT_DIMi) 84 for(kkkklt
min(kkts,MAT_DIM)k) 85 for(jjjjlt
min(jjts,MAT_DIM)j) 86 xij
xik zkj xij tile size ts 16
Overall performance (New / Old ) hits
982128 / 738811 misses 17872 / 261189 miss
ratio 0.017 / 0.261 temporal hits 947173 /
703930 spatial hits 34955 / 34881 temporal
ratio 0.96441 / 0.95 spatial ratio
0.03559 / 0.04721 spatial use 0.7039 / 0.169
Spatial Use
Temporal Ratio
Miss Ratio
Misses
Hits
Name
Per Reference Information
Old
Old
Old
Old
Old
New
New
New
New
New
0.171
no hits
1.0
2.50e05
0
0.673
0.972
0.035
8.79e03
2.41e05
z_Read_1
0.129
0.854
0.044
1.10e04
2.39e05
0.732
0.896
0.035
8.79e03
2.41e05
y_Read_0
0.5
1.00
0.0006
1.57e02
2.50e05
0.861
0.99
0.001
2.88e02
2.50e05
x_Read_2
no evicts
no evicts
1.00
0.89
0.0
0.0
0.00e00
0
2.50e05
2.50e05
x_Write_3
15Another example ADI Integration
Overall Performance
reads 800000 writes
200000 hits 499499 misses
500501 miss ratio 0.5
temporal hits 351731 spatial hits 147768
temporal ratio 0.704 spatial ratio
0.29583 spatial use 0.2018
16 for(k1kltNk) 17 for(i2iltNi) 18
xikxik- xi-1k
aik/bi-1k 22 for(i2iltNi) 23
bikbik aik
aik/bi-1k N 800 accesses logged
1000000
Per Reference Metrics
Miss Temporal Spatial Line Name
Source_Ref Hits Misses Ratio
Ratio Use 18 x_Read_3 xik
0 1.00e05 1.00 no hits 0.13 18 a_Read_1
aik 0 1.00e05 1.00 no hits
0.25 18 b_Read_2 bi-1k 0 1.00e05
1.00 no hits 0.13 23 b_Read_8 bik 0
9.98e04 1.00 no hits 0.24 23 a_Read_5
aik 0 9.98e04 1.00 no hits
0.24 18 x_Read_0 xi-1k 1.00e05 1.26e02
0 1.0 0.25 23 b_Read_7 bi-1k
9.96e04 1.25e02 0 1.0 0.25 18
x_Write_4 xik 1.00e05 0.00e00 0 0.50
no evicts 23 b_Write_9 bik 9.98e04
0.00e00 0 0.27 no evicts 23 a_Read_6
aik 9.98e04 0.00e00 0 0.74 no
evicts
- high overall miss rate ? poor locality
- low overall spatial use
- first 5 references have 0 hits
- pattern references iterate over rows
- spatial use values low ? evictions
- increase locality top 5 references
- increase spatial locality
- ? interchange loops
16Optimized ADI Integration
Overall performance (New / Old ) hits
874600 / 499499 misses 125400 / 500501 miss
ratio 0.125 / 0.5 temporal hits 454867 /
351731 spatial hits 419733 / 147768 temporal
ratio 0.52009 / 0.704 spatial ratio
0.4799 / 0.29583 spatial use 0.9628 / 0.2018
14 for(i2iltNi) 15 for(k1kltNk) 16
xikxik-xi-1kaik/bi-1k 17
for(k1kltNk) 18 bik
bikaikaik/bi-1k N 800
accesses logged 1000000
Per Reference Information
Spatial Use
Temporal Ratio
Miss Ratio
Misses
Hits
Name
Old
Old
Old
Old
Old
0.13
No hits
1.0
1.00e05
0
x_Read_3
0.25
No hits
1.0
1.00e05
0
a_Read_1
0.13
No hits
1.0
1.00e05
0
b_Read_2
0.24
No hits
1.0
9.98e04
0
b_Read_8
0.24
No hits
1.0
9.98e04
0
a_Read_5
- significantly more hits
- fewer evictions ? higher spatial use
17Summing Up ..
Process
- Use Binary Rewriting to instrument target
executable. - Compress Generated Trace online.
- Use Compressed Trace for cache simulation.
Highlights
- Compiler-independent support.
- Useful for mixed-language applications.
- Partial Data Traces targetted instrumentation
- Efficient Online Trace Compression.
- Enhanced User Feedback source-correlated
statistics.
18Thank You !
19Future Work
Automatic Optimization
Inject Optimizations
Executing binary
Controller
Attach
Text Section
Extract
CFG
- Identify natural loops in CFG.
- Attempt to identify data dependencies from
binary. - Reconfigure binary with optimizations, without
violating data - dependencies.
- Optimizations could include prefetching,
tiling, loop fusion , - loop interchange, etc.
20Related Work
- SIGMA Supercomputing02
- Simulator Infrastructure to Guide Memory Analysis
- Capture Full Address Trace
- No Evictor Information , weaker compression
algorithm - MTOOL TPDS93, CPROF Computer94
- Correlation to source line numbers only.
- PAPI , HPM
- APIs to access hardware performance counters
21The Compression Algorithm
- Targetted at regular array accesses in tightly
nested loops.
For( ) A BI 2.0
For( ) For( ..)
For(.)
A
Wont Work Well
Works Well
Constant Size Compressed Trace !
- Algorithm has growth rate O (n x w) where n
accesses and w pool size (maximum
accesses residing in memory for pattern matching
).
- Tool Structure Modular Possible to use some
other algorithm - more suited for application domain.
22Challenges
- Reverse-mapping of accesses to variable
expressions in source - Currently limited to local and global variables
only. - Difficult to support dynamically allocated
objects , since - program counter might have passed object
allocation stage - (malloc) by the time we attach to the
application. - Reverse-engineering of access point --gt source
expression - difficult. (eg. AIj2Q1PR 2.0
maps to lots of - machine instructions.)
- Symbol Table information must be present, for
effective - user feedback ( getting variable names, line
numbers etc.)
23Memory Performance Metrics
Main Memory
L2 Cache
L1 Cache
Processor
Relative Access Cycles
2
30
5
- A hit occurs when layer contains accessed
element.
- A miss occurs when requested element absent in
layer.
- Misses bad ! force processor stall till data
fetched from next layer. -
- Fewer Misses ? Faster Performance.