Detecting Performance Bottlenecks Using Binary Rewriting - PowerPoint PPT Presentation

About This Presentation

Title:

Detecting Performance Bottlenecks Using Binary Rewriting

Description:

Temporal Locality :: same cache block element accessed repeatedly before block is evicted. ... spatial hits = 34881 temporal ratio = 0.95 spatial ratio = 0. ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 24

Provided by: yifa2

Learn more at: https://arcb.csc.ncsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Detecting Performance Bottlenecks Using Binary Rewriting

1
Detecting Performance Bottlenecks Using Binary
Rewriting
Jaydeep Marathe and Frank Mueller North Carolina
State University Department of Computer Science
2
Why are Memory Performance Bottlenecks a Problem?
Processor
L1 Cache
L2 Cache
Main Memory

Processor speeds growing much faster than
memory access speeds.
Application memory performance has
increasingly significant impact on overall
performance.

3

Locality Of Reference
Increase Locality ? Decrease Misses !
4
How To Gauge Memory Performance ?
One Way ..

Drawbacks
Tradeoff between accuracy ? sampling
overhead
Fairly coarse statistics - overall hits, misses
etc.

Another Way ..

Drawbacks
High Execution Overhead due to logging.
Complete Trace is huge ! hundreds of MBs in
size

Need Accurate Metrics with Minimum Time Space
Overheads !
5
Detecting Bottlenecks Using Binary Rewriting
Online Compression
Instrument
Target Binary
Memory trace
Compressed trace
Trace file
Controller Process
Detailed Cache Statistics
Cache Simulator

Binary Rewriting to instrument application
binary.
Insert online compression routines to compress
generated trace
Drive incremental cache simulator with
compressed trace.
Simulator generates detailed cache metrics for
user feedback.

6
Advantages ..
Online Compression
Instrument
Target Binary
Memory trace
Compressed trace
Trace file
Controller Process
Detailed Cache Statistics
Cache Simulator

Selective Instrumentation of parts of target
binary.
Partial Data Traces instead of complete
traces.
Online Compression reduces trace storage
requirements.
Statistics Correlated to Source Data
Structures.

7
Instrumenting Target Binary
CFG
Machine code
Mutator (controller)
DynInst
Target Binary

Extended a Portable Binary Manipulation
Framework (DynInst U. Maryland).

8
Compressing Generated Trace

Generated Trace potentially contains millions
of accesses !
Solution Detect Regular Patterns in trace
for effective compression.

9
An RSD Example
10
Power Regular Section Descriptors (PRSDs)

RSDs not powerful enough to compress address
stream efficiently.
Solution Nest RSD to create Power Regular
Section Descriptor (PRSD)

PRSD lt base_addr first address
generated by PRSD.
base_addr_shift stride of
base_addr b/w PRSD iterations
base_seq starting
position of this pattern in trace.
base_seq_shift interleave
distance b/w PRSD iterations.
length PRSD
length
child PRSD/RSD nested PRSD/
RSD gt

11
Incremental Cache Simulation
Address Trace
Compressed Trace
Cache Simulator
Report File
Scopes File
Scope Structure of Target
Detailed Cache Statistics
Base Addresses of Variables in Target
Variables File

Incremental cache simulation (modified MHSim
Rice U.)
Correlate
Trace Addresses lt---gt
Variable Names
Access point (LD / ST Instructions) lt---gt
Line Numbers in Source
Metrics per Access Point, also aggregated by
scope structure of target.

12
Cache Metrics Per Access Point
Report File
What it tells us
Metric
Definition
Miss ratio
Coarse Indicator of performance
Temporal ratio
Relative degree of temporal locality

Relative degree of spatial locality
Spatial ratio

Cache Block Fraction Used,before eviction
(Access Efficiency)
Spatial Use
Conflicting Variables (useful !)
Evictor References
List of evictors
13
Test Kernel Matrix Multiplication
Overall Performance
reads 750000 writes 250000
hits 738811 misses 261189
miss ratio 0.261 temporal hits
703930 spatial hits 34881 temporal ratio
0.95 spatial ratio 0.04721 spatial use
0.169
60 for(i0iltMAT_DIMi) 61 for(j0jltMAT_DIMj
) 62 for(k 0kltMAT_DIMk) 63
xij yik zkj xij MAT_DIM
800 total samples registered 1000000
Per Reference Information

Miss Temporal Spatial Line Name
Hits Ratio Ratio Use
Evictors 66 z_Read_1 0.00e00 1.0
no hits 0.171 Z,Y,X 66
y_Read_0 2.39e05 0.044 0.854
0.129 Z 66 x_Read_2 2.50e05
0.0006 1.00 0.5 Z 66
x_Write_3 2.50e05 0.0 1.00
no evicts
C Source code

High miss ratio More than 25 of accesses
were misses.
Low spatial use References evicted before
cache block fully referenced.
z_Read_1 dominating ? 100 misses
cause iteration space layout
z_Read_1 sole evictor (evicts itself 95 ?
evictor table)
Evictions ? low spatial use for x , y and z
loads

locality for z interchange j k loops.
temporal reuse for y and x blocking (tiling)

14
Optimized Matrix Multiply
81for(jj 0jjltMAT_DIMjj ts) 82
for(kk0kkltMAT_DIMkk ts) 83
for(i0iltMAT_DIMi) 84 for(kkkklt
min(kkts,MAT_DIM)k) 85 for(jjjjlt
min(jjts,MAT_DIM)j) 86 xij
xik zkj xij tile size ts 16
Overall performance (New / Old ) hits
982128 / 738811 misses 17872 / 261189 miss
ratio 0.017 / 0.261 temporal hits 947173 /
703930 spatial hits 34955 / 34881 temporal
ratio 0.96441 / 0.95 spatial ratio
0.03559 / 0.04721 spatial use 0.7039 / 0.169
Spatial Use
Temporal Ratio
Miss Ratio
Misses
Hits
Name
Per Reference Information
Old
Old
Old
Old
Old
New
New
New
New
New
0.171
no hits
1.0
2.50e05
0
0.673
0.972
0.035
8.79e03
2.41e05
z_Read_1
0.129
0.854
0.044
1.10e04
2.39e05
0.732
0.896
0.035
8.79e03
2.41e05
y_Read_0
0.5
1.00
0.0006
1.57e02
2.50e05
0.861
0.99
0.001
2.88e02
2.50e05
x_Read_2
no evicts
no evicts
1.00
0.89
0.0
0.0
0.00e00
0
2.50e05
2.50e05
x_Write_3
15
Another example ADI Integration
Overall Performance
reads 800000 writes
200000 hits 499499 misses
500501 miss ratio 0.5
temporal hits 351731 spatial hits 147768
temporal ratio 0.704 spatial ratio
0.29583 spatial use 0.2018
16 for(k1kltNk) 17 for(i2iltNi) 18
xikxik- xi-1k
aik/bi-1k 22 for(i2iltNi) 23
bikbik aik
aik/bi-1k N 800 accesses logged
1000000
Per Reference Metrics

Miss Temporal Spatial Line Name
Source_Ref Hits Misses Ratio
Ratio Use 18 x_Read_3 xik
0 1.00e05 1.00 no hits 0.13 18 a_Read_1
aik 0 1.00e05 1.00 no hits
0.25 18 b_Read_2 bi-1k 0 1.00e05
1.00 no hits 0.13 23 b_Read_8 bik 0
9.98e04 1.00 no hits 0.24 23 a_Read_5
aik 0 9.98e04 1.00 no hits
0.24 18 x_Read_0 xi-1k 1.00e05 1.26e02
0 1.0 0.25 23 b_Read_7 bi-1k
9.96e04 1.25e02 0 1.0 0.25 18
x_Write_4 xik 1.00e05 0.00e00 0 0.50
no evicts 23 b_Write_9 bik 9.98e04
0.00e00 0 0.27 no evicts 23 a_Read_6
aik 9.98e04 0.00e00 0 0.74 no
evicts

high overall miss rate ? poor locality
low overall spatial use
first 5 references have 0 hits
pattern references iterate over rows
spatial use values low ? evictions

increase locality top 5 references
increase spatial locality
? interchange loops

16
Optimized ADI Integration
Overall performance (New / Old ) hits
874600 / 499499 misses 125400 / 500501 miss
ratio 0.125 / 0.5 temporal hits 454867 /
351731 spatial hits 419733 / 147768 temporal
ratio 0.52009 / 0.704 spatial ratio
0.4799 / 0.29583 spatial use 0.9628 / 0.2018
14 for(i2iltNi) 15 for(k1kltNk) 16
xikxik-xi-1kaik/bi-1k 17
for(k1kltNk) 18 bik
bikaikaik/bi-1k N 800
accesses logged 1000000
Per Reference Information
Spatial Use
Temporal Ratio
Miss Ratio
Misses
Hits
Name
Old
Old
Old
Old
Old
0.13
No hits
1.0
1.00e05
0
x_Read_3
0.25
No hits
1.0
1.00e05
0
a_Read_1
0.13
No hits
1.0
1.00e05
0
b_Read_2
0.24
No hits
1.0
9.98e04
0
b_Read_8
0.24
No hits
1.0
9.98e04
0
a_Read_5