Automatic Measurement of Instruction Cache Capacity in X-Ray

About This Presentation

Title:

Automatic Measurement of Instruction Cache Capacity in X-Ray

Description:

Department of Computer Science. Cornell University. QEST'05. 2. 10/9/09 ... Array of pointers Code sequence with branches. Such branches are very predictable ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 26

Provided by: kamen2

Learn more at: https://www.ece.lsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Measurement of Instruction Cache Capacity in X-Ray

1
Automatic Measurement of Instruction Cache
Capacityin X-Ray

Kamen Yotov
kyotov_at_us.ibm.com
IBM T. J. Watson Research Center
Joint work with
Tyler Steele, Sandra Jackson,
Keshav Pingali, Paul Stodghill
Department of Computer Science
Cornell University

2
Motivation self-optimizing software

Goal portable performance
Self-optimizing software
Generates code with parameters whose optimal
values depend on the platform (hardware / OS /
compiler)
Determines experimentally optimal parameter
values
Uses native C compiler to produce library
Examples ATLAS, FFTW, SPIRAL,

3
Example Register Blocking for MMM

Hardware parameters
Number of FP registers (NR)
I-Cache Capacity (ICC)
A simple model for the register tile size for
MMM
Yotov et al. IEEE05
MU x NU MU NU Temp NR
KU (unroll of K loop)
does not depend on NR
depends on ICC
Need to know NR and ICC!

4
Why not consult the manuals?

Self-optimizing systems
Require online manuals
Actual hardware values vs. number available for
optimization
For software optimization, hardware values may
not be relevant
(e.g.) number of hardware registers may not be
equal to number of registers available for
holding program values (register 0 on SPARC)
Incomplete
Parameters like capacity and line size of
off-chip caches vary from model to model
Even same model of computer may be shipped with
different cache organizations
Not usually documented in processor manuals
Moving Target

5
Automatic Measurement Tools

lmbench
OS benchmark, some CPU / Memory benchmarks
Larry McVoy, BitMover, Inc.
Carl Staelin, HP
Calibrator
Memory hierarchy benchmark
Stefan Manegold
Centrum voor Wiskunde en Informatica
MOB
Memory hierarchy benchmark
Josep Blanquer, Robert Chalmers
University of California Santa Barbara

6
X-Ray

Set of micro-benchmarks in ANSI C89
Download and compile on any architecture
(portable)
Deduce hardware parameter values from timing
results
Some amount of O/S specific code
High-resolution timing routines
Super-page allocation
Currently support Linux
Windows and Solaris, IRIX, and AIX in the works
Paradox
Compiler optimizations may contaminate timing
results
Cannot afford to turn off all optimizations

7
Example Latency of Integer ADD(Step by Step)

t gettime()
r1 r2
return gettime() t

Problem hard to measure small time intervals
accurately
8
Step by Step (cont.)

t gettime()
while (--R) //R is number of repetitions
r1 r2
return gettime() t

Problem loop overhead
9
Step by Step (cont.)

t gettime()
i R / U
while (--i) //loop unrolled U times
r1 r2
r1 r2
........
r1 r2
return gettime() t

Problem compiler optimizations
10
Step by Step (cont.)

t gettime()
i R / U
switch (v)
case 0 loop
case 1 r1 r2
case 2 r1 r2
.................
case U r1 r2
if (--i)
goto loop
if (!v) return gettime() t else use(r1,r2)

Solution volatile int v 0
11
Latency of integer ADD nano-benchmark C code

Want to measure
r1r2
Generate C Code from specification
ltr1r2, ltr1, r2 intgtgt

volatile int v 0
volatile int vr 0
register int r1 vr
register int r2 vr
t gettime()
i R / U
switch (v)
case 0 loop
case 1 r1 r2
case 2 r1 r2
.................
case U r1 r2
if (--i)
goto loop
if (!v)
return gettime() t
else

12
X-Ray architecture
13
Instruction Throughput

Specification

Control Engine

N3, B1
14
Micro-benchmarks in X-Ray

CPU
Frequency
Instruction Latency
Instruction Throughput
Instruction Existence
FPU on embedded processors
FMA on general purpose processors
SMP and SMT
Memory Hierarchy
Number of Registers of various types (int, float,
SSE, )
Multilevel Caches, TLB
Associativity
Block Size
Capacity
Latency
Instruction Cache Capacity

15
Previous Approaches for Memory Hierarchy
Parameters

Saavedra Benchmark (Hennessy-Patterson)
Accesses elements of an array constant stride
apart
Measures average memory access time
Deficiencies
Considers all levels simultaneously
Works only for capacities that are powers-of-2
Suffers from a number of implementation level
deficiencies
Constant stride accesses
Loop overhead problems
Overlapping memory operations
Prone to compiler optimizations

16
ExampleIsolation of lower cache levels

Idea for Ln measurements
Use sequences as for L1 measurements
Make L1Ln-1 transparent to measurements
Unique in isolating the behavior of Ln so that
all higher levels miss
Approach
Use sequences of sequences
Convolution of sequences

?

17
Measuring I-Cache Capacity

Approach for Data Cache does not work
Array of pointers ? Code sequence with branches
Such branches are very predictable
Nearly impossible to get precise timing
Measure time to execute special code sequence of
size N statements
Find the biggest N for which there is no
significant increase in time per statement

18
Nano-benchmark

Similar to Instruction Throughput
Parameters (1, 4)
Grow length N
Code size computed
(char )finish (char )start

19
Sensitivity

Graph for Pentium M
9 more in the paper
Performance oscillates
Even after averaging out noise
Cannot wait for jump
Need more robust measurement

20
Control Engine Script

Start with N256
Compute
Mean
Standard deviation
For
Binary-search
Detect jump when time is more than

21
Experimental Results
22
Pentium 4

Does not cache ISA instructions, but uops
Trace cache
Measure the number of instructions
Smoothing in the nano-benchmark minimum of time
in

23
Conclusions

X-Ray A framework and tool
First to measure instruction cache capacity
Algorithms for precise measurements of some
important hardware parameters
Experimental results on many modern architectures
Other X-Ray resources
Memory Hierarchy parameter measurement appeared
at SIGMETRICS05
CPU parameter measurement appeared at QEST05
Improving X-Ray is work in progress

24
Current and Future Work

2-address vs. 3-address code
Out-of-Order execution
Number Physical registers
Number / Type Functional Units
Cache
bandwidth
write mode
sharedness
replacement policy

25
Thank you!

My E-Mail
kamen_at_yotov.org
kyotov_at_us.ibm.com
Cornell Group homepage
http//iss.cs.cornell.edu
This work emerged from a joint project with David
Paduas group at UIUC
http//polaris.cs.uiuc.edu/newframework.html
Download X-Ray!
http//iss.cs.cornell.edu/software/x-ray.aspx

Write a Comment

User Comments (0)

About PowerShow.com

Automatic Measurement of Instruction Cache Capacity in X-Ray - PowerPoint PPT Presentation

Automatic Measurement of Instruction Cache Capacity in X-Ray

Department of Computer Science. Cornell University. QEST'05. 2. 10/9/09 ... Array of pointers Code sequence with branches. Such branches are very predictable ... – PowerPoint PPT presentation