Principles of High Performance Computing ICS 632

About This Presentation

Title:

Principles of High Performance Computing ICS 632

Description:

Code that is fast on machine A can be slow on machine B ... all to the code but this would result in completely unreadable/undebuggable code ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 117

Provided by: henrica

Category:

more less

Transcript and Presenter's Notes

Title: Principles of High Performance Computing ICS 632

1
Principles of High Performance Computing (ICS
632)

Performance
of
Sequential Programs

2
Performance

We will mostly talk about how to make code go
fast, hence the high performance
Performance conflicts with other concerns
Correctness
You will see that when trying to make code go
fast one often breaks it
Readability
Fast code typically requires more lines!
Modularity can hurt performance
e.g., virtual classes
Portability
Code that is fast on machine A can be slow on
machine B
At the extreme, highly optimized code is not
portable at all, and in fact is done in hardware!

3
Why Performance?

To do a time-consuming operation in less time
I am an aircraft engineer
I need to run a simulation to test the stability
of the wings at high aircraft velocity
Id rather have the result in 5 minutes than in 5
hours so that I can complete the aircraft final
design sooner.
To do an operation before a tighter deadline
I am a weather prediction agency
I am getting input from weather stations/sensors
Id like to make the forecast for tomorrow before
tomorrow
To do a high number of operations per seconds
I am the CTO of Amazon.com
My Web server gets 1,000 hits per seconds
Id like my Web server and my databases to handle
1,000 transactions per seconds to reduce customer
delay
Amazon does process several GBytes of data per
seconds

4
How to Improve Performance?

Option 1 Buy Faster Hardware
Only gets you so far for so long
Sometimes the amount of hardware to buy would be
staggering and one cant just wait for technology
improvements and price drops
Better to achieve the same effect by modifying
the code a little bit

5
How to Improve Performance?

Option 2 Modify the algorithm
Example Search for an element in a sorted array
First implementation a linear search
Easy to write at first
Does the job
When performance becomes an issue, replace the
linear search by a binary search
More complex code
Goes much faster for large arrays

6
How to Improve Performance?

Option 3 Modify the data structures
Example Linked List
The list.length() method computes the length by
going through the list and incrementing a counter
If users call the method often and/or the list is
long, this can cause significant overhead
Instead, add a length attribute to the list
class, and do 1 and -1 on it when insertion and
removal
The new list.length() method just return the
length attribute
This will vastly speeds up list.length(), and
will minimally slow down list.insert() and
list.remove() an minimally increase memory
consumption by 4 bytes
Example Replace a List by a Heap

7
How to Improve Performance?

Option 4 Modify the implementation
Do not change the spirit of the algorithm but...
Shuffle lines of code around
to do instructions in a different order
to remove optimization blockers
Modify code organization
e.g., remove classes
e.g., modify data structures
etc.

8
How to Improve Performance

Option 5 Use concurrency
Multi-threaded code on a single-CPU machine to
utilize hardware resources more effectively
Multi-threaded code on a multi-CPU/multi-core
machine

9
Performance as Time

Time between the start and the end of an
operation
Also called running time, elapsed time,
wall-clock time, response time, latency,
execution time, ...
Most straightforward measure my program takes
12.5s on a Pentium 3.5GHz
Can be normalized to some reference time
Must be measured on a dedicated machine

10
Performance as Rate

Used often so that performance can be independent
on the size of the application
e.g., compressing a 1MB file takes 1 minute.
compressing a 2MB file takes 2 minutes. The
performance is the same.
Millions of instructions / sec (MIPS)
MIPS instruction count / (execution time
106) clock rate / (CPI 106)
But Instructions Set Architectures are not
equivalent
1 CISC instruction many RISC instructions
Programs use different instruction mixes
May be ok for same program on same architectures

11
Performance as Rate

Millions of floating point operations /sec
(MFlops)
Very popular, but often misleading
e.g., A high MFlops rate in a stupid algorithm
could have poor application performance
Application-specific
Millions of frames rendered per second
Millions of amino-acid compared per second
Millions of HTTP requests served per seconds
Application-specific metrics are often preferable
and others may be misleading
MFlops can be application-specific thought
For instance
I want to add two n-element vectors
This requires n Floating Point Operations
Therefore MFlops is a good measure

12
Measuring Performance Rates

How do we measure performance rates?
Time a section of code
Count how many items are done in that section
of the code
Compute the rate as the number of items divided
by the measured time
Example
start_stopwatch()
for (i0 ilt1000000 i)
x y z a
stop_stopwatch()
Number of MFlop 2 (1000000 additions, 1000000
multiplications)
Number of MFlops 2 / time

13
Peak Performance?

Resource vendors always talk about peak
performance rate
Computed based on specifications of the machine
For instance
I build a machine with 2 floating point units
Each unit can do an operation in 2 cycles
My CPU is at 1GHz
Therefore I have a 12/2 1GFlops Machine
Problem
In real code you will never be able to use the
two floating point units constantly
Data needs to come from memory and cause the
floating point units to be idle
Typically, real code achieves only an (often
small) fraction of the peak performance

14
Benchmarks

Since many performance metrics turn out to be
misleading, people have designed benchmarks
Example SPEC Benchmark
Integer benchmark
Floating point benchmark
These benchmarks are typically a collection of
several codes that come from real-world
software
The question what is a good benchmark? is
difficult
If the benchmarks do not correspond to what
youll do with the computer, then the benchmark
results are not relevant to you

15
How About GHz?

This is often the way in which people say that a
computer is better than another
More instruction per seconds for higher clock
rate
Faces the same problems as MIPS
But usable within a specific architecture

16
Program Performance

In this class were not really concerned with
determining the performance of a compute platform
(whichever way it is defined)
Instead were concerned with improving a
programs performance
For a given platform, take a given program
Run it and measure its wall-clock time
Enhance it, run it and quantify the performance
improvement
i.e., the reduction in wall-clock time
For each version compute its performance
preferably as a relevant performance rate
so that you can say the best implementation we
have so far goes this fast (perhaps a of the
peak performance)

17
The UNIX time Command

You can put time in front of any UNIX command you
invoke
When the invoked command completes, time prints
out timing (and other) information
time ls /home/casanova/ -la -R
0.520u 1.570s 020.58 10.1 00k 570105io
0pf0w
0.520u 0.52 seconds of user time
1.570s 1.57 seconds of system time
020.56 20.56 seconds of wall-clock time
10.1 10.1 of CPU was used
00k memory used (text data)
570105io 570 input, 105 output (file system I/O)
0pf0w 0 page faults and 0 swaps

18
User, System, Wall-Clock?

User Time time that the code spends executing
user code (i.e., non system calls)
System Time time that the code spends executing
system calls
Wall-Clock Time time from start to end
Wall-Clock User System
in our example 20.56 0.52 1.57
Why?
because the process can be suspended by the O/S
due to contention for the CPU by other processes
because the process can be blocked waiting for
I/O

19
Using time

Its interesting to know what the user time and
the system time are
for instance, if the system time is really high,
it may be that the code does to many calls to
malloc(), for instance
But one would really need more information to fix
the code (not always clear which system calls may
be responsible for the high system time)
Wall-clock - system - user I/O suspended
If the system is dedicated, suspended 0
Therefore one can estimate the cost of I/O
If I/O is really high, one may want to look at
reducing I/O or doing I/O better
Therefore, time can give us insight into
bottlenecks and gives us wall-clock time

20
Drawbacks of UNIX time

The time command has poor resolution
Only milliseconds
Sometimes we want a higher precision, especially
if our performance improvements are in the 1-2
range
time times the whole code
Sometimes were only interested in timing some
part of the code, for instance the one that we
are trying to optimize
Sometimes we want to compare the execution time
of different sections of the code

21
Timing with gettimeofday

gettimeofday from the standard C library
Measures the number of microseconds since
midnight, Jan 1st 1970, expressed in seconds and
microseconds
include ltsys/time.hgt
struct timeval start
...
gettimeofday(start,NULL)
printf(ld,ld\n,start-gttv_sec,start-gttv_usec)
...
Can be used to time sections of code
Call gettimeofday at beginning of section
Call gettimeofday at end of section
Compute the time elapsed in microseconds
e.g., (end.tv_sec1000000.0 end.tv_usec -
start.tv_sec1000000.0 - start.tv_usec) /
1000000.0

22
Other Ways to Time Code

ntp_gettime() (Internet RFC 1589)
Sort of like gettimeofday, but reports estimated
error on time measurement
Not available for all systems
Part of the GNU C Library
Java System.currentTimeMillis()
Known to have resolution problems, with
resolution higher than 1 millisecond!
Solution use a native interface to a better
timer
Java System.nanoTime()
Added in J2SE 5.0
Probably not accurate at the nanosecond level
Tons of high precision timing in Java on the Web

23
Dedicated Systems

Measuring the performance of a code must be done
dedicated system
No other user can start a process
The user measuring the performance only runs the
minimum amount of processes
basically, a shell
single-user mode is typically considered
overkill
Nevertheless, one should always present
measurement results as averages over several
experiments
Because the (small) load imposed by the O/S is
not deterministic
In your assignments, always show averages over 10
experiments, or more if asked to do so explicitly

24
How do I speed up my code?

One option to make code faster is basically to
monkey around with the code
Lets look at some examples of what one can do by
hand
These techniques were very popular before
compilers were any good
Of course, well talk about what the compiler can
do nowadays

25
Optimization Techniques

Technique 1 identify loop constants
for (k0kltNk)
cij aik bkj
sum 0
for (k0kltNk)
sum aik bkj
cij sum

26
Optimization Techniques

Technique 2 replace array accesses by pointer
dereferences
for (j0jltNj)
aij 2 // 2N adds, N multiplies
double ptr (ai0) // 2 adds, 1
multiplies
for (j0jltNj)
ptr 2
ptr // N integer addition

27
Optimization Techniques

Technique 3 Loop Unrolling
for (i0ilt100i)
ai i
i0
do
ai i i
ai i i
ai i i
ai i i
while (ilt100) // fewer comparisons

28
Optimization Techniques

Technique 4 Code Motion
sum 0
for (i 0 i lt fact(n) i)
sum i
sum 0
f fact(n)
for (i 0 i lt f i)
sum i

29
Optimization Techniques

Technique 5 Inlining
for (i0iltNi) sum cube(i)
...
void cube(i) return (iii)
for (i0iltNi) sum iii

30
Other Techniques

Common sub-expression elimination
x a b - c
y a d e b
tmp a b
x tmp - c
y tmp d e

31
Other Techniques

Dead code elimination
x 12
...
x ac
...
x ac

Seems obvious, but may be hidden int x
0 ... ifdef FOO x f(3) else
32
Other Techniques

Strength reduction
a i3 a iii
Constant propagation
int speedup 3
efficiency 100 speedup / numprocs
x efficiency 2
x 600 / numprocs

33
Now what?

There are many other techniques
We could apply them all to the code but this
would result in completely unreadable/undebuggable
code
Fortunately, the compiler should come to the
rescue
To some extent, at least
Good compilers can do a lot for you
Typically compilers provided by a vendor can do
pretty tricky optimizations

34
What do compilers do?

All modern compilers perform some automatic
optimization when generating code
In fact, you implement some of those in a
graduate-level compiler class, and sometimes at
the undergraduate level.
Most compilers provide several levels of
optimization
-O0 No optimization
in fact some is always done
-O1, -O2, .... -OX
The higher the optimization level the higher the
probability that a debugger may have trouble
dealing with the code.
Always debug with -O0
some compiler enforce that -g means -O0
Some compiler will flat out tell you that higher
levels of optimization may break some code!

35
Compiler optimizations

gcc is a pretty good, free compiler
-Os Optimize for size
Some optimizations increase code size
tremendously
Do a man gcc and look at the many optimization
options
one can pick and choose,
or just use standard sets via O1, O2, etc.
The most fancy compilers are typically the ones
done by vendors
You cant sell a good machine if it has a bad
compiler
Compiler technology used to be really poor
also, languages used to be designed without
thinking of compilers (FORTRAN, Ada)
no longer true every language designer has
in-depth understanding of compiler technology
today

36
What can compilers do?

Many, many things
Inlining
Assignment of variables to registers
Its a difficult problem
Dead code elimination
Algebraic simplification
Moving invariant code out of loops
Constant propagation
Control flow simplification
Instruction scheduling, reordering
Strength reduction
e.g., add to pointers, rather than doing array
index computation
Loop unrolling and software pipelining
Dead store elimination
and many other......

37
Instruction scheduling

Modern computers have multiple functional units
that could be used in parallel
Or at least ones that are pipelined
if fed operands at each cycle they can produce a
result at each cycle
although a computation may require 20 cycles
Instruction scheduling
Reorder the instructions of a program
e.g., at the assembly code level
Preserve correctness
Make it possible to use functional units optimally

38
Instruction Scheduling

One cannot just shuffle all instructions around
Preserving correctness means that data
dependences are unchanged
Three types of data dependences
True dependence
a ...
... a
Output dependence
a ...
a ...
Anti dependence
... a
a ...

39
Instruction Scheduling Example

... ...
ADD R1,R2,R4 ADD R1,R2,R4
ADD R2,R2,1 LOAD R4,_at_2
ADD R3,R6,R2 ADD R2,R2,1
LOAD R4,_at_2 ADD R3,R6,R2
... ...
Since loading from memory can take many cycles,
one may as well do is as early as possible
Cant move instruction earlier because of
anti-dependence on R4

40
Software Pipelining

Fancy name for instruction scheduling for loops
Can be done by a good compiler
First unroll the loop
Then make sure that instructions can happen in
parallel
i.e., scheduling them on functional units
Lets see a simple example

41
Example

Source code for(i0iltni) sum ai
Loop body in assembly
Unroll loop allocate registers
May be very difficult

r1 L r0--- stall r2 Add r2,r1r0 Add
r0,12r4 L r3--- stall r2 Add r2,r4r3
Add r3,12r7 L r6--- stall r2 Add r2,r7r6
Add r6,12r10 L r9--- stall r2 Add
r2,r10r9 add r9,12
r1 L r0--- stall r2 Add r2,r1r0 Add r0,4
42
Example (cont.)
Schedule Unrolled Instructions, exploiting
instruction level parallelism if possible
r1 L r0r4 L r3 r2 Add r2,r1 r7 L r6 r0
Add r0,12 r2 Add r2,r4 r10 L r9 r3 Add
r3,12 r2 Add r2,r7 r1 L r0 r6 Add r6,12 r2
Add r2,r10r4 L r3 r9 add r9,12 r2 Add
r2,r1 r7 L r6 r0 Add r0,12 r2 Add r2,r4
r10 L r9 r3 Add r3,12 r2 Add r2,r7 r1 L
r0 r6 Add r6,12 r2 Add r2,r10r4 L r3 r9
add r9,12 r2 Add r2,r1 r7 L r6 . .
.r0 Add r0,12 r2 Add r2,r4 r10 L r9r3
Add r3,12 r2 Add r2,r7r6 Add r6,12 Add
r2,r10 r9 add r9,12
Identifyrepeatingpattern (kernel)
43
Example (cont.)

Loop becomes

prologue
r1 L r0r4 L r3 r2 Add r2,r1 r7 L r6
r0 Add r0,12 r2 Add r2,r4 r10 L r9 r3
Add r3,12 r2 Add r2,r7 r1 L r0 r6 Add
r6,12 r2 Add r2r10 r4 L r3 r9 Add r9,12 r2
Add r2,r1 r7 L r6 r0 Add r0,12 r2 Add
r2,r4 r10 L r9r3 Add r3,12 r2 Add r2,r7r6
Add r6,12 Add r2,r10 r9 Add r9,12
kernel
epilogue
44
Software Pipelining

The kernel may require many registers and its
good nice to know how to use as few as possible
otherwise, one may have to go to cache more,
which may negate the benefits of software
pipelining
Dependency constraints must be respected
May be very difficult to analyze for complex
nested loops
Software pipelining with registers is a very
well-known NP-hard program

45
Limits to Compiler Optimization

Behavior that may be obvious to the programmer
can be obfuscated by languages and coding styles
e.g., data ranges may be more limited than
variable types suggest
e.g., using an int in C for what could be an
enumerated type
Most analysis is performed only within procedures
whole-program analysis is too expensive in most
cases
Most analysis is based only on static information
compiler has difficulty anticipating run-time
inputs
When in doubt, the compiler must be conservative
cannot perform optimization if it changes program
behavior under any realizable circumstance
even if circumstances seem quite bizarre and
unlikely

46
So were are we now?

We have seen techniques to optimize code
reducing the number of instructions
instruction scheduling
memory access management
But compilers do a lot of things
So, does it mean that we, as software developers
have nothing to worry about?
Sadly, no

47
Good practice

Writing code for high performance means working
hand-in-hand with the compiler
Principle 1 Optimize things that we know the
compiler cannot deal with
Well see a few such examples in the next set of
slides
Principle 2 Write code so that the compiler can
do its optimizations
Remove optimization blockers

48
Optimization blocker aliasing

Aliasing two pointers point to the same
location
If a compiler cant tell what a pointer points
at, it must assume it can point at almost
anything
Example
void foo(int q, int p)
q 3
p
q 4
cannot be safely optimized to
p
q 12
because perhaps p q
Some compilers have pretty fancy aliasing
analysis capabilities

49
Blocker False Dependencies

A special case of aliasing
ai bi c
ai1 bi1 d
The compiler cannot know that (bi1) is
different from (ai)
Therefore, it cant do efficient instruction
scheduling
Instead, one should write code as
float f1 bi
float f2 bi1
ai f1 c
ai1 f2 d
Used local variable to expose independent
operations
Some compiler allow users to give them hints
e.g., declare arrays a and b unaliased via some
keyword

50
Blocker Function Call

sum 0
for (i 0 i lt fact(n) i)
sum i
A compiler cannot optimize this because
function fact may have side-effects
e.g., modifies global variables
Function May Not Return Same Value for Given
Arguments
Depends on other parts of global state, which may
be modified in the loop
Why doesnt compiler look at the code for fact?
Linker may overload with different version
Unless declared static
Interprocedural optimization is not used
extensively due to cost
Inlining can achieve the same effect for small
procedures
Again
Compiler treats procedure call as a black box
Weakens optimizations in and around them

51
Other Techniques

Use more local variables

while( ) res filter0signal0
filter1signal1
filter2signal2 signal
Helps some compilers
register float f0 filter0 register float f1
filter1 register float f2
filter2 while( ) res f0signal0
f1signal1
f2signal2 signal
52
Other Techniques

Replace pointer updates for strided memory
addressing with constant array offsets

f0 r8 r8 4 f1 r8 r8 4 f2 r8
r8 4
Some compilers are better at figuring this out
than others Some systems may go faster with
option 1, some others with option 2!
f0 r80 f1 r84 f2 r88 r8 12
53
Bottom line

Know your compilers
Some are great
Some are not so great
Some will not do things that you think they
should do
often because you forget about things like
aliasing
There is not golden rule because there are some
system-dependent behaviors
Although the general principles typically holds
Doing all optimization by hand is a bad idea in
general
But were doing it in the class for some of the
programming assignment to truly understand about
code, hardware, and performance.

54
By-hand Optimization of Matrix Multiplication
for(i 0 i lt SIZE i) int orig_pa
ai0 for(j 0 j lt SIZE j) int
pa orig_pa int pb a0j int
sum 0 for(k 0 k lt SIZE k)
sum pa pb pa pb SIZE
cij sum
for(i 0 i lt SIZE i) for(j 0 j lt
SIZE j) for(k 0 k lt SIZE k)
cijaikbkj

Turned array accesses into pointer dereferences
Assign to each element of c just once

55
Results (Courtesy of CMU)
56
Why is Simple Sometimes Better?

Easier for humans and the compiler to understand
The more the compiler knows the more it can do
Pointers are hard to analyze, arrays are easier
You never know how fast code will run until you
time it on a dedicated system
The transformations done by hand good optimizers
will often do for us
And they will often do a better job than we can
do, but not always
Pointers may cause aliases and data dependences
where the array code had none

57
Bottom Line

How should I write my programs, given that I have
a good, optimizing compiler?
Dont Smash Code into Oblivion
Hard to read, maintain ensure correctness
Do
Select best algorithm
Write code thats readable maintainable
Procedures, recursion, without built-in constant
limits
Even though these factors can slow down code
Eliminate optimization blockers
Allows compiler to do its job

58
Good Performance?

You have a code that was given to you or that you
wrote
You compile it with your favorite optimizing
compiler, you have removed obvious optimization
blockers
And then, performance is poor
Not sufficient for the code to be used to meet
deadlines
The code could still be usable but lead to long
waits, and you can tell that the performance is
way below the peak performance
What do you do?

59
Why is Performance Poor?

Performance is poor because the code suffers from
a performance bottleneck
Definition
An application runs on a platform that has many
components
CPU, Memory, Operating System, Network, Hard
Drive, Video Card, etc.
Pick a component and make it faster
If the application performance increases, that
component was the bottleneck!

60
Identifying a Bottleneck

It can be difficult
Youre not going to change the memory bus just to
see what happens to the application
But you can run the code on a different machine
and see what happens
Typical Approach
Know/discover the characteristics of the machine
Know/discover the characteristics of the
application
Observe the application execution on the machine
Reason about what the bottleneck is
Luckily there are well-known bottlenecks that are
likely candidates when performance is poor

61
Removing a Bottleneck

Brute force Hardware Upgrade
Sometimes necessary, but can only get you so far
and may be very costly
e.g., memory technology
Instead, modify the code
The bottleneck is there because the code uses a
resource heavily or in non-intelligent manner
This is, unfortunately, what we have to do often
after the fact
You wrote a beautifully structured/modular code
Its slow and you have to decrease readability,
modularity to increase performance

62
The Memory Bottleneck

The memory is a very common bottleneck that
beginning programmers often dont think about
When you look at code, you often pay more
attention to computation
ai bj ck
The access to the 3 arrays take more time than
doing an addition
For the code above, the memory is the bottleneck
for many machines!

63
Why the Memory Bottleneck?

In the 70s, everything was balanced
The memory kept pace with the CPU
n cycles to execute an instruction, n cycles to
bring in a word from memory
No longer true
CPUs have gotten 1,000x faster
Memory have gotten 10x faster and 1,000,000x
larger
Flops are free and bandwidth is expensive and
processors are STARVED for data

64
Current Memory Technology
source http//www.xbitlabs.com/articles/memory/di
splay/ddr2-ddr_2.html
65
Memory Bottleneck Example

Fragment of code ai bj ck
Three memory references 2 reads, 1 write
One addition can be done in one cycle
If the memory bandwidth is 12.8GB/sec, then the
rate at which the processor can access integers
(4 bytes) is 12.8102410241024 / 4 3.4GHz
The above code needs to access 3 integers
Therefore, the rate at which the code gets its
data is 1.1GHz
But the CPU could perform additions at 4GHz!
Therefore The memory is the bottleneck
And we assumed memory worked at the peak!!!
We ignored other possible overheads on the bus
In practice the gap can be around a factor 15 or
higher

66
Reducing the Memory Bottleneck

The way in which computer architects have dealt
with the memory bottleneck is via the memory
hierarchy

larger, slower, cheaper
CPU
Memory
disk
regs
register reference
L2-cache (SRAM) reference
memory (DRAM) reference
disk reference
L1-cache (SRAM) reference
L3-cache (DRAM) reference
hundreds cycles
tens of thousands cycles
sub ns
1-2 cycles
20 cycles
10 cycles
67
Locality

The memory hierarchy is useful because of
locality
Temporal locality a memory location that was
referenced in the past is likely to be referenced
again
Spatial locality a memory location next to one
that was referenced in the past is likely to be
referenced in the near future
This is great, but what we write our code for
performance we want our code to have the maximum
amount of locality
The compiler can do some work for us regarding
locality
But unfortunately not everything

68
Programming for Locality

Essentially, a programmer should keep a mental
picture of the memory layout of the application,
and reason about locality
When writing concurrent code on a multi-core
architecture, one must also thing of which caches
are shared/private
This can be extremely complex, but there are a
few well-known techniques
The typical example is with 2-D arrays

69
Example 2-D Array Initialization

int a200200 int a200200
for (i0ilt200i) for (j0jlt200j)
for (j0jlt200j) for (i0ilt200i)
aij 2 aij 2
Which alternative is best?
i,j?
j,i?
To answer this, one must understand the memory
layout of a 2-D array

70
2-D Arrays in Memory

A static 2-D array is one declared as
lttypegt ltnamegtltsizegtltsizegt
int myarray1030
The elements of a 2-D array are stored in
contiguous memory cells
The problem is that
The array is 2-D, conceptually
Computer memory is 1-D
1-D computer memory a memory location is
described by a single number, its address
Just like a single axis
Therefore, there must be a mapping from 2-D to
1-D
From a 2-D abstraction to a 1-D implementation

71
Mapping from 2-D to 1-D?
1-D computer memory
nxn 2-D array
72
Row-Major, Column-Major

Luckily, only 2 of the n2! mappings are ever
implemented in a language
Row-Major
Rows are stored contiguously
Column-Major
Columns are stored contiguously

1st row
2nd row
3rd row
4th row
1st col
2nd col
3rd col
4th col
73
Row-Major

C uses Row-Major

address
rows in memory
memory lines
memory/cache line

Matrix elements are stored in contiguous memory
lines

74
Row-Major

C uses Row-Major
First option
int a200200
for (i0ilt200i)
for (j0jlt200j)
aij2
Second option
int a200200
for (j0jlt200j)
for (i0ilt200i)
aij2

75
Counting cache misses

nxn 2-D array, element size e bytes, cache line
size b bytes

memory/cache line

One cache miss for every cache line n2 x e /b
Total number of memory accesses n2
Miss rate e/b
Example Miss rate 4 bytes / 64 bytes 6.25
Unless the array is very small

memory/cache line

One cache miss for every access
Example Miss rate 100
Unless the array is very small

76
Array Initialization in C

First option
int a200200
for (i0ilt200i)
for (j0jlt200j)
aij2
Second option
int a200200
for (j0jlt200j)
for (i0ilt200i)
aij2

Good Locality
77
Performance Measurements

Option 1
int aXX
for (i0ilt200i)
for (j0jlt200j)
aij2
Option 2
int aXX
for (j0jlt200j)
for (i0ilt200i)
aij2

Experiments on my laptop

Note that other languages use column major
e.g., FORTRAN

78
Matrix Multiplication

A classic example for locality-aware programming
is matrix multiplication
for (i0iltNi)
for (j0jltNj)
for (k0kltNk)
ci,j aik bkj
There are 6 possible orders for the three loops
i-j-k, i-k-j, j-i-k, j-k-i, k-i-j, k-j-i
Each order corresponds to a different access
patterns of the matrices
Lets focus on the inner loop, as it is the one
thats executed most often

79
Inner Loop Memory Accesses

Each matrix element can be accessed in three
modes in the inner loop
Constant doesnt depend on the inner loops
index
Sequential contiguous addresses
Stride non-contiguous addresses (N elements
apart)
cij aik
bkj
i-j-k Constant Sequential Strided
i-k-j Sequential Constant Sequential
j-i-k Constant Sequential Strided
j-k-i Strided Strided Constant
k-i-j Sequential Constant Sequential
k-j-i Strided Strided Constant

80
Loop order and Performance

Constant access is better than sequential access
its always good to have constants in loops
because they can be put in registers (as weve
seen in our very first optimization)
Sequential access is better than strided access
sequential access is better than strided because
it utilizes the cache better
Lets go back to the previous slides

81
Best Loop Ordering?

cij aik
bkj
i-j-k Constant Sequential Strided
i-k-j Sequential Constant Sequential
j-i-k Constant Sequential Strided
j-k-i Strided Strided Constant
k-i-j Sequential Constant Sequential
k-j-i Strided Strided Constant
k-i-j and i-k-j should have the best performance
i-j-k and j-i-k should be worse
j-k-i and k-j-i should be the worst
You will measure this in a Programming Assignment

82
How good is the best ordering?

Let us assume that i-k-j is best
How many cache misses?
for (i0iltNi)
for (k0kltNk)
xaik
for (j0jltNj)
ci,j x bkj
Clearly this is not easy to compute
e.g., if the matrix is twice the size of the
cache, there is a lot of loading/evicting and
obtaining a formula would be complicated
Let L be the cache line size in number of matrix
elements
How about a very coarse approximation, by
assuming that the matrix is much larger than the
cache?
determine what matrix pieces are loaded/written
Figure out the expected number of cache misses

83
Slow Memory Operations

for (i0iltNi)
// (1) read row i of a into cache
// (2) write row i of c back to memory
for (k0kltNk)
// (3) read column j of b into cache
for (j0jltNj)
ci,jaikbkj
L cache line size
(1) N (N / L) cache misses
(2) N (N / L) cache misses
(3) N N N cache misses
Although the access to B is sequential, its
sequential along the column and the matrix is
store in row-major fashion!
Total 2N2/L N3 N3 (for large N)

84
Bad News

N3 slow memory operations and 2N3 arithmetic
operations
Ratio ops / mem 2
This is bad news because we know that computer
architectures are NOT balanced and memory
operations are orders of magnitude slower than
arithmetic operations
Therefore, the memory is still the bottleneck for
this implementation of matrix multiplication (the
ratio should be much higher)
BUT we have only N2 matrix elements, how come we
perform N3 slow memory accesses?
Because we access matrix B very inefficiently,
trying to load entire columns one after the other
Lesson counting the number of operations and
comparing it with the size of the data is not
sufficient to ascertain that an algorithm will
not suffer from the memory bottleneck

85
Better cache reuse?

Since we really need only N2 elements, perhaps
there is a better way to reorganize the
operations of the matrix multiplication for a
higher number of cache hits
Possible because and are associative and
commutative
Researchers have spent a lot of time trying to
find out the best ordering
There are even theorems!
Let q ratio of operations to slow memory
accesses
q must be as high as possible to remove the
memory bottleneck
HongKung 1981 Any reorganization of the
algorithm is limited to q O(vM), where M is the
size of the cache (in number of elements)
obtained with a lot of unrealistic assumptions
about the cache
still shows that q wont scale with N, unlike
what one may think when dividing 2n3 by n2.

86
Blocked Matrix Multiplication

One problem with our implementation is that we
try to access entire columns of matrix B.
What about accessing only a subset of a column,
or of multiple columns, at a time?

87
Blocked Matrix Multiplication
cache line
j
j
i
i
A
B
C
Key idea reuse the other elements in each cache
line as much as possible
88
Blocked Matrix Multiplication
cache line
j
j
i
i
b elements
b elements
A
B
C
May as well compute ci,j1 since one loads column
j1 of B in the cache lines anyway. But must
reorder the operations as follows compute
the first b terms of cij, compute the first b
terms of ci,j1 compute the next b
terms of cii, compute the next b terms of cij1
.....
89
Blocked Matrix Multiplication
cache line
j
j
i
i
A
B
C
May as well compute a whole subrow of C, with the
same reordering of the operations. But by
computing a whole row of C, then one has to load
all columns of B, which one has to do again for
computing the next row of C. Idea reuse the
blocks of B that we have just loaded.
90
Blocked Matrix Multiplication
cache line
j
j
i
i
A
B
C
Order of the operation Compute the first b
terms of all cij values in the C block Compute
the next b terms of all cij values in the C
block . . . Compute the last b terms of all cij
values in the C block
91
Blocked Matrix Multiplication
N 4 b
C11
C12
C13
C14
A11
A12
A13
A14
B11
B12
B13
B14
C21
C22
C23
C24
A21
A22
A23
A24
B21
B22
B23
B24
C31
C32
C43
C34
A31
A32
A33
A34
B32
B32
B33
B34
C41
C42
C43
C44
A41
A42
A43
A144
B41
B42
B43
B44

C22 A21B12 A22B22 A23B32 A24B42
4 matrix multiplications
4 matrix additions
Main Point each multiplication operates on
small block matrices, whose size may be chosen
so that they fit in the cache.

92
Blocked Algorithm

The blocked version of the i-j-k algorithm is
written simply as
for (i0iltN/bi)
for (j0jltN/bj)
for (k0kltN/bk)
Cij AikBkj
where b is the block size (which we assume
divides N)
where Xij is the block of matrix X on block
row i and block column j
where means matrix addition
where means matrix multiplication

93
Cache Misses?

for (i0iltN/bi)
for (j0jltN/bj)
// (1) write block Cij to memory
for (k0kltN/bk)
// (2) Load block Aik from memory
// (3) Load block Bkj from memory
Cij AikBkj
(1) (N/b)(N/b)bb
(2) (N/b)(N/b)(N/b)bb
(3) (N/b)(N/b)(N/b)bb
Total N2 2N3/b 2N3/b

94
Performance?

Slow memory accesses 2N3/b
Number of operations 2N3
Therefore, ratio ops / mem b
This ratio should be as high as possible
(Compare to the value of 2 that we obtained with
the non-blocked implementation)
This implies that one should make the block size
as large as possible
But, if we take this result to the extreme, then
the block size should be equal to N!!
This clearly doesnt make sense because then
were back to the non-blocked implementation

95
Maximum Block Size

The blocking optimization only works if the
blocks fit in cache
That is, 3 blocks of size bxb must fit in cache
(for A, B, and C)
Let M be the cache size (in elements)
We must have 3b2 M, or b v(M/3)
Therefore, in the best case, ratio of number of
operations to slow memory accesses is v(M/3)

96
Optimizing Further?

At this point we know that blocking is a good
idea
Turns out that the best block size isnt that
easy to determine
There are many other things we could do to the
code
loop unrolling
instruction reordering
...
There are many things the compiler can do to the
code and there are many compiler flags we could
use
In the end, how do we determine the best
implementation for a given architecture?

97
Automatic Program Generation

It is difficult to optimize code because
There are many possible options for
tuning/modifying the code
These options interact in complex ways with the
compiler and the hardware
This is really an optimization problem
The objective function is the codes performance
The feasible solutions are all possible ways to
implement the software
Typically a finite number of implementation
decisions are to be made
Each decision can take a range of values
e.g., the 7th loop in the 3rd function can be
unrolled 1, 2, ..., 20 times
e.g., the block size could be 2x2, 4x4, ...,
400x400
e.g., function could be recursive or iterative
And one needs to do it again and again for
different platforms

98
Automatic Program Generation

What is good at solving hard optimization
problems?
computers
Therefore, a computer program could generate the
computer program with the best performance
Could use a brute force approach try all
possible solutions
but there is an exponential number of them
Could use genetic algorithms
Could use some ad-hoc optimization technique

99
Matrix Multiplication

We have seen that for matrix multiplication there
are several possible ways to optimize the code
block size
optimization flag to the compiler
order of loops
...
It is difficult to find the best one
People have written automatic matrix
multiplication program generators!

100
The ATLAS Project

ATLAS is a software that you can download and run
on most platforms.
It runs for a while (perhaps a couple of hours)
and generates a .c file that implements matrix
multiplication!
ATLAS optimizes for
Instruction cache reuse
Floating point instruction ordering
pipeline functional units
Reducing loop overhead
Exposing parallelism
multiple functional units
Cache reuse

101
ATLAS (500x500 matrices)
Source Jack Dongarra

ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.

102
Improving an Application

So, we have seen ways in which to improve pieces
of code
The problem is that one typically doesnt have an
application that just performance an array
initialization, or a matrix multiplication
In fact, there are many parts of the application
that one could think of optimizing for memory,
etc.

103
Profiling

Question how do we know which part of the code
is the most expensive?
If youve not writen the code you may know
If youve written the code you may have some idea
(although experience shows that many programmers
dont
The most expensive part may be in some library
function you havent written
You could put gettimeofday() calls everywhere,
but that gets really cumbersome for large
projects
The standard way use a profiler

104
What is a Profiler?

A profiler is a tool that monitors the execution
of a program and that reports the amount of time
spent in different functions
Useful to identify the expensive functions
Profiling cycle
Compile the code with the profiler
Run the code
Identify the most expensive function
Optimize that function
call it less often if possible
make it faster
Repeat until you cant think of any ways to
further optimize the most expensive function
UNIX has a good, free profiler called gprof

105
Using gprof

Compile your code using gcc with the -pg option
Run your code until completion
Then run gprof with your programs name as single
command-line argument
Example
gcc -pg prog.c -o prog
./prog
gprof prog gt profile_file
The output file contains all profiling information

106
Profiling output

The content of the file is explained in detail in
the file itself
At the beginning of the file is a summary of
which fraction of the code is spent in which
function
In the middle section is a detailed entry for
each function
At the end of the file is a function index, in
which each function is assigned a number in
brackets, e.g., 3

107
Profiling Output

Flat profiling summary
cumulative self
time seconds seconds name
30.9 0.77 0.77 ___multadd_D2A 1
16.9 1.19 0.42 _scheduler ltcycle 1gt
3
15.3 1.57 0.38 _scandir 5
9.2 1.80 0.23 _NSLookupAndBindSymbo
lHint 6
6.4 1.96 0.16 _job ltcycle 1gt 8
4.4 2.07 0.11 _NSIsSymbolNameDefine
dHint 9
1.6 2.11 0.04 _hash_nkey 10
1.6 2.15 0.04 _pthread_key_create
11
1.2 2.18 0.03 ___quorem_D2A 12
1.2 2.21 0.03 __mh_dylib_header
13
1.2 2.24 0.03 _probe_submitter
14
1.2 2.27 0.03 _request_submitter
15

in the function itself
in the function and its children
108
Profiling output

The middle section of the file provides detailed
information for each function
Entry format
index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39
Can vary depending on the version of gprof
You should really read the explanations in the
file to be sure

109
Profiling output

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Function func 1
110
Profiling output
Parents f1 111, f2 123

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Function func 1
111
Profiling output
Parents f1 111, f2 123

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Function func 1
Children c 39
112
Profiling output
Parents f1 111, f2 123

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Function func 1
Children c 39
f1
f2
calls
calls
func
Call graph
calls
c
113
Profiling output
Parents f1 111, f2 123

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Function func 1
Children c 39
f1
f2
Call counts
80 calls
52 calls
func
Call graph
called 132 times total
32 calls
c
called 5231 times total
114
Profiling output
Parents f1 111, f2 123