Title: Principles of High Performance Computing ICS 632
1Principles of High Performance Computing (ICS
632)
- Performance
- of
- Sequential Programs
2Performance
- We will mostly talk about how to make code go
fast, hence the high performance - Performance conflicts with other concerns
- Correctness
- You will see that when trying to make code go
fast one often breaks it - Readability
- Fast code typically requires more lines!
- Modularity can hurt performance
- e.g., virtual classes
- Portability
- Code that is fast on machine A can be slow on
machine B - At the extreme, highly optimized code is not
portable at all, and in fact is done in hardware!
3Why Performance?
- To do a time-consuming operation in less time
- I am an aircraft engineer
- I need to run a simulation to test the stability
of the wings at high aircraft velocity - Id rather have the result in 5 minutes than in 5
hours so that I can complete the aircraft final
design sooner. - To do an operation before a tighter deadline
- I am a weather prediction agency
- I am getting input from weather stations/sensors
- Id like to make the forecast for tomorrow before
tomorrow - To do a high number of operations per seconds
- I am the CTO of Amazon.com
- My Web server gets 1,000 hits per seconds
- Id like my Web server and my databases to handle
1,000 transactions per seconds to reduce customer
delay - Amazon does process several GBytes of data per
seconds
4How to Improve Performance?
- Option 1 Buy Faster Hardware
- Only gets you so far for so long
- Sometimes the amount of hardware to buy would be
staggering and one cant just wait for technology
improvements and price drops - Better to achieve the same effect by modifying
the code a little bit
5How to Improve Performance?
- Option 2 Modify the algorithm
- Example Search for an element in a sorted array
- First implementation a linear search
- Easy to write at first
- Does the job
- When performance becomes an issue, replace the
linear search by a binary search - More complex code
- Goes much faster for large arrays
6How to Improve Performance?
- Option 3 Modify the data structures
- Example Linked List
- The list.length() method computes the length by
going through the list and incrementing a counter - If users call the method often and/or the list is
long, this can cause significant overhead - Instead, add a length attribute to the list
class, and do 1 and -1 on it when insertion and
removal - The new list.length() method just return the
length attribute - This will vastly speeds up list.length(), and
will minimally slow down list.insert() and
list.remove() an minimally increase memory
consumption by 4 bytes - Example Replace a List by a Heap
7How to Improve Performance?
- Option 4 Modify the implementation
- Do not change the spirit of the algorithm but...
- Shuffle lines of code around
- to do instructions in a different order
- to remove optimization blockers
- Modify code organization
- e.g., remove classes
- e.g., modify data structures
- etc.
8How to Improve Performance
- Option 5 Use concurrency
- Multi-threaded code on a single-CPU machine to
utilize hardware resources more effectively - Multi-threaded code on a multi-CPU/multi-core
machine
9Performance as Time
- Time between the start and the end of an
operation - Also called running time, elapsed time,
wall-clock time, response time, latency,
execution time, ... - Most straightforward measure my program takes
12.5s on a Pentium 3.5GHz - Can be normalized to some reference time
- Must be measured on a dedicated machine
10Performance as Rate
- Used often so that performance can be independent
on the size of the application - e.g., compressing a 1MB file takes 1 minute.
compressing a 2MB file takes 2 minutes. The
performance is the same. - Millions of instructions / sec (MIPS)
- MIPS instruction count / (execution time
106) clock rate / (CPI 106) - But Instructions Set Architectures are not
equivalent - 1 CISC instruction many RISC instructions
- Programs use different instruction mixes
- May be ok for same program on same architectures
11Performance as Rate
- Millions of floating point operations /sec
(MFlops) - Very popular, but often misleading
- e.g., A high MFlops rate in a stupid algorithm
could have poor application performance - Application-specific
- Millions of frames rendered per second
- Millions of amino-acid compared per second
- Millions of HTTP requests served per seconds
- Application-specific metrics are often preferable
and others may be misleading - MFlops can be application-specific thought
- For instance
- I want to add two n-element vectors
- This requires n Floating Point Operations
- Therefore MFlops is a good measure
12Measuring Performance Rates
- How do we measure performance rates?
- Time a section of code
- Count how many items are done in that section
of the code - Compute the rate as the number of items divided
by the measured time - Example
- start_stopwatch()
- for (i0 ilt1000000 i)
- x y z a
- stop_stopwatch()
- Number of MFlop 2 (1000000 additions, 1000000
multiplications) - Number of MFlops 2 / time
13Peak Performance?
- Resource vendors always talk about peak
performance rate - Computed based on specifications of the machine
- For instance
- I build a machine with 2 floating point units
- Each unit can do an operation in 2 cycles
- My CPU is at 1GHz
- Therefore I have a 12/2 1GFlops Machine
- Problem
- In real code you will never be able to use the
two floating point units constantly - Data needs to come from memory and cause the
floating point units to be idle - Typically, real code achieves only an (often
small) fraction of the peak performance
14Benchmarks
- Since many performance metrics turn out to be
misleading, people have designed benchmarks - Example SPEC Benchmark
- Integer benchmark
- Floating point benchmark
- These benchmarks are typically a collection of
several codes that come from real-world
software - The question what is a good benchmark? is
difficult - If the benchmarks do not correspond to what
youll do with the computer, then the benchmark
results are not relevant to you
15How About GHz?
- This is often the way in which people say that a
computer is better than another - More instruction per seconds for higher clock
rate - Faces the same problems as MIPS
- But usable within a specific architecture
16Program Performance
- In this class were not really concerned with
determining the performance of a compute platform
(whichever way it is defined) - Instead were concerned with improving a
programs performance - For a given platform, take a given program
- Run it and measure its wall-clock time
- Enhance it, run it and quantify the performance
improvement - i.e., the reduction in wall-clock time
- For each version compute its performance
- preferably as a relevant performance rate
- so that you can say the best implementation we
have so far goes this fast (perhaps a of the
peak performance)
17The UNIX time Command
- You can put time in front of any UNIX command you
invoke - When the invoked command completes, time prints
out timing (and other) information - time ls /home/casanova/ -la -R
- 0.520u 1.570s 020.58 10.1 00k 570105io
0pf0w - 0.520u 0.52 seconds of user time
- 1.570s 1.57 seconds of system time
- 020.56 20.56 seconds of wall-clock time
- 10.1 10.1 of CPU was used
- 00k memory used (text data)
- 570105io 570 input, 105 output (file system I/O)
- 0pf0w 0 page faults and 0 swaps
18User, System, Wall-Clock?
- User Time time that the code spends executing
user code (i.e., non system calls) - System Time time that the code spends executing
system calls - Wall-Clock Time time from start to end
- Wall-Clock User System
- in our example 20.56 0.52 1.57
- Why?
- because the process can be suspended by the O/S
due to contention for the CPU by other processes - because the process can be blocked waiting for
I/O
19Using time
- Its interesting to know what the user time and
the system time are - for instance, if the system time is really high,
it may be that the code does to many calls to
malloc(), for instance - But one would really need more information to fix
the code (not always clear which system calls may
be responsible for the high system time) - Wall-clock - system - user I/O suspended
- If the system is dedicated, suspended 0
- Therefore one can estimate the cost of I/O
- If I/O is really high, one may want to look at
reducing I/O or doing I/O better - Therefore, time can give us insight into
bottlenecks and gives us wall-clock time
20Drawbacks of UNIX time
- The time command has poor resolution
- Only milliseconds
- Sometimes we want a higher precision, especially
if our performance improvements are in the 1-2
range - time times the whole code
- Sometimes were only interested in timing some
part of the code, for instance the one that we
are trying to optimize - Sometimes we want to compare the execution time
of different sections of the code
21Timing with gettimeofday
- gettimeofday from the standard C library
- Measures the number of microseconds since
midnight, Jan 1st 1970, expressed in seconds and
microseconds - include ltsys/time.hgt
- struct timeval start
- ...
- gettimeofday(start,NULL)
- printf(ld,ld\n,start-gttv_sec,start-gttv_usec)
- ...
- Can be used to time sections of code
- Call gettimeofday at beginning of section
- Call gettimeofday at end of section
- Compute the time elapsed in microseconds
- e.g., (end.tv_sec1000000.0 end.tv_usec -
start.tv_sec1000000.0 - start.tv_usec) /
1000000.0
22Other Ways to Time Code
- ntp_gettime() (Internet RFC 1589)
- Sort of like gettimeofday, but reports estimated
error on time measurement - Not available for all systems
- Part of the GNU C Library
- Java System.currentTimeMillis()
- Known to have resolution problems, with
resolution higher than 1 millisecond! - Solution use a native interface to a better
timer - Java System.nanoTime()
- Added in J2SE 5.0
- Probably not accurate at the nanosecond level
- Tons of high precision timing in Java on the Web
23Dedicated Systems
- Measuring the performance of a code must be done
dedicated system - No other user can start a process
- The user measuring the performance only runs the
minimum amount of processes - basically, a shell
- single-user mode is typically considered
overkill - Nevertheless, one should always present
measurement results as averages over several
experiments - Because the (small) load imposed by the O/S is
not deterministic - In your assignments, always show averages over 10
experiments, or more if asked to do so explicitly
24How do I speed up my code?
- One option to make code faster is basically to
monkey around with the code - Lets look at some examples of what one can do by
hand - These techniques were very popular before
compilers were any good - Of course, well talk about what the compiler can
do nowadays
25Optimization Techniques
- Technique 1 identify loop constants
- for (k0kltNk)
- cij aik bkj
-
-
- sum 0
- for (k0kltNk)
- sum aik bkj
-
- cij sum
26Optimization Techniques
- Technique 2 replace array accesses by pointer
dereferences -
- for (j0jltNj)
- aij 2 // 2N adds, N multiplies
-
- double ptr (ai0) // 2 adds, 1
multiplies - for (j0jltNj)
- ptr 2
- ptr // N integer addition
-
-
27Optimization Techniques
- Technique 3 Loop Unrolling
- for (i0ilt100i)
- ai i
- i0
- do
- ai i i
- ai i i
- ai i i
- ai i i
- while (ilt100) // fewer comparisons
28Optimization Techniques
- Technique 4 Code Motion
-
- sum 0
- for (i 0 i lt fact(n) i)
- sum i
- sum 0
- f fact(n)
- for (i 0 i lt f i)
- sum i
29Optimization Techniques
- Technique 5 Inlining
- for (i0iltNi) sum cube(i)
- ...
- void cube(i) return (iii)
- for (i0iltNi) sum iii
30Other Techniques
- Common sub-expression elimination
- x a b - c
- y a d e b
-
- tmp a b
- x tmp - c
- y tmp d e
31Other Techniques
- Dead code elimination
- x 12
- ...
- x ac
- ...
- x ac
Seems obvious, but may be hidden int x
0 ... ifdef FOO x f(3) else
32Other Techniques
- Strength reduction
- a i3 a iii
- Constant propagation
- int speedup 3
- efficiency 100 speedup / numprocs
- x efficiency 2
- x 600 / numprocs
33Now what?
- There are many other techniques
- We could apply them all to the code but this
would result in completely unreadable/undebuggable
code - Fortunately, the compiler should come to the
rescue - To some extent, at least
- Good compilers can do a lot for you
- Typically compilers provided by a vendor can do
pretty tricky optimizations
34What do compilers do?
- All modern compilers perform some automatic
optimization when generating code - In fact, you implement some of those in a
graduate-level compiler class, and sometimes at
the undergraduate level. - Most compilers provide several levels of
optimization - -O0 No optimization
- in fact some is always done
- -O1, -O2, .... -OX
- The higher the optimization level the higher the
probability that a debugger may have trouble
dealing with the code. - Always debug with -O0
- some compiler enforce that -g means -O0
- Some compiler will flat out tell you that higher
levels of optimization may break some code!
35Compiler optimizations
- gcc is a pretty good, free compiler
- -Os Optimize for size
- Some optimizations increase code size
tremendously - Do a man gcc and look at the many optimization
options - one can pick and choose,
- or just use standard sets via O1, O2, etc.
- The most fancy compilers are typically the ones
done by vendors - You cant sell a good machine if it has a bad
compiler - Compiler technology used to be really poor
- also, languages used to be designed without
thinking of compilers (FORTRAN, Ada) - no longer true every language designer has
in-depth understanding of compiler technology
today
36What can compilers do?
- Many, many things
- Inlining
- Assignment of variables to registers
- Its a difficult problem
- Dead code elimination
- Algebraic simplification
- Moving invariant code out of loops
- Constant propagation
- Control flow simplification
- Instruction scheduling, reordering
- Strength reduction
- e.g., add to pointers, rather than doing array
index computation - Loop unrolling and software pipelining
- Dead store elimination
- and many other......
37Instruction scheduling
- Modern computers have multiple functional units
that could be used in parallel - Or at least ones that are pipelined
- if fed operands at each cycle they can produce a
result at each cycle - although a computation may require 20 cycles
- Instruction scheduling
- Reorder the instructions of a program
- e.g., at the assembly code level
- Preserve correctness
- Make it possible to use functional units optimally
38Instruction Scheduling
- One cannot just shuffle all instructions around
- Preserving correctness means that data
dependences are unchanged - Three types of data dependences
- True dependence
- a ...
- ... a
- Output dependence
- a ...
- a ...
- Anti dependence
- ... a
- a ...
39Instruction Scheduling Example
- ... ...
- ADD R1,R2,R4 ADD R1,R2,R4
- ADD R2,R2,1 LOAD R4,_at_2
- ADD R3,R6,R2 ADD R2,R2,1
- LOAD R4,_at_2 ADD R3,R6,R2
- ... ...
- Since loading from memory can take many cycles,
one may as well do is as early as possible - Cant move instruction earlier because of
anti-dependence on R4
40Software Pipelining
- Fancy name for instruction scheduling for loops
- Can be done by a good compiler
- First unroll the loop
- Then make sure that instructions can happen in
parallel - i.e., scheduling them on functional units
- Lets see a simple example
41Example
- Source code for(i0iltni) sum ai
- Loop body in assembly
- Unroll loop allocate registers
- May be very difficult
r1 L r0--- stall r2 Add r2,r1r0 Add
r0,12r4 L r3--- stall r2 Add r2,r4r3
Add r3,12r7 L r6--- stall r2 Add r2,r7r6
Add r6,12r10 L r9--- stall r2 Add
r2,r10r9 add r9,12
r1 L r0--- stall r2 Add r2,r1r0 Add r0,4
42Example (cont.)
Schedule Unrolled Instructions, exploiting
instruction level parallelism if possible
r1 L r0r4 L r3 r2 Add r2,r1 r7 L r6 r0
Add r0,12 r2 Add r2,r4 r10 L r9 r3 Add
r3,12 r2 Add r2,r7 r1 L r0 r6 Add r6,12 r2
Add r2,r10r4 L r3 r9 add r9,12 r2 Add
r2,r1 r7 L r6 r0 Add r0,12 r2 Add r2,r4
r10 L r9 r3 Add r3,12 r2 Add r2,r7 r1 L
r0 r6 Add r6,12 r2 Add r2,r10r4 L r3 r9
add r9,12 r2 Add r2,r1 r7 L r6 . .
.r0 Add r0,12 r2 Add r2,r4 r10 L r9r3
Add r3,12 r2 Add r2,r7r6 Add r6,12 Add
r2,r10 r9 add r9,12
Identifyrepeatingpattern (kernel)
43Example (cont.)
prologue
r1 L r0r4 L r3 r2 Add r2,r1 r7 L r6
r0 Add r0,12 r2 Add r2,r4 r10 L r9 r3
Add r3,12 r2 Add r2,r7 r1 L r0 r6 Add
r6,12 r2 Add r2r10 r4 L r3 r9 Add r9,12 r2
Add r2,r1 r7 L r6 r0 Add r0,12 r2 Add
r2,r4 r10 L r9r3 Add r3,12 r2 Add r2,r7r6
Add r6,12 Add r2,r10 r9 Add r9,12
kernel
epilogue
44Software Pipelining
- The kernel may require many registers and its
good nice to know how to use as few as possible - otherwise, one may have to go to cache more,
which may negate the benefits of software
pipelining - Dependency constraints must be respected
- May be very difficult to analyze for complex
nested loops - Software pipelining with registers is a very
well-known NP-hard program
45Limits to Compiler Optimization
- Behavior that may be obvious to the programmer
can be obfuscated by languages and coding styles - e.g., data ranges may be more limited than
variable types suggest - e.g., using an int in C for what could be an
enumerated type - Most analysis is performed only within procedures
- whole-program analysis is too expensive in most
cases - Most analysis is based only on static information
- compiler has difficulty anticipating run-time
inputs - When in doubt, the compiler must be conservative
- cannot perform optimization if it changes program
behavior under any realizable circumstance - even if circumstances seem quite bizarre and
unlikely
46So were are we now?
- We have seen techniques to optimize code
- reducing the number of instructions
- instruction scheduling
- memory access management
- But compilers do a lot of things
- So, does it mean that we, as software developers
have nothing to worry about? - Sadly, no
47Good practice
- Writing code for high performance means working
hand-in-hand with the compiler - Principle 1 Optimize things that we know the
compiler cannot deal with - Well see a few such examples in the next set of
slides - Principle 2 Write code so that the compiler can
do its optimizations - Remove optimization blockers
48Optimization blocker aliasing
- Aliasing two pointers point to the same
location - If a compiler cant tell what a pointer points
at, it must assume it can point at almost
anything - Example
- void foo(int q, int p)
- q 3
- p
- q 4
- cannot be safely optimized to
- p
- q 12
- because perhaps p q
- Some compilers have pretty fancy aliasing
analysis capabilities
49Blocker False Dependencies
- A special case of aliasing
- ai bi c
- ai1 bi1 d
- The compiler cannot know that (bi1) is
different from (ai) - Therefore, it cant do efficient instruction
scheduling - Instead, one should write code as
- float f1 bi
- float f2 bi1
- ai f1 c
- ai1 f2 d
- Used local variable to expose independent
operations - Some compiler allow users to give them hints
- e.g., declare arrays a and b unaliased via some
keyword
50Blocker Function Call
- sum 0
- for (i 0 i lt fact(n) i)
- sum i
- A compiler cannot optimize this because
- function fact may have side-effects
- e.g., modifies global variables
- Function May Not Return Same Value for Given
Arguments - Depends on other parts of global state, which may
be modified in the loop - Why doesnt compiler look at the code for fact?
- Linker may overload with different version
- Unless declared static
- Interprocedural optimization is not used
extensively due to cost - Inlining can achieve the same effect for small
procedures - Again
- Compiler treats procedure call as a black box
- Weakens optimizations in and around them
51Other Techniques
while( ) res filter0signal0
filter1signal1
filter2signal2 signal
Helps some compilers
register float f0 filter0 register float f1
filter1 register float f2
filter2 while( ) res f0signal0
f1signal1
f2signal2 signal
52Other Techniques
- Replace pointer updates for strided memory
addressing with constant array offsets
f0 r8 r8 4 f1 r8 r8 4 f2 r8
r8 4
Some compilers are better at figuring this out
than others Some systems may go faster with
option 1, some others with option 2!
f0 r80 f1 r84 f2 r88 r8 12
53Bottom line
- Know your compilers
- Some are great
- Some are not so great
- Some will not do things that you think they
should do - often because you forget about things like
aliasing - There is not golden rule because there are some
system-dependent behaviors - Although the general principles typically holds
- Doing all optimization by hand is a bad idea in
general - But were doing it in the class for some of the
programming assignment to truly understand about
code, hardware, and performance.
54By-hand Optimization of Matrix Multiplication
for(i 0 i lt SIZE i) int orig_pa
ai0 for(j 0 j lt SIZE j) int
pa orig_pa int pb a0j int
sum 0 for(k 0 k lt SIZE k)
sum pa pb pa pb SIZE
cij sum
for(i 0 i lt SIZE i) for(j 0 j lt
SIZE j) for(k 0 k lt SIZE k)
cijaikbkj
- Turned array accesses into pointer dereferences
- Assign to each element of c just once
55Results (Courtesy of CMU)
56Why is Simple Sometimes Better?
- Easier for humans and the compiler to understand
- The more the compiler knows the more it can do
- Pointers are hard to analyze, arrays are easier
- You never know how fast code will run until you
time it on a dedicated system - The transformations done by hand good optimizers
will often do for us - And they will often do a better job than we can
do, but not always - Pointers may cause aliases and data dependences
where the array code had none
57Bottom Line
- How should I write my programs, given that I have
a good, optimizing compiler? - Dont Smash Code into Oblivion
- Hard to read, maintain ensure correctness
- Do
- Select best algorithm
- Write code thats readable maintainable
- Procedures, recursion, without built-in constant
limits - Even though these factors can slow down code
- Eliminate optimization blockers
- Allows compiler to do its job
58Good Performance?
- You have a code that was given to you or that you
wrote - You compile it with your favorite optimizing
compiler, you have removed obvious optimization
blockers - And then, performance is poor
- Not sufficient for the code to be used to meet
deadlines - The code could still be usable but lead to long
waits, and you can tell that the performance is
way below the peak performance - What do you do?
59Why is Performance Poor?
- Performance is poor because the code suffers from
a performance bottleneck - Definition
- An application runs on a platform that has many
components - CPU, Memory, Operating System, Network, Hard
Drive, Video Card, etc. - Pick a component and make it faster
- If the application performance increases, that
component was the bottleneck!
60Identifying a Bottleneck
- It can be difficult
- Youre not going to change the memory bus just to
see what happens to the application - But you can run the code on a different machine
and see what happens - Typical Approach
- Know/discover the characteristics of the machine
- Know/discover the characteristics of the
application - Observe the application execution on the machine
- Reason about what the bottleneck is
- Luckily there are well-known bottlenecks that are
likely candidates when performance is poor
61Removing a Bottleneck
- Brute force Hardware Upgrade
- Sometimes necessary, but can only get you so far
and may be very costly - e.g., memory technology
- Instead, modify the code
- The bottleneck is there because the code uses a
resource heavily or in non-intelligent manner - This is, unfortunately, what we have to do often
after the fact - You wrote a beautifully structured/modular code
- Its slow and you have to decrease readability,
modularity to increase performance
62The Memory Bottleneck
- The memory is a very common bottleneck that
beginning programmers often dont think about - When you look at code, you often pay more
attention to computation - ai bj ck
- The access to the 3 arrays take more time than
doing an addition - For the code above, the memory is the bottleneck
for many machines!
63Why the Memory Bottleneck?
- In the 70s, everything was balanced
- The memory kept pace with the CPU
- n cycles to execute an instruction, n cycles to
bring in a word from memory - No longer true
- CPUs have gotten 1,000x faster
- Memory have gotten 10x faster and 1,000,000x
larger - Flops are free and bandwidth is expensive and
processors are STARVED for data
64Current Memory Technology
source http//www.xbitlabs.com/articles/memory/di
splay/ddr2-ddr_2.html
65Memory Bottleneck Example
- Fragment of code ai bj ck
- Three memory references 2 reads, 1 write
- One addition can be done in one cycle
- If the memory bandwidth is 12.8GB/sec, then the
rate at which the processor can access integers
(4 bytes) is 12.8102410241024 / 4 3.4GHz - The above code needs to access 3 integers
- Therefore, the rate at which the code gets its
data is 1.1GHz - But the CPU could perform additions at 4GHz!
- Therefore The memory is the bottleneck
- And we assumed memory worked at the peak!!!
- We ignored other possible overheads on the bus
- In practice the gap can be around a factor 15 or
higher
66Reducing the Memory Bottleneck
- The way in which computer architects have dealt
with the memory bottleneck is via the memory
hierarchy
larger, slower, cheaper
CPU
Memory
disk
regs
register reference
L2-cache (SRAM) reference
memory (DRAM) reference
disk reference
L1-cache (SRAM) reference
L3-cache (DRAM) reference
hundreds cycles
tens of thousands cycles
sub ns
1-2 cycles
20 cycles
10 cycles
67Locality
- The memory hierarchy is useful because of
locality - Temporal locality a memory location that was
referenced in the past is likely to be referenced
again - Spatial locality a memory location next to one
that was referenced in the past is likely to be
referenced in the near future - This is great, but what we write our code for
performance we want our code to have the maximum
amount of locality - The compiler can do some work for us regarding
locality - But unfortunately not everything
68Programming for Locality
- Essentially, a programmer should keep a mental
picture of the memory layout of the application,
and reason about locality - When writing concurrent code on a multi-core
architecture, one must also thing of which caches
are shared/private - This can be extremely complex, but there are a
few well-known techniques - The typical example is with 2-D arrays
69Example 2-D Array Initialization
- int a200200 int a200200
- for (i0ilt200i) for (j0jlt200j)
- for (j0jlt200j) for (i0ilt200i)
- aij 2 aij 2
-
-
- Which alternative is best?
- i,j?
- j,i?
- To answer this, one must understand the memory
layout of a 2-D array
702-D Arrays in Memory
- A static 2-D array is one declared as
- lttypegt ltnamegtltsizegtltsizegt
- int myarray1030
- The elements of a 2-D array are stored in
contiguous memory cells - The problem is that
- The array is 2-D, conceptually
- Computer memory is 1-D
- 1-D computer memory a memory location is
described by a single number, its address - Just like a single axis
- Therefore, there must be a mapping from 2-D to
1-D - From a 2-D abstraction to a 1-D implementation
71Mapping from 2-D to 1-D?
1-D computer memory
nxn 2-D array
72Row-Major, Column-Major
- Luckily, only 2 of the n2! mappings are ever
implemented in a language - Row-Major
- Rows are stored contiguously
- Column-Major
- Columns are stored contiguously
1st row
2nd row
3rd row
4th row
1st col
2nd col
3rd col
4th col
73Row-Major
address
rows in memory
memory lines
memory/cache line
- Matrix elements are stored in contiguous memory
lines
74Row-Major
- C uses Row-Major
- First option
- int a200200
- for (i0ilt200i)
- for (j0jlt200j)
- aij2
- Second option
- int a200200
- for (j0jlt200j)
- for (i0ilt200i)
- aij2
75Counting cache misses
- nxn 2-D array, element size e bytes, cache line
size b bytes
memory/cache line
- One cache miss for every cache line n2 x e /b
- Total number of memory accesses n2
- Miss rate e/b
- Example Miss rate 4 bytes / 64 bytes 6.25
- Unless the array is very small
memory/cache line
- One cache miss for every access
- Example Miss rate 100
- Unless the array is very small
76Array Initialization in C
- First option
- int a200200
- for (i0ilt200i)
- for (j0jlt200j)
- aij2
- Second option
- int a200200
- for (j0jlt200j)
- for (i0ilt200i)
- aij2
Good Locality
77Performance Measurements
- Option 1
- int aXX
- for (i0ilt200i)
- for (j0jlt200j)
- aij2
- Option 2
- int aXX
- for (j0jlt200j)
- for (i0ilt200i)
- aij2
Experiments on my laptop
- Note that other languages use column major
- e.g., FORTRAN
78Matrix Multiplication
- A classic example for locality-aware programming
is matrix multiplication - for (i0iltNi)
- for (j0jltNj)
- for (k0kltNk)
- ci,j aik bkj
- There are 6 possible orders for the three loops
- i-j-k, i-k-j, j-i-k, j-k-i, k-i-j, k-j-i
- Each order corresponds to a different access
patterns of the matrices - Lets focus on the inner loop, as it is the one
thats executed most often
79Inner Loop Memory Accesses
- Each matrix element can be accessed in three
modes in the inner loop - Constant doesnt depend on the inner loops
index - Sequential contiguous addresses
- Stride non-contiguous addresses (N elements
apart) - cij aik
bkj - i-j-k Constant Sequential Strided
- i-k-j Sequential Constant Sequential
- j-i-k Constant Sequential Strided
- j-k-i Strided Strided Constant
- k-i-j Sequential Constant Sequential
- k-j-i Strided Strided Constant
80Loop order and Performance
- Constant access is better than sequential access
- its always good to have constants in loops
because they can be put in registers (as weve
seen in our very first optimization) - Sequential access is better than strided access
- sequential access is better than strided because
it utilizes the cache better - Lets go back to the previous slides
81Best Loop Ordering?
- cij aik
bkj - i-j-k Constant Sequential Strided
- i-k-j Sequential Constant Sequential
- j-i-k Constant Sequential Strided
- j-k-i Strided Strided Constant
- k-i-j Sequential Constant Sequential
- k-j-i Strided Strided Constant
- k-i-j and i-k-j should have the best performance
- i-j-k and j-i-k should be worse
- j-k-i and k-j-i should be the worst
- You will measure this in a Programming Assignment
82How good is the best ordering?
- Let us assume that i-k-j is best
- How many cache misses?
- for (i0iltNi)
- for (k0kltNk)
- xaik
- for (j0jltNj)
- ci,j x bkj
-
- Clearly this is not easy to compute
- e.g., if the matrix is twice the size of the
cache, there is a lot of loading/evicting and
obtaining a formula would be complicated - Let L be the cache line size in number of matrix
elements - How about a very coarse approximation, by
assuming that the matrix is much larger than the
cache? - determine what matrix pieces are loaded/written
- Figure out the expected number of cache misses
83Slow Memory Operations
- for (i0iltNi)
- // (1) read row i of a into cache
- // (2) write row i of c back to memory
- for (k0kltNk)
- // (3) read column j of b into cache
- for (j0jltNj)
- ci,jaikbkj
-
-
- L cache line size
- (1) N (N / L) cache misses
- (2) N (N / L) cache misses
- (3) N N N cache misses
- Although the access to B is sequential, its
sequential along the column and the matrix is
store in row-major fashion! - Total 2N2/L N3 N3 (for large N)
84Bad News
- N3 slow memory operations and 2N3 arithmetic
operations - Ratio ops / mem 2
- This is bad news because we know that computer
architectures are NOT balanced and memory
operations are orders of magnitude slower than
arithmetic operations - Therefore, the memory is still the bottleneck for
this implementation of matrix multiplication (the
ratio should be much higher) - BUT we have only N2 matrix elements, how come we
perform N3 slow memory accesses? - Because we access matrix B very inefficiently,
trying to load entire columns one after the other - Lesson counting the number of operations and
comparing it with the size of the data is not
sufficient to ascertain that an algorithm will
not suffer from the memory bottleneck
85Better cache reuse?
- Since we really need only N2 elements, perhaps
there is a better way to reorganize the
operations of the matrix multiplication for a
higher number of cache hits - Possible because and are associative and
commutative - Researchers have spent a lot of time trying to
find out the best ordering - There are even theorems!
- Let q ratio of operations to slow memory
accesses - q must be as high as possible to remove the
memory bottleneck - HongKung 1981 Any reorganization of the
algorithm is limited to q O(vM), where M is the
size of the cache (in number of elements) - obtained with a lot of unrealistic assumptions
about the cache - still shows that q wont scale with N, unlike
what one may think when dividing 2n3 by n2.
86Blocked Matrix Multiplication
- One problem with our implementation is that we
try to access entire columns of matrix B. - What about accessing only a subset of a column,
or of multiple columns, at a time?
87Blocked Matrix Multiplication
cache line
j
j
i
i
A
B
C
Key idea reuse the other elements in each cache
line as much as possible
88Blocked Matrix Multiplication
cache line
j
j
i
i
b elements
b elements
A
B
C
May as well compute ci,j1 since one loads column
j1 of B in the cache lines anyway. But must
reorder the operations as follows compute
the first b terms of cij, compute the first b
terms of ci,j1 compute the next b
terms of cii, compute the next b terms of cij1
.....
89Blocked Matrix Multiplication
cache line
j
j
i
i
A
B
C
May as well compute a whole subrow of C, with the
same reordering of the operations. But by
computing a whole row of C, then one has to load
all columns of B, which one has to do again for
computing the next row of C. Idea reuse the
blocks of B that we have just loaded.
90Blocked Matrix Multiplication
cache line
j
j
i
i
A
B
C
Order of the operation Compute the first b
terms of all cij values in the C block Compute
the next b terms of all cij values in the C
block . . . Compute the last b terms of all cij
values in the C block
91Blocked Matrix Multiplication
N 4 b
C11
C12
C13
C14
A11
A12
A13
A14
B11
B12
B13
B14
C21
C22
C23
C24
A21
A22
A23
A24
B21
B22
B23
B24
C31
C32
C43
C34
A31
A32
A33
A34
B32
B32
B33
B34
C41
C42
C43
C44
A41
A42
A43
A144
B41
B42
B43
B44
- C22 A21B12 A22B22 A23B32 A24B42
-
- 4 matrix multiplications
- 4 matrix additions
- Main Point each multiplication operates on
small block matrices, whose size may be chosen
so that they fit in the cache.
92Blocked Algorithm
- The blocked version of the i-j-k algorithm is
written simply as - for (i0iltN/bi)
- for (j0jltN/bj)
- for (k0kltN/bk)
- Cij AikBkj
- where b is the block size (which we assume
divides N) - where Xij is the block of matrix X on block
row i and block column j - where means matrix addition
- where means matrix multiplication
93Cache Misses?
- for (i0iltN/bi)
- for (j0jltN/bj)
- // (1) write block Cij to memory
- for (k0kltN/bk)
- // (2) Load block Aik from memory
- // (3) Load block Bkj from memory
- Cij AikBkj
- (1) (N/b)(N/b)bb
- (2) (N/b)(N/b)(N/b)bb
- (3) (N/b)(N/b)(N/b)bb
- Total N2 2N3/b 2N3/b
94Performance?
- Slow memory accesses 2N3/b
- Number of operations 2N3
- Therefore, ratio ops / mem b
- This ratio should be as high as possible
- (Compare to the value of 2 that we obtained with
the non-blocked implementation) - This implies that one should make the block size
as large as possible - But, if we take this result to the extreme, then
the block size should be equal to N!! - This clearly doesnt make sense because then
were back to the non-blocked implementation
95Maximum Block Size
- The blocking optimization only works if the
blocks fit in cache - That is, 3 blocks of size bxb must fit in cache
(for A, B, and C) - Let M be the cache size (in elements)
- We must have 3b2 M, or b v(M/3)
- Therefore, in the best case, ratio of number of
operations to slow memory accesses is v(M/3)
96Optimizing Further?
- At this point we know that blocking is a good
idea - Turns out that the best block size isnt that
easy to determine - There are many other things we could do to the
code - loop unrolling
- instruction reordering
- ...
- There are many things the compiler can do to the
code and there are many compiler flags we could
use - In the end, how do we determine the best
implementation for a given architecture?
97Automatic Program Generation
- It is difficult to optimize code because
- There are many possible options for
tuning/modifying the code - These options interact in complex ways with the
compiler and the hardware - This is really an optimization problem
- The objective function is the codes performance
- The feasible solutions are all possible ways to
implement the software - Typically a finite number of implementation
decisions are to be made - Each decision can take a range of values
- e.g., the 7th loop in the 3rd function can be
unrolled 1, 2, ..., 20 times - e.g., the block size could be 2x2, 4x4, ...,
400x400 - e.g., function could be recursive or iterative
- And one needs to do it again and again for
different platforms
98Automatic Program Generation
- What is good at solving hard optimization
problems? - computers
- Therefore, a computer program could generate the
computer program with the best performance - Could use a brute force approach try all
possible solutions - but there is an exponential number of them
- Could use genetic algorithms
- Could use some ad-hoc optimization technique
99Matrix Multiplication
- We have seen that for matrix multiplication there
are several possible ways to optimize the code - block size
- optimization flag to the compiler
- order of loops
- ...
- It is difficult to find the best one
- People have written automatic matrix
multiplication program generators!
100The ATLAS Project
- ATLAS is a software that you can download and run
on most platforms. - It runs for a while (perhaps a couple of hours)
and generates a .c file that implements matrix
multiplication! - ATLAS optimizes for
- Instruction cache reuse
- Floating point instruction ordering
- pipeline functional units
- Reducing loop overhead
- Exposing parallelism
- multiple functional units
- Cache reuse
101ATLAS (500x500 matrices)
Source Jack Dongarra
- ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.
102Improving an Application
- So, we have seen ways in which to improve pieces
of code - The problem is that one typically doesnt have an
application that just performance an array
initialization, or a matrix multiplication - In fact, there are many parts of the application
that one could think of optimizing for memory,
etc.
103Profiling
- Question how do we know which part of the code
is the most expensive? - If youve not writen the code you may know
- If youve written the code you may have some idea
(although experience shows that many programmers
dont - The most expensive part may be in some library
function you havent written - You could put gettimeofday() calls everywhere,
but that gets really cumbersome for large
projects - The standard way use a profiler
104What is a Profiler?
- A profiler is a tool that monitors the execution
of a program and that reports the amount of time
spent in different functions - Useful to identify the expensive functions
- Profiling cycle
- Compile the code with the profiler
- Run the code
- Identify the most expensive function
- Optimize that function
- call it less often if possible
- make it faster
- Repeat until you cant think of any ways to
further optimize the most expensive function - UNIX has a good, free profiler called gprof
105Using gprof
- Compile your code using gcc with the -pg option
- Run your code until completion
- Then run gprof with your programs name as single
command-line argument - Example
- gcc -pg prog.c -o prog
- ./prog
- gprof prog gt profile_file
- The output file contains all profiling information
106Profiling output
- The content of the file is explained in detail in
the file itself - At the beginning of the file is a summary of
which fraction of the code is spent in which
function - In the middle section is a detailed entry for
each function - At the end of the file is a function index, in
which each function is assigned a number in
brackets, e.g., 3
107Profiling Output
- Flat profiling summary
- cumulative self
- time seconds seconds name
- 30.9 0.77 0.77 ___multadd_D2A 1
- 16.9 1.19 0.42 _scheduler ltcycle 1gt
3 - 15.3 1.57 0.38 _scandir 5
- 9.2 1.80 0.23 _NSLookupAndBindSymbo
lHint 6 - 6.4 1.96 0.16 _job ltcycle 1gt 8
- 4.4 2.07 0.11 _NSIsSymbolNameDefine
dHint 9 - 1.6 2.11 0.04 _hash_nkey 10
- 1.6 2.15 0.04 _pthread_key_create
11 - 1.2 2.18 0.03 ___quorem_D2A 12
- 1.2 2.21 0.03 __mh_dylib_header
13 - 1.2 2.24 0.03 _probe_submitter
14 - 1.2 2.27 0.03 _request_submitter
15 -
in the function itself
in the function and its children
108Profiling output
- The middle section of the file provides detailed
information for each function - Entry format
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39 - Can vary depending on the version of gprof
- You should really read the explanations in the
file to be sure
109Profiling output
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Function func 1
110Profiling output
Parents f1 111, f2 123
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Function func 1
111Profiling output
Parents f1 111, f2 123
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Function func 1
Children c 39
112Profiling output
Parents f1 111, f2 123
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Function func 1
Children c 39
f1
f2
calls
calls
func
Call graph
calls
c
113Profiling output
Parents f1 111, f2 123
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Function func 1
Children c 39
f1
f2
Call counts
80 calls
52 calls
func
Call graph
called 132 times total
32 calls
c
called 5231 times total
114Profiling output
Parents f1 111, f2 123
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132