Title: CME212 Introduction to LargeScale Computing in Engineering
1CME212 Lecture 16
- Cache Blocking, Unrolling and Assignment 3 Roundup
2Writing Cache Friendly Code
- Repeated references to variables are good
(temporal locality) - Stride-1 reference patterns are good (spatial
locality) - Examples
- cold cache, 4-byte words, 4-word cache blocks
int sumarrayrows(int aMN) int i, j, sum
0 for (i 0 i lt M i) for (j
0 j lt N j) sum aij
return sum
int sumarraycols(int aMN) int i, j, sum
0 for (j 0 j lt N j) for (i
0 i lt M i) sum aij
return sum
Miss rate
Miss rate
100
1/4 25
3Locality Example
- Question Can you permute the loops so that the
function scans the 3-d array a with a stride-1
reference pattern (and thus has good spatial
locality)?
int sumarray3d(int aMNN) int i, j, k,
sum 0 for (i 0 i lt M i) for
(j 0 j lt N j) for (k 0 k lt
N k) sum akij
return sum
4Matrix Multiplication Example
Variable sum held in register
- Major Cache Effects to Consider
- Total cache size
- Exploit temporal locality and keep the working
set small (e.g., by using blocking) - Block size
- Exploit spatial locality
- Description
- Multiply N x N matrices
- O(N3) total operations
- Accesses
- N reads per source element
- N values summed per destination
for (i0 iltn i) for (j0 jltn j)
sum 0.0 for (k0 kltn k) sum
aik bkj cij sum
5Miss Rate Analysis for Matrix Multiply
- Assume
- Line size 32B (big enough for 4 64-bit words)
- Matrix dimension (N) is very large
- Approximate 1/N as 0.0
- Cache is not even big enough to hold multiple
rows - Analysis Method
- Look at access pattern of inner loop
C
6Matrix Multiplication (ijk)
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum
Inner loop
(,j)
(i,j)
(i,)
A
B
C
Row-wise
Misses per Inner Loop Iteration A B C 0.25 1.
0 0.0
7Matrix Multiplication (jik)
/ jik / for (j0 jltn j) for (i0 iltn
i) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum
Inner loop
(,j)
(i,j)
(i,)
A
B
C
Misses per Inner Loop Iteration A B C 0.25 1.
0 0.0
8Matrix Multiplication (kij)
/ kij / for (k0 kltn k) for (i0 iltn
i) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
Misses per Inner Loop Iteration A B C 0.0 0.2
5 0.25
9Matrix Multiplication (ikj)
/ ikj / for (i0 iltn i) for (k0 kltn
k) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
Fixed
Misses per Inner Loop Iteration A B C 0.0 0.2
5 0.25
10Matrix Multiplication (jki)
/ jki / for (j0 jltn j) for (k0 kltn
k) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
Misses per Inner Loop Iteration A B C 1.0 0.0
1.0
11Matrix Multiplication (kji)
/ kji / for (k0 kltn k) for (j0 jltn
j) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
Misses per Inner Loop Iteration A B C 1.0 0.0
1.0
12Summary of Matrix Multiplication
13Pentium Matrix Multiply Performance
- Miss rates are helpful but not perfect
predictors. - Code scheduling matters, too.
14Improving Temporal Locality by Blocking
- Example Blocked matrix multiplication
- block (in this context) does not mean cache
block. - Instead, it mean a sub-block within the matrix.
- Example N 8 sub-block size 4
Key idea Sub-blocks (i.e., Axy) can be treated
just like scalars.
A11 A12 A21 A22
B11 B12 B21 B22
C11 C12 C21 C22
X
C11 A11B11 A12B21 C12 A11B12
A12B22 C21 A21B11 A22B21 C22
A21B12 A22B22
15Blocking
- / Unoptimized ARRAY x y z /
- for (i 0 i lt N i)
- for (j 0 j lt N j)
-
- r 0
- for (k 0 k lt N k)
- r r yik zkj
- xij r
-
-
X
Y
Z
j
k
j
i
i
k
16Blocking
- / Optimized ARRAY X Y Z /
- for (jj 0 jj lt N jj B)
- for (kk 0 kk lt N kk B)
- for (i 0 i lt N i)
- for (j jj j lt min(jjB,N) j)
-
- r 0
- for (k kk k lt min(kkB,N) k)
- r r yik zkj
- xij r
-
First block
Partial solution
X
Y
Z
j
k
j
i
i
k
17Blocked Matrix Multiply (bijk)
for (jj0 jjltn jjbsize) for (i0 iltn
i) for (jjj j lt min(jjbsize,n) j)
cij 0.0 for (kk0 kkltn kkbsize)
for (i0 iltn i) for (jjj j lt
min(jjbsize,n) j) sum 0.0
for (kkk k lt min(kkbsize,n) k)
sum aik bkj
cij sum
18Blocked Matrix Multiply Analysis
- Innermost loop pair multiplies a 1 X bsize sliver
of A by a bsize X bsize block of B and
accumulates into 1 X bsize sliver of C - Loop over i steps through n row slivers of A C,
using same B
Innermost Loop Pair
19Blocking, Analysis
20Pentium Blocked Matrix Multiply Performance
- Blocking (bijk and bikj) improves performance by
a factor of two over unblocked versions (ijk and
jik) - relatively insensitive to array size.
21Sparse Matrices
- Many applications result in very sparse matrices
- Discretizations of PDE
- We can store only the nonzero elements
- Lots of different formats
- Need to look-up indices and values
- Much slower than full matrices
- Sparsity enables us to solve much larger problems
22Example Matrix
23Matrix-Vector product, Compressed Sparse Row (CSR)
- Lookup indices
- Loop length determined by number of non-zeros per
row
Pointers into array of column indices
for i 1, n y(i) 0 for j
row_ptr(i), row_ptr(i1) - 1 y(i) y(i)
val(j) x(col_ind(j)) end end
Indirect addressing
24Software Prefetching
- Prefetching guess what data is needed and put it
in the cache. - Run-ahead scouting
- You can hide some of the latency by inserting
prefetch instructions - Increases bandwidth
- On Sun
- -xprefetchno
- -xprefetch_leveln, nlt123gt
- void sparc_prefetch_read_once(address)
- void sparc_prefetch_read_many(address)
- void sparc_prefetch_write_once(address)
- void sparc_prefetch_write_many(address)
- CPRAGMA SPARC_PREFETCH_READ_ONCE(address)
25Concluding Observations
- Programmer can optimize for cache performance
- How data structures are organized
- How data are accessed
- Nested loop structure
- Blocking is a general technique
- All systems favor cache friendly code
- Getting absolute optimum performance is very
platform specific - Cache sizes, line sizes, etc.
- Can get most of the advantage with generic code
- Keep working set reasonably small (temporal
locality) - Use small strides (spatial locality)
26Matrix-vector Products
27C Implementations
Row variant
Column variant
x
x
for( i0 iltm i ) sum 0.0 for( j0
jltn j ) sum A(i,j)xj
vi sum
for( i0 iltm i ) vi 0.0 for( j0
jltn j ) for( i0 iltm i )
viA(i,j)xj
28MxV in BLAS
- Implemented in Basic Linear Algebra Subprograms
(BLAS) - DGEMV double precision
- CGEMV complex
- SGEMV single precision
- More general form
29Multi-dimensional Arrays in C
- There are two fundamental ways of representing
the matrix A - Nested arrays
- One-dimensional array
- Indexing using arithmetic, A(i,j)ainj
- Multi-level array
- Array of pointers
- Bracket indexing, A(i,j)aij
- DGEMV in BLAS uses nested array
30Performance Comparison
31Memory consumption
- One 2D array (A), two 1D arrays (v,x)
- Use nested arrays
- DGEMV requires nested arrays
- Assume mn, square matrix
- Datatype is double, 64bits 8B
- Total mem 8n2 16n
32UltraSPARCIIICu
- Experiment run on UltraSPARCIIICu
- 900MHz clock frequency
- One FPadd and one FPmul per cycle, one mem
- L1 cache on-chip 64kB
- L2 cache off-chip 8MB
- Problem fits in L2 if 8n216nlt8MB
- n lt 1000
33Compiler flags
- Code compiled using Sun STUDIO compilers
- FLAGS
- xarchv9b (64-bit)
- xO3 (basic optimizations)
34Performance Again
35Using the Compiler
- Added the flags
- xtargetultra3cu
- xcache64/32/48192/512/2 (from fpversion)
- fast
- xrestrict (aliasing)
- xdepend (analyze loop dependencies)
- xprefetchyes (do software prefetching)
36Compiler Optimized
37Effect of Flags, Row Version
38Effect of Flags, Column Version
39Effect of Flags, Library Version
40Effect of Representation, Multilevel
41Manual Unrolling
Row
Column
for (j1 jltn-jrem j2) (void)
dummy(aj) for (i0 iltm i)
ai B(i,j)cj
B(i,j1)cj1 for(jn-jremjltnj)
for(i0iltmi) aiB(i,j)cj
for (i0 iltm-irem i2) sum0 0.0
sum1 0.0 for (j0 jltn j)
sum0 B(i,j)cj sum1
B(i1,j)cj ai sum0
ai1 sum1
42Effect of Unrolling
No progress!
43Cache Analysis
- Spatial locality
- Row version, stride-1 access
- Col version, stride-m access BAD
- Temporal locality between MxVs
- Row version, as good as it gets
- Col version, same
44Cache Analysis,Column Version
- First column will miss on all rows
- If blocks are still in cache accesses will hit
when accessing second column - Mem per column
- mcache_block_size
- L1 m32
- L2 m64
- We also need the array v
Memory page
45Cache Analysis, contd
- L1
- one cache line contains 32/8 4 elements
- The L1 can store 64kB/32 2048 lines
- Total of 42048 8192 elements
- L2
- One cache line contains 64/8 8 elements
- The L2 can store 8MB/64 131072 lines
- Total of 1310728 1048576 elements
- TLB
- 512 entries of 8kB pages 4MB of data (TLB
reach) - Each 8kB page holds 1024 elements
- Total of 5121024 524288 elements
46Cache Analysis
- The L1 cache is full after accessing 2048
elements in column order since we bring in whole
lines - Because of LRU replacements, the first elements
are evicted if we access more than 2048 elements - Consider the case where mn2048
- L1 cache is full after first column -gt
replacements - L2 cache is full after 64 columns
- When i gt 256 we need to replace entries in the TLB
47Increasing the Page Size
- Many CPUs and OSes have support for larger
pagesizes - Increases TLB-reach since we can use fewer
entries to map a large space - Only works for large chunks of data
- UltraSPARC support 8kB,64kB,512kB and 4MB pages
- ppgsz -o heap512k
48Conclusions
- Compiler options have large effects
- The Sun compiler is more conservative for C than
for Fortran - Aliasing, unrolling
- Handtuning such as unrolling and blocking can
help or destroy performance - Blocking can typically not be performed by the
compiler - Representations matter for performance
- Multi-level arrays are often slower due to the
amount of memory referencing needed - Peak performance is 900M21.8GFlop/s
- We got to about 1.2 Gflop/s
- Still some tweaking to do
49N-body Wrap-up
- Barnes-Hut method is faster but also more
inaccurate - If theta is too large, we may miss important
physics - Adds a lot of programming complexity and also
indirection - Following linked lists or trees is not very
efficient on modern CPUs - More addressing
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
50Representation of Particles
- A particle has a mass, position and velocity
- Is the force really a property of a particle?
- Two ways of representing the particles
- One big array of all variables
- Several arrays, one for each variable
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
51One big Array
- struct _particle
- double position2
- double velocity2
- double force2
- double mass
-
- struct _particle particles
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
52View of Memory
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
53Several Arrays
double positions double velocities double
forces double masses positions
malloc(2Nsizeof(double)) velocities
malloc(2Nsizeof(double)) forces
malloc(2Nsizeof(double)) masses
malloc(Nsizeof(double))
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
54View of Memory
fx
fy
fx
fy
m
m
px
py
px
py
ux
uy
ux
uy
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
55Naïve Force Calculation
for(i0iltNi) for(j0jltNj)
if( i!j ) forcesi
compute_forces(particlesi,particlesj)
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
56Remarks
- Forces are symmetric, copy the force with
opposite sign - Removes the conditional within the loop
- Velocities are not used in the calculations
- What about cache hit ratios?
for(i0iltNi) for(ji1jltNj)
forcesi compute_forces(particlesi,particle
sj) forcesj - compute_forces(particle
sj,particlesi)
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
57Big Array
Cache blocks
Cache blocks
Cache blocks
Cache blocks
- There is potential for spatial locality in inner
loop - Space for velocity is never used, waste of cache
space - Struct padding?
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
58Many Arrays
px
py
px
py
fx
fy
fx
fy
m
m
- Velocity components not cached
- Still some potential for spatial locality within
each variable but probably not between variables
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
59Force Calculations as Linear Algebra
- The force calculations can be written out in
linear algebra
Vector of all masses
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
60Distance Matrix
- Matrix D is symmetric and dense
- Diagonal is zero
- Needs to be updated every time step
- New positions -gt new distances
- Small elements in D will indicate long range
forces - Optimizations to force calculations can be seen
as approximating this matrix with a sparse one
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
61Barnes-Hut
- Just a way of recursively dividing up your domain
- Use mass lumping and a cut-off (theta) to
approximate far-field forces - Each tree node stores a reference to the particle
arrays - An index to the array or arrays
- Also need, center of mass, total mass etc
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
62Locality for the Barnes-Hut case
- We will not use the particle array indices
directly in the force calculations - Instead, these are looked up from the tree node
- Depends on where we are in the domain
- More irregular accessing patterns
- We can reorder the particles to ensure that
indices that are close also correspond to
particles that are close in space - Space filling curves
- Bandwidth minimization (matrix case)
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
63Other Issues
- Risk of cancellations when calculating the forces
- Try different precisions and rounding modes to
smoke it out - Symmetric forces in the Barnes-Hut case?
- Visualization how many particles can we draw in
each display frame (30Hz)
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
64Discussion
- Last one! None next week
- Topic BLAS and LAPACK
- Bldg 320, Rm 220, 415-530
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming