CME212 Introduction to LargeScale Computing in Engineering - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

CME212 Introduction to LargeScale Computing in Engineering

Description:

wise. Row-wise. Fixed. Misses per Inner Loop Iteration: A B C. 0.25 1.0 0.0 ... wise. Column- wise. Misses per Inner Loop Iteration: A B C. 1.0 0.0 1.0 ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 65

Provided by: henri3

Category:

more less

Transcript and Presenter's Notes

Title: CME212 Introduction to LargeScale Computing in Engineering

1
CME212 Lecture 16

Cache Blocking, Unrolling and Assignment 3 Roundup

2
Writing Cache Friendly Code

Repeated references to variables are good
(temporal locality)
Stride-1 reference patterns are good (spatial
locality)
Examples
cold cache, 4-byte words, 4-word cache blocks

int sumarrayrows(int aMN) int i, j, sum
0 for (i 0 i lt M i) for (j
0 j lt N j) sum aij
return sum
int sumarraycols(int aMN) int i, j, sum
0 for (j 0 j lt N j) for (i
0 i lt M i) sum aij
return sum
Miss rate
Miss rate
100
1/4 25
3
Locality Example

Question Can you permute the loops so that the
function scans the 3-d array a with a stride-1
reference pattern (and thus has good spatial
locality)?

int sumarray3d(int aMNN) int i, j, k,
sum 0 for (i 0 i lt M i) for
(j 0 j lt N j) for (k 0 k lt
N k) sum akij
return sum
4
Matrix Multiplication Example
Variable sum held in register

Major Cache Effects to Consider
Total cache size
Exploit temporal locality and keep the working
set small (e.g., by using blocking)
Block size
Exploit spatial locality
Description
Multiply N x N matrices
O(N3) total operations
Accesses
N reads per source element
N values summed per destination

for (i0 iltn i) for (j0 jltn j)
sum 0.0 for (k0 kltn k) sum
aik bkj cij sum
5
Miss Rate Analysis for Matrix Multiply

Assume
Line size 32B (big enough for 4 64-bit words)
Matrix dimension (N) is very large
Approximate 1/N as 0.0
Cache is not even big enough to hold multiple
rows
Analysis Method
Look at access pattern of inner loop

C
6
Matrix Multiplication (ijk)
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

Inner loop
(,j)
(i,j)
(i,)
A
B
C
Row-wise
Misses per Inner Loop Iteration A B C 0.25 1.
0 0.0
7
Matrix Multiplication (jik)
/ jik / for (j0 jltn j) for (i0 iltn
i) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

Inner loop
(,j)
(i,j)
(i,)
A
B
C
Misses per Inner Loop Iteration A B C 0.25 1.
0 0.0
8
Matrix Multiplication (kij)
/ kij / for (k0 kltn k) for (i0 iltn
i) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
Misses per Inner Loop Iteration A B C 0.0 0.2
5 0.25
9
Matrix Multiplication (ikj)
/ ikj / for (i0 iltn i) for (k0 kltn
k) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
Fixed
Misses per Inner Loop Iteration A B C 0.0 0.2
5 0.25
10
Matrix Multiplication (jki)
/ jki / for (j0 jltn j) for (k0 kltn
k) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
Misses per Inner Loop Iteration A B C 1.0 0.0
1.0
11
Matrix Multiplication (kji)
/ kji / for (k0 kltn k) for (j0 jltn
j) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
Misses per Inner Loop Iteration A B C 1.0 0.0
1.0
12
Summary of Matrix Multiplication
13
Pentium Matrix Multiply Performance

Miss rates are helpful but not perfect
predictors.
Code scheduling matters, too.

14
Improving Temporal Locality by Blocking

Example Blocked matrix multiplication
block (in this context) does not mean cache
block.
Instead, it mean a sub-block within the matrix.
Example N 8 sub-block size 4

Key idea Sub-blocks (i.e., Axy) can be treated
just like scalars.
A11 A12 A21 A22
B11 B12 B21 B22
C11 C12 C21 C22

X
C11 A11B11 A12B21 C12 A11B12
A12B22 C21 A21B11 A22B21 C22
A21B12 A22B22
15
Blocking

/ Unoptimized ARRAY x y z /
for (i 0 i lt N i)
for (j 0 j lt N j)
r 0
for (k 0 k lt N k)
r r yik zkj
xij r

X
Y
Z
j
k
j
i
i
k
16
Blocking

/ Optimized ARRAY X Y Z /
for (jj 0 jj lt N jj B)
for (kk 0 kk lt N kk B)
for (i 0 i lt N i)
for (j jj j lt min(jjB,N) j)
r 0
for (k kk k lt min(kkB,N) k)
r r yik zkj
xij r

First block
Partial solution
X
Y
Z
j
k
j
i
i
k
17
Blocked Matrix Multiply (bijk)
for (jj0 jjltn jjbsize) for (i0 iltn
i) for (jjj j lt min(jjbsize,n) j)
cij 0.0 for (kk0 kkltn kkbsize)
for (i0 iltn i) for (jjj j lt
min(jjbsize,n) j) sum 0.0
for (kkk k lt min(kkbsize,n) k)
sum aik bkj
cij sum
18
Blocked Matrix Multiply Analysis

Innermost loop pair multiplies a 1 X bsize sliver
of A by a bsize X bsize block of B and
accumulates into 1 X bsize sliver of C
Loop over i steps through n row slivers of A C,
using same B

Innermost Loop Pair
19
Blocking, Analysis
20
Pentium Blocked Matrix Multiply Performance

Blocking (bijk and bikj) improves performance by
a factor of two over unblocked versions (ijk and
jik)
relatively insensitive to array size.

21
Sparse Matrices

Many applications result in very sparse matrices
Discretizations of PDE
We can store only the nonzero elements
Lots of different formats
Need to look-up indices and values
Much slower than full matrices
Sparsity enables us to solve much larger problems

22
Example Matrix
23
Matrix-Vector product, Compressed Sparse Row (CSR)

Lookup indices
Loop length determined by number of non-zeros per
row

Pointers into array of column indices
for i 1, n y(i) 0 for j
row_ptr(i), row_ptr(i1) - 1 y(i) y(i)
val(j) x(col_ind(j)) end end
Indirect addressing
24
Software Prefetching

Prefetching guess what data is needed and put it
in the cache.
Run-ahead scouting
You can hide some of the latency by inserting
prefetch instructions
Increases bandwidth
On Sun
-xprefetchno
-xprefetch_leveln, nlt123gt
void sparc_prefetch_read_once(address)
void sparc_prefetch_read_many(address)
void sparc_prefetch_write_once(address)
void sparc_prefetch_write_many(address)
CPRAGMA SPARC_PREFETCH_READ_ONCE(address)

25
Concluding Observations

Programmer can optimize for cache performance
How data structures are organized
How data are accessed
Nested loop structure
Blocking is a general technique
All systems favor cache friendly code
Getting absolute optimum performance is very
platform specific
Cache sizes, line sizes, etc.
Can get most of the advantage with generic code
Keep working set reasonably small (temporal
locality)
Use small strides (spatial locality)

26
Matrix-vector Products
27
C Implementations
Row variant
Column variant
x
x

for( i0 iltm i ) sum 0.0 for( j0
jltn j ) sum A(i,j)xj
vi sum
for( i0 iltm i ) vi 0.0 for( j0
jltn j ) for( i0 iltm i )
viA(i,j)xj
28
MxV in BLAS

Implemented in Basic Linear Algebra Subprograms
(BLAS)
DGEMV double precision
CGEMV complex
SGEMV single precision
More general form

29
Multi-dimensional Arrays in C

There are two fundamental ways of representing
the matrix A
Nested arrays
One-dimensional array
Indexing using arithmetic, A(i,j)ainj
Multi-level array
Array of pointers
Bracket indexing, A(i,j)aij
DGEMV in BLAS uses nested array

30
Performance Comparison
31
Memory consumption

One 2D array (A), two 1D arrays (v,x)
Use nested arrays
DGEMV requires nested arrays
Assume mn, square matrix
Datatype is double, 64bits 8B
Total mem 8n2 16n

32
UltraSPARCIIICu

Experiment run on UltraSPARCIIICu
900MHz clock frequency
One FPadd and one FPmul per cycle, one mem
L1 cache on-chip 64kB
L2 cache off-chip 8MB
Problem fits in L2 if 8n216nlt8MB
n lt 1000

33
Compiler flags

Code compiled using Sun STUDIO compilers
FLAGS
xarchv9b (64-bit)
xO3 (basic optimizations)

34
Performance Again
35
Using the Compiler

Added the flags
xtargetultra3cu
xcache64/32/48192/512/2 (from fpversion)
fast
xrestrict (aliasing)
xdepend (analyze loop dependencies)
xprefetchyes (do software prefetching)

36
Compiler Optimized
37
Effect of Flags, Row Version
38
Effect of Flags, Column Version
39
Effect of Flags, Library Version
40
Effect of Representation, Multilevel
41
Manual Unrolling
Row
Column
for (j1 jltn-jrem j2) (void)
dummy(aj) for (i0 iltm i)
ai B(i,j)cj
B(i,j1)cj1 for(jn-jremjltnj)
for(i0iltmi) aiB(i,j)cj
for (i0 iltm-irem i2) sum0 0.0
sum1 0.0 for (j0 jltn j)
sum0 B(i,j)cj sum1
B(i1,j)cj ai sum0
ai1 sum1
42
Effect of Unrolling
No progress!
43
Cache Analysis

Spatial locality
Row version, stride-1 access
Col version, stride-m access BAD
Temporal locality between MxVs
Row version, as good as it gets
Col version, same

44
Cache Analysis,Column Version

First column will miss on all rows
If blocks are still in cache accesses will hit
when accessing second column
Mem per column
mcache_block_size
L1 m32
L2 m64
We also need the array v

Memory page
45
Cache Analysis, contd

L1
one cache line contains 32/8 4 elements
The L1 can store 64kB/32 2048 lines
Total of 42048 8192 elements
L2
One cache line contains 64/8 8 elements
The L2 can store 8MB/64 131072 lines
Total of 1310728 1048576 elements
TLB
512 entries of 8kB pages 4MB of data (TLB
reach)
Each 8kB page holds 1024 elements
Total of 5121024 524288 elements

46
Cache Analysis

The L1 cache is full after accessing 2048
elements in column order since we bring in whole
lines
Because of LRU replacements, the first elements
are evicted if we access more than 2048 elements
Consider the case where mn2048
L1 cache is full after first column -gt
replacements
L2 cache is full after 64 columns
When i gt 256 we need to replace entries in the TLB

47
Increasing the Page Size

Many CPUs and OSes have support for larger
pagesizes
Increases TLB-reach since we can use fewer
entries to map a large space
Only works for large chunks of data
UltraSPARC support 8kB,64kB,512kB and 4MB pages
ppgsz -o heap512k

48
Conclusions

Compiler options have large effects
The Sun compiler is more conservative for C than
for Fortran
Aliasing, unrolling
Handtuning such as unrolling and blocking can
help or destroy performance
Blocking can typically not be performed by the
compiler
Representations matter for performance
Multi-level arrays are often slower due to the
amount of memory referencing needed
Peak performance is 900M21.8GFlop/s
We got to about 1.2 Gflop/s
Still some tweaking to do

49
N-body Wrap-up

Barnes-Hut method is faster but also more
inaccurate
If theta is too large, we may miss important
physics
Adds a lot of programming complexity and also
indirection
Following linked lists or trees is not very
efficient on modern CPUs
More addressing

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
50
Representation of Particles

A particle has a mass, position and velocity
Is the force really a property of a particle?
Two ways of representing the particles
One big array of all variables
Several arrays, one for each variable

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
51
One big Array

struct _particle
double position2
double velocity2
double force2
double mass
struct _particle particles

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
52
View of Memory
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
53
Several Arrays
double positions double velocities double
forces double masses positions
malloc(2Nsizeof(double)) velocities
malloc(2Nsizeof(double)) forces
malloc(2Nsizeof(double)) masses
malloc(Nsizeof(double))
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
54
View of Memory
fx
fy
fx
fy
m
m
px
py
px
py
ux
uy
ux
uy
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
55
Naïve Force Calculation
for(i0iltNi) for(j0jltNj)
if( i!j ) forcesi
compute_forces(particlesi,particlesj)
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
56
Remarks

Forces are symmetric, copy the force with
opposite sign
Removes the conditional within the loop
Velocities are not used in the calculations
What about cache hit ratios?

for(i0iltNi) for(ji1jltNj)
forcesi compute_forces(particlesi,particle
sj) forcesj - compute_forces(particle
sj,particlesi)
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
57
Big Array
Cache blocks
Cache blocks
Cache blocks
Cache blocks

There is potential for spatial locality in inner
loop
Space for velocity is never used, waste of cache
space
Struct padding?

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
58
Many Arrays
px
py
px
py
fx
fy
fx
fy
m
m

Velocity components not cached
Still some potential for spatial locality within
each variable but probably not between variables

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
59
Force Calculations as Linear Algebra

The force calculations can be written out in
linear algebra

Vector of all masses
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
60
Distance Matrix

Matrix D is symmetric and dense
Diagonal is zero
Needs to be updated every time step
New positions -gt new distances
Small elements in D will indicate long range
forces
Optimizations to force calculations can be seen
as approximating this matrix with a sparse one

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
61
Barnes-Hut

Just a way of recursively dividing up your domain
Use mass lumping and a cut-off (theta) to
approximate far-field forces
Each tree node stores a reference to the particle
arrays
An index to the array or arrays
Also need, center of mass, total mass etc

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
62
Locality for the Barnes-Hut case

We will not use the particle array indices
directly in the force calculations
Instead, these are looked up from the tree node
Depends on where we are in the domain
More irregular accessing patterns
We can reorder the particles to ensure that
indices that are close also correspond to
particles that are close in space
Space filling curves
Bandwidth minimization (matrix case)

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
63
Other Issues