CME212 Introduction to LargeScale Computing in Engineering - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

CME212 Introduction to LargeScale Computing in Engineering

Description:

wise. Row-wise. Fixed. Misses per Inner Loop Iteration: A B C. 0.25 1.0 0.0 ... wise. Column- wise. Misses per Inner Loop Iteration: A B C. 1.0 0.0 1.0 ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 65
Provided by: henri3
Category:

less

Transcript and Presenter's Notes

Title: CME212 Introduction to LargeScale Computing in Engineering


1
CME212 Lecture 16
  • Cache Blocking, Unrolling and Assignment 3 Roundup

2
Writing Cache Friendly Code
  • Repeated references to variables are good
    (temporal locality)
  • Stride-1 reference patterns are good (spatial
    locality)
  • Examples
  • cold cache, 4-byte words, 4-word cache blocks

int sumarrayrows(int aMN) int i, j, sum
0 for (i 0 i lt M i) for (j
0 j lt N j) sum aij
return sum
int sumarraycols(int aMN) int i, j, sum
0 for (j 0 j lt N j) for (i
0 i lt M i) sum aij
return sum
Miss rate
Miss rate
100
1/4 25
3
Locality Example
  • Question Can you permute the loops so that the
    function scans the 3-d array a with a stride-1
    reference pattern (and thus has good spatial
    locality)?

int sumarray3d(int aMNN) int i, j, k,
sum 0 for (i 0 i lt M i) for
(j 0 j lt N j) for (k 0 k lt
N k) sum akij
return sum
4
Matrix Multiplication Example
Variable sum held in register
  • Major Cache Effects to Consider
  • Total cache size
  • Exploit temporal locality and keep the working
    set small (e.g., by using blocking)
  • Block size
  • Exploit spatial locality
  • Description
  • Multiply N x N matrices
  • O(N3) total operations
  • Accesses
  • N reads per source element
  • N values summed per destination

for (i0 iltn i) for (j0 jltn j)
sum 0.0 for (k0 kltn k) sum
aik bkj cij sum
5
Miss Rate Analysis for Matrix Multiply
  • Assume
  • Line size 32B (big enough for 4 64-bit words)
  • Matrix dimension (N) is very large
  • Approximate 1/N as 0.0
  • Cache is not even big enough to hold multiple
    rows
  • Analysis Method
  • Look at access pattern of inner loop



C
6
Matrix Multiplication (ijk)
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

Inner loop
(,j)
(i,j)
(i,)
A
B
C
Row-wise
Misses per Inner Loop Iteration A B C 0.25 1.
0 0.0
7
Matrix Multiplication (jik)
/ jik / for (j0 jltn j) for (i0 iltn
i) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

Inner loop
(,j)
(i,j)
(i,)
A
B
C
Misses per Inner Loop Iteration A B C 0.25 1.
0 0.0
8
Matrix Multiplication (kij)
/ kij / for (k0 kltn k) for (i0 iltn
i) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
Misses per Inner Loop Iteration A B C 0.0 0.2
5 0.25
9
Matrix Multiplication (ikj)
/ ikj / for (i0 iltn i) for (k0 kltn
k) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
Fixed
Misses per Inner Loop Iteration A B C 0.0 0.2
5 0.25
10
Matrix Multiplication (jki)
/ jki / for (j0 jltn j) for (k0 kltn
k) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
Misses per Inner Loop Iteration A B C 1.0 0.0
1.0
11
Matrix Multiplication (kji)
/ kji / for (k0 kltn k) for (j0 jltn
j) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
Misses per Inner Loop Iteration A B C 1.0 0.0
1.0
12
Summary of Matrix Multiplication
13
Pentium Matrix Multiply Performance
  • Miss rates are helpful but not perfect
    predictors.
  • Code scheduling matters, too.

14
Improving Temporal Locality by Blocking
  • Example Blocked matrix multiplication
  • block (in this context) does not mean cache
    block.
  • Instead, it mean a sub-block within the matrix.
  • Example N 8 sub-block size 4

Key idea Sub-blocks (i.e., Axy) can be treated
just like scalars.
A11 A12 A21 A22
B11 B12 B21 B22
C11 C12 C21 C22

X
C11 A11B11 A12B21 C12 A11B12
A12B22 C21 A21B11 A22B21 C22
A21B12 A22B22
15
Blocking
  • / Unoptimized ARRAY x y z /
  • for (i 0 i lt N i)
  • for (j 0 j lt N j)
  • r 0
  • for (k 0 k lt N k)
  • r r yik zkj
  • xij r

X
Y
Z
j
k
j
i
i
k
16
Blocking
  • / Optimized ARRAY X Y Z /
  • for (jj 0 jj lt N jj B)
  • for (kk 0 kk lt N kk B)
  • for (i 0 i lt N i)
  • for (j jj j lt min(jjB,N) j)
  • r 0
  • for (k kk k lt min(kkB,N) k)
  • r r yik zkj
  • xij r

First block
Partial solution
X
Y
Z
j
k
j
i
i
k
17
Blocked Matrix Multiply (bijk)
for (jj0 jjltn jjbsize) for (i0 iltn
i) for (jjj j lt min(jjbsize,n) j)
cij 0.0 for (kk0 kkltn kkbsize)
for (i0 iltn i) for (jjj j lt
min(jjbsize,n) j) sum 0.0
for (kkk k lt min(kkbsize,n) k)
sum aik bkj
cij sum
18
Blocked Matrix Multiply Analysis
  • Innermost loop pair multiplies a 1 X bsize sliver
    of A by a bsize X bsize block of B and
    accumulates into 1 X bsize sliver of C
  • Loop over i steps through n row slivers of A C,
    using same B

Innermost Loop Pair
19
Blocking, Analysis
20
Pentium Blocked Matrix Multiply Performance
  • Blocking (bijk and bikj) improves performance by
    a factor of two over unblocked versions (ijk and
    jik)
  • relatively insensitive to array size.

21
Sparse Matrices
  • Many applications result in very sparse matrices
  • Discretizations of PDE
  • We can store only the nonzero elements
  • Lots of different formats
  • Need to look-up indices and values
  • Much slower than full matrices
  • Sparsity enables us to solve much larger problems

22
Example Matrix
23
Matrix-Vector product, Compressed Sparse Row (CSR)
  • Lookup indices
  • Loop length determined by number of non-zeros per
    row

Pointers into array of column indices
for i 1, n y(i) 0 for j
row_ptr(i), row_ptr(i1) - 1 y(i) y(i)
val(j) x(col_ind(j)) end end
Indirect addressing
24
Software Prefetching
  • Prefetching guess what data is needed and put it
    in the cache.
  • Run-ahead scouting
  • You can hide some of the latency by inserting
    prefetch instructions
  • Increases bandwidth
  • On Sun
  • -xprefetchno
  • -xprefetch_leveln, nlt123gt
  • void sparc_prefetch_read_once(address)
  • void sparc_prefetch_read_many(address)
  • void sparc_prefetch_write_once(address)
  • void sparc_prefetch_write_many(address)
  • CPRAGMA SPARC_PREFETCH_READ_ONCE(address)

25
Concluding Observations
  • Programmer can optimize for cache performance
  • How data structures are organized
  • How data are accessed
  • Nested loop structure
  • Blocking is a general technique
  • All systems favor cache friendly code
  • Getting absolute optimum performance is very
    platform specific
  • Cache sizes, line sizes, etc.
  • Can get most of the advantage with generic code
  • Keep working set reasonably small (temporal
    locality)
  • Use small strides (spatial locality)

26
Matrix-vector Products
27
C Implementations
Row variant
Column variant
x
x




for( i0 iltm i ) sum 0.0 for( j0
jltn j ) sum A(i,j)xj
vi sum
for( i0 iltm i ) vi 0.0 for( j0
jltn j ) for( i0 iltm i )
viA(i,j)xj
28
MxV in BLAS
  • Implemented in Basic Linear Algebra Subprograms
    (BLAS)
  • DGEMV double precision
  • CGEMV complex
  • SGEMV single precision
  • More general form

29
Multi-dimensional Arrays in C
  • There are two fundamental ways of representing
    the matrix A
  • Nested arrays
  • One-dimensional array
  • Indexing using arithmetic, A(i,j)ainj
  • Multi-level array
  • Array of pointers
  • Bracket indexing, A(i,j)aij
  • DGEMV in BLAS uses nested array

30
Performance Comparison
31
Memory consumption
  • One 2D array (A), two 1D arrays (v,x)
  • Use nested arrays
  • DGEMV requires nested arrays
  • Assume mn, square matrix
  • Datatype is double, 64bits 8B
  • Total mem 8n2 16n

32
UltraSPARCIIICu
  • Experiment run on UltraSPARCIIICu
  • 900MHz clock frequency
  • One FPadd and one FPmul per cycle, one mem
  • L1 cache on-chip 64kB
  • L2 cache off-chip 8MB
  • Problem fits in L2 if 8n216nlt8MB
  • n lt 1000

33
Compiler flags
  • Code compiled using Sun STUDIO compilers
  • FLAGS
  • xarchv9b (64-bit)
  • xO3 (basic optimizations)

34
Performance Again
35
Using the Compiler
  • Added the flags
  • xtargetultra3cu
  • xcache64/32/48192/512/2 (from fpversion)
  • fast
  • xrestrict (aliasing)
  • xdepend (analyze loop dependencies)
  • xprefetchyes (do software prefetching)

36
Compiler Optimized
37
Effect of Flags, Row Version
38
Effect of Flags, Column Version
39
Effect of Flags, Library Version
40
Effect of Representation, Multilevel
41
Manual Unrolling
Row
Column
for (j1 jltn-jrem j2) (void)
dummy(aj) for (i0 iltm i)
ai B(i,j)cj
B(i,j1)cj1 for(jn-jremjltnj)
for(i0iltmi) aiB(i,j)cj
for (i0 iltm-irem i2) sum0 0.0
sum1 0.0 for (j0 jltn j)
sum0 B(i,j)cj sum1
B(i1,j)cj ai sum0
ai1 sum1
42
Effect of Unrolling
No progress!
43
Cache Analysis
  • Spatial locality
  • Row version, stride-1 access
  • Col version, stride-m access BAD
  • Temporal locality between MxVs
  • Row version, as good as it gets
  • Col version, same

44
Cache Analysis,Column Version
  • First column will miss on all rows
  • If blocks are still in cache accesses will hit
    when accessing second column
  • Mem per column
  • mcache_block_size
  • L1 m32
  • L2 m64
  • We also need the array v

Memory page
45
Cache Analysis, contd
  • L1
  • one cache line contains 32/8 4 elements
  • The L1 can store 64kB/32 2048 lines
  • Total of 42048 8192 elements
  • L2
  • One cache line contains 64/8 8 elements
  • The L2 can store 8MB/64 131072 lines
  • Total of 1310728 1048576 elements
  • TLB
  • 512 entries of 8kB pages 4MB of data (TLB
    reach)
  • Each 8kB page holds 1024 elements
  • Total of 5121024 524288 elements

46
Cache Analysis
  • The L1 cache is full after accessing 2048
    elements in column order since we bring in whole
    lines
  • Because of LRU replacements, the first elements
    are evicted if we access more than 2048 elements
  • Consider the case where mn2048
  • L1 cache is full after first column -gt
    replacements
  • L2 cache is full after 64 columns
  • When i gt 256 we need to replace entries in the TLB

47
Increasing the Page Size
  • Many CPUs and OSes have support for larger
    pagesizes
  • Increases TLB-reach since we can use fewer
    entries to map a large space
  • Only works for large chunks of data
  • UltraSPARC support 8kB,64kB,512kB and 4MB pages
  • ppgsz -o heap512k

48
Conclusions
  • Compiler options have large effects
  • The Sun compiler is more conservative for C than
    for Fortran
  • Aliasing, unrolling
  • Handtuning such as unrolling and blocking can
    help or destroy performance
  • Blocking can typically not be performed by the
    compiler
  • Representations matter for performance
  • Multi-level arrays are often slower due to the
    amount of memory referencing needed
  • Peak performance is 900M21.8GFlop/s
  • We got to about 1.2 Gflop/s
  • Still some tweaking to do

49
N-body Wrap-up
  • Barnes-Hut method is faster but also more
    inaccurate
  • If theta is too large, we may miss important
    physics
  • Adds a lot of programming complexity and also
    indirection
  • Following linked lists or trees is not very
    efficient on modern CPUs
  • More addressing

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
50
Representation of Particles
  • A particle has a mass, position and velocity
  • Is the force really a property of a particle?
  • Two ways of representing the particles
  • One big array of all variables
  • Several arrays, one for each variable

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
51
One big Array
  • struct _particle
  • double position2
  • double velocity2
  • double force2
  • double mass
  • struct _particle particles

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
52
View of Memory
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
53
Several Arrays
double positions double velocities double
forces double masses positions
malloc(2Nsizeof(double)) velocities
malloc(2Nsizeof(double)) forces
malloc(2Nsizeof(double)) masses
malloc(Nsizeof(double))
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
54
View of Memory
fx
fy
fx
fy
m
m
px
py
px
py
ux
uy
ux
uy
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
55
Naïve Force Calculation
for(i0iltNi) for(j0jltNj)
if( i!j ) forcesi
compute_forces(particlesi,particlesj)
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
56
Remarks
  • Forces are symmetric, copy the force with
    opposite sign
  • Removes the conditional within the loop
  • Velocities are not used in the calculations
  • What about cache hit ratios?

for(i0iltNi) for(ji1jltNj)
forcesi compute_forces(particlesi,particle
sj) forcesj - compute_forces(particle
sj,particlesi)
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
57
Big Array
Cache blocks
Cache blocks
Cache blocks
Cache blocks
  • There is potential for spatial locality in inner
    loop
  • Space for velocity is never used, waste of cache
    space
  • Struct padding?

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
58
Many Arrays
px
py
px
py
fx
fy
fx
fy
m
m
  • Velocity components not cached
  • Still some potential for spatial locality within
    each variable but probably not between variables

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
59
Force Calculations as Linear Algebra
  • The force calculations can be written out in
    linear algebra

Vector of all masses
CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
60
Distance Matrix
  • Matrix D is symmetric and dense
  • Diagonal is zero
  • Needs to be updated every time step
  • New positions -gt new distances
  • Small elements in D will indicate long range
    forces
  • Optimizations to force calculations can be seen
    as approximating this matrix with a sparse one

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
61
Barnes-Hut
  • Just a way of recursively dividing up your domain
  • Use mass lumping and a cut-off (theta) to
    approximate far-field forces
  • Each tree node stores a reference to the particle
    arrays
  • An index to the array or arrays
  • Also need, center of mass, total mass etc

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
62
Locality for the Barnes-Hut case
  • We will not use the particle array indices
    directly in the force calculations
  • Instead, these are looked up from the tree node
  • Depends on where we are in the domain
  • More irregular accessing patterns
  • We can reorder the particles to ensure that
    indices that are close also correspond to
    particles that are close in space
  • Space filling curves
  • Bandwidth minimization (matrix case)

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
63
Other Issues
  • Risk of cancellations when calculating the forces
  • Try different precisions and rounding modes to
    smoke it out
  • Symmetric forces in the Barnes-Hut case?
  • Visualization how many particles can we draw in
    each display frame (30Hz)

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
64
Discussion
  • Last one! None next week
  • Topic BLAS and LAPACK
  • Bldg 320, Rm 220, 415-530

CME212 Introduction to Large-Scale Computing in
Engineering High Performance Computing and
Programming
Write a Comment
User Comments (0)
About PowerShow.com