Title: Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling Reducing the Price of Naivety
1Improving the Performance of Morton Layout by
Array Alignment andLoop UnrollingReducing the
Price of Naivety
- Jeyarajan Thiyagalingam
- Olav Beckmann and Paul H.J. Kelly
- Software Performance Optimisation Group,
- Imperial College, London
2Motivation
- Consider two code variants of a matrix multiply
IJK Variant for( i0 iltN i ) for( j0 jltN
j ) for( k0 kltN k ) Ci,j Ai,k
Bk,j
IKJ Variant for( i0 iltN i ) for( k0 kltN
k ) for( j0 jltN j ) Ci,j Ai,k
Bk,j
- Both code variants are valid, apparently same
complexity.
3The price of naivety
- Depending on problem size and architecture, the
IKJ variant can be up to 10 times faster than IJK.
4Performance Programming Model
- Naively-written code can suffer a factor 10
performance hit - Sometimes the compiler can help none of the
compilers we used interchanged these loops. - A robust performance programming model would have
to account for the capabilities of the compiler - Offering a clear Performance Programming Model
should be part of Compiler Research.
5Compromise blocked layout
0 1 2 3 4 5 6 7
2
2
8 9 10 11 12 13 14 15
2
16 17 18 19 20 21 22 23
2
24 25 26 27 28 29 30 31
.
32 33 34 35 36 37 38 39
.
40 41 42 43 44 45 46 47
.
48 49 50 51 52 53 54 55
.
56 57 58 59 60 61 62 63
.
.
.
.
8
8
8
8
- Reason for differences in performance
- Row-major traversal uses 4 words per block
- But column-major traversal uses only 1 word per
block - Bandwidth wasted with CM
- Blocked 4-word cache block contains 2x2
subarray - Row-major traversal uses 2 words per block
- Column-major traversal uses 2 words per block
6Recursively-blocked layout
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)
Offset 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
(i,j) (0,4) (0,5) (1,4) (1,5) (0,6) (0,7) (1,6) (1,6) (2,4) (2,5) (3,4) (3,5) (2,6) (2,7) (3,6) (3,7)
- Real machines have deep memory hierarchies
- Therefore, need to apply blocking recursively
- Layout of the blocks Z-Morton (one of a number
of space-filling curves)
7Morton Layout A Compromise
Row-major Traversal Row-major Traversal Row-major Traversal
Block Size RM Array Morton Array
32B 75 50
128B 93.7 75
8kB page 99.9 96.87
Column-major Traversal Column-major Traversal Column-major Traversal
Block Size RM Array Morton Array
32B 0 50
128B 0 75
8kB 0 96.87
- Morton storage layout is unbiased towards either
row- or column-major traversal.
8So have we solved the problem?
- Unfortunately, the basic Morton Scheme often
performs disappointingly. - At least Morton does not seem to suffer from
pathological drops in performance.
9Alignment
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)
- Statement that Morton is unbiased turns out to be
based on assumption that a cache line maps to
start of Morton block.
10Alignment
- It turns out that Morton layout is only unbiased
for even power-of-two cache line sizes - The same problems happen when mis-aligning the
base address
11Alignment
- We calculated miss-rates systematically for all
levels of memory hierarchy - In each case, we calculated the miss-rates for
all possible alignments of the base address. - The difference in miss-rates between best and
worst alignment of the base address of Morton
arrays can be up to a factor of 1.5 for even
power-of-two cache lines, a factor of 2 for odd
power-of-two cache lines.
12Alignment
- The overall miss-rates drop exponentially with
block size, but access times are generally
assumed to increase geometrically with block size.
13Alignment
- With canonical layouts, it is often necessary to
pad the row or column length in order to avoid
pathological behaviour. - Finding the right amount of padding is not
trivial. - Theoretically, one should align the base address
of Morton arrays to the largest significant block
size in the memory hierarchy i.e. page size. - Aliasing in the memory hierarchy can spoil the
theory. - For example, on Pentium 4, the following aliasing
patterns cause problems - 2K map to same L1 cache line
- 16K aliases in store-forwarding logic
- 32K map to the same L2 cache line
- 64K indistinguishable in L1 cache
14Address calculation
- With lexicographic (aka canonical) layout, its
easy to calculate the offset S of Ai,j in a N?M
array A - Srm(i,j) N?i j Scm(i,j) i M?j
- (if N and M are powers of two, this is bit-wise
concatenation of i and j) - In loops, the multiplication is replaced by an
increment - When unrolling loops, the address calculation can
be strength-reduced. - How can we calculate the Morton offset?
15Address calculation
- Morton indices can be calculated by using the
bit-concatenation idea of RM/CM for power-of-two
arrays recursively - For a 2x2 array, if i and j are the indices, then
the location is (i ltlt 1) j. - Let D0(i) in0 i10i00
- Let D1(i) 0in 0i10i0
- Then Smz(i,j) D0(i) D1(j)
- Dilation is rather expensive for inner loop
- Strength reduction (Wise et al)
- D0(i1) ((D0(i) Ones0) 1) Ones1
- D1(i1) ((D1(i) Ones1) 1) Ones0
16Address calculation
- Idea use lookup tables for D0(i) and D1(j)
- AMortonTabEveni MortonTabOddj
- When can we do strength reduction?
- In general Smz(i,j1) could be anywhere
- D0(i 1) ???
- D0(i k) D0(i) D0(k) if is and ks bits do
not overlap. - We can do strength reduction D0(i k) D0(i)
D0(k) as long as i 2n and k lt 2n - With this, we can do loop unrolling
17Unrolled Code with Stength-Reduction
- double mmijk_unrolled(unsigned sz,FLOATTYPE
A,FLOATTYPE B,FLOATTYPE C) - unsigned i,j,k
- for (i0iltszi)
- unsigned int t1iMortonTabOddi
- for (j0jltszj)
- unsigned int t0jMortonTabEvenj
- for (k0kltszk4)
- unsigned int t0kMortonTabEvenk
- unsigned int t1kMortonTabOddk
-
- Ct1it0j At1it0k Bt1kt0j
- Ct1it0j At1it0k 2 Bt1kt0j 1
- Ct1it0j At1it0k 8 Bt1kt0j
4 - Ct1it0j At1it0k 10 Bt1kt0j
5 -
-
-
18So have we solved the problem?
- Unrolling significantly reduces the overhead of
the Basic Morton Scheme. - IKJ is still faster than IJK might be due to
having two table lookups in the inner loop.
19Benchmarks
- Suite of simple numerical kernels operating on 2D
arrays of doubles - Used the compilers and flags which the vendors
used for their SPEC-CFP2000 results
MM-ijk Matrix Multiply, ijk loop nest
MM-ikj Matrix Multiply, ikj loop nest
Cholk Cholesky-K Variant
Jacobi2d 2 Dimensional, 4point stencil smoother
ADI Alternating Direction Implicit kernel ij, ij order
20Experimental Setup
- We used identical clusters of (student) lab
machines during off-peak periods - Extensive scripting to automate data collection
- Dixon Test to remove outliers from the
measurements - Use median instead of mean.
- Overall more than 26M measurements
21Architectures
- AMD, Thunderbird 1.8GHz, 512MB DDR-RAM
- 64KB, 2-way, 64Byte block L1 cache, 256KB, 8-way
64B block L2. - Intel C Compiler v7.1 for Linux. Flags -xK
-ipo static FDO - Pentium III, Coppermine 450MHz, 256MB SDRAM
- 16KB, 4-way, 32Byte block L1 cache, 256KB, 8-way
32B block L2. - Intel C Compiler v7.1 for Linux Flags -xK
ipo O3 static FDO - Sun, SunFire 6800, UltraSparc III 750MHz
- 64KB, 4-way, 32Byte block L1 cache, 8MB Direct
Mapped L2 Cache - Sun Workshop Compiler V6, Flags -fast
xcrossfile xalias_levelstd FDO - Alpha Compaq AlphaServer ES40, 21264 (EV6) 500MHz
- 64KB, 2-way, 64Byte block L1 cache, 4MB Direct
Mapped L2 Cache - Compaq C Compiler V6 , Flags arch ev6 -fast
O4 - Pentium 4, 2.0GHz, 512MB DDR-RAM
- 8KB, 8-way,64Byte block L1 cache, 256KB, 8-way
64B block L2. - Intel C Compiler v7 for Linux Flags -xW ipo
O3 static FDO
22Alpha (L164KB/2-w/64Byte, L24MB/DM)
23Athlon (L164KB/2-w/64Byte, L2256KB/8-w/64B)
24Pentium III (L116KB/4-w/32Byte,
L2256KB/8-w/32B)
25Pentium 4 (L18KB/8-w/64Byte, L2256KB/8-w/64B)
26Sparc (L164KB/4-w/32Byte, L28MB
Direct-mapped/64Byte)
27Summary
- The Basic Morton Scheme often performs
disappointingly. - Page-aligning the base address theoretically
maximises spatial locality. - Unrolling is facilitated by carefully aligning
the start iteration of unrolled loops to
power-of-two indices into the array. - With base-address alignment and unrolling for
strength-reduction of index calculation, Morton
layout is beginning to actually work.
28Future Work
- Larger Factors of Unrolling
- Until now only factor 4 hand-unrolled
- We have used code generation to unroll by larger
factors, and it seems that there are more
improvements to be had. - Prefetching
- Its likely that hardware prefetching will fetch
the wrong things - Turn off hardware prefetching, use the right,
compiler-directed prefetching instead. - Tiling
- Storage layout transformations and iteration
space transformations are complimentary - But we should do both.