Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling Reducing the Price of Naivety - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling Reducing the Price of Naivety

Description:

Title: Is Morton layout competitive for large 2D arrays? Author: jeyan Last modified by: Olav Beckmann Created Date: 5/13/2002 10:05:23 AM Document presentation format – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 29

Provided by: jey4

Category:

more less

Transcript and Presenter's Notes

Title: Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling Reducing the Price of Naivety

1
Improving the Performance of Morton Layout by
Array Alignment andLoop UnrollingReducing the
Price of Naivety

Jeyarajan Thiyagalingam
Olav Beckmann and Paul H.J. Kelly
Software Performance Optimisation Group,
Imperial College, London

2
Motivation

Consider two code variants of a matrix multiply

IJK Variant for( i0 iltN i ) for( j0 jltN
j ) for( k0 kltN k ) Ci,j Ai,k
Bk,j
IKJ Variant for( i0 iltN i ) for( k0 kltN
k ) for( j0 jltN j ) Ci,j Ai,k
Bk,j

Both code variants are valid, apparently same
complexity.

3
The price of naivety

Depending on problem size and architecture, the
IKJ variant can be up to 10 times faster than IJK.

4
Performance Programming Model

Naively-written code can suffer a factor 10
performance hit
Sometimes the compiler can help none of the
compilers we used interchanged these loops.
A robust performance programming model would have
to account for the capabilities of the compiler
Offering a clear Performance Programming Model
should be part of Compiler Research.

5
Compromise blocked layout
0 1 2 3 4 5 6 7
2
2
8 9 10 11 12 13 14 15
2
16 17 18 19 20 21 22 23
2
24 25 26 27 28 29 30 31
.
32 33 34 35 36 37 38 39
.
40 41 42 43 44 45 46 47
.
48 49 50 51 52 53 54 55
.
56 57 58 59 60 61 62 63
.
.
.
.
8
8
8
8

Reason for differences in performance
Row-major traversal uses 4 words per block
But column-major traversal uses only 1 word per
block
Bandwidth wasted with CM

Blocked 4-word cache block contains 2x2
subarray
Row-major traversal uses 2 words per block
Column-major traversal uses 2 words per block

6
Recursively-blocked layout
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)
Offset 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
(i,j) (0,4) (0,5) (1,4) (1,5) (0,6) (0,7) (1,6) (1,6) (2,4) (2,5) (3,4) (3,5) (2,6) (2,7) (3,6) (3,7)

Real machines have deep memory hierarchies
Therefore, need to apply blocking recursively
Layout of the blocks Z-Morton (one of a number
of space-filling curves)

7
Morton Layout A Compromise
Row-major Traversal Row-major Traversal Row-major Traversal
Block Size RM Array Morton Array
32B 75 50
128B 93.7 75
8kB page 99.9 96.87
Column-major Traversal Column-major Traversal Column-major Traversal
Block Size RM Array Morton Array
32B 0 50
128B 0 75
8kB 0 96.87

Morton storage layout is unbiased towards either
row- or column-major traversal.

8
So have we solved the problem?

Unfortunately, the basic Morton Scheme often
performs disappointingly.
At least Morton does not seem to suffer from
pathological drops in performance.

9
Alignment
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)

Statement that Morton is unbiased turns out to be
based on assumption that a cache line maps to
start of Morton block.

10
Alignment

It turns out that Morton layout is only unbiased
for even power-of-two cache line sizes
The same problems happen when mis-aligning the
base address

11
Alignment

We calculated miss-rates systematically for all
levels of memory hierarchy
In each case, we calculated the miss-rates for
all possible alignments of the base address.
The difference in miss-rates between best and
worst alignment of the base address of Morton
arrays can be up to a factor of 1.5 for even
power-of-two cache lines, a factor of 2 for odd
power-of-two cache lines.

12
Alignment

The overall miss-rates drop exponentially with
block size, but access times are generally
assumed to increase geometrically with block size.

13
Alignment

With canonical layouts, it is often necessary to
pad the row or column length in order to avoid
pathological behaviour.
Finding the right amount of padding is not
trivial.
Theoretically, one should align the base address
of Morton arrays to the largest significant block
size in the memory hierarchy i.e. page size.
Aliasing in the memory hierarchy can spoil the
theory.
For example, on Pentium 4, the following aliasing
patterns cause problems
2K map to same L1 cache line
16K aliases in store-forwarding logic
32K map to the same L2 cache line
64K indistinguishable in L1 cache

14
Address calculation

With lexicographic (aka canonical) layout, its
easy to calculate the offset S of Ai,j in a N?M
array A
Srm(i,j) N?i j Scm(i,j) i M?j
(if N and M are powers of two, this is bit-wise
concatenation of i and j)
In loops, the multiplication is replaced by an
increment
When unrolling loops, the address calculation can
be strength-reduced.
How can we calculate the Morton offset?

15
Address calculation

Morton indices can be calculated by using the
bit-concatenation idea of RM/CM for power-of-two
arrays recursively
For a 2x2 array, if i and j are the indices, then
the location is (i ltlt 1) j.
Let D0(i) in0 i10i00
Let D1(i) 0in 0i10i0
Then Smz(i,j) D0(i) D1(j)
Dilation is rather expensive for inner loop
Strength reduction (Wise et al)
D0(i1) ((D0(i) Ones0) 1) Ones1
D1(i1) ((D1(i) Ones1) 1) Ones0

16
Address calculation

Idea use lookup tables for D0(i) and D1(j)
AMortonTabEveni MortonTabOddj
When can we do strength reduction?
In general Smz(i,j1) could be anywhere
D0(i 1) ???
D0(i k) D0(i) D0(k) if is and ks bits do
not overlap.
We can do strength reduction D0(i k) D0(i)
D0(k) as long as i 2n and k lt 2n
With this, we can do loop unrolling

17
Unrolled Code with Stength-Reduction

double mmijk_unrolled(unsigned sz,FLOATTYPE
A,FLOATTYPE B,FLOATTYPE C)
unsigned i,j,k
for (i0iltszi)
unsigned int t1iMortonTabOddi
for (j0jltszj)
unsigned int t0jMortonTabEvenj
for (k0kltszk4)
unsigned int t0kMortonTabEvenk
unsigned int t1kMortonTabOddk
Ct1it0j At1it0k Bt1kt0j
Ct1it0j At1it0k 2 Bt1kt0j 1
Ct1it0j At1it0k 8 Bt1kt0j
4
Ct1it0j At1it0k 10 Bt1kt0j
5

18
So have we solved the problem?

Unrolling significantly reduces the overhead of
the Basic Morton Scheme.
IKJ is still faster than IJK might be due to
having two table lookups in the inner loop.

19
Benchmarks

Suite of simple numerical kernels operating on 2D
arrays of doubles
Used the compilers and flags which the vendors
used for their SPEC-CFP2000 results

MM-ijk Matrix Multiply, ijk loop nest
MM-ikj Matrix Multiply, ikj loop nest
Cholk Cholesky-K Variant
Jacobi2d 2 Dimensional, 4point stencil smoother
ADI Alternating Direction Implicit kernel ij, ij order
20
Experimental Setup

We used identical clusters of (student) lab
machines during off-peak periods
Extensive scripting to automate data collection
Dixon Test to remove outliers from the
measurements
Use median instead of mean.
Overall more than 26M measurements

21
Architectures

AMD, Thunderbird 1.8GHz, 512MB DDR-RAM
64KB, 2-way, 64Byte block L1 cache, 256KB, 8-way
64B block L2.
Intel C Compiler v7.1 for Linux. Flags -xK
-ipo static FDO
Pentium III, Coppermine 450MHz, 256MB SDRAM
16KB, 4-way, 32Byte block L1 cache, 256KB, 8-way
32B block L2.
Intel C Compiler v7.1 for Linux Flags -xK
ipo O3 static FDO
Sun, SunFire 6800, UltraSparc III 750MHz
64KB, 4-way, 32Byte block L1 cache, 8MB Direct
Mapped L2 Cache
Sun Workshop Compiler V6, Flags -fast
xcrossfile xalias_levelstd FDO
Alpha Compaq AlphaServer ES40, 21264 (EV6) 500MHz
64KB, 2-way, 64Byte block L1 cache, 4MB Direct
Mapped L2 Cache
Compaq C Compiler V6 , Flags arch ev6 -fast
O4
Pentium 4, 2.0GHz, 512MB DDR-RAM
8KB, 8-way,64Byte block L1 cache, 256KB, 8-way
64B block L2.
Intel C Compiler v7 for Linux Flags -xW ipo
O3 static FDO

22
Alpha (L164KB/2-w/64Byte, L24MB/DM)
23
Athlon (L164KB/2-w/64Byte, L2256KB/8-w/64B)
24
Pentium III (L116KB/4-w/32Byte,
L2256KB/8-w/32B)
25
Pentium 4 (L18KB/8-w/64Byte, L2256KB/8-w/64B)
26
Sparc (L164KB/4-w/32Byte, L28MB
Direct-mapped/64Byte)
27
Summary

The Basic Morton Scheme often performs
disappointingly.
Page-aligning the base address theoretically
maximises spatial locality.
Unrolling is facilitated by carefully aligning
the start iteration of unrolled loops to
power-of-two indices into the array.
With base-address alignment and unrolling for
strength-reduction of index calculation, Morton
layout is beginning to actually work.

28
Future Work

Larger Factors of Unrolling
Until now only factor 4 hand-unrolled
We have used code generation to unroll by larger
factors, and it seems that there are more
improvements to be had.
Prefetching
Its likely that hardware prefetching will fetch
the wrong things
Turn off hardware prefetching, use the right,
compiler-directed prefetching instead.
Tiling
Storage layout transformations and iteration
space transformations are complimentary
But we should do both.