Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling Reducing the Price of Naivety - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling Reducing the Price of Naivety

Description:

Title: Is Morton layout competitive for large 2D arrays? Author: jeyan Last modified by: Olav Beckmann Created Date: 5/13/2002 10:05:23 AM Document presentation format – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 29
Provided by: jey4
Category:

less

Transcript and Presenter's Notes

Title: Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling Reducing the Price of Naivety


1
Improving the Performance of Morton Layout by
Array Alignment andLoop UnrollingReducing the
Price of Naivety
  • Jeyarajan Thiyagalingam
  • Olav Beckmann and Paul H.J. Kelly
  • Software Performance Optimisation Group,
  • Imperial College, London

2
Motivation
  • Consider two code variants of a matrix multiply

IJK Variant for( i0 iltN i ) for( j0 jltN
j ) for( k0 kltN k ) Ci,j Ai,k
Bk,j
IKJ Variant for( i0 iltN i ) for( k0 kltN
k ) for( j0 jltN j ) Ci,j Ai,k
Bk,j
  • Both code variants are valid, apparently same
    complexity.

3
The price of naivety
  • Depending on problem size and architecture, the
    IKJ variant can be up to 10 times faster than IJK.

4
Performance Programming Model
  • Naively-written code can suffer a factor 10
    performance hit
  • Sometimes the compiler can help none of the
    compilers we used interchanged these loops.
  • A robust performance programming model would have
    to account for the capabilities of the compiler
  • Offering a clear Performance Programming Model
    should be part of Compiler Research.

5
Compromise blocked layout
0 1 2 3 4 5 6 7
2
2
8 9 10 11 12 13 14 15
2
16 17 18 19 20 21 22 23
2
24 25 26 27 28 29 30 31
.
32 33 34 35 36 37 38 39
.
40 41 42 43 44 45 46 47
.
48 49 50 51 52 53 54 55
.
56 57 58 59 60 61 62 63
.
.
.
.
8
8
8
8
  • Reason for differences in performance
  • Row-major traversal uses 4 words per block
  • But column-major traversal uses only 1 word per
    block
  • Bandwidth wasted with CM
  • Blocked 4-word cache block contains 2x2
    subarray
  • Row-major traversal uses 2 words per block
  • Column-major traversal uses 2 words per block

6
Recursively-blocked layout
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)
Offset 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
(i,j) (0,4) (0,5) (1,4) (1,5) (0,6) (0,7) (1,6) (1,6) (2,4) (2,5) (3,4) (3,5) (2,6) (2,7) (3,6) (3,7)
  • Real machines have deep memory hierarchies
  • Therefore, need to apply blocking recursively
  • Layout of the blocks Z-Morton (one of a number
    of space-filling curves)

7
Morton Layout A Compromise
Row-major Traversal Row-major Traversal Row-major Traversal
Block Size RM Array Morton Array
32B 75 50
128B 93.7 75
8kB page 99.9 96.87
Column-major Traversal Column-major Traversal Column-major Traversal
Block Size RM Array Morton Array
32B 0 50
128B 0 75
8kB 0 96.87
  • Morton storage layout is unbiased towards either
    row- or column-major traversal.

8
So have we solved the problem?
  • Unfortunately, the basic Morton Scheme often
    performs disappointingly.
  • At least Morton does not seem to suffer from
    pathological drops in performance.

9
Alignment
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0) (0,1) (1,0) (1,1) (0,2) (0,3) (1,2) (1,3) (2,0) (2,1) (3,0) (3,1) (2,2) (2,3) (3,2) (3,3)
  • Statement that Morton is unbiased turns out to be
    based on assumption that a cache line maps to
    start of Morton block.

10
Alignment
  • It turns out that Morton layout is only unbiased
    for even power-of-two cache line sizes
  • The same problems happen when mis-aligning the
    base address

11
Alignment
  • We calculated miss-rates systematically for all
    levels of memory hierarchy
  • In each case, we calculated the miss-rates for
    all possible alignments of the base address.
  • The difference in miss-rates between best and
    worst alignment of the base address of Morton
    arrays can be up to a factor of 1.5 for even
    power-of-two cache lines, a factor of 2 for odd
    power-of-two cache lines.

12
Alignment
  • The overall miss-rates drop exponentially with
    block size, but access times are generally
    assumed to increase geometrically with block size.

13
Alignment
  • With canonical layouts, it is often necessary to
    pad the row or column length in order to avoid
    pathological behaviour.
  • Finding the right amount of padding is not
    trivial.
  • Theoretically, one should align the base address
    of Morton arrays to the largest significant block
    size in the memory hierarchy i.e. page size.
  • Aliasing in the memory hierarchy can spoil the
    theory.
  • For example, on Pentium 4, the following aliasing
    patterns cause problems
  • 2K map to same L1 cache line
  • 16K aliases in store-forwarding logic
  • 32K map to the same L2 cache line
  • 64K indistinguishable in L1 cache

14
Address calculation
  • With lexicographic (aka canonical) layout, its
    easy to calculate the offset S of Ai,j in a N?M
    array A
  • Srm(i,j) N?i j Scm(i,j) i M?j
  • (if N and M are powers of two, this is bit-wise
    concatenation of i and j)
  • In loops, the multiplication is replaced by an
    increment
  • When unrolling loops, the address calculation can
    be strength-reduced.
  • How can we calculate the Morton offset?

15
Address calculation
  • Morton indices can be calculated by using the
    bit-concatenation idea of RM/CM for power-of-two
    arrays recursively
  • For a 2x2 array, if i and j are the indices, then
    the location is (i ltlt 1) j.
  • Let D0(i) in0 i10i00
  • Let D1(i) 0in 0i10i0
  • Then Smz(i,j) D0(i) D1(j)
  • Dilation is rather expensive for inner loop
  • Strength reduction (Wise et al)
  • D0(i1) ((D0(i) Ones0) 1) Ones1
  • D1(i1) ((D1(i) Ones1) 1) Ones0

16
Address calculation
  • Idea use lookup tables for D0(i) and D1(j)
  • AMortonTabEveni MortonTabOddj
  • When can we do strength reduction?
  • In general Smz(i,j1) could be anywhere
  • D0(i 1) ???
  • D0(i k) D0(i) D0(k) if is and ks bits do
    not overlap.
  • We can do strength reduction D0(i k) D0(i)
    D0(k) as long as i 2n and k lt 2n
  • With this, we can do loop unrolling

17
Unrolled Code with Stength-Reduction
  • double mmijk_unrolled(unsigned sz,FLOATTYPE
    A,FLOATTYPE B,FLOATTYPE C)
  • unsigned i,j,k
  • for (i0iltszi)
  • unsigned int t1iMortonTabOddi
  • for (j0jltszj)
  • unsigned int t0jMortonTabEvenj
  • for (k0kltszk4)
  • unsigned int t0kMortonTabEvenk
  • unsigned int t1kMortonTabOddk
  • Ct1it0j At1it0k Bt1kt0j
  • Ct1it0j At1it0k 2 Bt1kt0j 1
  • Ct1it0j At1it0k 8 Bt1kt0j
    4
  • Ct1it0j At1it0k 10 Bt1kt0j
    5

18
So have we solved the problem?
  • Unrolling significantly reduces the overhead of
    the Basic Morton Scheme.
  • IKJ is still faster than IJK might be due to
    having two table lookups in the inner loop.

19
Benchmarks
  • Suite of simple numerical kernels operating on 2D
    arrays of doubles
  • Used the compilers and flags which the vendors
    used for their SPEC-CFP2000 results

MM-ijk Matrix Multiply, ijk loop nest
MM-ikj Matrix Multiply, ikj loop nest
Cholk Cholesky-K Variant
Jacobi2d 2 Dimensional, 4point stencil smoother
ADI Alternating Direction Implicit kernel ij, ij order
20
Experimental Setup
  • We used identical clusters of (student) lab
    machines during off-peak periods
  • Extensive scripting to automate data collection
  • Dixon Test to remove outliers from the
    measurements
  • Use median instead of mean.
  • Overall more than 26M measurements

21
Architectures
  • AMD, Thunderbird 1.8GHz, 512MB DDR-RAM
  • 64KB, 2-way, 64Byte block L1 cache, 256KB, 8-way
    64B block L2.
  • Intel C Compiler v7.1 for Linux. Flags -xK
    -ipo static FDO
  • Pentium III, Coppermine 450MHz, 256MB SDRAM
  • 16KB, 4-way, 32Byte block L1 cache, 256KB, 8-way
    32B block L2.
  • Intel C Compiler v7.1 for Linux Flags -xK
    ipo O3 static FDO
  • Sun, SunFire 6800, UltraSparc III 750MHz
  • 64KB, 4-way, 32Byte block L1 cache, 8MB Direct
    Mapped L2 Cache
  • Sun Workshop Compiler V6, Flags -fast
    xcrossfile xalias_levelstd FDO
  • Alpha Compaq AlphaServer ES40, 21264 (EV6) 500MHz
  • 64KB, 2-way, 64Byte block L1 cache, 4MB Direct
    Mapped L2 Cache
  • Compaq C Compiler V6 , Flags arch ev6 -fast
    O4
  • Pentium 4, 2.0GHz, 512MB DDR-RAM
  • 8KB, 8-way,64Byte block L1 cache, 256KB, 8-way
    64B block L2.
  • Intel C Compiler v7 for Linux Flags -xW ipo
    O3 static FDO

22
Alpha (L164KB/2-w/64Byte, L24MB/DM)
23
Athlon (L164KB/2-w/64Byte, L2256KB/8-w/64B)
24
Pentium III (L116KB/4-w/32Byte,
L2256KB/8-w/32B)
25
Pentium 4 (L18KB/8-w/64Byte, L2256KB/8-w/64B)
26
Sparc (L164KB/4-w/32Byte, L28MB
Direct-mapped/64Byte)
27
Summary
  • The Basic Morton Scheme often performs
    disappointingly.
  • Page-aligning the base address theoretically
    maximises spatial locality.
  • Unrolling is facilitated by carefully aligning
    the start iteration of unrolled loops to
    power-of-two indices into the array.
  • With base-address alignment and unrolling for
    strength-reduction of index calculation, Morton
    layout is beginning to actually work.

28
Future Work
  • Larger Factors of Unrolling
  • Until now only factor 4 hand-unrolled
  • We have used code generation to unroll by larger
    factors, and it seems that there are more
    improvements to be had.
  • Prefetching
  • Its likely that hardware prefetching will fetch
    the wrong things
  • Turn off hardware prefetching, use the right,
    compiler-directed prefetching instead.
  • Tiling
  • Storage layout transformations and iteration
    space transformations are complimentary
  • But we should do both.
Write a Comment
User Comments (0)
About PowerShow.com