Optimizing 3D Multigrid to Be Comparable to the FFT - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Optimizing 3D Multigrid to Be Comparable to the FFT

Description:

Reduces # stencil coordinates computed per cell. Exposes load reuse to compiler ... memory traffic to 3 load streams per cell. Overhead when switching between ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 30
Provided by: kda47
Category:

less

Transcript and Presenter's Notes

Title: Optimizing 3D Multigrid to Be Comparable to the FFT


1
Optimizing 3D Multigrid to Be Comparable to the
FFT
  • Michael Maire and Kaushik Datta
  • Note Several diagrams were taken from Kathy
    Yelicks CS267 lectures

2
Outline
  • FFT and Multigrid Overview
  • FFT and MG Running Times
  • Multigrid Performance Model
  • Multigrid Optimizations
  • Results

3
3D Poissons Equation
  • An elliptic PDE that arises in many physical
    problems (e.g. electrostatic or gravitational
    potential)
  • Many different techniques for solving are
    available

4
3D Poissons Equation
  • The continuous version is
  • d2?/dx2 d2?/dy2 d2?/dz2 ? (or ?? ?)
  • The discrete version is T x b
  • In 2D, the 9-point stencil looks like

5
Algorithms for Solving 2D Poissons Equation with
N unknowns
  • Algorithm Serial Flops Memory
  • Dense LU N3 N2
  • Band LU N2 N3/2
  • Jacobi N2 N
  • Explicit Inv. N N
  • Conj.Grad. N 3/2 N
  • RB SOR N 3/2 N
  • Sparse LU N 3/2 Nlog N
  • FFT Nlog N N
  • Multigrid N N
  • Lower bound N N

6
Multigrid Overview
  • Basic Algorithm
  • Replace problem on fine grid by an approximation
    on a coarser grid
  • Solve the coarse grid problem approximately, and
    use the solution as a starting guess for the
    fine-grid problem, which is then iteratively
    updated
  • Solve the coarse grid problem recursively, i.e.
    by using a still coarser grid approximation, etc.
  • Success depends on coarse grid solution being a
    good approximation to the fine grid

7
Multigrid Sketch on a 2D Mesh
  • Consider a 2m1 by 2m1 grid
  • Let P(i) be the problem of solving the discrete
    Poisson equation on a 2i1 by 2i1 grid in 2D
  • Write linear system as T(i) x(i) b(i)
  • P(m) , P(m-1) , , P(1) is sequence of problems
    from finest to coarsest

8
Multigrid Convergence
9
Multigrid Operators
  • The four operators that we examine are
  • evaluateResidual calculates the residual of our
    current solution
  • applySmoother performs a Jacobi relaxation step
  • coarsen maps from a (2m x 2m x 2m) grid to a
    (2m-1 x 2m-1 x 2m-1) grid
  • prolongate maps from a (2m-1 x 2m-1 x 2m-1)
    grid to a (2m x 2m x 2m) grid
  • All these operators perform nearest-neighbor
    computations using a 27-point stencil

10
Multigrid V-cycle
  • Just a picture of the call graph
  • In time a V-cycle looks like the following

level 5 4 3 2 1
prolongate, applySmoother, evaluateResidual
coarsen
time
11
Multigrid Performance Model
  • Memory access is performance bottleneck
  • Each pass over 3D grid requires (per cell)
  • 27 integer operations (stencil coordinates)
  • 27 FP loads of surrounding grid locations
  • 27 (approx) FP operations
  • 1 FP store
  • Traversing grid consecutively in memory causes 9
    cache misses every 1/( doubles stored in a cache
    line) cells
  • Grid size prevents reuse of cached values

12
(No Transcript)
13
(No Transcript)
14
Multigrid Optimizations
  • Optimizations possible in 2 areas
  • Reducing ALU operations per cell
  • Reusing stencil coordinates between cells
  • Reusing partial sums common to consecutive cells
  • Improving memory behavior
  • Reducing of loads (register blocking)
  • Reducing of cache misses (cache blocking)

15
Multigrid Optimizations
  • Common subexpression elimination
  • Loop unrolling
  • Memoization
  • Cache blocking
  • Memoization cache blocking

16
Platforms
  • Power 3- 375 MHz, 64 KB L1 Cache
  • Itanium II- 900 MHz, 16 KB L1 Cache
  • Alphaserver- 1000 MHz, 64 KB L1 Cache
  • Opteron- 1600 MHz, 64 KB L1 Cache

17
Optimizations Loop Unrolling
  • Reduces stencil coordinates computed per cell
  • Exposes load reuse to compiler
  • Allows compiler to use FP registers to store grid
    values, reducing loads
  • Minimum number of loads is 9/grid point (given
    generous FP registers)

18
Optimizations - Memoization
  • Traverses grid once to precompute partial sums
    common to consecutive cells
  • Traverses grid again to compute actual cell
    values
  • 9 integer stencil operations/cell
  • 18 FP operations/cell
  • Reduces FP register pressure by breaking
    computation into two stages, but still uses 9
    load streams per cell

19
(No Transcript)
20
Optimizations Cache Blocking
  • Break 3D grid into blocks that fit within cache
  • Attempts to allow reuse between adjacent
    2D-slices
  • Reduces memory traffic to 3 load streams per cell
  • Overhead when switching between blocks

21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
One V-cycle Time
26
FFT vs. MG on Power3
27
Summary Continuing Work
  • Overhead of cache-blocking is too large for the
    small block sizes that fit in the IBM Power3s L1
    cache
  • Memoization offers greatest performance benefit
    due to reduced FP operations

28
(No Transcript)
29
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com