Optimizing 3D Multigrid to Be Comparable to the FFT

About This Presentation

Title:

Optimizing 3D Multigrid to Be Comparable to the FFT

Description:

Reduces # stencil coordinates computed per cell. Exposes load reuse to compiler ... memory traffic to 3 load streams per cell. Overhead when switching between ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 30

Provided by: kda47

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing 3D Multigrid to Be Comparable to the FFT

1
Optimizing 3D Multigrid to Be Comparable to the
FFT

Michael Maire and Kaushik Datta
Note Several diagrams were taken from Kathy
Yelicks CS267 lectures

2
Outline

FFT and Multigrid Overview
FFT and MG Running Times
Multigrid Performance Model
Multigrid Optimizations
Results

3
3D Poissons Equation

An elliptic PDE that arises in many physical
problems (e.g. electrostatic or gravitational
potential)
Many different techniques for solving are
available

4
3D Poissons Equation

The continuous version is
d2?/dx2 d2?/dy2 d2?/dz2 ? (or ?? ?)
The discrete version is T x b
In 2D, the 9-point stencil looks like

5
Algorithms for Solving 2D Poissons Equation with
N unknowns

Algorithm Serial Flops Memory
Dense LU N3 N2
Band LU N2 N3/2
Jacobi N2 N
Explicit Inv. N N
Conj.Grad. N 3/2 N
RB SOR N 3/2 N
Sparse LU N 3/2 Nlog N
FFT Nlog N N
Multigrid N N
Lower bound N N

6
Multigrid Overview

Basic Algorithm
Replace problem on fine grid by an approximation
on a coarser grid
Solve the coarse grid problem approximately, and
use the solution as a starting guess for the
fine-grid problem, which is then iteratively
updated
Solve the coarse grid problem recursively, i.e.
by using a still coarser grid approximation, etc.
Success depends on coarse grid solution being a
good approximation to the fine grid

7
Multigrid Sketch on a 2D Mesh

Consider a 2m1 by 2m1 grid
Let P(i) be the problem of solving the discrete
Poisson equation on a 2i1 by 2i1 grid in 2D
Write linear system as T(i) x(i) b(i)
P(m) , P(m-1) , , P(1) is sequence of problems
from finest to coarsest

8
Multigrid Convergence
9
Multigrid Operators

The four operators that we examine are
evaluateResidual calculates the residual of our
current solution
applySmoother performs a Jacobi relaxation step
coarsen maps from a (2m x 2m x 2m) grid to a
(2m-1 x 2m-1 x 2m-1) grid
prolongate maps from a (2m-1 x 2m-1 x 2m-1)
grid to a (2m x 2m x 2m) grid
All these operators perform nearest-neighbor
computations using a 27-point stencil

10
Multigrid V-cycle

Just a picture of the call graph
In time a V-cycle looks like the following

level 5 4 3 2 1
prolongate, applySmoother, evaluateResidual
coarsen
time
11
Multigrid Performance Model

Memory access is performance bottleneck
Each pass over 3D grid requires (per cell)
27 integer operations (stencil coordinates)
27 FP loads of surrounding grid locations
27 (approx) FP operations
1 FP store
Traversing grid consecutively in memory causes 9
cache misses every 1/( doubles stored in a cache
line) cells
Grid size prevents reuse of cached values

12
(No Transcript)
13
(No Transcript)
14
Multigrid Optimizations

Optimizations possible in 2 areas
Reducing ALU operations per cell
Reusing stencil coordinates between cells
Reusing partial sums common to consecutive cells
Improving memory behavior
Reducing of loads (register blocking)
Reducing of cache misses (cache blocking)

15
Multigrid Optimizations

Common subexpression elimination
Loop unrolling
Memoization
Cache blocking
Memoization cache blocking

16
Platforms

Power 3- 375 MHz, 64 KB L1 Cache
Itanium II- 900 MHz, 16 KB L1 Cache
Alphaserver- 1000 MHz, 64 KB L1 Cache
Opteron- 1600 MHz, 64 KB L1 Cache

17
Optimizations Loop Unrolling

Reduces stencil coordinates computed per cell
Exposes load reuse to compiler
Allows compiler to use FP registers to store grid
values, reducing loads
Minimum number of loads is 9/grid point (given
generous FP registers)

18
Optimizations - Memoization

Traverses grid once to precompute partial sums
common to consecutive cells
Traverses grid again to compute actual cell
values
9 integer stencil operations/cell
18 FP operations/cell
Reduces FP register pressure by breaking
computation into two stages, but still uses 9
load streams per cell

19
(No Transcript)
20
Optimizations Cache Blocking

Break 3D grid into blocks that fit within cache
Attempts to allow reuse between adjacent
2D-slices
Reduces memory traffic to 3 load streams per cell
Overhead when switching between blocks

21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
One V-cycle Time
26
FFT vs. MG on Power3
27
Summary Continuing Work

Overhead of cache-blocking is too large for the
small block sizes that fit in the IBM Power3s L1
cache
Memoization offers greatest performance benefit
due to reduced FP operations

28
(No Transcript)
29
(No Transcript)

Write a Comment

User Comments (0)