Multigrid in the Chiral limit

About This Presentation

Title:

Multigrid in the Chiral limit

Description:

Multi-grid in the Chiral limit. Richard C. Brower. Extreme Scale Computing Workshop ... when tuned further & in chiral limit. Both Domain Wall & Staggered ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 31

Provided by: physWas

Category:

more less

Transcript and Presenter's Notes

Title: Multigrid in the Chiral limit

1
Multi-grid in the Chiral limit

Richard C. Brower
Extreme Scale Computing Workshop
- Quantum Universe_at_ Stanford,Dec. 9-11 , 2008

2
Multi-grid

Motivation
Eigenvectors vs Inexact Deflation
Multi-grid preconditioning
Applications
Analysis
Disconnected Diagrams
All to All
HMC with Chronological Inverter.

3
Slow convergence is due to vectors in near null
space

Laplace solver for A x 0 starting with a random
x
Result is (algebraically) smooth

4
Slow convergence is due to vectors in near null
space

Laplace solver for A x 0 starting with a random
x
Result is (algebraically) smooth

5
Slow convergence is due to vectors in near null
space

Laplace solver for A x 0 starting with a random
x
Result is (algebraically) smooth

6
Slow convergence is due to vectors in near null
space

Laplace solver for A x 0 starting with a random
x
Result is (algebraically) smooth

7
Motivation

Algorithms for lighter mass fermions and larger
lattice
The Dirac solver D Ã b becomes increasingly
singular
split vector into near null space D S ' 0
Complement S?
Basic idea (as always) is Schur decomposition!
(e near null, o complement)

Schur

Implies
8
3 Approaches to splitting

Deflation Nº exact eigenvector projection
Inexact deflation plus Schwarz (Luscher)
multi-grid preconditioning
2 3 use the same splitting S and S?

9
Choosing the Restrictor (R Py) and
Prolongator (P)?

Relax from random to find near null vectors
Cut up on sublattice (No. of blocks NB 2
L4/44 )

S Range(P) dim(S) Nº NB 2Nº L4/44
10
P non-square matrix
S?
But PyP 1 so Ker(P) 0
ker(Py)
Py
S
Image(Py)
Image(P)
P
(coarse lattice)
(fine lattice)
Ã1
0
0
Ã2
0
0
Ã3
0
P
0
Ã4
0
0
Ã5
0
0
Ã6
0
0
Ã7
0
0
Ã8
0
0
11
Multigrid Cycle (simplified)

Smooth x (1 - A) x b
r (1- A) r
Project Ac Py A P and rc Py r
Solve Ac ec rc
e P A-1c Py r
Update x x e
r b -D(x e)
1 - D P (Py D
P)-1 Py r

oblique projector
Note since Py r 0 exact deflation in S

12

Real algorithm has lots of tuning!
Multigrid is recursive to multi-levels.
Near null vectors are augmented by
recursive using MG itself.
pre and pos-smoothing is done by Minimum
Residual.
Entire cycle is used as preconditioner in CG.
5 is preserved 5,P 0
Current benchmarks for Wilson-Dirac
V163 x32, ß6.0, mcrit 0.8049,
Coarse lattice Block 44xNcx2, Nº 20.
3 level V(2,2) MG cycle.
1 CG application per 6 Dirac application
Note Nº scales O(1) but deflation Nº O(V)

13
Multigrid QCD PetaApps project
Brannick, Brower, Clark, McCormick,Manteuffel,Osbo
rn and Rebbi, The removal of critical slowing
down Lattice 2008 proceedings
see Oct 10-10 workshop (http//super.bu.edu/browe
r/PetaAppsOct10-11)
14
SA/AMGy timings for QCD
y Adaptive Smooth Aggregations Algebraic MultiGrid
15
163 x 64 asymmetric lattice
msea -0.4125.
16
MG Disconnected Diagram

Stochastic estimators need to solve
D x (random sources)
many time!
MG speed this up D-1 of course, but
Also gives a large recursive variance reduction!
Tr O D-1 Tr O (D-1 - P(Py D P)-1Py)
Tr Py O P (Py D P)-1
Smaller operator and matrix inverse on coarse
levels

17
Summary

Wilson Dirac Multi-grid works well
Better when tuned further in chiral limit.
Both Domain Wall Staggered are being developed.
Application to multiple RHS Analysis and
Disconnected Diagrams.
Both MG and Luschers deflation can be
applied to RHMC with chronological methods
Now the precondition is a by product of HMC
Exaflops may require concurrent RHS to spread
pre-conditioner work load to many processor

18
Nvidia GPU architecture
19
Consumer Chip GTX 280 ) Tesla C1060
20
Tesla Quad S1070 1U System 8K
21
Barros, Babich, Brower, Clark and Rebbi,
Blasting Through Lattice Calculations using
CUDA. Lattice 2008 procedings
22
C870 code using 60 of the memory bandwidth
(Why?).
23
(No Transcript)
24
CUDA code for Wilson Dirac CG inverter
http//www.scala-lang.org/
25
GPU multi-core straw man

10K S1070 Nvidiay with 103 cores
) 500 QCD-Gigaflop/s (sustained)
) O(106) core QCD-Petaflop/s (sustained)
But 324 to 1284 O(106 to 108) lattice sites.
) 1 thread per site O(1 to 28) threads/core!

y QCD Petaflop costs 20 Million s

26
Future Nvidia software Plans

Need find out why we are only saturating 60 of
Memory bandwidth
Further educe memory traffic
8 real number per SU(3) matrix (2/3 of 12 used
now)
shear spinors in 43 blocks (5/9 of used now)
Generalize to clover Wilson Domain Wall
operator (slightly better flops/mem ratio).
DMA between GPU on Quad system and network for
cluster
Start to design SciDAC API for many-core
technologies.

27
New Scaling Problem to solve!

QCD- Petaflop/s
) O(106) cores
Multigrid algorithm does data compression.
) fewer cores not more!
Possible solution?
Change Hardware Multigrid accelerators?
Change Softwarey Multiple copies of
Multigrid ?
Change Algorithms Schwartz domains,
deflation,...?
y help from Paul F. Fischer?

28
3 Approaches to splitting

deflation exact near null eigenvector
projection
inexact deflation plus Schwarz iterations
(Luscher)
multi-grid preconditioning (TOPS/PetaApps)
2 3 have Exactly the same splitting subspace
S

ker(Py)
Py
S
Image(Py)
Image(P)
P
(coarse lattice)
(fine lattice)
29
P non-square matrix
S
Im(P)
Im(Py)
P
S?
Ker(Py)
Ã1
Ã2
Ã3
P
Ã4
Ã5
Ã6
Ã7
Ã8
But PyP 1 so Ker(P) 0
30
Two Generations Consumer vs HPC GPUs