Title: Multigrid in the Chiral limit
1Multi-grid in the Chiral limit
- Richard C. Brower
- Extreme Scale Computing Workshop
- - Quantum Universe_at_ Stanford,Dec. 9-11 , 2008
2Multi-grid
- Motivation
- Eigenvectors vs Inexact Deflation
- Multi-grid preconditioning
- Applications
- Analysis
- Disconnected Diagrams
- All to All
- HMC with Chronological Inverter.
3Slow convergence is due to vectors in near null
space
- Laplace solver for A x 0 starting with a random
x - Result is (algebraically) smooth
4Slow convergence is due to vectors in near null
space
- Laplace solver for A x 0 starting with a random
x - Result is (algebraically) smooth
5Slow convergence is due to vectors in near null
space
- Laplace solver for A x 0 starting with a random
x - Result is (algebraically) smooth
6Slow convergence is due to vectors in near null
space
- Laplace solver for A x 0 starting with a random
x - Result is (algebraically) smooth
7Motivation
- Algorithms for lighter mass fermions and larger
lattice - The Dirac solver D Ã b becomes increasingly
singular - split vector into near null space D S ' 0
Complement S? - Basic idea (as always) is Schur decomposition!
- (e near null, o complement)
Schur
Implies
83 Approaches to splitting
- Deflation Nº exact eigenvector projection
- Inexact deflation plus Schwarz (Luscher)
- multi-grid preconditioning
- 2 3 use the same splitting S and S?
9Choosing the Restrictor (R Py) and
Prolongator (P)?
- Relax from random to find near null vectors
- Cut up on sublattice (No. of blocks NB 2
L4/44 )
S Range(P) dim(S) Nº NB 2Nº L4/44
10P non-square matrix
S?
But PyP 1 so Ker(P) 0
ker(Py)
Py
S
Image(Py)
Image(P)
P
(coarse lattice)
(fine lattice)
Ã1
0
0
Ã2
0
0
Ã3
0
P
0
Ã4
0
0
Ã5
0
0
Ã6
0
0
Ã7
0
0
Ã8
0
0
11Multigrid Cycle (simplified)
- Smooth x (1 - A) x b
- r (1- A) r
- Project Ac Py A P and rc Py r
- Solve Ac ec rc
- e P A-1c Py r
- Update x x e
- r b -D(x e)
- 1 - D P (Py D
P)-1 Py r
oblique projector
Note since Py r 0 exact deflation in S
12- Real algorithm has lots of tuning!
- Multigrid is recursive to multi-levels.
- Near null vectors are augmented by
recursive using MG itself. - pre and pos-smoothing is done by Minimum
Residual. - Entire cycle is used as preconditioner in CG.
- 5 is preserved 5,P 0
- Current benchmarks for Wilson-Dirac
- V163 x32, ß6.0, mcrit 0.8049,
- Coarse lattice Block 44xNcx2, Nº 20.
- 3 level V(2,2) MG cycle.
- 1 CG application per 6 Dirac application
- Note Nº scales O(1) but deflation Nº O(V)
13Multigrid QCD PetaApps project
Brannick, Brower, Clark, McCormick,Manteuffel,Osbo
rn and Rebbi, The removal of critical slowing
down Lattice 2008 proceedings
see Oct 10-10 workshop (http//super.bu.edu/browe
r/PetaAppsOct10-11)
14 SA/AMGy timings for QCD
y Adaptive Smooth Aggregations Algebraic MultiGrid
15163 x 64 asymmetric lattice
msea -0.4125.
16MG Disconnected Diagram
- Stochastic estimators need to solve
- D x (random sources)
- many time!
- MG speed this up D-1 of course, but
- Also gives a large recursive variance reduction!
-
- Tr O D-1 Tr O (D-1 - P(Py D P)-1Py)
- Tr Py O P (Py D P)-1
- Smaller operator and matrix inverse on coarse
- levels
17Summary
- Wilson Dirac Multi-grid works well
- Better when tuned further in chiral limit.
- Both Domain Wall Staggered are being developed.
- Application to multiple RHS Analysis and
Disconnected Diagrams. - Both MG and Luschers deflation can be
- applied to RHMC with chronological methods
- Now the precondition is a by product of HMC
- Exaflops may require concurrent RHS to spread
pre-conditioner work load to many processor
18Nvidia GPU architecture
19Consumer Chip GTX 280 ) Tesla C1060
20Tesla Quad S1070 1U System 8K
21Barros, Babich, Brower, Clark and Rebbi,
Blasting Through Lattice Calculations using
CUDA. Lattice 2008 procedings
22C870 code using 60 of the memory bandwidth
(Why?).
23(No Transcript)
24CUDA code for Wilson Dirac CG inverter
http//www.scala-lang.org/
25GPU multi-core straw man
- 10K S1070 Nvidiay with 103 cores
- ) 500 QCD-Gigaflop/s (sustained)
- ) O(106) core QCD-Petaflop/s (sustained)
- But 324 to 1284 O(106 to 108) lattice sites.
- ) 1 thread per site O(1 to 28) threads/core!
- y QCD Petaflop costs 20 Million s
26Future Nvidia software Plans
- Need find out why we are only saturating 60 of
Memory bandwidth - Further educe memory traffic
- 8 real number per SU(3) matrix (2/3 of 12 used
now) - shear spinors in 43 blocks (5/9 of used now)
- Generalize to clover Wilson Domain Wall
operator (slightly better flops/mem ratio). - DMA between GPU on Quad system and network for
cluster - Start to design SciDAC API for many-core
technologies.
27New Scaling Problem to solve!
- QCD- Petaflop/s
- ) O(106) cores
- Multigrid algorithm does data compression.
- ) fewer cores not more!
- Possible solution?
- Change Hardware Multigrid accelerators?
- Change Softwarey Multiple copies of
Multigrid ? - Change Algorithms Schwartz domains,
deflation,...? - y help from Paul F. Fischer?
283 Approaches to splitting
- deflation exact near null eigenvector
projection - inexact deflation plus Schwarz iterations
(Luscher) - multi-grid preconditioning (TOPS/PetaApps)
- 2 3 have Exactly the same splitting subspace
S
ker(Py)
Py
S
Image(Py)
Image(P)
P
(coarse lattice)
(fine lattice)
29P non-square matrix
S
Im(P)
Im(Py)
P
S?
Ker(Py)
Ã1
Ã2
Ã3
P
Ã4
Ã5
Ã6
Ã7
Ã8
But PyP 1 so Ker(P) 0
30Two Generations Consumer vs HPC GPUs
- Consumer cards ) High Performance (HPC) GPUs
- I. 8880 GTX ) Tesla C870
- (16 multi-processor with 8 cores each)
- II. GTX 280 ) Tesla C1060
- (30 multi-processor with 8 cores each)