Title: Running in Parallel : Theory and Practice
1Running in Parallel Theory and Practice
- Julian Gale
- Department of Chemistry
- Imperial College
2Why Run in Parallel?
- Increase real-time performance
- Allow larger calculations - usually memory
is the critical factor - distributed memory
essential for all significant arrays - Several possible mechanisms for parallelism -
MPI / PVM / OpenMP
3Parallel Strategies
- Massive parallelism - distribute according
to spatial location - large systems
(non-overlapping regions) - large numbers of
processors - Modest parallelism - distribute by orbital
index / K point - spatially compact
systems - spatially inhomogeneous systems -
small numbers of processors - Replica parallelism (transition states / phonons)
S. Itoh, P. Ordejón and R.M. Martin, CPC, 88, 173
(1995)
A. Canning, G. Galli, F. Mauri, A. de Vita and R.
Car, CPC, 94, 89 (1996)
D.W. Bowler, T. Miyazaki and M. Gillan, CPC, 137,
255 (2001)
4Key Steps in Calculation
- Calculating H (and S) matrices
- - Hartree potential
- - Exchange-correlation potential
- - Kinetic / overlap / pseudopotentials
- Solving for self-consistent solution
- - Diagonalisation
- - Order N
5One/Two-Centre Integrals
- Integrals evaluated directly in real space
- Orbitals distributed according to 1-D block
cyclic scheme - Each node calculates integrals relevant to local
orbitals - Presently duplicated set up on each node for
numerical tabulations
Blocksize 4
Kinetic energy integrals 45s Overlap
integrals 43s Non-local
pseudopotential 136s Mesh
2213s
16384 atoms of Si on 4 nodes
6Sparse Matrices
Order N memory
Compressed 2-D
Compressed 1-D
7Parallel Mesh Operations
- Spatial decomposition of mesh
- 2-D Blocked in y/z
- Map orbital to mesh distribution
- Perform parallel FFT ? Hartree
- XC calculation only involves local
communication - Map mesh back to orbitals
0
4
8
1
5
9
2
6
10
3
7
11
2-D Blocked
8Distribution of Processors
Better to divide work in y direction than z
Command ProcessorY
z
- Example
- 8 nodes
- ProcessorY 4
- 4(y) x 2(z) grid of nodes
y
9Diagonalisation
- H and S stored as sparse matrices
- Solve generalised eigenvalue problem
- Currently convert back to dense form
- Direct sparse solution is possible - sparse
solvers exist for standard eigenvalue
problem - main issue is sparse factorisation
10Dense Parallel Diagonalisation
0
1
0
Two options - Scalapack - Block Jacobi (Ian
Bush, Daresbury) - Scaling vs absolute
performance
1-D Block Cyclic (size 12 - 20)
Command BlockSize
11Order N
Kim, Mauri, Galli functional
12Order N
- Direct minimisation of band structure energy gt
co-efficients of orbitals in Wannier fns - Three basic operations - calculation of
gradient - 3 point extrapolation of
energy - density matrix build - Sparse matrices C, G, H, S, h, s, F, Fs gt
localisation radius - Arrays distributed by rhs index - nbasis or
nbands
13Putting it into practice.
- Model test system bulk Si (a5.43Ã…)
- Conditions as previous scalar runs
- Single-zeta basis set
- Mesh cut-off 40 Ry
- Localisation radius 5.0 Bohr
- Kim / Mauri / Galli functional
- Energy shift 0.02 Ry
- Order N calculations -gt 1 SCF cycle / 2
iterations - Calculations performed on SGI R12000 / 300 MHz
- Green at CSAR / Manchester Computing Centre
14Scaling of Time with System Size
32 processors
15Scaling of Memory with System Size
NB Memory is per processor
16Parallel Performance on Mesh
- 16384 atoms of Si / Mesh 180 x 360 x 360
- Mean time per call
- Loss of performance is due to orbital - mesh
mapping (XC shows perfect scaling
(LDA))
17Parallel Performance in Order N
- 16384 atoms of Si / Mesh 180 x 360 x 360
- Mean total time per call in 3 point energy
calculation - Minimum memory algorithm
- Needs spatial decomposition to limit internode
communication
18Installing Parallel SIESTA
- What you need - f90 - MPI -
scalapack - blacs - blas -
lapack - Usually ready installed on parallel machines
- Source/prebuilt binaries from www.netlib.org
- If compiling, look out for f90/c cross
compatibility - arch.make - available for several parallel
machines
Also needed for serial runs
19Running Parallel SIESTA
- To run a parallel job mpirun -np 4 siesta lt
job.fdf gt job.out -
- Sometimes must use prun on some sites
- Notes - generally must run in queues -
copy files on to local disk of run machine -
times reported in output are sum over nodes -
times can be erratic (Green/Fermat)
Number of processors
20Useful Parallel Options
- ParallelOverK Distribute K points over
nodes - good for metals - ProcessorY Sets dimension of processor
grid in Y direction - BlockSize Sets size of blocks into which
orbitals are divided - DiagMemory Controls memory available for
diagonalisation. Memory required depends on
clusters of eig values See also
DiagScale/TryMemoryIncrease - DirectPhi Phi values are calculated on
the fly - saves memory
21Why does my job run like a dead donkey?
- Poor load balance between nodes Alter
BlockSize / ProcessorY - I/O is too slow Could set WriteDM false
- Job is swapping like crazy Set DirectPhi
true - Scaling with increasing number of nodes is
poor Run a bigger job!! - General problems with parallelism Latency /
bandwidth Linux clusters with 100MB ethernet
switch - forget it!