Running in Parallel : Theory and Practice - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Running in Parallel : Theory and Practice

Description:

Allow larger calculations : - usually memory is the critical factor ... Massive parallelism : - distribute ... Why does my job run like a dead donkey? ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 22

Provided by: julia48

Category:

more less

Transcript and Presenter's Notes

Title: Running in Parallel : Theory and Practice

1
Running in Parallel Theory and Practice

Julian Gale
Department of Chemistry
Imperial College

2
Why Run in Parallel?

Increase real-time performance
Allow larger calculations - usually memory
is the critical factor - distributed memory
essential for all significant arrays
Several possible mechanisms for parallelism -
MPI / PVM / OpenMP

3
Parallel Strategies

Massive parallelism - distribute according
to spatial location - large systems
(non-overlapping regions) - large numbers of
processors
Modest parallelism - distribute by orbital
index / K point - spatially compact
systems - spatially inhomogeneous systems -
small numbers of processors
Replica parallelism (transition states / phonons)

S. Itoh, P. Ordejón and R.M. Martin, CPC, 88, 173
(1995)
A. Canning, G. Galli, F. Mauri, A. de Vita and R.
Car, CPC, 94, 89 (1996)
D.W. Bowler, T. Miyazaki and M. Gillan, CPC, 137,
255 (2001)
4
Key Steps in Calculation

Calculating H (and S) matrices
- Hartree potential
- Exchange-correlation potential
- Kinetic / overlap / pseudopotentials
Solving for self-consistent solution
- Diagonalisation
- Order N

5
One/Two-Centre Integrals

Integrals evaluated directly in real space
Orbitals distributed according to 1-D block
cyclic scheme
Each node calculates integrals relevant to local
orbitals
Presently duplicated set up on each node for
numerical tabulations

Blocksize 4
Kinetic energy integrals 45s Overlap
integrals 43s Non-local
pseudopotential 136s Mesh
2213s
16384 atoms of Si on 4 nodes
6
Sparse Matrices
Order N memory
Compressed 2-D
Compressed 1-D
7
Parallel Mesh Operations

Spatial decomposition of mesh
2-D Blocked in y/z
Map orbital to mesh distribution
Perform parallel FFT ? Hartree
XC calculation only involves local
communication
Map mesh back to orbitals

0
4
8
1
5
9
2
6
10
3
7
11
2-D Blocked
8
Distribution of Processors
Better to divide work in y direction than z
Command ProcessorY
z

Example
8 nodes
ProcessorY 4
4(y) x 2(z) grid of nodes

y
9
Diagonalisation

H and S stored as sparse matrices
Solve generalised eigenvalue problem
Currently convert back to dense form
Direct sparse solution is possible - sparse
solvers exist for standard eigenvalue
problem - main issue is sparse factorisation

10
Dense Parallel Diagonalisation
0
1
0
Two options - Scalapack - Block Jacobi (Ian
Bush, Daresbury) - Scaling vs absolute
performance
1-D Block Cyclic (size 12 - 20)
Command BlockSize
11
Order N
Kim, Mauri, Galli functional
12
Order N

Direct minimisation of band structure energy gt
co-efficients of orbitals in Wannier fns
Three basic operations - calculation of
gradient - 3 point extrapolation of
energy - density matrix build
Sparse matrices C, G, H, S, h, s, F, Fs gt
localisation radius
Arrays distributed by rhs index - nbasis or
nbands

13
Putting it into practice.

Model test system bulk Si (a5.43Å)
Conditions as previous scalar runs
Single-zeta basis set
Mesh cut-off 40 Ry
Localisation radius 5.0 Bohr
Kim / Mauri / Galli functional
Energy shift 0.02 Ry
Order N calculations -gt 1 SCF cycle / 2
iterations
Calculations performed on SGI R12000 / 300 MHz
Green at CSAR / Manchester Computing Centre

14
Scaling of Time with System Size
32 processors
15
Scaling of Memory with System Size
NB Memory is per processor
16
Parallel Performance on Mesh

16384 atoms of Si / Mesh 180 x 360 x 360
Mean time per call
Loss of performance is due to orbital - mesh
mapping (XC shows perfect scaling
(LDA))

17
Parallel Performance in Order N

16384 atoms of Si / Mesh 180 x 360 x 360
Mean total time per call in 3 point energy
calculation
Minimum memory algorithm
Needs spatial decomposition to limit internode
communication

18
Installing Parallel SIESTA

What you need - f90 - MPI -
scalapack - blacs - blas -
lapack
Usually ready installed on parallel machines
Source/prebuilt binaries from www.netlib.org
If compiling, look out for f90/c cross
compatibility
arch.make - available for several parallel
machines

Also needed for serial runs
19
Running Parallel SIESTA

To run a parallel job mpirun -np 4 siesta lt
job.fdf gt job.out
Sometimes must use prun on some sites
Notes - generally must run in queues -
copy files on to local disk of run machine -
times reported in output are sum over nodes -
times can be erratic (Green/Fermat)

Number of processors
20
Useful Parallel Options

ParallelOverK Distribute K points over
nodes - good for metals
ProcessorY Sets dimension of processor
grid in Y direction
BlockSize Sets size of blocks into which
orbitals are divided
DiagMemory Controls memory available for
diagonalisation. Memory required depends on
clusters of eig values See also
DiagScale/TryMemoryIncrease
DirectPhi Phi values are calculated on
the fly - saves memory

21
Why does my job run like a dead donkey?

Poor load balance between nodes Alter
BlockSize / ProcessorY
I/O is too slow Could set WriteDM false
Job is swapping like crazy Set DirectPhi
true
Scaling with increasing number of nodes is
poor Run a bigger job!!
General problems with parallelism Latency /
bandwidth Linux clusters with 100MB ethernet
switch - forget it!