Running in Parallel : Theory and Practice - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Running in Parallel : Theory and Practice

Description:

Allow larger calculations : - usually memory is the critical factor ... Massive parallelism : - distribute ... Why does my job run like a dead donkey? ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 22
Provided by: julia48
Category:

less

Transcript and Presenter's Notes

Title: Running in Parallel : Theory and Practice


1
Running in Parallel Theory and Practice
  • Julian Gale
  • Department of Chemistry
  • Imperial College

2
Why Run in Parallel?
  • Increase real-time performance
  • Allow larger calculations - usually memory
    is the critical factor - distributed memory
    essential for all significant arrays
  • Several possible mechanisms for parallelism -
    MPI / PVM / OpenMP

3
Parallel Strategies
  • Massive parallelism - distribute according
    to spatial location - large systems
    (non-overlapping regions) - large numbers of
    processors
  • Modest parallelism - distribute by orbital
    index / K point - spatially compact
    systems - spatially inhomogeneous systems -
    small numbers of processors
  • Replica parallelism (transition states / phonons)

S. Itoh, P. Ordejón and R.M. Martin, CPC, 88, 173
(1995)
A. Canning, G. Galli, F. Mauri, A. de Vita and R.
Car, CPC, 94, 89 (1996)
D.W. Bowler, T. Miyazaki and M. Gillan, CPC, 137,
255 (2001)
4
Key Steps in Calculation
  • Calculating H (and S) matrices
  • - Hartree potential
  • - Exchange-correlation potential
  • - Kinetic / overlap / pseudopotentials
  • Solving for self-consistent solution
  • - Diagonalisation
  • - Order N

5
One/Two-Centre Integrals
  • Integrals evaluated directly in real space
  • Orbitals distributed according to 1-D block
    cyclic scheme
  • Each node calculates integrals relevant to local
    orbitals
  • Presently duplicated set up on each node for
    numerical tabulations

Blocksize 4
Kinetic energy integrals 45s Overlap
integrals 43s Non-local
pseudopotential 136s Mesh
2213s
16384 atoms of Si on 4 nodes
6
Sparse Matrices
Order N memory
Compressed 2-D
Compressed 1-D
7
Parallel Mesh Operations
  • Spatial decomposition of mesh
  • 2-D Blocked in y/z
  • Map orbital to mesh distribution
  • Perform parallel FFT ? Hartree
  • XC calculation only involves local
    communication
  • Map mesh back to orbitals

0
4
8
1
5
9
2
6
10
3
7
11
2-D Blocked
8
Distribution of Processors
Better to divide work in y direction than z
Command ProcessorY
z
  • Example
  • 8 nodes
  • ProcessorY 4
  • 4(y) x 2(z) grid of nodes

y
9
Diagonalisation
  • H and S stored as sparse matrices
  • Solve generalised eigenvalue problem
  • Currently convert back to dense form
  • Direct sparse solution is possible - sparse
    solvers exist for standard eigenvalue
    problem - main issue is sparse factorisation

10
Dense Parallel Diagonalisation
0
1
0
Two options - Scalapack - Block Jacobi (Ian
Bush, Daresbury) - Scaling vs absolute
performance
1-D Block Cyclic (size 12 - 20)
Command BlockSize
11
Order N
Kim, Mauri, Galli functional
12
Order N
  • Direct minimisation of band structure energy gt
    co-efficients of orbitals in Wannier fns
  • Three basic operations - calculation of
    gradient - 3 point extrapolation of
    energy - density matrix build
  • Sparse matrices C, G, H, S, h, s, F, Fs gt
    localisation radius
  • Arrays distributed by rhs index - nbasis or
    nbands

13
Putting it into practice.
  • Model test system bulk Si (a5.43Ã…)
  • Conditions as previous scalar runs
  • Single-zeta basis set
  • Mesh cut-off 40 Ry
  • Localisation radius 5.0 Bohr
  • Kim / Mauri / Galli functional
  • Energy shift 0.02 Ry
  • Order N calculations -gt 1 SCF cycle / 2
    iterations
  • Calculations performed on SGI R12000 / 300 MHz
  • Green at CSAR / Manchester Computing Centre

14
Scaling of Time with System Size
32 processors
15
Scaling of Memory with System Size
NB Memory is per processor
16
Parallel Performance on Mesh
  • 16384 atoms of Si / Mesh 180 x 360 x 360
  • Mean time per call
  • Loss of performance is due to orbital - mesh
    mapping (XC shows perfect scaling
    (LDA))

17
Parallel Performance in Order N
  • 16384 atoms of Si / Mesh 180 x 360 x 360
  • Mean total time per call in 3 point energy
    calculation
  • Minimum memory algorithm
  • Needs spatial decomposition to limit internode
    communication

18
Installing Parallel SIESTA
  • What you need - f90 - MPI -
    scalapack - blacs - blas -
    lapack
  • Usually ready installed on parallel machines
  • Source/prebuilt binaries from www.netlib.org
  • If compiling, look out for f90/c cross
    compatibility
  • arch.make - available for several parallel
    machines

Also needed for serial runs
19
Running Parallel SIESTA
  • To run a parallel job mpirun -np 4 siesta lt
    job.fdf gt job.out
  • Sometimes must use prun on some sites
  • Notes - generally must run in queues -
    copy files on to local disk of run machine -
    times reported in output are sum over nodes -
    times can be erratic (Green/Fermat)

Number of processors
20
Useful Parallel Options
  • ParallelOverK Distribute K points over
    nodes - good for metals
  • ProcessorY Sets dimension of processor
    grid in Y direction
  • BlockSize Sets size of blocks into which
    orbitals are divided
  • DiagMemory Controls memory available for
    diagonalisation. Memory required depends on
    clusters of eig values See also
    DiagScale/TryMemoryIncrease
  • DirectPhi Phi values are calculated on
    the fly - saves memory

21
Why does my job run like a dead donkey?
  • Poor load balance between nodes Alter
    BlockSize / ProcessorY
  • I/O is too slow Could set WriteDM false
  • Job is swapping like crazy Set DirectPhi
    true
  • Scaling with increasing number of nodes is
    poor Run a bigger job!!
  • General problems with parallelism Latency /
    bandwidth Linux clusters with 100MB ethernet
    switch - forget it!
Write a Comment
User Comments (0)
About PowerShow.com