Title: Parallel computing for particle simulation
1Parallel computing for particle simulation
- Zhihong Lin
- Department of Physics Astronomy University of
California, Irvine - Ack.
- Stephane Ethier, SciDAC2005
4th Workshop on Nonlinear Plasma Sciences
International School on Plasma Turbulence and
Transport Zhejiang University, Hangzhou, China
2Outline
- Parallel computer
- Shared vs. distributed parallelism
- PIC domain decomposition
- GTC architecture performance
3Why Parallel Computing?
- Want to speed up a calculation.
- Solution
- Split the work between several processors.
- How?
- It depends on the type of parallel computer
- Shared memory (usually thread-based)
- Distributed memory (process-based)
- Massively parallel computer tightly-coupled
nodes - Why bother?
- Design/choice of physics model, numerical method,
and algorithm
4Main Computing PlatformNERSCs IBM SP Seaborg
- 9TF
- 416 x 16-processor SMP nodes (with 64G, 32G, or
16G memory) - 380 compute nodes (6,080 processors)
- 375 MHz POWER 3 processors with 1.5
GFlops/sec/proc peak
5Earth Simulator
- 40 TF
- 5,120 vector processors
- 8 Gflops/sec per proc.
6CRAY X1E at ORNL
- 18TF
- 1024 Multi-streaming vector processors (MSPs)
- 18 Gflops/sec peak performance per MSP
7Moores Law Comes with Issues of Power
Consumption Heat Dissipation
10
11
4G
2G
10
10
1G
512M
Memory
256M
10
9
128M
Itanium
Microprocessor
64M
10
8
Pentium 4
16M
Pentium III
10
4M
7
Pentium II
1M
10
6
Pentium
256K
Transistors Per Die
i486
64K
10
5
i386
16K
80286
4K
10
4
8080
1K
8086
10
3
4004
10
2
10
1
10
0
60
65
70
75
80
85
90
95
00
05
10
Source Intel
8Microarchitecture Low Level Parallelism
- Larger cache
- Multi-threaded
- Multi-core
- System-on-a-chip
Adapted from Johan De Gelas, Quest for More
Processing Power, AnandTech, Feb. 8, 2005.
9IBM Blue Gene Systems
- LLNL BG/L
- 360 TF
- 64 racks
- 65,536 nodes
- 131,072 processors
- Node
- Two 2.8 Gflops processors
- System-on-a-Chip design
- 700 MHz
- Two fused multiply-adds per cycle
- Up to 512 Mbytes of memory
- 27 Watts
10Shared memory parallelism
- Program runs inside a single process
- Several execution threads are created within
that process and work is split between them. - The threads run on different processors.
- All threads have access to the shared data
through shared memory access. - Must be careful not the have threads overwrite
each others data.
11Shared memory programming
- Easy to do loop-level parallelism.
- Compiler-based automatic parallelization
- Easy but not always efficient
- Better to do it yourself with OpenMP
- Coarse-grain parallelism can be difficult
12Distributed memory parallelism
- Process-based programming.
- Each process has its own memory space that cannot
be accessed by the other processes. - The work is split between several processes.
- For efficiency, each processor runs a single
process. - Communication between the processes must be
explicit, e.g. Message Passing
13Most widely used method on distributed memory
machines
- Run the same program on all processors.
- Each processor works on a subset of the problem.
- Exchange data when needed
- Can be exchange through the network interconnect
- Or through the shared memory on SMP machines
- Easy to do coarse grain parallelism scalable
14How to split the work between processors?
- Most widely used method for grid-based
calculations - DOMAIN DECOMPOSITION
- Split particles in particle-in-cell (PIC) or
molecular dynamics codes. - Split arrays in PDE solvers
- etc
- Keep it LOCAL
15What is MPI?
- MPI stands for Message Passing Interface.
- It is a message-passing specification, a
standard, for the vendors to implement. - In practice, MPI is a set of functions (C) and
subroutines (Fortran) used for exchanging data
between processes. - An MPI library exists on most, if not all,
parallel computing platforms so it is highly
portable.
16How much do I need to know?
- MPI is small
- Many parallel programs can be written with lt10
basic functions. - MPI is large (125 functions)
- MPI's extensive functionality requires many
functions - Number of functions not necessarily a measure of
complexity - MPI is just right
- One can access flexibility when it is required.
- One need not master all parts of MPI to use it.
17Good MPI web sites
- http//www.llnl.gov/computing/tutorials/mpi/
- http//www.nersc.gov/nusers/help/tutorials/mpi/int
ro/ - http//www-unix.mcs.anl.gov/mpi/tutorial/gropp/tal
k.html - http//www-unix.mcs.anl.gov/mpi/tutorial/
- MPI on Linux clusters
- MPICH (http//www-unix.mcs.anl.gov/mpi/mpich/)
- LAM (http//www.lam-mpi.org/)
18The Particle-in-cell Method
- Particles sample distribution function
- Interactions via the grid, on which the potential
is calculated (from deposited charges).
- The PIC Steps
- SCATTER, or deposit, charges on the grid
(nearest neighbors) - Solve Poisson equation
- GATHER forces on each particle from potential
- Move particles (PUSH)
- Repeat
19Charge Deposition in Gyrokinetic4-point average
method
20Global Field-aligned Mesh for Quasi-2D Structure
of Electrostatic Potential
Y
(Y,a,z) ? a q - z/q
- Saves a factor of about 100 in CPU time
21Domain Decomposition
- Domain decomposition
- each MPI process holds a toroidal section
- each particle is assigned to a processor
according to its position - Initial memory allocation is done locally on each
processor to maximize efficiency - Communication between domains is done with MPI
22Domain Decomposition
- Domain-decomposition for particle-field
interactions - Dynamic objects particle points
- Static objects field grids
- DD particle-grid interactions on-node
- Communication across nodes MPI
On-node shared memory parallelization
OpenMP - Computational bottleneck on-node gather-scatter
232nd Level of ParallelismLoop-level with OpenMP
24New MPI-based particle decomposition
- Each domain decomposition can have more than 1
processor associated with it. - Each processor holds a fraction of the total
number of particles in that domain. - Scales well when using a large number of particles
25Mixed-Mode Domain Decomposition
- Particle-field DD existence of simple surfaces
enclosing sub-domains - Field-aligned mesh distorted when rotates in
toroidal direction - Not accurate or efficient for FEM solver
- Re-arrangement of connectivity no simple
surfaces - Particle DD toroidal radial
- Field DD 3D
- Solver via PETSc
- Preconditioning HPRE
- Initial guess value from previous time step
- Field repartitioning CPU overhead minimal
26Optimization Challenges
- Gather-Scatter operation in PIC codes
- The particles are randomly distributed in the
simulation volume (grid). - Particle charge deposition on the grid leads to
indirect addressing in memory. - Not cache friendly.
- Need to be tuned differently depending on the
architecture. - Work-vector method each element in the processor
register has a private copy of the local grid
particle array scatter operation
grid array
27GTC Performance
3.7 Teraflops achieved on the Earth Simulator
with 2,048 processors using 6.6 billion
particles!!
28Performance Results
29Parallel Computing for China ITER
- Parallel computer hardware excellent
- of entry 5th on top 500 supercomputer list
20TF available - US, Japan China compete for 1PT computer (1B)
- Software, wide-area network support poor
- Current simulation projects in China MFE
- single investigator/developer/user
- Small scale parallel simulation 10 processors
of local cluster - Simulation initiatives among ITER partners (13B
stake) - SciDAC, FSP in US, Integrated Simulation
Initiatives in Japan, EU, US - Shift of paradigm needed for China MFE simulation
- Team work plasma physics, computational science,
applied math, - Access to national supercomputers vs. local
cluster - Physics compete vs. follow (e.g., nonlinear
effects in RF-heating)