Parallel computing for particle simulation

About This Presentation

Title:

Parallel computing for particle simulation

Description:

Better to do it yourself with OpenMP. Coarse-grain parallelism can be difficult ... Current simulation projects in China MFE. single investigator/developer/user ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 30

Provided by: gkPs

Category:

more less

Transcript and Presenter's Notes

Title: Parallel computing for particle simulation

1
Parallel computing for particle simulation

Zhihong Lin
Department of Physics Astronomy University of
California, Irvine
Ack.
Stephane Ethier, SciDAC2005

4th Workshop on Nonlinear Plasma Sciences
International School on Plasma Turbulence and
Transport Zhejiang University, Hangzhou, China
2
Outline

Parallel computer
Shared vs. distributed parallelism
PIC domain decomposition
GTC architecture performance

3
Why Parallel Computing?

Want to speed up a calculation.
Solution
Split the work between several processors.
How?
It depends on the type of parallel computer
Shared memory (usually thread-based)
Distributed memory (process-based)
Massively parallel computer tightly-coupled
nodes
Why bother?
Design/choice of physics model, numerical method,
and algorithm

4
Main Computing PlatformNERSCs IBM SP Seaborg

9TF
416 x 16-processor SMP nodes (with 64G, 32G, or
16G memory)
380 compute nodes (6,080 processors)
375 MHz POWER 3 processors with 1.5
GFlops/sec/proc peak

5
Earth Simulator

40 TF
5,120 vector processors
8 Gflops/sec per proc.

6
CRAY X1E at ORNL

18TF
1024 Multi-streaming vector processors (MSPs)
18 Gflops/sec peak performance per MSP

7
Moores Law Comes with Issues of Power
Consumption Heat Dissipation
10
11
4G
2G
10
10
1G
512M
Memory
256M
10
9
128M
Itanium
Microprocessor
64M
10
8
Pentium 4
16M
Pentium III
10
4M
7
Pentium II
1M
10
6
Pentium
256K
Transistors Per Die
i486
64K
10
5
i386
16K
80286
4K
10
4
8080
1K
8086
10
3
4004
10
2
10
1
10
0

60

65

70

75

80

85

90

95

00

05

10
Source Intel
8
Microarchitecture Low Level Parallelism

Larger cache
Multi-threaded
Multi-core
System-on-a-chip

Adapted from Johan De Gelas, Quest for More
Processing Power, AnandTech, Feb. 8, 2005.
9
IBM Blue Gene Systems

LLNL BG/L
360 TF
64 racks
65,536 nodes
131,072 processors
Node
Two 2.8 Gflops processors
System-on-a-Chip design
700 MHz
Two fused multiply-adds per cycle
Up to 512 Mbytes of memory
27 Watts

10
Shared memory parallelism

Program runs inside a single process
Several execution threads are created within
that process and work is split between them.
The threads run on different processors.
All threads have access to the shared data
through shared memory access.
Must be careful not the have threads overwrite
each others data.

11
Shared memory programming

Easy to do loop-level parallelism.
Compiler-based automatic parallelization
Easy but not always efficient
Better to do it yourself with OpenMP
Coarse-grain parallelism can be difficult

12
Distributed memory parallelism

Process-based programming.
Each process has its own memory space that cannot
be accessed by the other processes.
The work is split between several processes.
For efficiency, each processor runs a single
process.
Communication between the processes must be
explicit, e.g. Message Passing

13
Most widely used method on distributed memory
machines

Run the same program on all processors.
Each processor works on a subset of the problem.
Exchange data when needed
Can be exchange through the network interconnect
Or through the shared memory on SMP machines
Easy to do coarse grain parallelism scalable

14
How to split the work between processors?

Most widely used method for grid-based
calculations
DOMAIN DECOMPOSITION
Split particles in particle-in-cell (PIC) or
molecular dynamics codes.
Split arrays in PDE solvers
etc
Keep it LOCAL

15
What is MPI?

MPI stands for Message Passing Interface.
It is a message-passing specification, a
standard, for the vendors to implement.
In practice, MPI is a set of functions (C) and
subroutines (Fortran) used for exchanging data
between processes.
An MPI library exists on most, if not all,
parallel computing platforms so it is highly
portable.

16
How much do I need to know?

MPI is small
Many parallel programs can be written with lt10
basic functions.
MPI is large (125 functions)
MPI's extensive functionality requires many
functions
Number of functions not necessarily a measure of
complexity
MPI is just right
One can access flexibility when it is required.
One need not master all parts of MPI to use it.

17
Good MPI web sites

http//www.llnl.gov/computing/tutorials/mpi/
http//www.nersc.gov/nusers/help/tutorials/mpi/int
ro/
http//www-unix.mcs.anl.gov/mpi/tutorial/gropp/tal
k.html
http//www-unix.mcs.anl.gov/mpi/tutorial/
MPI on Linux clusters
MPICH (http//www-unix.mcs.anl.gov/mpi/mpich/)
LAM (http//www.lam-mpi.org/)

18
The Particle-in-cell Method

Particles sample distribution function
Interactions via the grid, on which the potential
is calculated (from deposited charges).

The PIC Steps
SCATTER, or deposit, charges on the grid
(nearest neighbors)
Solve Poisson equation
GATHER forces on each particle from potential
Move particles (PUSH)
Repeat

19
Charge Deposition in Gyrokinetic4-point average
method
20
Global Field-aligned Mesh for Quasi-2D Structure
of Electrostatic Potential
Y
(Y,a,z) ? a q - z/q

Saves a factor of about 100 in CPU time

21
Domain Decomposition

Domain decomposition
each MPI process holds a toroidal section
each particle is assigned to a processor
according to its position
Initial memory allocation is done locally on each
processor to maximize efficiency
Communication between domains is done with MPI

22
Domain Decomposition

Domain-decomposition for particle-field
interactions
Dynamic objects particle points
Static objects field grids
DD particle-grid interactions on-node
Communication across nodes MPI
On-node shared memory parallelization
OpenMP
Computational bottleneck on-node gather-scatter

23
2nd Level of ParallelismLoop-level with OpenMP
24
New MPI-based particle decomposition

Each domain decomposition can have more than 1
processor associated with it.
Each processor holds a fraction of the total
number of particles in that domain.
Scales well when using a large number of particles

25
Mixed-Mode Domain Decomposition

Particle-field DD existence of simple surfaces
enclosing sub-domains
Field-aligned mesh distorted when rotates in
toroidal direction
Not accurate or efficient for FEM solver
Re-arrangement of connectivity no simple
surfaces
Particle DD toroidal radial
Field DD 3D
Solver via PETSc
Preconditioning HPRE
Initial guess value from previous time step
Field repartitioning CPU overhead minimal

26
Optimization Challenges

Gather-Scatter operation in PIC codes
The particles are randomly distributed in the
simulation volume (grid).
Particle charge deposition on the grid leads to
indirect addressing in memory.
Not cache friendly.
Need to be tuned differently depending on the
architecture.
Work-vector method each element in the processor
register has a private copy of the local grid

particle array scatter operation
grid array
27
GTC Performance
3.7 Teraflops achieved on the Earth Simulator
with 2,048 processors using 6.6 billion
particles!!
28
Performance Results
29
Parallel Computing for China ITER

Parallel computer hardware excellent
of entry 5th on top 500 supercomputer list
20TF available
US, Japan China compete for 1PT computer (1B)
Software, wide-area network support poor
Current simulation projects in China MFE
single investigator/developer/user
Small scale parallel simulation 10 processors
of local cluster
Simulation initiatives among ITER partners (13B
stake)
SciDAC, FSP in US, Integrated Simulation
Initiatives in Japan, EU, US
Shift of paradigm needed for China MFE simulation
Team work plasma physics, computational science,
applied math,
Access to national supercomputers vs. local
cluster
Physics compete vs. follow (e.g., nonlinear
effects in RF-heating)

Write a Comment

User Comments (0)

About PowerShow.com

Parallel computing for particle simulation - PowerPoint PPT Presentation

Parallel computing for particle simulation

Better to do it yourself with OpenMP. Coarse-grain parallelism can be difficult ... Current simulation projects in China MFE. single investigator/developer/user ... – PowerPoint PPT presentation