Title: Application Performance Analysis on Blue GeneL
1Application Performance Analysis on Blue Gene/L
- Jim Pool, P.I.
- Maciej Brodowicz, Sharon Brunett,
- Tom Gottschalk, Dan Meiron,
- Paul Springer, Thomas Sterling,
- Ed Upchurch
2Caltechs Role in Blue Gene/L Project
- Understand implications of BG/L network
architecture drive results from real world ASCI
applications - Develop statistical models of applications,
processors as message generators, and the network - Focus on
- Application communications distribution
- Network contention as function of load, size and
adaptive routing - Represent 64K Nodes Explicitly in Statistical
Model - Create trace analysis tools to characterize
applications - Extensible Trace Facility (ETF)
3Blue Gene / L Node
4Blue Gene / L Network
5ETF Built-in Trace Options
- MPI events
- All point-to-point communications (MPI-1)
- All collective communications (MPI-1)
- Non-blocking request tracking
- Communicator creation and destruction
- MPI datatype decoding (requires MPI-2)
- Languages C, Fortran
- Easy instrumentation of applications
- Memory reference and program execution tracing
- Tracking of statically and dynamically allocated
arrays (identifiers, element sizes, dimensions) - Tracking of scalar variables
- Read and write accesses to individual scalars and
array elements as well as contiguous vectors of
elements - Function calls
- Program execution phases
6ETF Tracing Example forMagnetic Hydro Dynamic
(MHD) Code with Adaptive Mesh Refinement (AMR)
- Parallel MHD fluid code solves equations of
hydrodynamics and resistive Maxwells equations - Part of larger application which computes dynamic
responses to strong shock waves impinging on
target materials - Fortran 90 MPI
- MPI Cartesian communicators
- Nearest neighbor comms use non blocking send/recv
- MPI Allreduce for calculating stable time steps
7AMR MHD Communication Profile
- 20 time steps on 32 processors, 128x128 cells
Max. level 1
Max. level 2
8Lennard-Jones Molecular Dynamics
- Short range molecular dynamics application
simulating Newtonian interactions in large groups
of atoms - production code from Sandia National Lab
- Simulations are large in two dimensions
- number of atoms and number of time steps
- Spatial decomposition case selected
- each processing node keeps track of the positions
and movement of the atoms in a 3-D box - Computations carried out in a single time step
correspond to femto-seconds of real time - a meaningful simulation of the evolution of the
systems state typically requires thousands of
time steps - Point-to-point MPI messages are exchanged across
each of the 6 sides of the box / time step - Code is written in Fortran and MPI
9Lennard-Jones Molecular Dynamics
Communication Steps
Typical Grid Cell and Cutoff Radius
Computational Cycle Model
10LJS Single Processor BG/L Performance
Original Code
vs.
Tuned for BG/L
12
10
good cache reuse
8
Improvement ()
6
4
2
0
15,625
31,250
62,500
125,000
250,000
500,000
Number of Atoms per BG/L CPU
11LJS Molecular Dynamics Performance
Fixed Problem Size of 1 Billion Atoms
Compute Time ms
Communications Time ms
Time per single iteration (ms)
2k
4k
8k
16k
32k
64k
Number of BG/L CPUs
12LJS Speedup BG/L vs. ASCI Red 3200 Nodes
1 Billion Atom Problem
80
70
60
50
Speedup
40
30
20
10
0
2k
4k
8k
16k
32k
64k
Number of BlueGene/L Nodes
13LJS Communications Time
500,000 Atoms per BG/L Node
60
50
40
30
Communications Time Per Iteration (msecs)
20
Physical Nearest Neighbor Mapping
Random Mapping
10
0
4x4x4 (64 BGL Nodes)
8x8x8 (512 BGL Nodes)
16x16x16 (4096 BGL Nodes)
BG/L Configuration
14What is QMC and Why is it a Good Fit for BG/L?
- QMC is a finite all-electron Quantum Monte Carlo
code used to determine quantum properties of
materials with extremely high accuracy - Developed at Caltech by Bill Goddards ASCI
Material Properties group - Interesting Characteristics
- Low memory requirements
- After initialization, highly parallel and
scalable - Minimal set of MPI calls required
- Non blocking p2p, reduction, probe, communicator,
collective calls - No communications during QMC working steps
- Communicating convergence statistics is 7200
bytes regardless of problem size and node count - Code already ported to many platforms (Linux,
AIX, IRIX, etc.) - C and MPI sources
15Iterative QMC Algorithm
For each processor do Steps Total Steps /
number of processors Generate walkers
Equilibrate walkers for each step generate
QMC statistics send QMC statistics to master node
16QMC Communications Time
For 100,000 Steps Per Node
(Reduce Using the Torus)
1
8x8x8 (512)
16x16x16 (4K)
32x16x16 (8K)
32x32x16 (16K)
32x32x32 (32K)
64x32x32 (64K)
0.1
Time (seconds)
0.01
0.001
BG/L Configuration
17Future Application Porting and Analysis for BG/L
- ASCI solid dynamics code simulating the
mechanical response of polycrystalline materials,
such as tantalum - Address memory constraints, grain load imbalance
and MPI_Waitall() efficiency as we port/tune to
BG/L - good stress test for BG/L robustness
- Scalable simulation of polycrystalline response
with assumed grain shape. The grain shape
corresponds to the space-filling polyhedra
corresponding to the Wigner-Seitz cell of a BCC
crystal. The 390 grain example shown here was run
on LLNLs IBM - SP3, frost.