Application Performance Analysis on Blue GeneL presentation

About This Presentation

Transcript and Presenter's Notes

Title: Application Performance Analysis on Blue GeneL

1
Application Performance Analysis on Blue Gene/L

Jim Pool, P.I.
Maciej Brodowicz, Sharon Brunett,
Tom Gottschalk, Dan Meiron,
Paul Springer, Thomas Sterling,
Ed Upchurch

2
Caltechs Role in Blue Gene/L Project

Understand implications of BG/L network
architecture drive results from real world ASCI
applications
Develop statistical models of applications,
processors as message generators, and the network
Focus on
Application communications distribution
Network contention as function of load, size and
adaptive routing
Represent 64K Nodes Explicitly in Statistical
Model
Create trace analysis tools to characterize
applications
Extensible Trace Facility (ETF)

3
Blue Gene / L Node
4
Blue Gene / L Network
5
ETF Built-in Trace Options

MPI events
All point-to-point communications (MPI-1)
All collective communications (MPI-1)
Non-blocking request tracking
Communicator creation and destruction
MPI datatype decoding (requires MPI-2)
Languages C, Fortran
Easy instrumentation of applications
Memory reference and program execution tracing
Tracking of statically and dynamically allocated
arrays (identifiers, element sizes, dimensions)
Tracking of scalar variables
Read and write accesses to individual scalars and
array elements as well as contiguous vectors of
elements
Function calls
Program execution phases

6
ETF Tracing Example forMagnetic Hydro Dynamic
(MHD) Code with Adaptive Mesh Refinement (AMR)

Parallel MHD fluid code solves equations of
hydrodynamics and resistive Maxwells equations
Part of larger application which computes dynamic
responses to strong shock waves impinging on
target materials
Fortran 90 MPI
MPI Cartesian communicators
Nearest neighbor comms use non blocking send/recv
MPI Allreduce for calculating stable time steps

7
AMR MHD Communication Profile

20 time steps on 32 processors, 128x128 cells

Max. level 1
Max. level 2
8
Lennard-Jones Molecular Dynamics

Short range molecular dynamics application
simulating Newtonian interactions in large groups
of atoms
production code from Sandia National Lab
Simulations are large in two dimensions
number of atoms and number of time steps
Spatial decomposition case selected
each processing node keeps track of the positions
and movement of the atoms in a 3-D box
Computations carried out in a single time step
correspond to femto-seconds of real time
a meaningful simulation of the evolution of the
systems state typically requires thousands of
time steps
Point-to-point MPI messages are exchanged across
each of the 6 sides of the box / time step
Code is written in Fortran and MPI

9
Lennard-Jones Molecular Dynamics
Communication Steps
Typical Grid Cell and Cutoff Radius
Computational Cycle Model
10
LJS Single Processor BG/L Performance
Original Code
vs.
Tuned for BG/L
12
10
good cache reuse
8
Improvement ()
6
4
2
0
15,625
31,250
62,500
125,000
250,000
500,000
Number of Atoms per BG/L CPU
11
LJS Molecular Dynamics Performance
Fixed Problem Size of 1 Billion Atoms
Compute Time ms
Communications Time ms
Time per single iteration (ms)
2k
4k
8k
16k
32k
64k
Number of BG/L CPUs
12
LJS Speedup BG/L vs. ASCI Red 3200 Nodes
1 Billion Atom Problem
80
70
60
50
Speedup
40
30
20
10
0
2k
4k
8k
16k
32k
64k
Number of BlueGene/L Nodes
13
LJS Communications Time
500,000 Atoms per BG/L Node
60
50
40
30
Communications Time Per Iteration (msecs)
20
Physical Nearest Neighbor Mapping
Random Mapping
10
0
4x4x4 (64 BGL Nodes)
8x8x8 (512 BGL Nodes)
16x16x16 (4096 BGL Nodes)
BG/L Configuration
14
What is QMC and Why is it a Good Fit for BG/L?

QMC is a finite all-electron Quantum Monte Carlo
code used to determine quantum properties of
materials with extremely high accuracy
Developed at Caltech by Bill Goddards ASCI
Material Properties group
Interesting Characteristics
Low memory requirements
After initialization, highly parallel and
scalable
Minimal set of MPI calls required
Non blocking p2p, reduction, probe, communicator,
collective calls
No communications during QMC working steps
Communicating convergence statistics is 7200
bytes regardless of problem size and node count
Code already ported to many platforms (Linux,
AIX, IRIX, etc.)
C and MPI sources

15
Iterative QMC Algorithm
For each processor do Steps Total Steps /
number of processors Generate walkers
Equilibrate walkers for each step generate
QMC statistics send QMC statistics to master node

16
QMC Communications Time
For 100,000 Steps Per Node
(Reduce Using the Torus)
1
8x8x8 (512)
16x16x16 (4K)
32x16x16 (8K)
32x32x16 (16K)
32x32x32 (32K)
64x32x32 (64K)
0.1
Time (seconds)
0.01
0.001
BG/L Configuration
17
Future Application Porting and Analysis for BG/L

ASCI solid dynamics code simulating the
mechanical response of polycrystalline materials,
such as tantalum
Address memory constraints, grain load imbalance
and MPI_Waitall() efficiency as we port/tune to
BG/L
good stress test for BG/L robustness

Scalable simulation of polycrystalline response
with assumed grain shape. The grain shape
corresponds to the space-filling polyhedra
corresponding to the Wigner-Seitz cell of a BCC
crystal. The 390 grain example shown here was run
on LLNLs IBM
SP3, frost.

Write a Comment

User Comments (0)

About PowerShow.com

Application Performance Analysis on Blue GeneL PowerPoint PPT Presentation