An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations presentation

About This Presentation

Transcript and Presenter's Notes

Title: An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations

1
An HPspmd Analysis of the Gadget 2 Code for
Cosmological Simulations

Bryan Carpenter
University of Southampton

2
Background

HPspmd model was introduced around 1997 as a
reaction to (the failure of) High Performance
Fortran (HPF).
The idea was to keep the data model of HPF, which
we thought was good, but abandon the execution
model, which was both inflexible and hard to
implement.
See papers at www.hpjava.org

3
The HPspmd Model

The idea of HPspmd is to extend a base language
with a specific set of syntax extensions for
representing distributed data (distributed
arrays).
The extended language does not contain any
communication or synchronization primitives.
The programming model is SPMD, and all
communication between nodes goes through
libraries.
HPspmd itself only provides a framework for
writing libraries that act on distributed data,
including communication libraries.

4
HPJava, and other Base Languages

It was always the intention that HPspmd could be
bound to various base languages.
First full implementation was its binding to
Java. HPJava finally released 2003.
Up to now HPJava is the only full implementation
of HPspmd (other than early C prototypes).
But a full binding to, say, C or C also
possible (and desirable?)

5
Goals of This Work

Revisit HPspmd, with a particular eye to a
possible role in programming multicore systems.
Push the HPspmd model beyond its original zone of
applicationto a highly irregular production code
for astrophysics.
See what we learn.

6
HPspmd
7
HPF Distributed Array Example

INTEGER A(8, 8)
!HPF PROCESSORS P(3,2)
!HPF DISTRIBUTE A(BLOCK, BLOCK) ONTO P

8
HPF Parallel Assignments

! Parallel loop
FORALL (I 28, J 18)
A(I, J) (I 1) (J 1)
! Communication (implicit)
A(1, ) A(8, )
Communication can happen almost anywhere.

9
HPspmd Distributed Array Example
int N 8 Procs p new Procs2(3,2) on(p)
Range x new BlockRange(N, p.dim(0))
Range y new BlockRange(N, p.dim(1)) int
-,- a new int x, y
10
HPspmd Parallel Assigments

// Parallel loop
overall(i x for 17)
overall(j y for )
ai, j (i 2) (j 2)
// Communication (always explicit)
Adlib.remap(a0, , a7, )
In element reference, subscript must be a
distributed index (like i or j) in the range of
the array
Section subscriptsanything goes.

11
HPF Collapsed Dimensions

INTEGER A(8, 8)
!HPF PROCESSORS P(3)
!HPF DISTRIBUTE A(BLOCK,) ONTO P

12
HPspmd Sequential Dimensions
int N 8 Procs p new Procs1(3) on(p)
Range x new BlockRange(N, p.dim(0)) int
-, a new int x, N
13
Working with Sequential Dimensions

overall(i x for 17)
for(int j 0 jltN j)
ai, j (i 2) (j 2)
overall is a general control constructit can
include (almost) arbitrary blocks of code
Constraints on nesting of overalls, and on calls
to collective ops inside overall. See later.

14
HPspmd Co-arrays
int N 8 Procs p new Procs1(3) on(p)
Dimension procs p.dim(0) int -, a
new int procs, N
15
Working with Co-arrays

overall(me procs for )
for(int i 0 iltN i)
ame, i me N i
Just a special case of HPspmd distributed arrays,
where distributed range is process dimension.
Distribution format reminiscent of Co-array
Fortran, but no communication in the
languagestill use remap() etc.

16
HPspmd and Multicore

HPspmd model originally designed for distributed
memory architectures.
There are some arguments why it might be a good
model for shared memory and (especially) hybrid
architectures.

17
It isnt MPI

In principle you can do explicit message-passing
in HPspmd programs, but not usually necessary.
High level collective communications like remap()
can be implemented efficiently by MPI, or by
copying in shared memory, or a mixture.
See also other collective primitives proposed
later in this talk.

18
It isnt OpenMP

It doesnt require shared memory for efficient
implementation.
Even primitives built into PGAS languages like
Co-array Fortran have a bias towards shared
memory model.
HPspmd model is agnostic about communication,
leaving it to libraries.
Existing and proposed libraries are above
message-passing or shared memory level.

19
Cache Issues

In cache-coherent processors, common advice to
reduce false sharing, fragmentation, etc is to
use higher dimensional arrays to keep
partitions contiguous in the address space.
This is just how HPspmd distributed arrays are
implemented on shared memory processors.
See for example Parallel Computer Architecture,
Culler and Pal Singh, 1999, 5.6

20
Gadget
21
Gadget-2

A free production code for cosmological N-body
and hydrodynamic computations.
Written by Volker Springel, of the Max Plank
Institute for Astrophysics, Garching.
Written in Calready fully parallelized using
MPI.
Versions used in various astrophysics research
papers, including the Millennium Simulation.
http//www.mpa-garching.mpg.de/gadget

22
Millennium Simulation

Follows the evolution of 1010 dark matter
particles from early Universe to current day.
Performed on 512 nodes of IBM p690.
Used 1TB of distributed memory.
350,000 CPU hours 28 days elapsed time.
Floating point performance around 0.2 TFLOPS.
Around 20Tb total data generated.
Simulating the Joint Evolution of Quasars,
Galaxies and their Large-Scale Distribution,
Springel et al. Gadget home page.

23
Dynamics in Gadget

Gadget is just simulating the movement of (a
lot of) representative particles under the
influence of Newtons law of gravity, plus
hydrodynamic forces on gas.
Classical N-body problem.

24
Gadget Main Loop

Schematic view of the Gadget code
Initialize
while (not done)
move_particles()
// update positions
domain_Decomposition()
compute_accelerations()
advance_and_find_timesteps() // update
velocities
Most of the interesting work happens in
domain_Decomposition() and compute_accelerations()
.

25
Computing Forces

The function compute_accelerations() must compute
forces experienced by all particles.
In particular must compute gravitational force.
Because gravity is a long range force, every
particle affects every other. Naively, total
cost is O(N2).
With N 1010, this is infeasible.
Need some kind of approximation.
Intuitive approach when computing force on
particle i, group together particles in regions
far from i, and treat a groups as a single
particle, located at its centre of gravity.

26
Barnes-Hut Tree
Recursively split space into octree (quadtree in
this 2d example), until no node contains more
than 1 particle.
27
Barnes Hut Force Computation

BH computation of force on a particle i, traverse
tree starting from root
if a node n is far from i, just add contribution
to force on i from centre of mass of n no need
to visit children of n
if node n is close to i, visit children of n and
recurse.
On average, number of nodes opened to compute
force on i is O(log N).
c.f. visiting O(N) particles in naïve algorithm

28
BH in Gadget

In Gadget, preferred method of dealing with long
range part gravitational force is PMTree
algorithm (see later).
But BH-tree is still the fundamental data
structuretraversals of BH are used to locate
particles near to a given particle
Short-range part of gravity force in PMTree
Estimation of density in SPH
Computation of hydrodynamic forces

29
TreePM

Artificially split gravitational potential into
two parts

1

Fshort (r)
Flong(r)

Calculate Fshort using BH calculate Flong by
projecting particle mass distribution onto a
mesh, then working in Fourier space.

30
Smoothed Particle Hydrodynamics

Uses discrete particles to represent state of
fluid.
Equations of motion dependent on mass density,
estimated within a smoothing radius.
Force calculation has 2 distinct phases
Calculate density, adaptively setting smoothing
radius to give enough neighbours.
Calculate forces based on the smoothing kernel.

31
Domain Decomposition

Cant just divide space in a fixed way, because
some regions will have many more particle than
others poor load balancing.
Cant just divide particles in a fixed way,
because particles move independently through
space, and want to maintain physically close
particles on the same processor, as far as
practical communication problem.

32
Peano-Hilbert Curve

Picture borrowed from
http//www.mpa-garching.mpg.de/gadget/gadget2-pape
r.pdf

33
Decomposition based on P-H Curve

Sort particles by position on Peano-Hilbert
curve, then divide evenly into P domains.
Characteristics of this decomposition
Good load balancing.
Domains simply connected and quite compact in
real space, because particles that are close
along P-H curve are close in real space.
Domains have relatively simple mapping to BH
octree nodes.

34
Distribution of BH Tree in Gadget

Ibid.

35
Communication in Gadget

Can identify 4 recurring non-trivial patterns
Distributed sort of particles, according to P-H
key implements domain decomposition.
Export of particles to other nodes, for
calculation of remote contribution to force,
density, etc, and retrieval of results.
Projection of particle density to regular grid
for calculation of Flong distribution of results
back to irregularly distributed particles.
Distributed Fast Fourier Transform.

36
HPspmd Analysis of Gadget
37
Approach

As a first step, we go through the whole code,
and annotate it with the changes that would be
needed to make it into HPspmd code.
Gadget is 16,000 lines of C code.
Gadget is more complex than any previous HPspmd
code, so we expect that new communication
functions may be needed.
Also open to the possibility of new language
extensions.

38
Use of Co-Arrays

Immediately clear the translation will make
extensive use of co-arrays.
General HPspmd distributed arrays are not
suitable for a structures like the Nodes array
that holds the B-H, because HPspmd arrays only
allow very regular patterns of local access.
Similar considerations apply to other major
arrays, like the particles array P.

39
The Particles Array

Consider P, which holds NumPart particles in each
node, where in general NumPart is different for
each node.
Could use a general HPspmd distributed array with
an IrregRange
Range x new IrregRange(procs, NumPart)
struct particle_data - P
new struct particle_data x
Called GEN_BLOCK in HPF 2.0

40
Possible IrregRange Array for P

Assumes NumParts 6, 8, 5.

41
Using General Distributed Array

This would often be convenient, e.g. in
move_particles()
overall(ix for )
for(j 0 j lt 3 j)
Pi.Posj Pi.Velj
dt_drift
and even, in domain_Decomposition()
sort(P, domain_compare_key)
where sort() is a notional distributed sort.

42
Problem with General Array

But, in compute_potential(), for example
overall(i x for )
pos_x Pi.Pos0
Traverse down from root of BH tree
if( node contains noth single
particle)
dx Pno.Pos0 pos_x
Reference Pno is illegal in HPspmd!!

43
Co-Array for P

Assumes NumParts 6, 8, 5.

44
Using a Co-Array

Now we need in move_particles()
overall(me procs for )
for(i 0 i lt NumPartsme i)
for(j 0 j lt 3 j)
Pme, i.Posj Pme, i.Velj
dt_drift
and, notionally, in domain_Decomposition()
sort(P, NumParts, domain_compare_key)

45
Different Views?

HPJava actually has a mechanism for viewing an
array as a high-level distributed array in one
part of the code, and a co-array in another.
This mechanism was designed for library-writers.
Complex, and I was reluctant to introduce it into
user code.
First pass translation of Gadget uses co-arrays
for most of the major arrays.
Mechanism is called a splitting section.

46
Programming Style

Use of HPspmd programming features is largely
limited to patterns like this
Declaration of co-arrays, etc
Global computation
overall(me procs for )
Local computation
Collective communication
Any of above in sequential control flow

47
Programming Discipline

Syntactically, distributed array elements can
only be accessed inside overall.
Global variablesvariables declared outside
overall, not distributed array elementsshould
not be assigned inside overall.
In principle, message passing is allowed inside
overall (local computation section), but
collective communication (outside overall) is
preferred.

48
Basic Communication

Gadget makes extensive use of MPI collectives.
Easily map to methods from the Adlib collective
libraryalways more succinct.
Translation uses

Adlib.remap() Adlib.sum() Adlib.maxval() Adlib.min
val() Adlib.and()
Adlib.broadcast() Adlib.sumDim() Adlib.maxvalDim()
Adlib.minvalDim()
49
Non-trivial Communication Patterns

Will consider in detail two recurring patterns in
the Gadget code
Basic communication pattern in compute_potential()
.
Projection to particle mesh in TreePM.

50
Sketch of compute_potential()

while(... some local particles not processed
...)
for(... all local particles, while send
buffer not exhausted ...)
... Compute local contribution to
potential, and flag exports ...
... Add exported particles to send buffer
...
... Sort export buffer by destination ...
while(... there are more peers in export list
...)
... Send exports to all peers possible
(without exhausting
recv buffers) recv particles for local
computation ...
... Compute local contribution for received
particles ...
for(... all peer nodes communicated with
above ...)
... Send results back to exporters recv
our results ...
... Add the results to the P array...

51
Notes on compute_potential()

Highlighted code is essentially application
specific.
Remaining code is essentially boilerplate code
that recurs in
compute_potential()
gravity_tree()
gravity_forcetest()
density()
hydro_force()

52
Boilerplate Code

while(ntotleft gt 0)
overall(me procs for )
int nexport 0
for(j 0 j lt NTask j)
nsend_localme, j 0
ndoneme 0
while(i lt numme nexport lt bunchSize -
NTask)
if(condition(ame, i))
ndoneme
for(j 0 j lt NTask j)
Exportflagme, j 0
... Compute local contribution to
potential, and flag exports ...
for(j 0 j lt NTask j)
if(Exportflagme, j)
setDataIn(dataInme, nexport,
ame, i, ...)

MPlib.sendrecv(dataInsend_offset
send_offset
send_count - 1,
world.getProc(recvTask)
, TAG_A,
dataGetrecv_offset
recv_offset
recv_count - 1
world.getProc(recvTask)
, TAG_A,
MPI_COMM_WORLD,
status)
for(j 0 j lt NTask j)
if((j ngrp) lt NTask)
nbufferj nsendj ngrp, j
... Compute local contributions for
received particles ...
/ get the result /
for(j 0 j lt NTask j)

53
Remarks

We expect this pure message-passing code is quite
efficient.
But presumably hard to maintain, and certainly
not in the HPspmd spirit, which eschews low-level
messaging.
Too much code that is communication specific
rather than application specific.
Can we find a better abstraction?

54
A Different Point of View

Yes, I think so.
Lets try considering this fragment as the main
loop
for(... all local particles, while send
buffer not exhausted...)
... Compute local contribution to
potential, and flag exports ...
... Add exported particles to send
buffer ...
Now suppose that the highlighted elision, rather
than just flagging exports, calls a library
method, passing it the values to be exported.

55
The invoke() Method

This library method, which Ill suggestively call
invoke(), can internally load values into the
send buffer, and keep a track of whether the
buffer is full.
When the buffer is full, invoke() internally
triggers all the remaining of the actions on the
earlier slideincluding all communication with
peersthen returns control for the
application-specific main loop, to resume with a
cleared buffer.

56
A Better Code Organization

So then we could implement exactly equivalent
user code like this
for(... all local particles ...)
... Compute local contribution to
potential, and pass
exports to invoke() ...
But invoke() must somehow trigger the user code
segments indicated earlier by
... Compute local contribution for received
particles ...
... Add the results to the P array...

57
Callbacks

Clearly these must be callbacks of some kind.
Lets regard the first callback as remote
implementation code and the second as a
response handler.
The structure we now have is an asynchronous
remote method invocation!
But behind the scenes it is implemented
efficiently and collectively, aggregating atomic
invocations.

58
Putting Things Together

ComputePotentialProxy - proxies
new ComputePotentialProxy procs
ComputePotentialHandler - handlers
new ComputePotentialHandler procs
Instantiate custom proxies and handlers
RMISchedule schedule new RMISchedule(proxies,
handlers)
overall(me procs for )
for(i 0 i lt NumPartme i)
... Computing local contribution to
potential
proxiesme.invoke(remoteTask, i,
GravDataIn) // export
...
schedule.complete()

59
Defining Callback Methods

class ComputePotentialHandler extends RMIHandler
Declare fields and constructor
gravdata_in invoke(gravdata_in GravDataGet)
... Compute local contribution for received
particles ...
class ComputePotentialProxy extends RMIProxy
Declare fields and constructor
void handleResponse(int context, gravdata_in
GravDataOut)
... Add the results to the P array ...

60
Projection to Particle Mesh

Second of the non-trivial communications
patterns identified earlier comes from the TreePM
algorithm
Projection of particle density to regular grid
for calculation of Flong distribution of results
back to irregularly distributed particles.

61
Sketch of pmpotential_periodic()

overall(me procs for )
workspace new fftw_real dimx, dimy,
dimz
... Project local particle masses into
workspace ...
... Accumulate workspace into rhogrid ...
... Do the FFT of the density field ...
... Multiply with Green's function for the
potential ...
... Reverse the FFT to get the potential ...
overall(me procs for )
workspace new fftw_real dimx, dimy,
dimz
... Get workspace from rhogrid ...
... Read out local potential from
workspace...

62
Notes on pmpotential_periodic()

Computes Flong contribution to potential.
workspace is large enough to cover all particles
from P in local domain, and maps to a subsection
of the full PM distributed array, rhogrid.
Highlighted code is essentially application
independent.

2
Co-array P. Domain on node 3, covered by
workspace
1
3
0
4
Regularly distributed array rhogrid.
0
1
2
3
4
63
Accumulate workspace into rhogrid...

for(level 0 level lt (1 ltlt PTask) level)
/ note for level0, target is the same task
/
sendTask me
recvTask me level
if(recvTask lt NTask)
/ check how much we have to send /
sendmin 2 PMGRID
sendmax -1
for(slab_x meshminme, 0 slab_x lt
meshmaxme, 0 2 slab_x)
if(slab_to_taskslab_x PMGRID
recvTask)
if(slab_x lt sendmin)
sendmin slab_x
if(slab_x gt sendmax)
sendmax slab_x
if(sendmax -1)
sendmin 0

if(level gt 0)
MPlib.sendrecv(workspacesend
min - meshmin0sendmax - mashmin0, , ,
world.getProc(recvTask),
TAG_PERIODIC_C,
forcegrid,
world.getProc(recvTask),
TAG_PERIODIC_C,
MPI_COMM_WORLD, status)
else
HPspmd.copy(forcegrid,
workspacesendmin - meshmin0sendmax -
mashmin0, , )
for(slab_x recvmin slab_x lt
recvmax slab_x)
slab_xx (slab_x PMGRID) -
first_slab_of_taskme

64
Get workspace from rhogrid

for(level 0 level lt (1 ltlt PTask) level)
/ note for level0, target is the same task
/
sendTask me
recvTask me level
if(recvTask lt NTask)
/ check how much we have to send /
sendmin 2 PMGRID
sendmax -PMGRID
for(slab_x meshmin_list3
recvTask slab_x lt meshmax_list3 recvTask
2 slab_x)
if(slab_to_task(slab_x PMGRID)
PMGRID sendTask)
if(slab_x lt sendmin)
sendmin slab_x
if(slab_x gt sendmax)
sendmax slab_x

for(slab_x recvmin slab_x lt
recvmax slab_x)
if(slab_to_task(slab_x
PMGRID) PMGRID ! recvTask)
/ non-contiguous /
cont_recvmax0 slab_x
- 1
while(slab_to_task(slab_x
PMGRID) PMGRID ! recvTask)
slab_x
cont_recvmin1 slab_x
if(ncont 1)
ncont
for(rep 0 rep lt ncont rep)
sendmin cont_sendminrep
sendmax cont_sendmaxrep

65
Recurrences

Essentially similar code recurs in
pmforce_periodic()
pmpotential_periodic()
pmforce_nonperiodic()
pmpotential_nonperiodic()

66
A put Schedule

Modelled on earlier ideas about RMISchedule, we
could implement a scheduleSectionPutSumSchedule,
saythat has equivalent MPI implementation, but
hides all the explicit messaging.

67
Rewriting pmpotential_periodic()

SectionPutSumProxy - proxies
new SectionPutSumProxy procs
SectionPutSumSchedule schedule
new SectionPutSumSchedule(proxies,
rhogrid)
overall(me procs for )
workspace new fftw_real dimx, dimy,
dimz
... Project local particle masses into
workspace ...
proxiesme.put(new int meshminx,
meshminy, meshminz,
workspace)
Schedule.complete()
etc

68
Finishing Touches

Likewise we could introduce a SectionGetSchedule
for the second part of pmpotential_periodic().
Proxy would have an application-specific response
handler callback similar to RMISchedule.
Estimate this API would reduce implementation of
this single function in Gadget from 400 lines to
nearer 200.
Underlying implementation still essentially the
same in MPI case. But allows other
implementations.

69
Conclusions
70
HPspmd

HPspmd is a model for parallel programming that
introduces a few concepts into a base language,
for representing distributed data.
Goal minimal language extensions to support
libraries that work with distributed
dataincluding communication libraries.
Originally inspired by fairly regular
applications, and pure distributed memory.

71
HPspmd Analysis of Gadget

Highly irregular application.
Focus was on co-arraysmost primitive kind of
distributed array in HPspmd.
Nevertheless, were able to propose new high-level
communication APIsvery much in the HPspmd
spiritthat should make Gadget-like codes much
more maintainable.

72
Multicore

We argue that the high-level abstraction of
communication in HPspmd (and to some extent the
data model itself) lend naturally to multiple
implementationsincluding hybrid parallelism,
across clusters of multicore processors.
New communications APIs proposed here may be
especially interesting in this respect.
By construction they can be efficiently
implemented in MPI.
But their nature suggests they will also be
directly implementable on shared memory.

Write a Comment

User Comments (0)

About PowerShow.com

An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations PowerPoint PPT Presentation