Title: An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations
1An HPspmd Analysis of the Gadget 2 Code for
Cosmological Simulations
- Bryan Carpenter
- University of Southampton
2Background
- HPspmd model was introduced around 1997 as a
reaction to (the failure of) High Performance
Fortran (HPF). - The idea was to keep the data model of HPF, which
we thought was good, but abandon the execution
model, which was both inflexible and hard to
implement. - See papers at www.hpjava.org
3The HPspmd Model
- The idea of HPspmd is to extend a base language
with a specific set of syntax extensions for
representing distributed data (distributed
arrays). - The extended language does not contain any
communication or synchronization primitives. - The programming model is SPMD, and all
communication between nodes goes through
libraries. - HPspmd itself only provides a framework for
writing libraries that act on distributed data,
including communication libraries.
4HPJava, and other Base Languages
- It was always the intention that HPspmd could be
bound to various base languages. - First full implementation was its binding to
Java. HPJava finally released 2003. - Up to now HPJava is the only full implementation
of HPspmd (other than early C prototypes). - But a full binding to, say, C or C also
possible (and desirable?)
5Goals of This Work
- Revisit HPspmd, with a particular eye to a
possible role in programming multicore systems. - Push the HPspmd model beyond its original zone of
applicationto a highly irregular production code
for astrophysics. - See what we learn.
6HPspmd
7HPF Distributed Array Example
- INTEGER A(8, 8)
- !HPF PROCESSORS P(3,2)
- !HPF DISTRIBUTE A(BLOCK, BLOCK) ONTO P
8HPF Parallel Assignments
- ! Parallel loop
- FORALL (I 28, J 18)
- A(I, J) (I 1) (J 1)
- ! Communication (implicit)
- A(1, ) A(8, )
- Communication can happen almost anywhere.
9HPspmd Distributed Array Example
int N 8 Procs p new Procs2(3,2) on(p)
Range x new BlockRange(N, p.dim(0))
Range y new BlockRange(N, p.dim(1)) int
-,- a new int x, y
10HPspmd Parallel Assigments
- // Parallel loop
- overall(i x for 17)
- overall(j y for )
- ai, j (i 2) (j 2)
- // Communication (always explicit)
- Adlib.remap(a0, , a7, )
- In element reference, subscript must be a
distributed index (like i or j) in the range of
the array - Section subscriptsanything goes.
11HPF Collapsed Dimensions
- INTEGER A(8, 8)
- !HPF PROCESSORS P(3)
- !HPF DISTRIBUTE A(BLOCK,) ONTO P
12HPspmd Sequential Dimensions
int N 8 Procs p new Procs1(3) on(p)
Range x new BlockRange(N, p.dim(0)) int
-, a new int x, N
13Working with Sequential Dimensions
- overall(i x for 17)
- for(int j 0 jltN j)
- ai, j (i 2) (j 2)
- overall is a general control constructit can
include (almost) arbitrary blocks of code - Constraints on nesting of overalls, and on calls
to collective ops inside overall. See later.
14HPspmd Co-arrays
int N 8 Procs p new Procs1(3) on(p)
Dimension procs p.dim(0) int -, a
new int procs, N
15Working with Co-arrays
- overall(me procs for )
- for(int i 0 iltN i)
- ame, i me N i
- Just a special case of HPspmd distributed arrays,
where distributed range is process dimension. - Distribution format reminiscent of Co-array
Fortran, but no communication in the
languagestill use remap() etc.
16HPspmd and Multicore
- HPspmd model originally designed for distributed
memory architectures. - There are some arguments why it might be a good
model for shared memory and (especially) hybrid
architectures.
17It isnt MPI
- In principle you can do explicit message-passing
in HPspmd programs, but not usually necessary. - High level collective communications like remap()
can be implemented efficiently by MPI, or by
copying in shared memory, or a mixture. - See also other collective primitives proposed
later in this talk.
18It isnt OpenMP
- It doesnt require shared memory for efficient
implementation. - Even primitives built into PGAS languages like
Co-array Fortran have a bias towards shared
memory model. - HPspmd model is agnostic about communication,
leaving it to libraries. - Existing and proposed libraries are above
message-passing or shared memory level.
19Cache Issues
- In cache-coherent processors, common advice to
reduce false sharing, fragmentation, etc is to
use higher dimensional arrays to keep
partitions contiguous in the address space. - This is just how HPspmd distributed arrays are
implemented on shared memory processors. - See for example Parallel Computer Architecture,
Culler and Pal Singh, 1999, 5.6
20Gadget
21Gadget-2
- A free production code for cosmological N-body
and hydrodynamic computations. - Written by Volker Springel, of the Max Plank
Institute for Astrophysics, Garching. - Written in Calready fully parallelized using
MPI. - Versions used in various astrophysics research
papers, including the Millennium Simulation. - http//www.mpa-garching.mpg.de/gadget
22Millennium Simulation
- Follows the evolution of 1010 dark matter
particles from early Universe to current day. - Performed on 512 nodes of IBM p690.
- Used 1TB of distributed memory.
- 350,000 CPU hours 28 days elapsed time.
- Floating point performance around 0.2 TFLOPS.
- Around 20Tb total data generated.
- Simulating the Joint Evolution of Quasars,
Galaxies and their Large-Scale Distribution,
Springel et al. Gadget home page.
23Dynamics in Gadget
- Gadget is just simulating the movement of (a
lot of) representative particles under the
influence of Newtons law of gravity, plus
hydrodynamic forces on gas. - Classical N-body problem.
24Gadget Main Loop
- Schematic view of the Gadget code
- Initialize
- while (not done)
- move_particles()
// update positions - domain_Decomposition()
- compute_accelerations()
- advance_and_find_timesteps() // update
velocities -
- Most of the interesting work happens in
domain_Decomposition() and compute_accelerations()
.
25Computing Forces
- The function compute_accelerations() must compute
forces experienced by all particles. - In particular must compute gravitational force.
- Because gravity is a long range force, every
particle affects every other. Naively, total
cost is O(N2). - With N 1010, this is infeasible.
- Need some kind of approximation.
- Intuitive approach when computing force on
particle i, group together particles in regions
far from i, and treat a groups as a single
particle, located at its centre of gravity.
26Barnes-Hut Tree
Recursively split space into octree (quadtree in
this 2d example), until no node contains more
than 1 particle.
27Barnes Hut Force Computation
- BH computation of force on a particle i, traverse
tree starting from root - if a node n is far from i, just add contribution
to force on i from centre of mass of n no need
to visit children of n - if node n is close to i, visit children of n and
recurse. - On average, number of nodes opened to compute
force on i is O(log N). - c.f. visiting O(N) particles in naïve algorithm
28BH in Gadget
- In Gadget, preferred method of dealing with long
range part gravitational force is PMTree
algorithm (see later). - But BH-tree is still the fundamental data
structuretraversals of BH are used to locate
particles near to a given particle - Short-range part of gravity force in PMTree
- Estimation of density in SPH
- Computation of hydrodynamic forces
29TreePM
- Artificially split gravitational potential into
two parts
1
Fshort (r)
Flong(r)
- Calculate Fshort using BH calculate Flong by
projecting particle mass distribution onto a
mesh, then working in Fourier space.
30Smoothed Particle Hydrodynamics
- Uses discrete particles to represent state of
fluid. - Equations of motion dependent on mass density,
estimated within a smoothing radius. - Force calculation has 2 distinct phases
- Calculate density, adaptively setting smoothing
radius to give enough neighbours. - Calculate forces based on the smoothing kernel.
31Domain Decomposition
- Cant just divide space in a fixed way, because
some regions will have many more particle than
others poor load balancing. - Cant just divide particles in a fixed way,
because particles move independently through
space, and want to maintain physically close
particles on the same processor, as far as
practical communication problem.
32Peano-Hilbert Curve
- Picture borrowed from
- http//www.mpa-garching.mpg.de/gadget/gadget2-pape
r.pdf
33Decomposition based on P-H Curve
- Sort particles by position on Peano-Hilbert
curve, then divide evenly into P domains. - Characteristics of this decomposition
- Good load balancing.
- Domains simply connected and quite compact in
real space, because particles that are close
along P-H curve are close in real space. - Domains have relatively simple mapping to BH
octree nodes.
34Distribution of BH Tree in Gadget
35Communication in Gadget
- Can identify 4 recurring non-trivial patterns
- Distributed sort of particles, according to P-H
key implements domain decomposition. - Export of particles to other nodes, for
calculation of remote contribution to force,
density, etc, and retrieval of results. - Projection of particle density to regular grid
for calculation of Flong distribution of results
back to irregularly distributed particles. - Distributed Fast Fourier Transform.
36HPspmd Analysis of Gadget
37Approach
- As a first step, we go through the whole code,
and annotate it with the changes that would be
needed to make it into HPspmd code. - Gadget is 16,000 lines of C code.
- Gadget is more complex than any previous HPspmd
code, so we expect that new communication
functions may be needed. - Also open to the possibility of new language
extensions.
38Use of Co-Arrays
- Immediately clear the translation will make
extensive use of co-arrays. - General HPspmd distributed arrays are not
suitable for a structures like the Nodes array
that holds the B-H, because HPspmd arrays only
allow very regular patterns of local access. - Similar considerations apply to other major
arrays, like the particles array P.
39The Particles Array
- Consider P, which holds NumPart particles in each
node, where in general NumPart is different for
each node. - Could use a general HPspmd distributed array with
an IrregRange - Range x new IrregRange(procs, NumPart)
- struct particle_data - P
- new struct particle_data x
- Called GEN_BLOCK in HPF 2.0
40Possible IrregRange Array for P
- Assumes NumParts 6, 8, 5.
41Using General Distributed Array
- This would often be convenient, e.g. in
move_particles() - overall(ix for )
- for(j 0 j lt 3 j)
- Pi.Posj Pi.Velj
dt_drift - and even, in domain_Decomposition()
- sort(P, domain_compare_key)
- where sort() is a notional distributed sort.
42Problem with General Array
- But, in compute_potential(), for example
- overall(i x for )
- pos_x Pi.Pos0
-
- Traverse down from root of BH tree
- if( node contains noth single
particle) - dx Pno.Pos0 pos_x
-
-
-
- Reference Pno is illegal in HPspmd!!
43Co-Array for P
- Assumes NumParts 6, 8, 5.
44Using a Co-Array
- Now we need in move_particles()
- overall(me procs for )
- for(i 0 i lt NumPartsme i)
- for(j 0 j lt 3 j)
- Pme, i.Posj Pme, i.Velj
dt_drift -
- and, notionally, in domain_Decomposition()
- sort(P, NumParts, domain_compare_key)
45Different Views?
- HPJava actually has a mechanism for viewing an
array as a high-level distributed array in one
part of the code, and a co-array in another. - This mechanism was designed for library-writers.
Complex, and I was reluctant to introduce it into
user code. - First pass translation of Gadget uses co-arrays
for most of the major arrays. - Mechanism is called a splitting section.
46Programming Style
- Use of HPspmd programming features is largely
limited to patterns like this - Declaration of co-arrays, etc
- Global computation
- overall(me procs for )
- Local computation
-
- Collective communication
- Any of above in sequential control flow
47Programming Discipline
- Syntactically, distributed array elements can
only be accessed inside overall. - Global variablesvariables declared outside
overall, not distributed array elementsshould
not be assigned inside overall. - In principle, message passing is allowed inside
overall (local computation section), but
collective communication (outside overall) is
preferred.
48Basic Communication
- Gadget makes extensive use of MPI collectives.
- Easily map to methods from the Adlib collective
libraryalways more succinct. - Translation uses
Adlib.remap() Adlib.sum() Adlib.maxval() Adlib.min
val() Adlib.and()
Adlib.broadcast() Adlib.sumDim() Adlib.maxvalDim()
Adlib.minvalDim()
49Non-trivial Communication Patterns
- Will consider in detail two recurring patterns in
the Gadget code - Basic communication pattern in compute_potential()
. - Projection to particle mesh in TreePM.
50Sketch of compute_potential()
- while(... some local particles not processed
...) - for(... all local particles, while send
buffer not exhausted ...) - ... Compute local contribution to
potential, and flag exports ... - ... Add exported particles to send buffer
... -
- ... Sort export buffer by destination ...
- while(... there are more peers in export list
...) - ... Send exports to all peers possible
(without exhausting - recv buffers) recv particles for local
computation ... - ... Compute local contribution for received
particles ... - for(... all peer nodes communicated with
above ...) - ... Send results back to exporters recv
our results ... - ... Add the results to the P array...
-
-
-
51Notes on compute_potential()
- Highlighted code is essentially application
specific. - Remaining code is essentially boilerplate code
that recurs in - compute_potential()
- gravity_tree()
- gravity_forcetest()
- density()
- hydro_force()
52Boilerplate Code
- while(ntotleft gt 0)
- overall(me procs for )
- int nexport 0
- for(j 0 j lt NTask j)
- nsend_localme, j 0
- ndoneme 0
- while(i lt numme nexport lt bunchSize -
NTask) - if(condition(ame, i))
- ndoneme
- for(j 0 j lt NTask j)
- Exportflagme, j 0
- ... Compute local contribution to
potential, and flag exports ... - for(j 0 j lt NTask j)
- if(Exportflagme, j)
- setDataIn(dataInme, nexport,
ame, i, ...)
- MPlib.sendrecv(dataInsend_offset
- send_offset
send_count - 1, - world.getProc(recvTask)
, TAG_A, - dataGetrecv_offset
- recv_offset
recv_count - 1 - world.getProc(recvTask)
, TAG_A, - MPI_COMM_WORLD,
status) -
-
- for(j 0 j lt NTask j)
- if((j ngrp) lt NTask)
- nbufferj nsendj ngrp, j
-
- ... Compute local contributions for
received particles ... - / get the result /
- for(j 0 j lt NTask j)
53Remarks
- We expect this pure message-passing code is quite
efficient. - But presumably hard to maintain, and certainly
not in the HPspmd spirit, which eschews low-level
messaging. - Too much code that is communication specific
rather than application specific. - Can we find a better abstraction?
54A Different Point of View
- Yes, I think so.
- Lets try considering this fragment as the main
loop - for(... all local particles, while send
buffer not exhausted...) - ... Compute local contribution to
potential, and flag exports ... - ... Add exported particles to send
buffer ... -
- Now suppose that the highlighted elision, rather
than just flagging exports, calls a library
method, passing it the values to be exported.
55The invoke() Method
- This library method, which Ill suggestively call
invoke(), can internally load values into the
send buffer, and keep a track of whether the
buffer is full. - When the buffer is full, invoke() internally
triggers all the remaining of the actions on the
earlier slideincluding all communication with
peersthen returns control for the
application-specific main loop, to resume with a
cleared buffer.
56A Better Code Organization
- So then we could implement exactly equivalent
user code like this - for(... all local particles ...)
- ... Compute local contribution to
potential, and pass - exports to invoke() ...
-
- But invoke() must somehow trigger the user code
segments indicated earlier by - ... Compute local contribution for received
particles ... - ... Add the results to the P array...
57Callbacks
- Clearly these must be callbacks of some kind.
- Lets regard the first callback as remote
implementation code and the second as a
response handler. - The structure we now have is an asynchronous
remote method invocation! - But behind the scenes it is implemented
efficiently and collectively, aggregating atomic
invocations.
58Putting Things Together
- ComputePotentialProxy - proxies
- new ComputePotentialProxy procs
- ComputePotentialHandler - handlers
- new ComputePotentialHandler procs
- Instantiate custom proxies and handlers
- RMISchedule schedule new RMISchedule(proxies,
handlers) - overall(me procs for )
- for(i 0 i lt NumPartme i)
- ... Computing local contribution to
potential - proxiesme.invoke(remoteTask, i,
GravDataIn) // export - ...
-
- schedule.complete()
59Defining Callback Methods
- class ComputePotentialHandler extends RMIHandler
- Declare fields and constructor
- gravdata_in invoke(gravdata_in GravDataGet)
- ... Compute local contribution for received
particles ... -
-
-
- class ComputePotentialProxy extends RMIProxy
- Declare fields and constructor
- void handleResponse(int context, gravdata_in
GravDataOut) - ... Add the results to the P array ...
-
60Projection to Particle Mesh
- Second of the non-trivial communications
patterns identified earlier comes from the TreePM
algorithm - Projection of particle density to regular grid
for calculation of Flong distribution of results
back to irregularly distributed particles.
61Sketch of pmpotential_periodic()
- overall(me procs for )
- workspace new fftw_real dimx, dimy,
dimz - ... Project local particle masses into
workspace ... - ... Accumulate workspace into rhogrid ...
-
- ... Do the FFT of the density field ...
- ... Multiply with Green's function for the
potential ... - ... Reverse the FFT to get the potential ...
- overall(me procs for )
- workspace new fftw_real dimx, dimy,
dimz - ... Get workspace from rhogrid ...
- ... Read out local potential from
workspace...
62Notes on pmpotential_periodic()
- Computes Flong contribution to potential.
- workspace is large enough to cover all particles
from P in local domain, and maps to a subsection
of the full PM distributed array, rhogrid. - Highlighted code is essentially application
independent.
2
Co-array P. Domain on node 3, covered by
workspace
1
3
0
4
Regularly distributed array rhogrid.
0
1
2
3
4
63Accumulate workspace into rhogrid...
- for(level 0 level lt (1 ltlt PTask) level)
/ note for level0, target is the same task
/ -
- sendTask me
- recvTask me level
- if(recvTask lt NTask)
-
- / check how much we have to send /
- sendmin 2 PMGRID
- sendmax -1
- for(slab_x meshminme, 0 slab_x lt
meshmaxme, 0 2 slab_x) - if(slab_to_taskslab_x PMGRID
recvTask) -
- if(slab_x lt sendmin)
- sendmin slab_x
- if(slab_x gt sendmax)
- sendmax slab_x
-
- if(sendmax -1)
- sendmin 0
- if(level gt 0)
-
- MPlib.sendrecv(workspacesend
min - meshmin0sendmax - mashmin0, , , -
world.getProc(recvTask), - TAG_PERIODIC_C,
forcegrid, -
world.getProc(recvTask), - TAG_PERIODIC_C,
MPI_COMM_WORLD, status) -
- else
-
- HPspmd.copy(forcegrid,
-
workspacesendmin - meshmin0sendmax -
mashmin0, , ) -
- for(slab_x recvmin slab_x lt
recvmax slab_x) -
- slab_xx (slab_x PMGRID) -
first_slab_of_taskme
64Get workspace from rhogrid
- for(level 0 level lt (1 ltlt PTask) level)
/ note for level0, target is the same task
/ -
- sendTask me
- recvTask me level
- if(recvTask lt NTask)
-
- / check how much we have to send /
- sendmin 2 PMGRID
- sendmax -PMGRID
- for(slab_x meshmin_list3
recvTask slab_x lt meshmax_list3 recvTask
2 slab_x) - if(slab_to_task(slab_x PMGRID)
PMGRID sendTask) -
- if(slab_x lt sendmin)
- sendmin slab_x
- if(slab_x gt sendmax)
- sendmax slab_x
-
- for(slab_x recvmin slab_x lt
recvmax slab_x) -
- if(slab_to_task(slab_x
PMGRID) PMGRID ! recvTask) -
- / non-contiguous /
- cont_recvmax0 slab_x
- 1 - while(slab_to_task(slab_x
PMGRID) PMGRID ! recvTask) - slab_x
- cont_recvmin1 slab_x
- if(ncont 1)
- ncont
-
-
- for(rep 0 rep lt ncont rep)
-
- sendmin cont_sendminrep
- sendmax cont_sendmaxrep
65Recurrences
- Essentially similar code recurs in
- pmforce_periodic()
- pmpotential_periodic()
- pmforce_nonperiodic()
- pmpotential_nonperiodic()
66A put Schedule
- Modelled on earlier ideas about RMISchedule, we
could implement a scheduleSectionPutSumSchedule,
saythat has equivalent MPI implementation, but
hides all the explicit messaging.
67Rewriting pmpotential_periodic()
- SectionPutSumProxy - proxies
- new SectionPutSumProxy procs
- SectionPutSumSchedule schedule
- new SectionPutSumSchedule(proxies,
rhogrid) - overall(me procs for )
- workspace new fftw_real dimx, dimy,
dimz - ... Project local particle masses into
workspace ... - proxiesme.put(new int meshminx,
meshminy, meshminz, - workspace)
-
- Schedule.complete()
- etc
68Finishing Touches
- Likewise we could introduce a SectionGetSchedule
for the second part of pmpotential_periodic(). - Proxy would have an application-specific response
handler callback similar to RMISchedule. - Estimate this API would reduce implementation of
this single function in Gadget from 400 lines to
nearer 200. - Underlying implementation still essentially the
same in MPI case. But allows other
implementations.
69Conclusions
70HPspmd
- HPspmd is a model for parallel programming that
introduces a few concepts into a base language,
for representing distributed data. - Goal minimal language extensions to support
libraries that work with distributed
dataincluding communication libraries. - Originally inspired by fairly regular
applications, and pure distributed memory.
71HPspmd Analysis of Gadget
- Highly irregular application.
- Focus was on co-arraysmost primitive kind of
distributed array in HPspmd. - Nevertheless, were able to propose new high-level
communication APIsvery much in the HPspmd
spiritthat should make Gadget-like codes much
more maintainable.
72Multicore
- We argue that the high-level abstraction of
communication in HPspmd (and to some extent the
data model itself) lend naturally to multiple
implementationsincluding hybrid parallelism,
across clusters of multicore processors. - New communications APIs proposed here may be
especially interesting in this respect. - By construction they can be efficiently
implemented in MPI. - But their nature suggests they will also be
directly implementable on shared memory.