An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations PowerPoint PPT Presentation

presentation player overlay
1 / 72
About This Presentation
Transcript and Presenter's Notes

Title: An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations


1
An HPspmd Analysis of the Gadget 2 Code for
Cosmological Simulations
  • Bryan Carpenter
  • University of Southampton

2
Background
  • HPspmd model was introduced around 1997 as a
    reaction to (the failure of) High Performance
    Fortran (HPF).
  • The idea was to keep the data model of HPF, which
    we thought was good, but abandon the execution
    model, which was both inflexible and hard to
    implement.
  • See papers at www.hpjava.org

3
The HPspmd Model
  • The idea of HPspmd is to extend a base language
    with a specific set of syntax extensions for
    representing distributed data (distributed
    arrays).
  • The extended language does not contain any
    communication or synchronization primitives.
  • The programming model is SPMD, and all
    communication between nodes goes through
    libraries.
  • HPspmd itself only provides a framework for
    writing libraries that act on distributed data,
    including communication libraries.

4
HPJava, and other Base Languages
  • It was always the intention that HPspmd could be
    bound to various base languages.
  • First full implementation was its binding to
    Java. HPJava finally released 2003.
  • Up to now HPJava is the only full implementation
    of HPspmd (other than early C prototypes).
  • But a full binding to, say, C or C also
    possible (and desirable?)

5
Goals of This Work
  • Revisit HPspmd, with a particular eye to a
    possible role in programming multicore systems.
  • Push the HPspmd model beyond its original zone of
    applicationto a highly irregular production code
    for astrophysics.
  • See what we learn.

6
HPspmd
7
HPF Distributed Array Example
  • INTEGER A(8, 8)
  • !HPF PROCESSORS P(3,2)
  • !HPF DISTRIBUTE A(BLOCK, BLOCK) ONTO P

8
HPF Parallel Assignments
  • ! Parallel loop
  • FORALL (I 28, J 18)
  • A(I, J) (I 1) (J 1)
  • ! Communication (implicit)
  • A(1, ) A(8, )
  • Communication can happen almost anywhere.

9
HPspmd Distributed Array Example
int N 8 Procs p new Procs2(3,2) on(p)
Range x new BlockRange(N, p.dim(0))
Range y new BlockRange(N, p.dim(1)) int
-,- a new int x, y
10
HPspmd Parallel Assigments
  • // Parallel loop
  • overall(i x for 17)
  • overall(j y for )
  • ai, j (i 2) (j 2)
  • // Communication (always explicit)
  • Adlib.remap(a0, , a7, )
  • In element reference, subscript must be a
    distributed index (like i or j) in the range of
    the array
  • Section subscriptsanything goes.

11
HPF Collapsed Dimensions
  • INTEGER A(8, 8)
  • !HPF PROCESSORS P(3)
  • !HPF DISTRIBUTE A(BLOCK,) ONTO P

12
HPspmd Sequential Dimensions
int N 8 Procs p new Procs1(3) on(p)
Range x new BlockRange(N, p.dim(0)) int
-, a new int x, N
13
Working with Sequential Dimensions
  • overall(i x for 17)
  • for(int j 0 jltN j)
  • ai, j (i 2) (j 2)
  • overall is a general control constructit can
    include (almost) arbitrary blocks of code
  • Constraints on nesting of overalls, and on calls
    to collective ops inside overall. See later.

14
HPspmd Co-arrays
int N 8 Procs p new Procs1(3) on(p)
Dimension procs p.dim(0) int -, a
new int procs, N
15
Working with Co-arrays
  • overall(me procs for )
  • for(int i 0 iltN i)
  • ame, i me N i
  • Just a special case of HPspmd distributed arrays,
    where distributed range is process dimension.
  • Distribution format reminiscent of Co-array
    Fortran, but no communication in the
    languagestill use remap() etc.

16
HPspmd and Multicore
  • HPspmd model originally designed for distributed
    memory architectures.
  • There are some arguments why it might be a good
    model for shared memory and (especially) hybrid
    architectures.

17
It isnt MPI
  • In principle you can do explicit message-passing
    in HPspmd programs, but not usually necessary.
  • High level collective communications like remap()
    can be implemented efficiently by MPI, or by
    copying in shared memory, or a mixture.
  • See also other collective primitives proposed
    later in this talk.

18
It isnt OpenMP
  • It doesnt require shared memory for efficient
    implementation.
  • Even primitives built into PGAS languages like
    Co-array Fortran have a bias towards shared
    memory model.
  • HPspmd model is agnostic about communication,
    leaving it to libraries.
  • Existing and proposed libraries are above
    message-passing or shared memory level.

19
Cache Issues
  • In cache-coherent processors, common advice to
    reduce false sharing, fragmentation, etc is to
    use higher dimensional arrays to keep
    partitions contiguous in the address space.
  • This is just how HPspmd distributed arrays are
    implemented on shared memory processors.
  • See for example Parallel Computer Architecture,
    Culler and Pal Singh, 1999, 5.6

20
Gadget
21
Gadget-2
  • A free production code for cosmological N-body
    and hydrodynamic computations.
  • Written by Volker Springel, of the Max Plank
    Institute for Astrophysics, Garching.
  • Written in Calready fully parallelized using
    MPI.
  • Versions used in various astrophysics research
    papers, including the Millennium Simulation.
  • http//www.mpa-garching.mpg.de/gadget

22
Millennium Simulation
  • Follows the evolution of 1010 dark matter
    particles from early Universe to current day.
  • Performed on 512 nodes of IBM p690.
  • Used 1TB of distributed memory.
  • 350,000 CPU hours 28 days elapsed time.
  • Floating point performance around 0.2 TFLOPS.
  • Around 20Tb total data generated.
  • Simulating the Joint Evolution of Quasars,
    Galaxies and their Large-Scale Distribution,
    Springel et al. Gadget home page.

23
Dynamics in Gadget
  • Gadget is just simulating the movement of (a
    lot of) representative particles under the
    influence of Newtons law of gravity, plus
    hydrodynamic forces on gas.
  • Classical N-body problem.

24
Gadget Main Loop
  • Schematic view of the Gadget code
  • Initialize
  • while (not done)
  • move_particles()
    // update positions
  • domain_Decomposition()
  • compute_accelerations()
  • advance_and_find_timesteps() // update
    velocities
  • Most of the interesting work happens in
    domain_Decomposition() and compute_accelerations()
    .

25
Computing Forces
  • The function compute_accelerations() must compute
    forces experienced by all particles.
  • In particular must compute gravitational force.
  • Because gravity is a long range force, every
    particle affects every other. Naively, total
    cost is O(N2).
  • With N 1010, this is infeasible.
  • Need some kind of approximation.
  • Intuitive approach when computing force on
    particle i, group together particles in regions
    far from i, and treat a groups as a single
    particle, located at its centre of gravity.

26
Barnes-Hut Tree
Recursively split space into octree (quadtree in
this 2d example), until no node contains more
than 1 particle.
27
Barnes Hut Force Computation
  • BH computation of force on a particle i, traverse
    tree starting from root
  • if a node n is far from i, just add contribution
    to force on i from centre of mass of n no need
    to visit children of n
  • if node n is close to i, visit children of n and
    recurse.
  • On average, number of nodes opened to compute
    force on i is O(log N).
  • c.f. visiting O(N) particles in naïve algorithm

28
BH in Gadget
  • In Gadget, preferred method of dealing with long
    range part gravitational force is PMTree
    algorithm (see later).
  • But BH-tree is still the fundamental data
    structuretraversals of BH are used to locate
    particles near to a given particle
  • Short-range part of gravity force in PMTree
  • Estimation of density in SPH
  • Computation of hydrodynamic forces

29
TreePM
  • Artificially split gravitational potential into
    two parts





1


Fshort (r)
Flong(r)
  • Calculate Fshort using BH calculate Flong by
    projecting particle mass distribution onto a
    mesh, then working in Fourier space.

30
Smoothed Particle Hydrodynamics
  • Uses discrete particles to represent state of
    fluid.
  • Equations of motion dependent on mass density,
    estimated within a smoothing radius.
  • Force calculation has 2 distinct phases
  • Calculate density, adaptively setting smoothing
    radius to give enough neighbours.
  • Calculate forces based on the smoothing kernel.

31
Domain Decomposition
  • Cant just divide space in a fixed way, because
    some regions will have many more particle than
    others poor load balancing.
  • Cant just divide particles in a fixed way,
    because particles move independently through
    space, and want to maintain physically close
    particles on the same processor, as far as
    practical communication problem.

32
Peano-Hilbert Curve
  • Picture borrowed from
  • http//www.mpa-garching.mpg.de/gadget/gadget2-pape
    r.pdf

33
Decomposition based on P-H Curve
  • Sort particles by position on Peano-Hilbert
    curve, then divide evenly into P domains.
  • Characteristics of this decomposition
  • Good load balancing.
  • Domains simply connected and quite compact in
    real space, because particles that are close
    along P-H curve are close in real space.
  • Domains have relatively simple mapping to BH
    octree nodes.

34
Distribution of BH Tree in Gadget
  • Ibid.

35
Communication in Gadget
  • Can identify 4 recurring non-trivial patterns
  • Distributed sort of particles, according to P-H
    key implements domain decomposition.
  • Export of particles to other nodes, for
    calculation of remote contribution to force,
    density, etc, and retrieval of results.
  • Projection of particle density to regular grid
    for calculation of Flong distribution of results
    back to irregularly distributed particles.
  • Distributed Fast Fourier Transform.

36
HPspmd Analysis of Gadget
37
Approach
  • As a first step, we go through the whole code,
    and annotate it with the changes that would be
    needed to make it into HPspmd code.
  • Gadget is 16,000 lines of C code.
  • Gadget is more complex than any previous HPspmd
    code, so we expect that new communication
    functions may be needed.
  • Also open to the possibility of new language
    extensions.

38
Use of Co-Arrays
  • Immediately clear the translation will make
    extensive use of co-arrays.
  • General HPspmd distributed arrays are not
    suitable for a structures like the Nodes array
    that holds the B-H, because HPspmd arrays only
    allow very regular patterns of local access.
  • Similar considerations apply to other major
    arrays, like the particles array P.

39
The Particles Array
  • Consider P, which holds NumPart particles in each
    node, where in general NumPart is different for
    each node.
  • Could use a general HPspmd distributed array with
    an IrregRange
  • Range x new IrregRange(procs, NumPart)
  • struct particle_data - P
  • new struct particle_data x
  • Called GEN_BLOCK in HPF 2.0

40
Possible IrregRange Array for P
  • Assumes NumParts 6, 8, 5.

41
Using General Distributed Array
  • This would often be convenient, e.g. in
    move_particles()
  • overall(ix for )
  • for(j 0 j lt 3 j)
  • Pi.Posj Pi.Velj
    dt_drift
  • and even, in domain_Decomposition()
  • sort(P, domain_compare_key)
  • where sort() is a notional distributed sort.

42
Problem with General Array
  • But, in compute_potential(), for example
  • overall(i x for )
  • pos_x Pi.Pos0
  • Traverse down from root of BH tree
  • if( node contains noth single
    particle)
  • dx Pno.Pos0 pos_x
  • Reference Pno is illegal in HPspmd!!

43
Co-Array for P
  • Assumes NumParts 6, 8, 5.

44
Using a Co-Array
  • Now we need in move_particles()
  • overall(me procs for )
  • for(i 0 i lt NumPartsme i)
  • for(j 0 j lt 3 j)
  • Pme, i.Posj Pme, i.Velj
    dt_drift
  • and, notionally, in domain_Decomposition()
  • sort(P, NumParts, domain_compare_key)

45
Different Views?
  • HPJava actually has a mechanism for viewing an
    array as a high-level distributed array in one
    part of the code, and a co-array in another.
  • This mechanism was designed for library-writers.
    Complex, and I was reluctant to introduce it into
    user code.
  • First pass translation of Gadget uses co-arrays
    for most of the major arrays.
  • Mechanism is called a splitting section.

46
Programming Style
  • Use of HPspmd programming features is largely
    limited to patterns like this
  • Declaration of co-arrays, etc
  • Global computation
  • overall(me procs for )
  • Local computation
  • Collective communication
  • Any of above in sequential control flow

47
Programming Discipline
  • Syntactically, distributed array elements can
    only be accessed inside overall.
  • Global variablesvariables declared outside
    overall, not distributed array elementsshould
    not be assigned inside overall.
  • In principle, message passing is allowed inside
    overall (local computation section), but
    collective communication (outside overall) is
    preferred.

48
Basic Communication
  • Gadget makes extensive use of MPI collectives.
  • Easily map to methods from the Adlib collective
    libraryalways more succinct.
  • Translation uses

Adlib.remap() Adlib.sum() Adlib.maxval() Adlib.min
val() Adlib.and()
Adlib.broadcast() Adlib.sumDim() Adlib.maxvalDim()
Adlib.minvalDim()
49
Non-trivial Communication Patterns
  • Will consider in detail two recurring patterns in
    the Gadget code
  • Basic communication pattern in compute_potential()
    .
  • Projection to particle mesh in TreePM.

50
Sketch of compute_potential()
  • while(... some local particles not processed
    ...)
  • for(... all local particles, while send
    buffer not exhausted ...)
  • ... Compute local contribution to
    potential, and flag exports ...
  • ... Add exported particles to send buffer
    ...
  • ... Sort export buffer by destination ...
  • while(... there are more peers in export list
    ...)
  • ... Send exports to all peers possible
    (without exhausting
  • recv buffers) recv particles for local
    computation ...
  • ... Compute local contribution for received
    particles ...
  • for(... all peer nodes communicated with
    above ...)
  • ... Send results back to exporters recv
    our results ...
  • ... Add the results to the P array...

51
Notes on compute_potential()
  • Highlighted code is essentially application
    specific.
  • Remaining code is essentially boilerplate code
    that recurs in
  • compute_potential()
  • gravity_tree()
  • gravity_forcetest()
  • density()
  • hydro_force()

52
Boilerplate Code
  • while(ntotleft gt 0)
  • overall(me procs for )
  • int nexport 0
  • for(j 0 j lt NTask j)
  • nsend_localme, j 0
  • ndoneme 0
  • while(i lt numme nexport lt bunchSize -
    NTask)
  • if(condition(ame, i))
  • ndoneme
  • for(j 0 j lt NTask j)
  • Exportflagme, j 0
  • ... Compute local contribution to
    potential, and flag exports ...
  • for(j 0 j lt NTask j)
  • if(Exportflagme, j)
  • setDataIn(dataInme, nexport,
    ame, i, ...)
  • MPlib.sendrecv(dataInsend_offset
  • send_offset
    send_count - 1,
  • world.getProc(recvTask)
    , TAG_A,
  • dataGetrecv_offset
  • recv_offset
    recv_count - 1
  • world.getProc(recvTask)
    , TAG_A,
  • MPI_COMM_WORLD,
    status)
  • for(j 0 j lt NTask j)
  • if((j ngrp) lt NTask)
  • nbufferj nsendj ngrp, j
  • ... Compute local contributions for
    received particles ...
  • / get the result /
  • for(j 0 j lt NTask j)

53
Remarks
  • We expect this pure message-passing code is quite
    efficient.
  • But presumably hard to maintain, and certainly
    not in the HPspmd spirit, which eschews low-level
    messaging.
  • Too much code that is communication specific
    rather than application specific.
  • Can we find a better abstraction?

54
A Different Point of View
  • Yes, I think so.
  • Lets try considering this fragment as the main
    loop
  • for(... all local particles, while send
    buffer not exhausted...)
  • ... Compute local contribution to
    potential, and flag exports ...
  • ... Add exported particles to send
    buffer ...
  • Now suppose that the highlighted elision, rather
    than just flagging exports, calls a library
    method, passing it the values to be exported.

55
The invoke() Method
  • This library method, which Ill suggestively call
    invoke(), can internally load values into the
    send buffer, and keep a track of whether the
    buffer is full.
  • When the buffer is full, invoke() internally
    triggers all the remaining of the actions on the
    earlier slideincluding all communication with
    peersthen returns control for the
    application-specific main loop, to resume with a
    cleared buffer.

56
A Better Code Organization
  • So then we could implement exactly equivalent
    user code like this
  • for(... all local particles ...)
  • ... Compute local contribution to
    potential, and pass
  • exports to invoke() ...
  • But invoke() must somehow trigger the user code
    segments indicated earlier by
  • ... Compute local contribution for received
    particles ...
  • ... Add the results to the P array...

57
Callbacks
  • Clearly these must be callbacks of some kind.
  • Lets regard the first callback as remote
    implementation code and the second as a
    response handler.
  • The structure we now have is an asynchronous
    remote method invocation!
  • But behind the scenes it is implemented
    efficiently and collectively, aggregating atomic
    invocations.

58
Putting Things Together
  • ComputePotentialProxy - proxies
  • new ComputePotentialProxy procs
  • ComputePotentialHandler - handlers
  • new ComputePotentialHandler procs
  • Instantiate custom proxies and handlers
  • RMISchedule schedule new RMISchedule(proxies,
    handlers)
  • overall(me procs for )
  • for(i 0 i lt NumPartme i)
  • ... Computing local contribution to
    potential
  • proxiesme.invoke(remoteTask, i,
    GravDataIn) // export
  • ...
  • schedule.complete()

59
Defining Callback Methods
  • class ComputePotentialHandler extends RMIHandler
  • Declare fields and constructor
  • gravdata_in invoke(gravdata_in GravDataGet)
  • ... Compute local contribution for received
    particles ...
  • class ComputePotentialProxy extends RMIProxy
  • Declare fields and constructor
  • void handleResponse(int context, gravdata_in
    GravDataOut)
  • ... Add the results to the P array ...

60
Projection to Particle Mesh
  • Second of the non-trivial communications
    patterns identified earlier comes from the TreePM
    algorithm
  • Projection of particle density to regular grid
    for calculation of Flong distribution of results
    back to irregularly distributed particles.

61
Sketch of pmpotential_periodic()
  • overall(me procs for )
  • workspace new fftw_real dimx, dimy,
    dimz
  • ... Project local particle masses into
    workspace ...
  • ... Accumulate workspace into rhogrid ...
  • ... Do the FFT of the density field ...
  • ... Multiply with Green's function for the
    potential ...
  • ... Reverse the FFT to get the potential ...
  • overall(me procs for )
  • workspace new fftw_real dimx, dimy,
    dimz
  • ... Get workspace from rhogrid ...
  • ... Read out local potential from
    workspace...

62
Notes on pmpotential_periodic()
  • Computes Flong contribution to potential.
  • workspace is large enough to cover all particles
    from P in local domain, and maps to a subsection
    of the full PM distributed array, rhogrid.
  • Highlighted code is essentially application
    independent.

2
Co-array P. Domain on node 3, covered by
workspace
1
3
0
4
Regularly distributed array rhogrid.
0
1
2
3
4
63
Accumulate workspace into rhogrid...
  • for(level 0 level lt (1 ltlt PTask) level)
    / note for level0, target is the same task
    /
  • sendTask me
  • recvTask me level
  • if(recvTask lt NTask)
  • / check how much we have to send /
  • sendmin 2 PMGRID
  • sendmax -1
  • for(slab_x meshminme, 0 slab_x lt
    meshmaxme, 0 2 slab_x)
  • if(slab_to_taskslab_x PMGRID
    recvTask)
  • if(slab_x lt sendmin)
  • sendmin slab_x
  • if(slab_x gt sendmax)
  • sendmax slab_x
  • if(sendmax -1)
  • sendmin 0
  • if(level gt 0)
  • MPlib.sendrecv(workspacesend
    min - meshmin0sendmax - mashmin0, , ,

  • world.getProc(recvTask),
  • TAG_PERIODIC_C,
    forcegrid,

  • world.getProc(recvTask),
  • TAG_PERIODIC_C,
    MPI_COMM_WORLD, status)
  • else
  • HPspmd.copy(forcegrid,

  • workspacesendmin - meshmin0sendmax -
    mashmin0, , )
  • for(slab_x recvmin slab_x lt
    recvmax slab_x)
  • slab_xx (slab_x PMGRID) -
    first_slab_of_taskme

64
Get workspace from rhogrid
  • for(level 0 level lt (1 ltlt PTask) level)
    / note for level0, target is the same task
    /
  • sendTask me
  • recvTask me level
  • if(recvTask lt NTask)
  • / check how much we have to send /
  • sendmin 2 PMGRID
  • sendmax -PMGRID
  • for(slab_x meshmin_list3
    recvTask slab_x lt meshmax_list3 recvTask
    2 slab_x)
  • if(slab_to_task(slab_x PMGRID)
    PMGRID sendTask)
  • if(slab_x lt sendmin)
  • sendmin slab_x
  • if(slab_x gt sendmax)
  • sendmax slab_x
  • for(slab_x recvmin slab_x lt
    recvmax slab_x)
  • if(slab_to_task(slab_x
    PMGRID) PMGRID ! recvTask)
  • / non-contiguous /
  • cont_recvmax0 slab_x
    - 1
  • while(slab_to_task(slab_x
    PMGRID) PMGRID ! recvTask)
  • slab_x
  • cont_recvmin1 slab_x
  • if(ncont 1)
  • ncont
  • for(rep 0 rep lt ncont rep)
  • sendmin cont_sendminrep
  • sendmax cont_sendmaxrep

65
Recurrences
  • Essentially similar code recurs in
  • pmforce_periodic()
  • pmpotential_periodic()
  • pmforce_nonperiodic()
  • pmpotential_nonperiodic()

66
A put Schedule
  • Modelled on earlier ideas about RMISchedule, we
    could implement a scheduleSectionPutSumSchedule,
    saythat has equivalent MPI implementation, but
    hides all the explicit messaging.

67
Rewriting pmpotential_periodic()
  • SectionPutSumProxy - proxies
  • new SectionPutSumProxy procs
  • SectionPutSumSchedule schedule
  • new SectionPutSumSchedule(proxies,
    rhogrid)
  • overall(me procs for )
  • workspace new fftw_real dimx, dimy,
    dimz
  • ... Project local particle masses into
    workspace ...
  • proxiesme.put(new int meshminx,
    meshminy, meshminz,
  • workspace)
  • Schedule.complete()
  • etc

68
Finishing Touches
  • Likewise we could introduce a SectionGetSchedule
    for the second part of pmpotential_periodic().
  • Proxy would have an application-specific response
    handler callback similar to RMISchedule.
  • Estimate this API would reduce implementation of
    this single function in Gadget from 400 lines to
    nearer 200.
  • Underlying implementation still essentially the
    same in MPI case. But allows other
    implementations.

69
Conclusions
70
HPspmd
  • HPspmd is a model for parallel programming that
    introduces a few concepts into a base language,
    for representing distributed data.
  • Goal minimal language extensions to support
    libraries that work with distributed
    dataincluding communication libraries.
  • Originally inspired by fairly regular
    applications, and pure distributed memory.

71
HPspmd Analysis of Gadget
  • Highly irregular application.
  • Focus was on co-arraysmost primitive kind of
    distributed array in HPspmd.
  • Nevertheless, were able to propose new high-level
    communication APIsvery much in the HPspmd
    spiritthat should make Gadget-like codes much
    more maintainable.

72
Multicore
  • We argue that the high-level abstraction of
    communication in HPspmd (and to some extent the
    data model itself) lend naturally to multiple
    implementationsincluding hybrid parallelism,
    across clusters of multicore processors.
  • New communications APIs proposed here may be
    especially interesting in this respect.
  • By construction they can be efficiently
    implemented in MPI.
  • But their nature suggests they will also be
    directly implementable on shared memory.
Write a Comment
User Comments (0)
About PowerShow.com