Hernquist SAC:PARTREE Stuart Johnson - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Hernquist SAC:PARTREE Stuart Johnson

Description:

2) Construct local BH trees. 3) Construct locally essential trees (Parallel) 4) Walk through trees to calculate forces (tuned) 5) Move particles. Tree data structures ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 33
Provided by: stua77
Category:

less

Transcript and Presenter's Notes

Title: Hernquist SAC:PARTREE Stuart Johnson


1
Hernquist SACPARTREEStuart Johnson
  • Single-processor optimization of a particle tree
    code
  • Punch line
  • tuned subroutines 2.75 X speedup
  • whole code 2 X speedup

2
Outline
  • Scientific Context and Goals
  • Initial Computational Characteristics
  • Optimization efforts
  • Performance improvements
  • Conclusions

3
Scientific Context and Goals
  • Simulate gravitational evolution of Galaxies
  • early disturbed development
  • origin of elliptical galaxies
  • background light and tidal debris
  • evolution of "cosmological" model
  • initial mass density
  • initial spectrum of heterogeneities

4
PARTREE
  • Approximately solves the N-body problem
  • Nearby Particles particle-particle interactions
  • "Distant" Particles particle-multipole
    interactions
  • Converts O(N2) -gt O(N log N) algorithm
  • More general than SCF (symmetry)

5
What the code does
  • At each timestep
  • 1) ORB domain decomposition (Parallel)
  • 2) Construct local BH trees
  • 3) Construct locally essential trees (Parallel)
  • 4) Walk through trees to calculate forces (tuned)
  • 5) Move particles

6
Tree data structures
  • ORB Orthogonal Recursive Bisection
  • Successive Halving for parallel domain
    decomposition on 2n processors
  • BH Barnes-Hut
  • Nested cubing for on-processor decomposition and
    localization of particles

7
Opening criterion
  • Open a cell if it is too close
  • Approximate a cell using a multipole expansion if
    it is distant

8
Force Calculation
  • Particle-particle interaction
  • Note the square root for calculating the
    particle-to-particle unit vector
  • Note the divide(s)

9
Particle grouping
  • Forces are calculated for nearby particles using
    the same tree decomposition, since nearby
    particles see almost the same gravity field
  • implemented by using the distance(d) to the
    nearest edge of the particle group in the opening
    criterion (calculation more accurate than for
    single particle)
  • has very significant implications for data reuse
    (generates in-cache force calculation loop(s))
  • has implications for independent operations
  • particle group size set to 32 by experimentation
  • grouping is a performance tradeoff

10
Initial Computational Characteristics
  • Test problem 2 million particle disk halo
    simulation
  • Not huge 640MB (4 nodes or more on SP)
  • 1000 interactions per particle
  • Profile of computation time
  • Scalability

11
Profile of computation time (4 node run)
  • cumulative self self
    total
  • time seconds seconds calls ms/call
    ms/call name
  • 71.6 5332.26 5332.26 29226604 0.18
    0.18 .cforce 6
  • 11.3 6176.81 844.55 1264378 0.67
    5.31 .GroupForceWalk 5
  • 7.3 6717.45 540.64 3540223 0.15
    0.15 .pforce 7
  • 2.5 6902.73 185.28 12000000 0.02
    0.02 .AddParticleToTree 9
  • 1.9 7040.62 137.89
    .__mcount 10
  • 0.7 7090.43 49.81
    .readsocket 12
  • 0.6 7131.67 41.24 638 64.64
    64.64 .WeightFrac 14
  • 0.5 7172.11 40.44
    .kickpipes 16
  • 0.5 7212.00 39.89 139771115 0.00
    0.00 .whichChild 17
  • 0.5 7247.02 35.02 128894017 0.00
    0.00 .subCenter 18

12
Scalability of test problem
13
Optimization Efforts
  • cforce/pforce tuning for the SP and T3E
  • general comments
  • program for architecture
  • program for cache
  • program for pipelines
  • avoid slow things (like divide)

14
Optimization Efforts
  • general tuning comments
  • techniques
  • predict performance compare to absolutes
  • understand limiting factors on performance
  • understand effects of code modifications and
    modify towards predicted performance
  • look at the assembly code
  • use compiler flags

15
cforce/pforce tuning for SP/T3E
  • elimination of divides
  • "Vectorization" of inverse square root
  • the tunable loops and their properties
  • cache behavior
  • computational intensity/performance predictions
  • elimination of statically declared temporaries
  • elimination of single precision calculations
  • absolute performance of modified code
  • tuning pipelineing and splitting loops (T3E)
  • increase independent operations, prevent register
    spill

16
Elimination of divides
  • Original code (1 86 253 162 CP T3E)
  • sr sqrt(sr2)
  • phii (c-gtmass) / sr
  • mor3 phii / sr2
  • temp 5.0phiquad/sr2
  • Best (since 1/sqrt is as fast as sqrt) (486
    90 CP T3E)
  • sr1/sqrt(sr2)
  • phii (c-gtmass)sr
  • rsr2srsr
  • mor3 phii rsr2
  • temp 5.0phiquad rsr2

17
"Vectorization" of inverse square root(T3E)
  • Intrinsic (libm) function timings (from T3E
    Benchmarker's Guide)
  • Routine CP Scalar CP Vector
  • ------- --------- ---------
  • SQRT 86 25
  • 1/SQRT 86 25
  • T3E can automatically stripmine and call vector
    routines
  • BUT the C compiler is broken! (assembly reveals
    this)

18
"Vectorization" of inverse square root(SP)
  • Routine CP SQRT CP 1/SQRT
  • ------------- ------------- ---------------
  • FSQRT/FD(HW) 13.8 22.4
  • libm(scalar) 43.7 58.0
  • libmass(scalar) 28.4 28.4
  • libmassvp2(vector) 7.1 7.1
  • loop timings are for 10,000,000 ops in vector
    lengths of 5000, reused to provide in-cache
    timings
  • SP (w/out preprocessor) requires explicit vector
    call to get lmassvp2 form (vrsqrt)

19
Tunable loop(s)
  • Loop 1
  • for(i0 iltng i)
  • for(j0, cclist cltclistccount j, c)
  • xj (c-gtr).x - pos0.x
  • yj (c-gtr).y - pos0.y
  • zj (c-gtr).z - pos0.z
  • x2j xjxj y2j yjyj
  • z2j zjzj
  • tmpsjx2jy2jz2j(c-gtepssq)

20
Tunable loop(s)
  • Loop 2
  • for(j0, cclist cltclistccount j, c)
  • sr2tmpsjtmpsj
  • phii (c-gtmass) tmpsj
  • mor3 phii sr2
  • phisum - phii
  • acc0.x mor3xj
  • acc0.y mor3yj
  • acc0.z mor3zj
  • or5 sr2 sr2 tmpsj
  • phiquad ( 0.5 (qxx0 x2j qyy0 y2j
    qzz0 z2j
  • - (c-gtepssq)momnode) xj(qxy0yj
    qxz0zj)
  • qyz0 yjzj)or5
  • phisum - phiquad
  • temp 5.0phiquadsr2
  • acc0.x (tempxj - (qxx0xj qxy0yj
    qxz0zj)or5)
  • acc0.y (tempyj - (qxy0xj qyy0yj
    qyz0zj)or5)
  • acc0.z (tempzj - (qxz0xj qyz0yj
    qzz0zj)or5)

21
Loop properties
  • 1) Cache behavior
  • Particle grouping -gt data reuse-gt almost all loop
    data can be considered in-cache if
  • Size of working arrays adjusted to fit in cache

22
Loop properties
  • 2) computational intensity/performance
    predictions

  • loop 1 loop 2 total
  • Floating Point
  • /-/ 9 59
  • FMA 0 18
  • cycles 9/2 (59-218)1841/2
  • Load/Store
  • L/S 9 14
  • cycles(no quads) 9/2 14/2
  • CPU bound loops, so
  • predicted cycles 9/2 41/2
    25

23
Elimination of statically declared variables
  • Original code
  • static float x, y, z
  • static float x2, y2, z2, epssq
  • static float dr2, sr, sr2, phii, mor3, phisum
  • static float or5, temp, phiquad
  • static coordStruct acc0, pos0
  • All of these should be in registers, but are
    forced to store back!
  • Change to local variables, and some statically
    declared in-cache workspace...

24
Elimination of single precision
  • Particles stored in single or double precision
  • Calculations performed in double precision
  • Example of macro modification
  • define COPYPARTICLE() \
  • \
  • plistpcount.type c-gttype \
  • plistpcount.mass (double)(c-gtmass) \
  • plistpcount.r.x (double)(c-gtr.x) \
  • plistpcount.r.y (double)(c-gtr.y) \
  • plistpcount.r.z (double)(c-gtr.z)
    \
  • plistpcount.epssq (double)(c-gtepssq)
    \

25
Performance Improvements
  • in-cache test of original cforce simulator loop
    7.59 seconds
  • in-cache test of optimized cforce simulator loop
    2.07 seconds
  • 3.67 times faster! (10 M iterations of loop)

26
Performance of optimized tuned loops (no vrsqrt)
  • CPU seconds 1.5913 CP
    executing 254610220
  • Elapsed seconds 1.6457
  • FPU0 results/sec 159.58M F.P. in
    Math0 253934523
  • FPU1 results/sec 138.09M F.P. in
    Math1 219752424
  • F.P. add ops/sec 25.44M F.P. add
    40480950
  • F.P. mul ops/sec 113.92M F.P. mul
    181278375
  • F.P. div ops/sec 0.00M F.P. div
    1776
  • F.P. ma ops/sec 158.16M F.P. ma
    251677116
  • MFLOPS ratio 455.67M F.P. math
    ops 725115333
  • Fixed instr/sec E0 64.67M Fixed
    instr E0 102907722
  • Fixed instr/sec E1 43.67M Fixed
    instr E1 69495885
  • ICU instr/sec 0.00M ICU instr.
    0
  • Integer MIPS 108.34 Total
    instr. 172403607
  • I Cache misses/sec 2.73k
  • D Cache reloads/sec 43.75k
  • D Cache storebacks/sec 22.45k
  • D Cache misses/sec 33.81k
  • Total TLB misses/sec 0.00k
  • cycles/FLOP 0.3511

27
Performance of optimized tuned loops (no vrsqrt)
  • Optimized loops need 25.5 cycles per iteration, a
    bit slower than my prediction.
  • Based on the displayed ops, this should be
    (25118140)/2 23.6 cycles
  • missing 1.9 cycles/loop and am only running at
    93 peak.
  • Tuned loops are running at 456 Mflops

28
Performance improvements
  • Rs2hpm results from 2M particle run(optimized
    code)
  • cumulative self self
    total
  • time seconds seconds calls ms/call
    ms/call name
  • 42.4 1598.79 1598.79 3009774 0.53
    0.53 .cforce 6
  • 25.0 2539.95 941.16 1264378 0.74
    2.08 .GroupForceWalk
  • 11.7 2981.47 441.52
    vrsqrt 7
  • 4.9 3165.09 183.62 12000000 0.02
    0.02 .AddParticleToTree
  • 3.6 3301.92 136.83
    .__mcount 10
  • 2.2 3385.88 83.96 1264407 0.07
    0.07 .pforce 11
  • 1.1 3427.49 41.61
    .readsocket 13
  • 1.1 3468.57 41.08 139771115 0.00
    0.00 .whichChild 14
  • 1.1 3509.58 41.01 638 64.28
    64.28 .WeightFrac 17
  • 0.9 3543.70 34.12
    .kickpipes 18

29
Performance Improvements
  • Rs2hpm results from 2M particle run (original
    code)
  • cumulative self self
    total
  • time seconds seconds calls ms/call
    ms/call name
  • 71.6 5332.26 5332.26 29226604 0.18
    0.18 .cforce 6
  • 11.3 6176.81 844.55 1264378 0.67
    5.31 .GroupForceWalk5
  • 7.3 6717.45 540.64 3540223 0.15
    0.15 .pforce 7
  • 2.5 6902.73 185.28 12000000 0.02
    0.02 .AddParticleToTree
  • 1.9 7040.62 137.89
    .__mcount 10
  • 0.7 7090.43 49.81
    .readsocket 12
  • 0.6 7131.67 41.24 638 64.64
    64.64 .WeightFrac 14
  • 0.5 7172.11 40.44
    .kickpipes 16
  • 0.5 7212.00 39.89 139771115 0.00
    0.00 .whichChild 17
  • 0.5 7247.02 35.02 128894017 0.00
    0.00 .subCenter 18
  • cforce and pforce are 2.76 X faster...

30
Performance improvements
  • Full run (2M particles)performance
  • Processors Opt code P/Proc/S Orig code
    P/Proc/S Speedup
  • -------------- -----------------------
    ------------------------ -----------
  • 4 3289 1650
    1.99
  • 8 3788 1880
    2.01
  • 16 3289 1645
    2.00
  • 32 2717 1358
    2.00
  • 64 2300 1200
    1.92

31
Conclusions 1
  • with 2X the speed can do 2X the particles in the
    same time as before
  • Code may scale well enough to 128 nodes for
    100,000,000 particle simulations (50,000
    processor seconds/timestep, 14000 SUs for 1000
    timesteps), but scaling problems may be inherent
    to algorithm (ORB aspect ratios?)
  • T3E code also optimized similarly with more
    attention to loop pipelining problems

32
Conclusions 2
  • Profiling and performance monitoring tools
    essential for optimization work
  • easy-to-use interactive nodes REALLY nice for
    optimization work
  • Same code does not run the best on the T3E and
    the SP
  • SP easier to tune for and faster than the T3E
    from a single-PE standpoint
  • Funky DEC Alpha on-chip bandwidths and latencies
Write a Comment
User Comments (0)
About PowerShow.com