Hernquist SAC:PARTREE Stuart Johnson - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Hernquist SAC:PARTREE Stuart Johnson

Description:

2) Construct local BH trees. 3) Construct locally essential trees (Parallel) 4) Walk through trees to calculate forces (tuned) 5) Move particles. Tree data structures ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 33

Provided by: stua77

Category:

more less

Transcript and Presenter's Notes

Title: Hernquist SAC:PARTREE Stuart Johnson

1
Hernquist SACPARTREEStuart Johnson

Single-processor optimization of a particle tree
code
Punch line
tuned subroutines 2.75 X speedup
whole code 2 X speedup

2
Outline

Scientific Context and Goals
Initial Computational Characteristics
Optimization efforts
Performance improvements
Conclusions

3
Scientific Context and Goals

Simulate gravitational evolution of Galaxies
early disturbed development
origin of elliptical galaxies
background light and tidal debris
evolution of "cosmological" model
initial mass density
initial spectrum of heterogeneities

4
PARTREE

Approximately solves the N-body problem
Nearby Particles particle-particle interactions
"Distant" Particles particle-multipole
interactions
Converts O(N2) -gt O(N log N) algorithm
More general than SCF (symmetry)

5
What the code does

At each timestep
1) ORB domain decomposition (Parallel)
2) Construct local BH trees
3) Construct locally essential trees (Parallel)
4) Walk through trees to calculate forces (tuned)
5) Move particles

6
Tree data structures

ORB Orthogonal Recursive Bisection
Successive Halving for parallel domain
decomposition on 2n processors
BH Barnes-Hut
Nested cubing for on-processor decomposition and
localization of particles

7
Opening criterion

Open a cell if it is too close
Approximate a cell using a multipole expansion if
it is distant

8
Force Calculation

Particle-particle interaction
Note the square root for calculating the
particle-to-particle unit vector
Note the divide(s)

9
Particle grouping

Forces are calculated for nearby particles using
the same tree decomposition, since nearby
particles see almost the same gravity field
implemented by using the distance(d) to the
nearest edge of the particle group in the opening
criterion (calculation more accurate than for
single particle)
has very significant implications for data reuse
(generates in-cache force calculation loop(s))
has implications for independent operations
particle group size set to 32 by experimentation
grouping is a performance tradeoff

10
Initial Computational Characteristics

Test problem 2 million particle disk halo
simulation
Not huge 640MB (4 nodes or more on SP)
1000 interactions per particle
Profile of computation time
Scalability

11
Profile of computation time (4 node run)

cumulative self self
total
time seconds seconds calls ms/call
ms/call name
71.6 5332.26 5332.26 29226604 0.18
0.18 .cforce 6
11.3 6176.81 844.55 1264378 0.67
5.31 .GroupForceWalk 5
7.3 6717.45 540.64 3540223 0.15
0.15 .pforce 7
2.5 6902.73 185.28 12000000 0.02
0.02 .AddParticleToTree 9
1.9 7040.62 137.89
.__mcount 10
0.7 7090.43 49.81
.readsocket 12
0.6 7131.67 41.24 638 64.64
64.64 .WeightFrac 14
0.5 7172.11 40.44
.kickpipes 16
0.5 7212.00 39.89 139771115 0.00
0.00 .whichChild 17
0.5 7247.02 35.02 128894017 0.00
0.00 .subCenter 18

12
Scalability of test problem
13
Optimization Efforts

cforce/pforce tuning for the SP and T3E
general comments
program for architecture
program for cache
program for pipelines
avoid slow things (like divide)

14
Optimization Efforts

general tuning comments
techniques
predict performance compare to absolutes
understand limiting factors on performance
understand effects of code modifications and
modify towards predicted performance
look at the assembly code
use compiler flags

15
cforce/pforce tuning for SP/T3E

elimination of divides
"Vectorization" of inverse square root
the tunable loops and their properties
cache behavior
computational intensity/performance predictions
elimination of statically declared temporaries
elimination of single precision calculations
absolute performance of modified code
tuning pipelineing and splitting loops (T3E)
increase independent operations, prevent register
spill

16
Elimination of divides

Original code (1 86 253 162 CP T3E)
sr sqrt(sr2)
phii (c-gtmass) / sr
mor3 phii / sr2
temp 5.0phiquad/sr2
Best (since 1/sqrt is as fast as sqrt) (486
90 CP T3E)
sr1/sqrt(sr2)
phii (c-gtmass)sr
rsr2srsr
mor3 phii rsr2
temp 5.0phiquad rsr2

17
"Vectorization" of inverse square root(T3E)

Intrinsic (libm) function timings (from T3E
Benchmarker's Guide)
Routine CP Scalar CP Vector
------- --------- ---------
SQRT 86 25
1/SQRT 86 25
T3E can automatically stripmine and call vector
routines
BUT the C compiler is broken! (assembly reveals
this)

18
"Vectorization" of inverse square root(SP)

Routine CP SQRT CP 1/SQRT
------------- ------------- ---------------
FSQRT/FD(HW) 13.8 22.4
libm(scalar) 43.7 58.0
libmass(scalar) 28.4 28.4
libmassvp2(vector) 7.1 7.1
loop timings are for 10,000,000 ops in vector
lengths of 5000, reused to provide in-cache
timings
SP (w/out preprocessor) requires explicit vector
call to get lmassvp2 form (vrsqrt)

19
Tunable loop(s)

Loop 1
for(i0 iltng i)
for(j0, cclist cltclistccount j, c)
xj (c-gtr).x - pos0.x
yj (c-gtr).y - pos0.y
zj (c-gtr).z - pos0.z
x2j xjxj y2j yjyj
z2j zjzj
tmpsjx2jy2jz2j(c-gtepssq)

20
Tunable loop(s)

Loop 2
for(j0, cclist cltclistccount j, c)
sr2tmpsjtmpsj
phii (c-gtmass) tmpsj
mor3 phii sr2
phisum - phii
acc0.x mor3xj
acc0.y mor3yj
acc0.z mor3zj
or5 sr2 sr2 tmpsj
phiquad ( 0.5 (qxx0 x2j qyy0 y2j
qzz0 z2j
- (c-gtepssq)momnode) xj(qxy0yj
qxz0zj)
qyz0 yjzj)or5
phisum - phiquad
temp 5.0phiquadsr2
acc0.x (tempxj - (qxx0xj qxy0yj
qxz0zj)or5)
acc0.y (tempyj - (qxy0xj qyy0yj
qyz0zj)or5)
acc0.z (tempzj - (qxz0xj qyz0yj
qzz0zj)or5)

21
Loop properties

1) Cache behavior
Particle grouping -gt data reuse-gt almost all loop
data can be considered in-cache if
Size of working arrays adjusted to fit in cache

22
Loop properties

2) computational intensity/performance
predictions
loop 1 loop 2 total
Floating Point
/-/ 9 59
FMA 0 18
cycles 9/2 (59-218)1841/2
Load/Store
L/S 9 14
cycles(no quads) 9/2 14/2
CPU bound loops, so
predicted cycles 9/2 41/2
25

23
Elimination of statically declared variables

Original code
static float x, y, z
static float x2, y2, z2, epssq
static float dr2, sr, sr2, phii, mor3, phisum
static float or5, temp, phiquad
static coordStruct acc0, pos0
All of these should be in registers, but are
forced to store back!
Change to local variables, and some statically
declared in-cache workspace...

24
Elimination of single precision

Particles stored in single or double precision
Calculations performed in double precision
Example of macro modification
define COPYPARTICLE() \
\
plistpcount.type c-gttype \
plistpcount.mass (double)(c-gtmass) \
plistpcount.r.x (double)(c-gtr.x) \
plistpcount.r.y (double)(c-gtr.y) \
plistpcount.r.z (double)(c-gtr.z)
\
plistpcount.epssq (double)(c-gtepssq)
\

25
Performance Improvements

in-cache test of original cforce simulator loop
7.59 seconds
in-cache test of optimized cforce simulator loop
2.07 seconds
3.67 times faster! (10 M iterations of loop)

26
Performance of optimized tuned loops (no vrsqrt)

CPU seconds 1.5913 CP
executing 254610220
Elapsed seconds 1.6457
FPU0 results/sec 159.58M F.P. in
Math0 253934523
FPU1 results/sec 138.09M F.P. in
Math1 219752424
F.P. add ops/sec 25.44M F.P. add
40480950
F.P. mul ops/sec 113.92M F.P. mul
181278375
F.P. div ops/sec 0.00M F.P. div
1776
F.P. ma ops/sec 158.16M F.P. ma
251677116
MFLOPS ratio 455.67M F.P. math
ops 725115333
Fixed instr/sec E0 64.67M Fixed
instr E0 102907722
Fixed instr/sec E1 43.67M Fixed
instr E1 69495885
ICU instr/sec 0.00M ICU instr.
0
Integer MIPS 108.34 Total
instr. 172403607
I Cache misses/sec 2.73k
D Cache reloads/sec 43.75k
D Cache storebacks/sec 22.45k
D Cache misses/sec 33.81k
Total TLB misses/sec 0.00k
cycles/FLOP 0.3511

27
Performance of optimized tuned loops (no vrsqrt)

Optimized loops need 25.5 cycles per iteration, a
bit slower than my prediction.
Based on the displayed ops, this should be
(25118140)/2 23.6 cycles
missing 1.9 cycles/loop and am only running at
93 peak.
Tuned loops are running at 456 Mflops

28
Performance improvements

Rs2hpm results from 2M particle run(optimized
code)
cumulative self self
total
time seconds seconds calls ms/call
ms/call name
42.4 1598.79 1598.79 3009774 0.53
0.53 .cforce 6
25.0 2539.95 941.16 1264378 0.74
2.08 .GroupForceWalk
11.7 2981.47 441.52
vrsqrt 7
4.9 3165.09 183.62 12000000 0.02
0.02 .AddParticleToTree
3.6 3301.92 136.83
.__mcount 10
2.2 3385.88 83.96 1264407 0.07
0.07 .pforce 11
1.1 3427.49 41.61
.readsocket 13
1.1 3468.57 41.08 139771115 0.00
0.00 .whichChild 14
1.1 3509.58 41.01 638 64.28
64.28 .WeightFrac 17
0.9 3543.70 34.12
.kickpipes 18

29
Performance Improvements

Rs2hpm results from 2M particle run (original
code)
cumulative self self
total
time seconds seconds calls ms/call
ms/call name
71.6 5332.26 5332.26 29226604 0.18
0.18 .cforce 6
11.3 6176.81 844.55 1264378 0.67
5.31 .GroupForceWalk5
7.3 6717.45 540.64 3540223 0.15
0.15 .pforce 7
2.5 6902.73 185.28 12000000 0.02
0.02 .AddParticleToTree
1.9 7040.62 137.89
.__mcount 10
0.7 7090.43 49.81
.readsocket 12
0.6 7131.67 41.24 638 64.64
64.64 .WeightFrac 14
0.5 7172.11 40.44
.kickpipes 16
0.5 7212.00 39.89 139771115 0.00
0.00 .whichChild 17
0.5 7247.02 35.02 128894017 0.00
0.00 .subCenter 18
cforce and pforce are 2.76 X faster...

30
Performance improvements

Full run (2M particles)performance
Processors Opt code P/Proc/S Orig code
P/Proc/S Speedup
-------------- -----------------------
------------------------ -----------
4 3289 1650
1.99
8 3788 1880
2.01
16 3289 1645
2.00
32 2717 1358
2.00
64 2300 1200
1.92

31
Conclusions 1

with 2X the speed can do 2X the particles in the
same time as before
Code may scale well enough to 128 nodes for
100,000,000 particle simulations (50,000
processor seconds/timestep, 14000 SUs for 1000
timesteps), but scaling problems may be inherent
to algorithm (ORB aspect ratios?)
T3E code also optimized similarly with more
attention to loop pipelining problems

32
Conclusions 2