Title: Libraries and Program Performance
1LibrariesandProgram Performance
- NERSC User Services Group
2An Embarrassment of Riches Serial
3Threaded Libraries (Threaded)
4Parallel Libraries(Distributed Threaded)
5Elementary Math Functions
- Three libraries provide elementary math
functions - C/Fortran intrinsics
- MASS/MASSV (Math Acceleration Subroutine System)
- ESSL/PESSL (Engineering Scientific Subroutine
Library - Language intrinsics are the most convenient, but
not the best performers
6Elementary Functions in Libraries
- MASS
- sqrt rsqrt exp log sin cos tan atan atan2 sinh
cosh tanh dnint xy - MASSV
- cos dint exp log sin log tan div rsqrt sqrt atan
- See
- http//www.nersc.gov/nusers/resources/software/lib
s/math/MASS/
7Other Intrinsics in Libraries
- ESSL
- Linear Algebra Subprograms
- Matrix Operations
- Linear Algebraic Equations
- Eigensystem Analysis
- Fourier Transforms, Convolutions, Correlations,
and Related Computations - Sorting and Searching
- Interpolation
- Numerical Quadrature
- Random Number Generation
8Comparing Elementary Functions
- Loop schema for elementary functions
- 99 write(6,98)
- 98 format( " sqrt " )
- x pi/4.0
- call f_hpmstart(1,"sqrt")
- do 100 i 1, loopceil
- y sqrt(x)
- x y y
- 100 continue
- call f_hpmstop(1)
- write(6,101) x
- 101 format( " x ", g21.14 )
9Comparing Elementary Functions
- Execution schema for elementary functions
- setenv F1 "-qfixed80 -qarchpwr3 -qtunepwr3 O3
-qipa" - module load hpmtoolkit
- module load mass
- module list
- setenv L1 "-Wl,-v,-bCmassmap"
- xlf90_r masstest.F F1 L1 MASS HPMTOOLKIT -o
masstest - timex mathtest lt input gt mathout
10Results Examined
- Answers after 50e6 iterations
- User execution time
- Floating and FMA instructions
- Operation rate in Mflip/sec
11Results Observed
- No difference in answers
- Best times/rates at -O3 or -O4
- ESSL no different from intrinsics
- MASS much faster than intrinsics
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Comparing Higher Level Functions
- Several sources of matrix-multiply function
- User coded scalar computation
- Fortran intrinsic matmul
- Single processor ESSL dgemm
- Multi-threaded SMP ESSL dgemm
- Single processor IMSL dmrrrr (32-bit)
- Single processor NAG f01ckf
- Multi-threaded SMP NAG f01ckf
16Sample Problem
- Multiply dense matrixes
- A(1n,1n) i j
- B(1n,1n) j i
- C(1n,1n) A B
- Output C to verify result
17Kernel of user matrix multiply
- do i1,n
- do j1,n
- a(i,j) real(ij)
- b(i,j) real(j-i)
- enddo
- enddo
- call f_hpmstart(1,"Matrix multiply")
- do j1,n
- do k1,n
- do i1,n
- c(i,j) c(i,j) a(i,k)b(k,j)
- enddo
- enddo
- enddo
- call f_hpmstop(1)
18Comparison of Matrix Multiply(N15,000)
- Version Wall Clock(sec) Mflip/s Scaled
Time (1Fastest) - User Scalar 1,490 168 106 Slowest
- Intrinsic 1,477 169 106 Slowest
- ESSL 195 1,280 13.9
- IMSL 194 1,290 13.8
- NAG 195 1,280 13.9
- ESSL-SMP 14 17,800
1.0 Fastest - NAG-SMP 14 17,800 1.0 Fastest
19Observations on Matrix Multiply
- Fastest times were obtained by the two SMP
libraries, ESSL-SMP and NAG-SMP, which both
obtained 74 of the peak node performance - All the single processor library functions took
14 times more wall clock time than the SMP
versions, each obtaining about 85 of peak for a
single processor - Worst times were from the user code and the
Fortran intrinsic, which took 100 times more wall
clock time than the SMP libraries
20Comparison of Matrix Multiply(N210,000)
- Version Wall Clock(sec) Mflip/s Scaled
Time - ESSL-SMP 101 19,800
1.01 - NAG-SMP 100 19,900
1.00 - Scaling with Problem Size (Complexity increase
8x) - Version Wall Clock(N2/N1) Mflip/s
(N2/N1) - ESSL-SMP 7.2
1.10 - NAG-SMP 7.1
1.12 - Both ESSL-SMP and NAG-SMP showed 10 performance
gains with the larger problem size.
21Observations on Scaling
- Scaling of problem size was only done for the SMP
libraries, to fit into reasonable times. - Doubling N results in 8 times increase of
computational complexity for dense matrix
multiplication - Performance actually increased for both routines
for larger problem size.
22ESSL-SMP Performance vs. Number of Threads
- All for N10,000
- Number of threads controlled by environment
variable OMP_NUM_THREADS
23Parallelism Choices Based on Problem Size
Doing a months work in a few minutes!
- Three Good Choices
- ESSL / LAPACK
- ESSL-SMP
- SCALAPACK
Only beyond a certain problem size is there any
opportunity for parallelism.
Matrix-Matrix Multiply
24Larger Functions FFTs
- ESSL, FFTW, NAG, IMSL
- See
- http//www.nersc.gov/nusers/resources/software/lib
s/math/fft/ - We looked at ESSL, NAG, and IMSL
- One-D, forward and reverse
25One-D FFTs
- NAG
- c06eaf - forward
- c06ebf - inverse, conjugate needed
- c06faf - forward, work-space needed
- c06ebf - inverse, work-space conjugate needed
- IMSL
- z_fast_dft - forward reverse, separate arrays
- ESSL
- drcft - forward reverse, work-space
initialization step needed - All have size constraints on their data sets
26One-D FFT Measurement
- 224 real8 data points input (a synthetic
signal) - Each transform ran in a 20-iteration loop
- All performed both forward and inverse transforms
on same data - Input and inverse outputs were identical
- Measured with HPMToolkit
- second1 rtc()
- call f_hpmstart(33,"nag2 forward")
- do loopind1, loopceil
- w(1n) x(1n)
- call c06faf( w, n, work, ifail )
- end do
- call f_hpmstop(33)
- second2 rtc()
27One-D FFT Performance
- NAG c06eaf fwd 25.182 sec. 54.006 Mflip/s
- C06ebf inv 24.465 sec. 40.666 Mflip/s
- c06faf fwd 29.451 sec. 46.531 Mflip/s
- c06ebf inv 24.469 sec. 40.663 Mflip/s
- (required a data copy for each iteration for
each transform) - IMSL z_fast_dft fwd 71.479 sec. 46.027 Mflip/s
- z_fast_dft inv 71.152 sec. 48.096 Mflip/s
- ESSL drcft init 0.032 sec. 62.315 Mflip/s
- drcft fwd 3.573 sec. 274.009 Mflip/s
- drcft init 0.058 sec. 96.384 Mflip/s
- drcft inv 3.616 sec. 277.650 Mflip/s
28(No Transcript)
29 30(No Transcript)
31ESSL and ESSL-SMP
- Easy parallelism
- ( -qsmpomp qessl
- -lomp lesslsmp )
- For simple problems can dial in the local data
size by adjusting number of threads - Cache reuse can lead to superlinear speed up.
- NH II node has 128 MB Cache!
32Parallelism Beyond One Node MPI
2D problem in 4 tasks
- Distributed Data Decomposition
- Distributed parallelism (MPI) requires both local
and global addressing contexts - Dimensionality of decomposition can have
profound impact on scalability - Consider surface to volume ratio
- Surface communication (MPI)
- Volume local work (HPM)
- Decomposition is often cause of load imbalance
which can reduce parallel efficiency
2D problem in 20 tasks
33Example FFTW 3D
- Popular for its portability and performance
- Also consider PESSLs FFTs (not treated here)
- Uses a slab (1D) data decomposition
- Direct algorithms for transforms of dimensions of
size 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,3
2,64 - For parallel FFTW calls transforms are done in
place - Plan
- fftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny,
nz, FFTW_FORWARD,flags) - fftwnd_mpi_local_sizes(plan, lnx, lxs, lnyt,
lyst, lsize) - Transform
- fftwnd_mpi(plan, 1, data, work, transform_flags)
What are these?
34FFTW Data Decomposition
Each MPI rank owns a portion of the problem
nx
- Local Address Context
- for(x0xltlnxx)
- for(y0yltnyy)
- for(z0zltnzz)
- dataxyz f(xlxs,y,z)
-
Global Address Context for(x0xltnxx)
for(y0yltnyy) for(z0zltnzz)
dataxyz f(x,y,z)
ny
nz
lxs lxslnx
35FFTW Parallel Performance
- FFT may be complex function of problem size .
Prime factors of the dimensions and concurrency
determine performance - Consider data decompositions and paddings that
lead to optimal local data sizes cache use and
prime factors
36FFTW Wisdom
- Runtime performance optimization, can be stored
to file - Wise options
- FFTW_MEASURE FFTW_USE_WISDOM
- Unwise options
- FFTW_ESTIMATE
- Wisdom works better for serial FFTs. Some benefit
for parallel must amortize increase in planning
overhead.