Libraries and Program Performance - PowerPoint PPT Presentation

About This Presentation
Title:

Libraries and Program Performance

Description:

Title: Elementary Math Functions Author: Thomas DeBoni Last modified by: Thomas M. DeBoni Created Date: 6/8/2004 4:27:48 PM Document presentation format – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 37
Provided by: ThomasD168
Learn more at: https://www.nersc.gov
Category:

less

Transcript and Presenter's Notes

Title: Libraries and Program Performance


1
LibrariesandProgram Performance
  • NERSC User Services Group

2
An Embarrassment of Riches Serial
3
Threaded Libraries (Threaded)
4
Parallel Libraries(Distributed Threaded)
5
Elementary Math Functions
  • Three libraries provide elementary math
    functions
  • C/Fortran intrinsics
  • MASS/MASSV (Math Acceleration Subroutine System)
  • ESSL/PESSL (Engineering Scientific Subroutine
    Library
  • Language intrinsics are the most convenient, but
    not the best performers

6
Elementary Functions in Libraries
  • MASS
  • sqrt rsqrt exp log sin cos tan atan atan2 sinh
    cosh tanh dnint xy
  • MASSV
  • cos dint exp log sin log tan div rsqrt sqrt atan
  • See
  • http//www.nersc.gov/nusers/resources/software/lib
    s/math/MASS/

7
Other Intrinsics in Libraries
  • ESSL
  • Linear Algebra Subprograms
  • Matrix Operations
  • Linear Algebraic Equations
  • Eigensystem Analysis
  • Fourier Transforms, Convolutions, Correlations,
    and Related Computations
  • Sorting and Searching
  • Interpolation
  • Numerical Quadrature
  • Random Number Generation

8
Comparing Elementary Functions
  • Loop schema for elementary functions
  • 99 write(6,98)
  • 98 format( " sqrt " )
  • x pi/4.0
  • call f_hpmstart(1,"sqrt")
  • do 100 i 1, loopceil
  • y sqrt(x)
  • x y y
  • 100 continue
  • call f_hpmstop(1)
  • write(6,101) x
  • 101 format( " x ", g21.14 )

9
Comparing Elementary Functions
  • Execution schema for elementary functions
  • setenv F1 "-qfixed80 -qarchpwr3 -qtunepwr3 O3
    -qipa"
  • module load hpmtoolkit
  • module load mass
  • module list
  • setenv L1 "-Wl,-v,-bCmassmap"
  • xlf90_r masstest.F F1 L1 MASS HPMTOOLKIT -o
    masstest
  • timex mathtest lt input gt mathout

10
Results Examined
  • Answers after 50e6 iterations
  • User execution time
  • Floating and FMA instructions
  • Operation rate in Mflip/sec

11
Results Observed
  • No difference in answers
  • Best times/rates at -O3 or -O4
  • ESSL no different from intrinsics
  • MASS much faster than intrinsics

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Comparing Higher Level Functions
  • Several sources of matrix-multiply function
  • User coded scalar computation
  • Fortran intrinsic matmul
  • Single processor ESSL dgemm
  • Multi-threaded SMP ESSL dgemm
  • Single processor IMSL dmrrrr (32-bit)
  • Single processor NAG f01ckf
  • Multi-threaded SMP NAG f01ckf

16
Sample Problem
  • Multiply dense matrixes
  • A(1n,1n) i j
  • B(1n,1n) j i
  • C(1n,1n) A B
  • Output C to verify result

17
Kernel of user matrix multiply
  • do i1,n
  • do j1,n
  • a(i,j) real(ij)
  • b(i,j) real(j-i)
  • enddo
  • enddo
  • call f_hpmstart(1,"Matrix multiply")
  • do j1,n
  • do k1,n
  • do i1,n
  • c(i,j) c(i,j) a(i,k)b(k,j)
  • enddo
  • enddo
  • enddo
  • call f_hpmstop(1)

18
Comparison of Matrix Multiply(N15,000)
  • Version Wall Clock(sec) Mflip/s Scaled
    Time (1Fastest)
  • User Scalar 1,490 168 106 Slowest
  • Intrinsic 1,477 169 106 Slowest
  • ESSL 195 1,280 13.9
  • IMSL 194 1,290 13.8
  • NAG 195 1,280 13.9
  • ESSL-SMP 14 17,800
    1.0 Fastest
  • NAG-SMP 14 17,800 1.0 Fastest

19
Observations on Matrix Multiply
  • Fastest times were obtained by the two SMP
    libraries, ESSL-SMP and NAG-SMP, which both
    obtained 74 of the peak node performance
  • All the single processor library functions took
    14 times more wall clock time than the SMP
    versions, each obtaining about 85 of peak for a
    single processor
  • Worst times were from the user code and the
    Fortran intrinsic, which took 100 times more wall
    clock time than the SMP libraries

20
Comparison of Matrix Multiply(N210,000)
  • Version Wall Clock(sec) Mflip/s Scaled
    Time
  • ESSL-SMP 101 19,800
    1.01
  • NAG-SMP 100 19,900
    1.00
  • Scaling with Problem Size (Complexity increase
    8x)
  • Version Wall Clock(N2/N1) Mflip/s
    (N2/N1)
  • ESSL-SMP 7.2
    1.10
  • NAG-SMP 7.1
    1.12
  • Both ESSL-SMP and NAG-SMP showed 10 performance
    gains with the larger problem size.

21
Observations on Scaling
  • Scaling of problem size was only done for the SMP
    libraries, to fit into reasonable times.
  • Doubling N results in 8 times increase of
    computational complexity for dense matrix
    multiplication
  • Performance actually increased for both routines
    for larger problem size.

22
ESSL-SMP Performance vs. Number of Threads
  • All for N10,000
  • Number of threads controlled by environment
    variable OMP_NUM_THREADS

23
Parallelism Choices Based on Problem Size
Doing a months work in a few minutes!
  • Three Good Choices
  • ESSL / LAPACK
  • ESSL-SMP
  • SCALAPACK

Only beyond a certain problem size is there any
opportunity for parallelism.
Matrix-Matrix Multiply
24
Larger Functions FFTs
  • ESSL, FFTW, NAG, IMSL
  • See
  • http//www.nersc.gov/nusers/resources/software/lib
    s/math/fft/
  • We looked at ESSL, NAG, and IMSL
  • One-D, forward and reverse

25
One-D FFTs
  • NAG
  • c06eaf - forward
  • c06ebf - inverse, conjugate needed
  • c06faf - forward, work-space needed
  • c06ebf - inverse, work-space conjugate needed
  • IMSL
  • z_fast_dft - forward reverse, separate arrays
  • ESSL
  • drcft - forward reverse, work-space
    initialization step needed
  • All have size constraints on their data sets

26
One-D FFT Measurement
  • 224 real8 data points input (a synthetic
    signal)
  • Each transform ran in a 20-iteration loop
  • All performed both forward and inverse transforms
    on same data
  • Input and inverse outputs were identical
  • Measured with HPMToolkit
  • second1 rtc()
  • call f_hpmstart(33,"nag2 forward")
  • do loopind1, loopceil
  • w(1n) x(1n)
  • call c06faf( w, n, work, ifail )
  • end do
  • call f_hpmstop(33)
  • second2 rtc()

27
One-D FFT Performance
  • NAG c06eaf fwd 25.182 sec. 54.006 Mflip/s
  • C06ebf inv 24.465 sec. 40.666 Mflip/s
  • c06faf fwd 29.451 sec. 46.531 Mflip/s
  • c06ebf inv 24.469 sec. 40.663 Mflip/s
  • (required a data copy for each iteration for
    each transform)
  • IMSL z_fast_dft fwd 71.479 sec. 46.027 Mflip/s
  • z_fast_dft inv 71.152 sec. 48.096 Mflip/s
  • ESSL drcft init 0.032 sec. 62.315 Mflip/s
  • drcft fwd 3.573 sec. 274.009 Mflip/s
  • drcft init 0.058 sec. 96.384 Mflip/s
  • drcft inv 3.616 sec. 277.650 Mflip/s

28
(No Transcript)
29

30
(No Transcript)
31
ESSL and ESSL-SMP
  • Easy parallelism
  • ( -qsmpomp qessl
  • -lomp lesslsmp )
  • For simple problems can dial in the local data
    size by adjusting number of threads
  • Cache reuse can lead to superlinear speed up.
  • NH II node has 128 MB Cache!

32
Parallelism Beyond One Node MPI
2D problem in 4 tasks
  • Distributed Data Decomposition
  • Distributed parallelism (MPI) requires both local
    and global addressing contexts
  • Dimensionality of decomposition can have
    profound impact on scalability
  • Consider surface to volume ratio
  • Surface communication (MPI)
  • Volume local work (HPM)
  • Decomposition is often cause of load imbalance
    which can reduce parallel efficiency

2D problem in 20 tasks
33
Example FFTW 3D
  • Popular for its portability and performance
  • Also consider PESSLs FFTs (not treated here)
  • Uses a slab (1D) data decomposition
  • Direct algorithms for transforms of dimensions of
    size 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,3
    2,64
  • For parallel FFTW calls transforms are done in
    place
  • Plan
  • fftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny,
    nz, FFTW_FORWARD,flags)
  • fftwnd_mpi_local_sizes(plan, lnx, lxs, lnyt,
    lyst, lsize)
  • Transform
  • fftwnd_mpi(plan, 1, data, work, transform_flags)

What are these?
34
FFTW Data Decomposition
Each MPI rank owns a portion of the problem
nx
  • Local Address Context
  • for(x0xltlnxx)
  • for(y0yltnyy)
  • for(z0zltnzz)
  • dataxyz f(xlxs,y,z)

Global Address Context for(x0xltnxx)
for(y0yltnyy) for(z0zltnzz)
dataxyz f(x,y,z)
ny
nz
lxs lxslnx
35
FFTW Parallel Performance
  • FFT may be complex function of problem size .
    Prime factors of the dimensions and concurrency
    determine performance
  • Consider data decompositions and paddings that
    lead to optimal local data sizes cache use and
    prime factors

36
FFTW Wisdom
  • Runtime performance optimization, can be stored
    to file
  • Wise options
  • FFTW_MEASURE FFTW_USE_WISDOM
  • Unwise options
  • FFTW_ESTIMATE
  • Wisdom works better for serial FFTs. Some benefit
    for parallel must amortize increase in planning
    overhead.
Write a Comment
User Comments (0)
About PowerShow.com