Title: Numerical Libraries in High Performance Computing
1Numerical Libraries in High Performance Computing
- Dmitry Pekurovsky
- dmitry_at_sdsc.edu
- CIEG workshop, April 25 2007
2Tutorial Outline
- General comments
- Overview of commonly used libraries
- MASS
- ScaLAPACK and its components
- BLAS, BLACS, PBLAS,LAPACK
- PETSc
- NAG
- ESSL, PESSL
- FFTW
- P3DFFT
- Lab assignment
3Numerical libraries
- Use numerical (a.k.a. math or scientific)
libraries whenever possible - Minimize code development effort
- Improved performance
- Robust, accurate, up-to-date numerical algorithms
- Collection (archive) of precompiled routines
- Use nm a to look inside if needed
- Linking mpxlf l /lib/loc/libnum.a OR
- mpxlf -L/lib/loc l lnum
- Reminder order of linking matters
- In case two libraries contain the same routine
name, the first one to appear on the linking
command line gets used - -l options should come after the object files
(.o) of your source - If library A calls routines from library B, -lA
has to come before -lB. -
4Calling Fortran Libraries from C/C
- Fortran uses column-major, C/C uses row-major
array storage - In C array counts begin with 0, but in Fortran
they begin with 1 - Pass arguments by reference
- Pass characters as strings
- In C, prototype functions to be called as
follows - extern Fortran func(int n,)
- Link with lxlf90_r flag added
- More details see http//www.sdsc.edu/us/newslette
r/common/newsletter.php?issue2005-10cornersoftw
are or ask SDSC consulting
5Third-party libraries
- Advantages
- Portable code
- Extended set of features
- Disadvantages
- Suboptimal performance
- Sometimes costly, though many are free
- Sometimes, need to install yourself
- A good set is already installed on SDSC systems
- Look in /usr/local/apps directory for 64-bit
apps, in /usr/local/apps32 for 32-bit apps - Examples ScaLapack, PETSc, FFTW, NAG
6Vendor-supplied libraries (specific to a given
architecture)
- Advantages
- Highly optimized performance (for a given
processor, memory system etc) - Installed, tested, supported
- Usually free for the user
- Disadvantages
- Sometimes code not portable to other platforms
- Some features may not be implemented
- Examples on IBM ESSL, MASS
- Examples on Intel MKL
7MASS Mathematical Acceleration SubSystem
- IBM product
- Scalar Library Functions
- Replace standard system intrinsics from libm
- Call from Fortran or C
- No source code modification required
- Double precision routines sin cos tan atan sqrt
sinh cosh tanh atan2 rsqrt exp dnint log
pow(C/C) xy(Fortran) - Provide better performance
- Sometimes less accurate results (usually in the
last bit) - Linking -L/usr/local/apps/mass -lmass
8Scalar MASS Performance (cycles per call, length
1000 loop)
- F      604e  pwr3  pwr4 Â
pwr5u          S      S   Â
  S      Sn          p     Â
p      p      pc     R      e Â
    e      e      et     a  l Â
m  e  l  m  e  l  m  e  l  m  ei   Â
 n  i  a  d  i  a  d  i  a  d  i  a Â
do    g  b  s  u  b  s  u  b  s  u Â
b  s  un    e  m  s  p  m  s  p  m Â
s  p  m  s  p--------------------------------
--------------------------acos   B  91  91 1.0
 70  69 1.0 108 116 0.9 113 115 1.0acosh   G
318 182 1.7 256 159 1.6 483 253 1.9 527 248
2.1asin   B  84  83 1.0  63  63 1.0 100 101
1.0 112 116 1.0asinh   D 352 232 1.5 264 178
1.5 463 301 1.5 482 286 1.7atan2 Â Â D 601 103
5.8 441 Â 79 5.6 754 100 7.5 847 Â 95 8.9atanh Â
 B 244 143 1.7 186 116 1.6 300 169 1.8 312 165
1.9cbrt   D 279 282 1.0 216 216 1.0 328 333
1.0 335 336 1.0cos    B  53  25 2.1  38  16
2.4  53  21 2.5  65  29 2.2cos    D  76  45
1.7  58  37 1.6  74  58 1.3  86  60 1.4cosh  Â
D 191 Â 45 4.2 154 Â 32 4.8 237 Â 46 5.2 255 Â 50
5.1cosisin  B 111  58 1.9  82  34 2.4 101  46
2.2 123  50 2.5cosisin  D 161  97 1.7 125  77
1.6 151 102 1.5 170 105 1.6div    D  32  32
1.0 8.8 8.8 1.0  28  28 1.0  29  29 1.0exp  Â
 D 120  35 3.4 106  27 3.9 162  35 4.6 168  40
4.2expm1 Â Â D 133 Â 41 3.2 111 Â 28 4.0 218 Â 43
5.1 198 Â 43 4.6
9Scalar MASS Performance (cycles per call, length
1000 loop), cont.
- F      604e  pwr3  pwr4 Â
pwr5u          S      S   Â
  S      Sn          p     Â
p      p      pc     R      e Â
    e      e      et     a  l Â
m  e  l  m  e  l  m  e  l  m  ei   Â
 n  i  a  d  i  a  d  i  a  d  i  a Â
do    g  b  s  u  b  s  u  b  s  u Â
b  s  un    e  m  s  p  m  s  p  m Â
s  p  m  s  plog    C 134  55 2.4 105  52
2.0 204 Â 84 2.4 210 Â 86 2.4log10 Â Â C 133 Â 56
2.4 107 Â 57 1.9 206 Â 90 2.3 207 Â 92 2.3log1p Â
 H 157  58 2.7 124  47 2.6 201  68 3.0 220  70
3.1pow    C 309 113 2.7 235  90 2.6 453 146
3.1 463 150 3.1qdrt   C 134  98 1.4 116  88
1.3 250 156 1.6 283 159 1.8rcbrt   D 309 309
1.0 236 237 1.0 354 354 1.0 362 363 1.0rec  Â
 D  32  32 1.0 9.2 8.8 1.0  29  29 1.0  29  29
1.0rqdrt   C 149 100 1.5 126  95 1.3 243 155
1.6 276 153 1.8rsqrt   C  86  52 1.7  70  52
1.3 120  73 1.6 156  74 2.1sin    B  52  27
1.9  37  16 2.3  54  24 2.3  66  32 2.1sin  Â
 D  78  45 1.7  61  37 1.6  77  57 1.4  82  60
1.4sincos  B 108  56 1.9  80  34 2.4 101  48
2.1 121  54 2.2sinh   D 250  50 5.0 197  35
5.6 345  52 6.6 329  55 6.0sqrt   C  69  50
1.4  59  46 1.3 128  72 1.8 143  73 2.0tan  Â
 D 137  72 1.9 112  43 2.6 194  62 3.1 183  64
2.9tanh   F 256  64 4.0 199  47 4.2 357  65
5.5 333 Â 67 5.0
10Vector MASS library - MASSV
- Provides optimized versions of intrinsic
functions for vectorized operations - Example
- for(i0 i lt N i)
- Zi exp(xi)
- Replace with
- vexp(Z,x,N)
- Takes advantage of pipelining
11Vector MASS library cont.
- Some source code modification required
- Note accuracy may differ even from scalar MASS
functions (usually in the last bit) - Library source code available for portability to
other platforms - Linking -lmassvp4 on power4 platforms, otherwise
lmassv - MASS URL http//www-306.ibm.com/software/awdtools
/mass/
12Vector MASS Performance -- double precision
functions (cycles per evaluation, length 1000
loop)
- F       604e   pwr3   pwr4
  pwr5u           S     m  S
    m  S     m  Sn           p
    a  p     a  p     a  pc    Â
R    m  e     s  e     s  e     s Â
et     a  l  a  e   l  s  e   l  s Â
e   l  s  ei     n  i  s  d   i  v Â
d   i  v  d   i  v  do     g  b  s Â
u   b  p  u   b  p  u   b  p  un   Â
 e  m  v  p   m  3  p   m  4  p   m  5
 p----------------------------------------------
-----------------vacos   B  91  41  2.2  70
 17  4.1 108  24  4.5 113  23  4.9vacosh   G
318 Â 39 Â 8.2 256 Â 15 17.1 483 Â 21 23.0 527 Â 19
27.7vasin   B  84  41  2.0  63  17  3.7 100
 24  4.2 112  23  4.9vasinh   D 352  44  8.0
264 Â 18 14.7 463 Â 23 20.1 482 Â 22 21.9vatan2 Â
 D 601  40 15.0 441  27 16.3 754  60 12.6 847  57
14.9vatanh   B 244  41  6.0 186  16 11.6 300
 21 14.3 312  19 16.4vcbrt   D 279  20 14.0
216 8.8 24.5 328  11 29.8 335  12 27.9vcos  Â
 B  53 9.3  5.7  38  4  9.5  53 6.5  8.2  65 6.5
10.0vcos    D  76  27  2.8  58  12  4.8  74
 20  3.7  86  20  4.3vcosh   D 191  25  7.6
154 9.2 16.7 237 Â 14 16.9 255 Â 13 19.6vcosisin
 B 111  19  5.8  82  8 10.3 101  13  7.8 123  12
10.3vcosisin  D 161  29  5.6 125  12 10.4 151
 21  7.2 170  20  8.5vdiv    D  32  11  2.9
8.8 6.8  1.3  28  13  2.2  29  13  2.2 - vexp    D 120  16  7.5 106 6.4 16.6 162  14
11.6 168 Â 13 12.9vexpm1 Â Â D 133 Â 19 Â 7.0 111 Â
8 13.9 218  12 18.2 198  12 16.5vlog    C 134
 20  6.7 105 7.2 14.6 204 9.7 21.0 210 9.5 22.1
13Vector MASS Performance - double precision
functions (cycles per evaluation, length 1000
loop), cont.
- F       604e   pwr3   pwr4
  pwr5u           S     m  S
    m  S     m  Sn           p
    a  p     a  p     a  pc    Â
R    m  e     s  e     s  e     s Â
et     a  l  a  e   l  s  e   l  s Â
e   l  s  ei     n  i  s  d   i  v Â
d   i  v  d   i  v  do     g  b  s Â
u   b  p  u   b  p  u   b  p  un   Â
 e  m  v  p   m  3  p   m  4  p   m  5
 p----------------------------------------------
----------------- - vlog10 Â Â C 133 Â 19 Â 7.0 107 7.6 14.1 206 Â 10
20.6 207 9.9 20.9vlog1p   H 157  26  6.0 124
 11 11.3 201  13 15.5 220  12 18.3vpow    C
309 Â 52 Â 5.9 235 Â 20 11.8 453 Â 29 15.6 463 Â 30
15.4vqdrt   C 134  15  8.9 116  6 19.3 250
8.2 30.5 283  8 35.4vrcbrt   D 309  20 15.5
236 8.8 26.8 354  11 32.2 362  11 32.9vrec  Â
 D  32  10  3.2 9.2 6.4  1.4  29  12  2.4  29  11
 2.6vrqdrt   C 149  14 10.6 126  6 21.0 243
7.8 31.2 276  7 39.4vrsqrt   C  86  16  5.4
 70 8.8  8.0 120  18  6.7 156  17  9.2vsin  Â
 B  52  11  4.7  37 4.8  7.7  54 7.2  7.5  66 6.9
 9.6vsin    D  78  27  2.9  61  12  5.1  77
 20  3.9  82  19  4.3vsinh   D 250  27  9.3
197  10 19.7 345  15 23.0 329  13 25.3vsqrt  Â
C Â 69 Â 16 Â 4.3 Â 59 8.8 Â 6.7 128 Â 17 Â 7.5 143 Â 17
 8.4vtan    D 137  31  4.4 112  13  8.6 194
 19 10.2 183  17 10.8vtanh   F 256  35  7.3
199 Â 13 15.3 357 Â 19 18.8 333 Â 18 18.5
14ScaLapack
- Scalable Linear Algebra PACKage,
http//www.netlib.org/scalapack - Developing team from University of Tennessee,
University of California Berkeley, ORNL, Rice
U.,UCLA, UIUC etc. - Support in Commercial Packages
- NAG Parallel Library
- IBM PESSL
- CRAY Scientific Library
- VNI IMSL
- Fujitsu, HP/Convex, Hitachi, NEC
- Handles dense and banded matrices
15Serial Parallel
16BLAS Basic Linear Algebra Subprograms
17BLAS, cont.
- Use blocked operations
- Optimized for a given memory hierarchy e.g.
good use of cache - Other libraries build on top of BLAS
- On any given system, try to use a
system-optimized version
18LAPACK
- Linear Algebra PACKage
- System of linear equations
- Eigenvalue problems
- Built on top of BLAS
- On SDSC systems, installed at /usr/local/apps/LAPA
CK
19BLACS - Basic Linear Algebra Communication
Subprograms
20BLACS basics
- nprow1 npcol3
- order'r' zero0
- blacs_get(zero,zero,icontxt)
- blacs_gridinit(icontxt,order,nprow,npcol)
- blacs_gridinfo(icontxt,nprow,npcol,myrow,mycol)
- icontxt like MPI_COMM_WORLD
- blacs_gridinit specifies topology
- blacs_gridinfo gets processor information
- Here we define a 3 element array of processors
21PBLAS
- Parallel version of BLAS
- Built on top of BLAS and BLACS
22ScaLapack data layout
23Block-Cyclic Partitioning (1D)
24Block-cyclic partitioning (2D)
25Block-cyclic partitioning (2D), cont.
26ScaLapack Array Descriptors
27Syntax for descinit
descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier)
IorO arg Description
OUT idesc Descriptor IN m Row Size
(Global) IN n Column Size (Global) IN
mb Row Block Size IN nb Column
Size IN i Starting Grid Row IN j
Starting Grid Column IN icon BLACS
context IN mla Leading Dimension of Local
Matrix OUT ier Error number The starting
grid location is usually (i0,j0).
28ScaLapack functionality
29ScaLapack API
- Prepend LAPACK equivalent names with P
P
X
Y
Y
Z
Z
Z
C
o
m
p
u
t
a
t
i
o
n
P
e
r
f
o
r
m
e
d
M
a
t
r
i
x
T
y
p
e
D
a
t
a
T
y
p
e
s
30ScaLapack API, cont.
- Matrix Types (YY)
PXYYZZZ - DB General Band (diagonally dominant-like)
- DT general Tridiagonal (Diagonally
dominant-like) - GB General Band
- GE GEneral matrices (e.g., unsymmetric,
rectangular, etc.) - GG General matrices, Generalized problem
- HE Complex Hermitian
- OR Orthogonal Real
- PB Positive definite Banded (symmetric or
Hermitian) - PO Positive definite (symmetric or Hermitian)
- PT Positive definite Tridiagonal (symmetric or
Hermitian) - ST Symmetric Tridiagonal Real
- SY SYmmetric
- TR TRiangular (possibly quasi-triangular)
- TZ TrapeZoidal
- UN UNitary complex
31ScaLapack API, cont.
Drivers (ZZZ) PXYYZZZ
- SL Linear Equations (SVX)
- SV LU Solver
- VD Singular Value
- EV Eigenvalue (EVX)
- GVX Generalized Eigenvalue
- Estimates condition numbers.
- Provides Error Bounds.
- Equilibrates System.
- Selected Eigenvalues
32ScaLapack function call example
- Solving AxB for a general square matrix
- A (N x N), B(N x NRHS)
- CALL PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO ) - Global indices IA JA IB JB 1
- Ipiv Integer array (N), on output containing
permutation matrix (row i of the matrix was
interchanged with row ipiv(i))
33ScaLapack performance tips
- Use native BLAS
- Choose a suitable number of processors, so as not
to make the local matrix size too large
(1000x1000 is about right) - Try to use square processor grid (NrowNcol)
- Block size 64
34PETSc
- Portable Extensible Toolkit for Scientific
Computation - Intended to facilitate the scalable (parallel)
solution of linear and non-linear partial
differential equations - Focus on problems discretized using semi or fully
implicit methods - Developed by Argonne National Laboratory, MCS
35PETSc
- Fully usable from Fortran, C, and C
- Uses MPI for all parallel communications
- Object-oriented
- Designed for both efficiency and ease of use
- Available freely from http//www.mcs.anl.gov/petsc
- Installed on Datastar at /usr/local/apps/petsc-2.3
.1
36PETSc features
- Parallel Vectors and discrete functions
- Distributed arrays (for regular grid problems)
- Parallel vector scatter/gathers (for unstructured
grid problems) - Parallel (sparse) matrices
- Parallel Krylov subspace methods
- Parallel preconditioners
- Parallel (Newton-based) nonlinear solvers
- Parallel time-stepping code
37PETSc Numerical components
Vectors
38MPI use features
- Communicators and attributes to handle managing
messages inside PETSc objects - Nonblocking sends and receives to overlap
communication with computation - Global reduction operations
- Derived data types
39NAG Numerical Algorithms Group
- Proprietary package, currently installed on
Datastar in /usr/local/apps/nag (64-bit only) - Differential Equations
- Linear Equations
- Interpolation, Curve and Surface Fitting
- Linear Algebra
- Eigensystem Analysis
- Statistical Analysis
- Random Number Generator
- Fourier Transforms, Summation of Series
- Time series analysis
- Mesh generation
40ESSL - Engineering and Scientific Subroutine
Library
- IBM product optimized for IBM platforms
- Over 400 high performance mathematical functions
- C/C and Fortran, 32 and 64 bit, thread-safe
- API is Fortran-based. From C/C must use
Fortran calling conventions, e.g. column-major
array - Linking -lessl
- ESSL User Guide http//publib.boulder.ibm.com/epu
bs/pdf/am501403.pdf
41ESSL features
- Numerical functions categories
- Linear Algebra - vector-scalar, sparse
vector-scalar, matrix-vector, sparse
matrix-vector - Matrix Operations - addition, subtraction,
multiplication, rank-k updates, transpose - Linear Algebra Equation solvers - dense, banded,
sparse, linear least-squares - Eigensystem Analysis - standard, generalized
- Signal Processing - Fourier transform,
convolutions, correlations - Sorting Searching
- Interpolation - polynomial, cubic spline
- Numerical Quadrature
- Random Number Generation
- Utilities
42ESSL example 1D real-to-complex Fast Fourier
Transform (FFT)
- include ltessl.hgt
- main()
-
- double RMN,work1,work2
- cmplx CMN/21
- int nwork1,nwork2
- Init(R,N,M)
-
- work1 malloc(nwork1sizeof(double))
- work2 malloc(nwork2sizeof(double))
- drcft(1,R,N,C,N/21,N,M,1,1.0,work1,nwork1,work2,n
work2) - drcft(0,R,N,C,N/21,N,M,1,1.0,work1,nwork1,work2,n
work2) -
43PESSL
- Invokes ESSL
- Uses MPI for communication
- PESSL Categories
- Subset of Level 2, 3 PBLAS
- Subset of ScaLAPACK (dense, banded)
- Sparse Routines
- Subset of ScaLAPACK Eigensystem Routines
- Fourier Transforms
- Uniform Random Number Generation
- BLACS
- Link with -lpessl
44FFTW Fastest Fourier Transform in the West
- Third party software, free http//www.fftw.org
- Latest version installed on SDSC systems in
/usr/local/apps/fftw312d and /usr/local/apps/fftw3
12s - 1,2, and 3 dimensional FFTs, and related
functions (sin/cos transforms, convolution etc) - Using plans
- Initialize transform parameters
- long integer plan1
- plan1 fftw_plan(..N,M,...)
- fftw_execute(plan,A,B)
- Work arrays allocated and initialized behind the
scenes - ESTIMATE, MEASURE, PATIENT intialization
- s
- Portable, free
- -s
- Performance about 3 times worse than ESSLs FFTs!
453D FFT in parallel
- FFTW version 3 does not have MPI implementation
- PESSL and other libraries use 1D processor
decomposition (slabs/planes) - Limitation Np lt N
- Not enough for some state-of-the art simulations,
for example turbulence in a periodic domain 40963 - 2D decomposition is crucial for scaling on
Petascale platforms
462D decomposition
z
x
y
472D decomposition
- 2D decomposition processors form a square grid
P1 x P2 - Columns (pencils) instead of slabs
- Local geometry
- Stage 1 N x K x M
- Stage 2 K x N x M
- Stage 3 K x M x N
- K N/P1, MN/P2
- At each stage, transform along the local
dimension (N) - Software scaling limit removed! Now the limit is
P lt N2
48Performance of DNS (2D) on Blue Gene at SDSC and
IBMs T.J.Watson Lab and SDSCs Datastar
- VN Two processors per node
- CO One processor per node
49P3DFFT
- Open source library for efficient, highly
scalable 3D FFT on parallel platforms - Built on top of an optimized 1D FFT library
- Currently ESSL or FFTW
- In the future, more libraries
- Uses 2D decomposition, which includes 1D option.
- Developed at SDSC
- Outcome of a Strategic Applications
Collaborations Project (DNS turbulence code) - Available at http//www.sdsc.edu/us/resources/p3df
ft.php
50P3DFFT features
- Implements real-to-complex (R2C) and
complex-to-real (C2R) 3D transforms - Written in Fortran 90 with MPI
- Implemented as F90 module
- Single or double precision
- Arbitrary dimensions
- Handles many uneven cases (Ni does not have to be
a factor of Pj) - Assumes Ni gt Pj
- Can do either in-place transform or otherwise
- In which case the input and output arrays should
not overlap - Memory requirements input and output arrays 1
buffer
51P3DFFT Storage
- R2C transform input
- Global (Nx,Ny,Nz) real array
- Local (Nx,Ny/P1,Nz/P2) real array
- R2C output
- Global (Nx/2,Ny,Nz) complex array
- Conjugate symmetry F(N-i) F(i)
- F(N/21) through F(N-1) are redundant
- F(0) and F(N/2) are real pack them in one
pseudo-complex pair - Total N/2 complex storage units
- Local (Nx/(2P1),Ny/P2,Nz) complex
- C2R input and output interchanged from R2C
52Building P3DFFT
- Subdirectory build/
- Edit makefile for your architecture
- Specify -DSINGLE_PREC or -DDOUBLE_PREC
- Default is single precision
- Specify -DESSL or -DFFTW
- Choose P3DFFT_ROOT
- Type make will compile and install the
library in P3DFFT_ROOT directory - On SDSC platforms, it is already installed in
/usr/local/apps/p3dfft - Subdirectory sample/
- Contains sample programs
- For manual see README file in upper level
directory
53Using P3DFFT
- use p3dfft
- Defines mytype 4 or 8 byte reals
- Initialization p3dfft_setup(proc_dims,nx,ny,nz)
- procs_dims(2) two integers, P1 and P2
- For 1D decomposition, set P1 1
- Sets up the square 2D grid of MPI communicators
in rows and columns - Initializes local array dimensions
- Local dimensions get_dims(istart,iend,isize,Q)
- integer istart(3),iend(3),isize(3)
- Set Q1 for R2C input, Q2 for R2C output
54Using P3DFFT (2)
- Now allocate your arrays
- When ready to call Fourier Transform
- p3dfft_ftran_r2c(IN,OUT,flg_inplace) Forward
transform - exp(-ikx/N)
- p3dfft_btran_c2r(IN,OUT,flg_inplace) Backward
(inverse) transform - exp(ikx/N)
- flg_inplace boolean, true if doing an in-place
transform
55Future work
- Include more array distribution schemes
- Expand options for memory ordering
- Include more transform types
- Derivatives
- Complex-to-complex
- C interface
- Convenience functions
- Break down in stages
- Expand choice of 1D FFT libraries
- Questions? Comments?
- dmitry_at_sdsc.edu
56Acknowledgements
- Amit Majumdar, SDSC
- ScaLapack development team http//www.netlib.org/s
calapack - ACTS tools presentation from NERSC
http//acts.nersc.gov
57Lab assignment
- Copy tar file dmitry/scalapack_examples.tar
- Untar
- Type make
- Test the example programs on several processor
counts - do they scale? - Modify source code to vary
- matrix size
- block sizes
- Rewrite example1.f in C or C