Numerical Libraries in High Performance Computing

About This Presentation

Title:

Numerical Libraries in High Performance Computing

Description:

SAN DIEGO SUPERCOMPUTER CENTER. UNIVERSITY OF CALIFORNIA, SAN DIEGO. Numerical Libraries in High Performance Computing ... Reminder: order of linking matters ... – PowerPoint PPT presentation

Number of Views:255

Avg rating:3.0/5.0

Slides: 58

Provided by: yfc5

Category:

more less

Transcript and Presenter's Notes

Title: Numerical Libraries in High Performance Computing

1
Numerical Libraries in High Performance Computing

Dmitry Pekurovsky
dmitry_at_sdsc.edu
CIEG workshop, April 25 2007

2
Tutorial Outline

General comments
Overview of commonly used libraries
MASS
ScaLAPACK and its components
BLAS, BLACS, PBLAS,LAPACK
PETSc
NAG
ESSL, PESSL
FFTW
P3DFFT
Lab assignment

3
Numerical libraries

Use numerical (a.k.a. math or scientific)
libraries whenever possible
Minimize code development effort
Improved performance
Robust, accurate, up-to-date numerical algorithms
Collection (archive) of precompiled routines
Use nm a to look inside if needed
Linking mpxlf l /lib/loc/libnum.a OR
mpxlf -L/lib/loc l lnum
Reminder order of linking matters
In case two libraries contain the same routine
name, the first one to appear on the linking
command line gets used
-l options should come after the object files
(.o) of your source
If library A calls routines from library B, -lA
has to come before -lB.

4
Calling Fortran Libraries from C/C

Fortran uses column-major, C/C uses row-major
array storage
In C array counts begin with 0, but in Fortran
they begin with 1
Pass arguments by reference
Pass characters as strings
In C, prototype functions to be called as
follows
extern Fortran func(int n,)
Link with lxlf90_r flag added
More details see http//www.sdsc.edu/us/newslette
r/common/newsletter.php?issue2005-10cornersoftw
are or ask SDSC consulting

5
Third-party libraries

Advantages
Portable code
Extended set of features
Disadvantages
Suboptimal performance
Sometimes costly, though many are free
Sometimes, need to install yourself
A good set is already installed on SDSC systems
Look in /usr/local/apps directory for 64-bit
apps, in /usr/local/apps32 for 32-bit apps
Examples ScaLapack, PETSc, FFTW, NAG

6
Vendor-supplied libraries (specific to a given
architecture)

Advantages
Highly optimized performance (for a given
processor, memory system etc)
Installed, tested, supported
Usually free for the user
Disadvantages
Sometimes code not portable to other platforms
Some features may not be implemented
Examples on IBM ESSL, MASS
Examples on Intel MKL

7
MASS Mathematical Acceleration SubSystem

IBM product
Scalar Library Functions
Replace standard system intrinsics from libm
Call from Fortran or C
No source code modification required
Double precision routines sin cos tan atan sqrt
sinh cosh tanh atan2 rsqrt exp dnint log
pow(C/C) xy(Fortran)
Provide better performance
Sometimes less accurate results (usually in the
last bit)
Linking -L/usr/local/apps/mass -lmass

8
Scalar MASS Performance (cycles per call, length
1000 loop)

F 604e pwr3 pwr4
pwr5u S S
S Sn p
p p pc R e
e e et a l
m e l m e l m e l m ei
n i a d i a d i a d i a
do g b s u b s u b s u
b s un e m s p m s p m
s p m s p--------------------------------
--------------------------acos B 91 91 1.0
70 69 1.0 108 116 0.9 113 115 1.0acosh G
318 182 1.7 256 159 1.6 483 253 1.9 527 248
2.1asin B 84 83 1.0 63 63 1.0 100 101
1.0 112 116 1.0asinh D 352 232 1.5 264 178
1.5 463 301 1.5 482 286 1.7atan2 D 601 103
5.8 441 79 5.6 754 100 7.5 847 95 8.9atanh
B 244 143 1.7 186 116 1.6 300 169 1.8 312 165
1.9cbrt D 279 282 1.0 216 216 1.0 328 333
1.0 335 336 1.0cos B 53 25 2.1 38 16
2.4 53 21 2.5 65 29 2.2cos D 76 45
1.7 58 37 1.6 74 58 1.3 86 60 1.4cosh
D 191 45 4.2 154 32 4.8 237 46 5.2 255 50
5.1cosisin B 111 58 1.9 82 34 2.4 101 46
2.2 123 50 2.5cosisin D 161 97 1.7 125 77
1.6 151 102 1.5 170 105 1.6div D 32 32
1.0 8.8 8.8 1.0 28 28 1.0 29 29 1.0exp
D 120 35 3.4 106 27 3.9 162 35 4.6 168 40
4.2expm1 D 133 41 3.2 111 28 4.0 218 43
5.1 198 43 4.6

9
Scalar MASS Performance (cycles per call, length
1000 loop), cont.

F 604e pwr3 pwr4
pwr5u S S
S Sn p
p p pc R e
e e et a l
m e l m e l m e l m ei
n i a d i a d i a d i a
do g b s u b s u b s u
b s un e m s p m s p m
s p m s plog C 134 55 2.4 105 52
2.0 204 84 2.4 210 86 2.4log10 C 133 56
2.4 107 57 1.9 206 90 2.3 207 92 2.3log1p
H 157 58 2.7 124 47 2.6 201 68 3.0 220 70
3.1pow C 309 113 2.7 235 90 2.6 453 146
3.1 463 150 3.1qdrt C 134 98 1.4 116 88
1.3 250 156 1.6 283 159 1.8rcbrt D 309 309
1.0 236 237 1.0 354 354 1.0 362 363 1.0rec
D 32 32 1.0 9.2 8.8 1.0 29 29 1.0 29 29
1.0rqdrt C 149 100 1.5 126 95 1.3 243 155
1.6 276 153 1.8rsqrt C 86 52 1.7 70 52
1.3 120 73 1.6 156 74 2.1sin B 52 27
1.9 37 16 2.3 54 24 2.3 66 32 2.1sin
D 78 45 1.7 61 37 1.6 77 57 1.4 82 60
1.4sincos B 108 56 1.9 80 34 2.4 101 48
2.1 121 54 2.2sinh D 250 50 5.0 197 35
5.6 345 52 6.6 329 55 6.0sqrt C 69 50
1.4 59 46 1.3 128 72 1.8 143 73 2.0tan
D 137 72 1.9 112 43 2.6 194 62 3.1 183 64
2.9tanh F 256 64 4.0 199 47 4.2 357 65
5.5 333 67 5.0

10
Vector MASS library - MASSV

Provides optimized versions of intrinsic
functions for vectorized operations
Example
for(i0 i lt N i)
Zi exp(xi)
Replace with
vexp(Z,x,N)
Takes advantage of pipelining

11
Vector MASS library cont.

Some source code modification required
Note accuracy may differ even from scalar MASS
functions (usually in the last bit)
Library source code available for portability to
other platforms
Linking -lmassvp4 on power4 platforms, otherwise
lmassv
MASS URL http//www-306.ibm.com/software/awdtools
/mass/

12
Vector MASS Performance -- double precision
functions (cycles per evaluation, length 1000
loop)

F 604e pwr3 pwr4
pwr5u S m S
m S m Sn p
a p a p a pc
R m e s e s e s
et a l a e l s e l s
e l s ei n i s d i v
d i v d i v do g b s
u b p u b p u b p un
e m v p m 3 p m 4 p m 5
p----------------------------------------------
-----------------vacos B 91 41 2.2 70
17 4.1 108 24 4.5 113 23 4.9vacosh G
318 39 8.2 256 15 17.1 483 21 23.0 527 19
27.7vasin B 84 41 2.0 63 17 3.7 100
24 4.2 112 23 4.9vasinh D 352 44 8.0
264 18 14.7 463 23 20.1 482 22 21.9vatan2
D 601 40 15.0 441 27 16.3 754 60 12.6 847 57
14.9vatanh B 244 41 6.0 186 16 11.6 300
21 14.3 312 19 16.4vcbrt D 279 20 14.0
216 8.8 24.5 328 11 29.8 335 12 27.9vcos
B 53 9.3 5.7 38 4 9.5 53 6.5 8.2 65 6.5
10.0vcos D 76 27 2.8 58 12 4.8 74
20 3.7 86 20 4.3vcosh D 191 25 7.6
154 9.2 16.7 237 14 16.9 255 13 19.6vcosisin
B 111 19 5.8 82 8 10.3 101 13 7.8 123 12
10.3vcosisin D 161 29 5.6 125 12 10.4 151
21 7.2 170 20 8.5vdiv D 32 11 2.9
8.8 6.8 1.3 28 13 2.2 29 13 2.2
vexp D 120 16 7.5 106 6.4 16.6 162 14
11.6 168 13 12.9vexpm1 D 133 19 7.0 111
8 13.9 218 12 18.2 198 12 16.5vlog C 134
20 6.7 105 7.2 14.6 204 9.7 21.0 210 9.5 22.1

13
Vector MASS Performance - double precision
functions (cycles per evaluation, length 1000
loop), cont.

F 604e pwr3 pwr4
pwr5u S m S
m S m Sn p
a p a p a pc
R m e s e s e s
et a l a e l s e l s
e l s ei n i s d i v
d i v d i v do g b s
u b p u b p u b p un
e m v p m 3 p m 4 p m 5
p----------------------------------------------
-----------------
vlog10 C 133 19 7.0 107 7.6 14.1 206 10
20.6 207 9.9 20.9vlog1p H 157 26 6.0 124
11 11.3 201 13 15.5 220 12 18.3vpow C
309 52 5.9 235 20 11.8 453 29 15.6 463 30
15.4vqdrt C 134 15 8.9 116 6 19.3 250
8.2 30.5 283 8 35.4vrcbrt D 309 20 15.5
236 8.8 26.8 354 11 32.2 362 11 32.9vrec
D 32 10 3.2 9.2 6.4 1.4 29 12 2.4 29 11
2.6vrqdrt C 149 14 10.6 126 6 21.0 243
7.8 31.2 276 7 39.4vrsqrt C 86 16 5.4
70 8.8 8.0 120 18 6.7 156 17 9.2vsin
B 52 11 4.7 37 4.8 7.7 54 7.2 7.5 66 6.9
9.6vsin D 78 27 2.9 61 12 5.1 77
20 3.9 82 19 4.3vsinh D 250 27 9.3
197 10 19.7 345 15 23.0 329 13 25.3vsqrt
C 69 16 4.3 59 8.8 6.7 128 17 7.5 143 17
8.4vtan D 137 31 4.4 112 13 8.6 194
19 10.2 183 17 10.8vtanh F 256 35 7.3
199 13 15.3 357 19 18.8 333 18 18.5

14
ScaLapack

Scalable Linear Algebra PACKage,
http//www.netlib.org/scalapack
Developing team from University of Tennessee,
University of California Berkeley, ORNL, Rice
U.,UCLA, UIUC etc.
Support in Commercial Packages
NAG Parallel Library
IBM PESSL
CRAY Scientific Library
VNI IMSL
Fujitsu, HP/Convex, Hitachi, NEC
Handles dense and banded matrices

15
Serial Parallel
16
BLAS Basic Linear Algebra Subprograms

17
BLAS, cont.

Use blocked operations
Optimized for a given memory hierarchy e.g.
good use of cache
Other libraries build on top of BLAS
On any given system, try to use a
system-optimized version

18
LAPACK

Linear Algebra PACKage
System of linear equations
Eigenvalue problems
Built on top of BLAS
On SDSC systems, installed at /usr/local/apps/LAPA
CK

19
BLACS - Basic Linear Algebra Communication
Subprograms

Built on top of MPI

20
BLACS basics

nprow1 npcol3
order'r' zero0
blacs_get(zero,zero,icontxt)
blacs_gridinit(icontxt,order,nprow,npcol)
blacs_gridinfo(icontxt,nprow,npcol,myrow,mycol)
icontxt like MPI_COMM_WORLD
blacs_gridinit specifies topology
blacs_gridinfo gets processor information
Here we define a 3 element array of processors

21
PBLAS

Parallel version of BLAS
Built on top of BLAS and BLACS

22
ScaLapack data layout
23
Block-Cyclic Partitioning (1D)
24
Block-cyclic partitioning (2D)
25
Block-cyclic partitioning (2D), cont.
26
ScaLapack Array Descriptors
27
Syntax for descinit
descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier)
IorO arg Description
OUT idesc Descriptor IN m Row Size
(Global) IN n Column Size (Global) IN
mb Row Block Size IN nb Column
Size IN i Starting Grid Row IN j
Starting Grid Column IN icon BLACS
context IN mla Leading Dimension of Local
Matrix OUT ier Error number The starting
grid location is usually (i0,j0).
28
ScaLapack functionality
29
ScaLapack API

Prepend LAPACK equivalent names with P

P
X
Y
Y
Z
Z
Z
C
o
m
p
u
t
a
t
i
o
n

P
e
r
f
o
r
m
e
d
M
a
t
r
i
x

T
y
p
e

D
a
t
a

T
y
p
e
s
30
ScaLapack API, cont.

Matrix Types (YY)
PXYYZZZ
DB General Band (diagonally dominant-like)
DT general Tridiagonal (Diagonally
dominant-like)
GB General Band
GE GEneral matrices (e.g., unsymmetric,
rectangular, etc.)
GG General matrices, Generalized problem
HE Complex Hermitian
OR Orthogonal Real
PB Positive definite Banded (symmetric or
Hermitian)
PO Positive definite (symmetric or Hermitian)
PT Positive definite Tridiagonal (symmetric or
Hermitian)
ST Symmetric Tridiagonal Real
SY SYmmetric
TR TRiangular (possibly quasi-triangular)
TZ TrapeZoidal
UN UNitary complex

31
ScaLapack API, cont.
Drivers (ZZZ) PXYYZZZ

SL Linear Equations (SVX)
SV LU Solver
VD Singular Value
EV Eigenvalue (EVX)
GVX Generalized Eigenvalue
Estimates condition numbers.
Provides Error Bounds.
Equilibrates System.
Selected Eigenvalues

32
ScaLapack function call example

Solving AxB for a general square matrix
A (N x N), B(N x NRHS)
CALL PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO )
Global indices IA JA IB JB 1
Ipiv Integer array (N), on output containing
permutation matrix (row i of the matrix was
interchanged with row ipiv(i))

33
ScaLapack performance tips

Use native BLAS
Choose a suitable number of processors, so as not
to make the local matrix size too large
(1000x1000 is about right)
Try to use square processor grid (NrowNcol)
Block size 64

34
PETSc

Portable Extensible Toolkit for Scientific
Computation
Intended to facilitate the scalable (parallel)
solution of linear and non-linear partial
differential equations
Focus on problems discretized using semi or fully
implicit methods
Developed by Argonne National Laboratory, MCS

35
PETSc

Fully usable from Fortran, C, and C
Uses MPI for all parallel communications
Object-oriented
Designed for both efficiency and ease of use
Available freely from http//www.mcs.anl.gov/petsc
Installed on Datastar at /usr/local/apps/petsc-2.3
.1

36
PETSc features

Parallel Vectors and discrete functions
Distributed arrays (for regular grid problems)
Parallel vector scatter/gathers (for unstructured
grid problems)
Parallel (sparse) matrices
Parallel Krylov subspace methods
Parallel preconditioners
Parallel (Newton-based) nonlinear solvers
Parallel time-stepping code

37
PETSc Numerical components

Vectors
38
MPI use features

Communicators and attributes to handle managing
messages inside PETSc objects
Nonblocking sends and receives to overlap
communication with computation
Global reduction operations
Derived data types

39
NAG Numerical Algorithms Group

Proprietary package, currently installed on
Datastar in /usr/local/apps/nag (64-bit only)
Differential Equations
Linear Equations
Interpolation, Curve and Surface Fitting
Linear Algebra
Eigensystem Analysis
Statistical Analysis
Random Number Generator
Fourier Transforms, Summation of Series
Time series analysis
Mesh generation

40
ESSL - Engineering and Scientific Subroutine
Library

IBM product optimized for IBM platforms
Over 400 high performance mathematical functions
C/C and Fortran, 32 and 64 bit, thread-safe
API is Fortran-based. From C/C must use
Fortran calling conventions, e.g. column-major
array
Linking -lessl
ESSL User Guide http//publib.boulder.ibm.com/epu
bs/pdf/am501403.pdf

41
ESSL features

Numerical functions categories
Linear Algebra - vector-scalar, sparse
vector-scalar, matrix-vector, sparse
matrix-vector
Matrix Operations - addition, subtraction,
multiplication, rank-k updates, transpose
Linear Algebra Equation solvers - dense, banded,
sparse, linear least-squares
Eigensystem Analysis - standard, generalized
Signal Processing - Fourier transform,
convolutions, correlations
Sorting Searching
Interpolation - polynomial, cubic spline
Numerical Quadrature
Random Number Generation
Utilities

42
ESSL example 1D real-to-complex Fast Fourier
Transform (FFT)

include ltessl.hgt
main()
double RMN,work1,work2
cmplx CMN/21
int nwork1,nwork2
Init(R,N,M)
work1 malloc(nwork1sizeof(double))
work2 malloc(nwork2sizeof(double))
drcft(1,R,N,C,N/21,N,M,1,1.0,work1,nwork1,work2,n
work2)
drcft(0,R,N,C,N/21,N,M,1,1.0,work1,nwork1,work2,n
work2)

43
PESSL

Invokes ESSL
Uses MPI for communication
PESSL Categories
Subset of Level 2, 3 PBLAS
Subset of ScaLAPACK (dense, banded)
Sparse Routines
Subset of ScaLAPACK Eigensystem Routines
Fourier Transforms
Uniform Random Number Generation
BLACS
Link with -lpessl

44
FFTW Fastest Fourier Transform in the West

Third party software, free http//www.fftw.org
Latest version installed on SDSC systems in
/usr/local/apps/fftw312d and /usr/local/apps/fftw3
12s
1,2, and 3 dimensional FFTs, and related
functions (sin/cos transforms, convolution etc)
Using plans
Initialize transform parameters
long integer plan1
plan1 fftw_plan(..N,M,...)
fftw_execute(plan,A,B)
Work arrays allocated and initialized behind the
scenes
ESTIMATE, MEASURE, PATIENT intialization
s
Portable, free
-s
Performance about 3 times worse than ESSLs FFTs!

45
3D FFT in parallel

FFTW version 3 does not have MPI implementation
PESSL and other libraries use 1D processor
decomposition (slabs/planes)
Limitation Np lt N
Not enough for some state-of-the art simulations,
for example turbulence in a periodic domain 40963
2D decomposition is crucial for scaling on
Petascale platforms

46
2D decomposition
z
x
y
47
2D decomposition

2D decomposition processors form a square grid
P1 x P2
Columns (pencils) instead of slabs
Local geometry
Stage 1 N x K x M
Stage 2 K x N x M
Stage 3 K x M x N
K N/P1, MN/P2
At each stage, transform along the local
dimension (N)
Software scaling limit removed! Now the limit is
P lt N2

48
Performance of DNS (2D) on Blue Gene at SDSC and
IBMs T.J.Watson Lab and SDSCs Datastar

VN Two processors per node
CO One processor per node

49
P3DFFT

Open source library for efficient, highly
scalable 3D FFT on parallel platforms
Built on top of an optimized 1D FFT library
Currently ESSL or FFTW
In the future, more libraries
Uses 2D decomposition, which includes 1D option.
Developed at SDSC
Outcome of a Strategic Applications
Collaborations Project (DNS turbulence code)
Available at http//www.sdsc.edu/us/resources/p3df
ft.php

50
P3DFFT features