Title: Simple Circuit
1Performance libraries Math Kernel Library
2003?3?
2Agenda
- ??
- ??
- MKL???
- ????
- ????????
- ??
- ?????
- BLAS??
3??
- ??,??,??!
- MKL?Intel??????????????
- ??,???BLAS? FFT
- ??
- Solvers (BLAS, LAPACK)
- ????/??? solvers(BLAS, LAPACK)
- ????????? (dgemm)
- PDEs, ????, ??, solid-state physics (FFTs)
- General scientific, financial (vector
transcendental functions VML) - ????????Intel???????
4?? donts
- But dont use MKL on
- Dont use MKL on small counts
- Dont call vector math functions on small n
X Y Z W
X Y Z W
4x4 Transformation matrix
????
????????IPP
5MKL???
- BLAS (Basic Linear Algebra Subroutines)
- Level 1 BLAS vector-vector operations
- 15 function types
- 48 functions
- Level 2 BLAS matrix-vector operations
- 26 function types
- 66 functions
- Level 3 BLAS matrix-matrix operations
- 9 function types
- 30 functions
- Extended BLAS level 1 BLAS for sparse vectors
- 8 function types
- 24 functions
6MKL Contents
- LAPACK (linear algebra package)
- Solvers eigensolvers Many hundreds of routines
total! - Total user callable support routines gt 1000
- FFTs (fast fourier transforms)
- one two dimensional
- with without frequency ordering (bit reversal)
- VML (vector math library)
- Set of vectorized transcendental functions
- Most of libm functions, but faster
7MKL???
- ??? MKL????Fortran??
- ???????
- BLAS, LAPACK ????????Fortran??
- Cblas?? ????C/C?????BLAS
8MKL??? - ??
- ??Intel?CVF Fortran ???
- ?? Linux ? Windows ????
- ????????
- ??????? 32-bit and 64-bit
- ????????
- ????? MKL Index
9????????
- ?????????????
- ???????????? (??????)
- CPU ???????????
- Cache ?????????Cache?
- ???? ?????????
- Computer ??????????
- System ????????? (??)
10??
- ??? MKL ???????,??
- level 1, level 2 BLAS ????????( O(n) )
- ???????????
- Level 3 BLAS ( O(n3) )
- LAPACK ( O(n3) )
- FFTs ( O(n log(n) )
- VML? Depends on processor and function
- ??????? OpenMP??
- ??MKL?????????????????
11??? MKL??
- Assume program calls MKL function then what?
- two approaches
- Static link all library objects linked into
program binary - DLL use without static link frequent C approach
12Static Link
- Scenario 1 ifl, BLAS, Pentium III processor
- ifl o myprog myprog.f static L/opt/intel/mkl/li
b/32 lmkl_p3 lpthread -lguide (Linux)
13Dynamic Link
- Scenario 2 C program uses BLAS but want optimal
code determined at runtime - ifl o myprog myprog.f L/opt/intel/mkl/lib/32
lmkl lpthread -lguide (Linux)
14BLAS ??
- 3 levels of functions sparse
- Level 1 vector-vector operations
- Level 2 vector-matrix operations
- Level 3 matrix-matrix operations
- Sparse level 1 operations on sparse vectors
- Levels ???
- Level 1 in early 70s
- Level 2 in mid-70s followed immediately by level
3
15BLAS ????
- General scheme ltprecisiongtltnamegtltmodifiergt
- precision one or two letters
- 1 letter implies input and output are same
type - s single, d double, c single complex, z
double complex - 2 letters input and output are different
- cs, zd complex in, real out sc, dz real in,
complex out
16BLAS????
- ltnamegt
- g general ge general gb band(??)
- s symmetric sy symmetric sp packed sb
band(??) - h Hermitian he Hermitian hp packed hb
band( Hermitian??) - t triangular tr triangular tp packed tb
band(??)
17??band(General Band)
18??band(symmetric band)
19????band(Hermitian Band)
20??band(triangular band)
21packed
22BLAS Naming Conventions
- Level 1ltmodifiergt
- c conjugated (cdotc), u unconjugated (cdotu),
g givens (srotg) - Level 2 ltmodifiergt
- mv matrix-vector sv solve (vector operations)
r rank update r2 rank 2 update - dger double-precision general rank update
- A alpha x y A
- Level 3 ltmodifiergt
- mm matrix-matrix sm solve (matrix operations)
r rank update r2 rank 2 update - dsyr2k double-precision symmetric rank-2 update
23Matrix Multiplication
- ??????
- Roll your own
- DDOT (level 1)
- DGEMV (level 2)
- DGEMM (level 3)
- Because C is used, all is not pretty J
24Matrix MultiplicationRoll Your Own/Dot Product
Roll Your Own
ddot
for( i 0 i lt n i ) for( j 0 j lt m
j ) temp 0.0 for( k 0 k lt
kk k ) temp aik bkj
cij temp
incx 1 incy ldb for( i 0 i lt n i
) for( j 0 j lt m j ) cij
DDOT( n, ai, incx, b0j, incy )
25Matrix MultiplicationDGEMV/DGEMM
dgemv
incx 1 incy ldb alpha 1.0 beta 0.0
transa 't' for( i 0 i lt n i )
dgemv( transa, m, n, alpha, a, lda,
b0i, ldb, beta, c0i, ldc
)
dgemm
alpha 1.0 beta 0.0 transa 'n' transb
'n' dgemm( transa, transb, m, n, kk,
alpha, b,
ldb, a, lda, beta,
c, ldc )
26MKL?????? ????? vs DGEMM
2.2 GHz Intel Pentium 4 processor, 512 MB
memory
27MKL ?????????DGEMM
800 MHz Itanium processor, 4 MB cache NEC
Express5800