BLAS, LAPACK, and beyond - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

BLAS, LAPACK, and beyond

Description:

C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh, ... libFLAME (UT-Austin, Dept. of CS) http://www.tacc.utexas.edu/resources/software ... – PowerPoint PPT presentation

Number of Views:257

Avg rating:3.0/5.0

Slides: 64

Provided by: admi639

Category:

more less

Transcript and Presenter's Notes

Title: BLAS, LAPACK, and beyond

1
BLAS, LAPACK, and beyond

Robert van de Geijn
Department of Computer Sciences
UT-Austin Institute for Computational Engineering
and Sciences

2
Motivation

Naïve implementation of matrix-matrix
multiplication
Pentium 4
1.8 GHz
Simple loops
C code

3
Solution Leave it to the Experts

Dense linear algebra libraries are often at the
bottom of the food chain.
Basic Linear Algebra Subprograms (BLAS)
standardized interfaces for
vector-vector (BLAS1)
matrix-vector (BLAS2)
matrix-matrix (BLAS3) operations.
Linear Algebra Package (LAPACK)
De facto standard for higher level dense and
banded linear algebra operations
Vendors are expected to provide
high-performance implementations of the BLAS.
Sometimes LAPACK level routines are optimized as
well.

4
Overview

Optimizations over time BLAS, LAPACK
And beyond FLAME
Just show me the library!
Conclusion

5
Once upon a time there was package

1972 EISPACK
Package for the solution of the dense eigenvalue
problem
E.g. A x l x
First robust software pack
Numerical stability, performance, and portability
were a concern
Consistent formatting of code
Coded in Algol
First released in 1972

6
Basic Linear Algebra Subprograms (BLAS)

In the 1970s vector supercomputers ruled
Dense (and many sparse) linear algebra algorithms
can be formulated in terms of vector operations
By standardizing an interface to such operations,
portable high performance could be achieved
Vendors responsible for optimized implementations
Other libraries coded in terms of the BLAS
First proposed in 1973. Published in
C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T.
Krogh,
Basic Linear Algebra Subprograms for FORTRAN
usage,
ACM Trans. Math. Soft., 5 (1979).
Later became known as the level-1 BLAS
Fortran77 interface

7
Linear Algebra Package (LINPACK)

Targeted solution of linear equations and linear
least-squares
Coded in terms of the level-1 BLAS for
portability
Started in 1974. Published/released in 1977
J. J. Dongarra, J. R. Bunch, C. B. Moler and G.
W. Stewart. LINPACK Users Guide, SIAM, 1977
Fortran77 interface

LINPACK
LU factorization
with partial pivoting
(abbreviated)

LINPACK
LU factorization
with partial pivoting
(abbreviated)

LINPACK
LU factorization
with partial pivoting
(abbreviated)

LINPACK
LU factorization
with partial pivoting
(abbreviated)

LINPACK
LU factorization
with partial pivoting
(abbreviated)

LINPACK
LU factorization
with partial pivoting
(abbreviated)

LINPACK
LU factorization
with partial pivoting
(abbreviated)

LINPACK
LU factorization
with partial pivoting
(abbreviated)

LINPACK
LU factorization
with partial pivoting
(abbreviated)

LINPACK
LU factorization
with partial pivoting
(abbreviated)

18
LINPACK (continued)

Portable high performance on vector architectures
Poor performance when architectures with complex
memory hierarchies arrived in the 1980s
O(n) operations on O(n) data means bandwidth to
main memory becomes the limiting factor

19
Complex Memory Hierarchies
fast
expensive
registers
L1 cache
L2 cache
RAM
disk
slow
cheap
20
Towards Higher Performance

Level 2 BLAS matrix-vector operations
Started in 1984. Published in 1988
J. J. Dongarra, J. Du Croz, S. Hammarling, and R.
J. Hanson, An extended set of FORTRAN Basic
Linear Algebra Subprograms, ACM Trans. Math.
Soft., 14 (1988), pp. 1--17.
Casts computation in terms of operations like
matrix-vector multiplication and rank-1 update
Benefit vector(s) can be kept in cache memory
Problem O( n2 ) operations on O(n2 ) data

LAPACK
LU factorization
with partial pivoting
unblocked algorithm
(abbreviated)

LAPACK
LU factorization
with partial pivoting
unblocked algorithm
(abbreviated)

LAPACK
LU factorization
with partial pivoting
unblocked algorithm
(abbreviated)

24
Performance (Intel Xeon 3.4GHz)
25
Towards High Performance

Level 3 BLAS matrix-matrix operations
Started in 1986. Published in 1990
J. J. Dongarra, J. Du Croz, I. S. Duff, and S.
Hammarling, A set of Level 3 Basic Linear Algebra
Subprograms, ACM Trans. Math. Soft., 16 (1990)
Casts computation in terms of operations like
matrix-matrix multiplication
Benefit submatrices can be kept in cache
O( n3 ) operations on O(n2 ) data

LAPACK
LU factorization
with partial pivoting
blocked algorithm
(abbreviated)

nb
27

LAPACK
LU factorization
with partial pivoting
blocked algorithm
(abbreviated)

nb
28

LAPACK
LU factorization
with partial pivoting
blocked algorithm
(abbreviated)

nb
29

LAPACK
LU factorization
with partial pivoting
blocked algorithm
(abbreviated)

nb
30
Performance (Intel Xeon 3.4GHz)
31
And Beyond

New architectures with many cores
Symmetric Multi-Processors (SMPs)
Multicore architectures
Algorithms encoded in LAPACK dont always
parallelize well

32
AMD Opteron 8 cores (2.4GHz)
LAPACK
33
And Beyond (continued)

LAPACK code is hard to write/read/maintain/alter
Formal Linear Algebra Methods Environment (FLAME)
Project
Started around 2000. First official library
release (libFLAME 1.0) April 1, 2007
Collaboration between UT Dept. of Computer
Sciences, TACC, and UJI-Spain
Systematic approach to deriving, presenting, and
implementing algorithms

34
Why the FLAME API?

From the LAPACK 3.1.1. changes log
replaced calls of the form
CALL SCOPY( N, WORK, 1, TAU, 1 )
with
CALL SCOPY( N-1, WORK, 1, TAU, 1 )
at line 694 for s/dchkhs and line 698 for
c/zchkhs.
(TAU is only of length N-1.)

FLAME
LU factorization
blocked algorithm

nb
36

FLAME
LU factorization
blocked algorithm

nb
37

FLAME
LU factorization
blocked algorithm

nb
38

FLAME
LU factorization
blocked algorithm

nb
39

FLAME
LU factorization
blocked algorithm

nb
40

FLA_Part_2x2( A, ATL, ATR,
ABL, ABR, 0, 0,
FLA_TL )
while (FLA_Obj_length( ATL ) A ))
b min( FLA_Obj_length( ABR ), nb_alg )
FLA_Repart_2x2_to_3x3
( ATL, // ATR, A00, // A01,
A02,
/ / /
/
A10, // A11,
A12,
ABL, // ABR, A20, // A21,
A22,
b, b, FLA_BR )
/--------------------------------------------
---/
LU_unb_var5( A11 )
FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 )

FLA_Part_2x2( A, ATL, ATR,
ABL, ABR, 0, 0,
FLA_TL )
while (FLA_Obj_length( ATL ) A ))
b min( FLA_Obj_length( ABR ), nb_alg )
FLA_Repart_2x2_to_3x3
( ATL, // ATR, A00, // A01,
A02,
/ / /
/
A10, // A11,
A12,
ABL, // ABR, A20, // A21,
A22,
b, b, FLA_BR )
/--------------------------------------------
---/
LU_unb_var5( A11 )
FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 )

FLA_Part_2x2( A, ATL, ATR,
ABL, ABR, 0, 0,
FLA_TL )
while (FLA_Obj_length( ATL ) A ))
b min( FLA_Obj_length( ABR ), nb_alg )
FLA_Repart_2x2_to_3x3
( ATL, // ATR, A00, // A01,
A02,
/ / /
/
A10, // A11,
A12,
ABL, // ABR, A20, // A21,
A22,
b, b, FLA_BR )
/--------------------------------------------
---/
LU_unb_var5( A11 )
FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 )

FLA_Part_2x2( A, ATL, ATR,
ABL, ABR, 0, 0,
FLA_TL )
while (FLA_Obj_length( ATL ) A ))
b min( FLA_Obj_length( ABR ), nb_alg )
FLA_Repart_2x2_to_3x3
( ATL, // ATR, A00, // A01,
A02,
/ / /
/
A10, // A11,
A12,
ABL, // ABR, A20, // A21,
A22,
b, b, FLA_BR )
/--------------------------------------------
---/
LU_unb_var5( A11 )
FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 )

FLA_Part_2x2( A, ATL, ATR,
ABL, ABR, 0, 0,
FLA_TL )
while (FLA_Obj_length( ATL ) A ))
b min( FLA_Obj_length( ABR ), nb_alg )
FLA_Repart_2x2_to_3x3
( ATL, // ATR, A00, // A01,
A02,
/ / /
/
A10, // A11,
A12,
ABL, // ABR, A20, // A21,
A22,
b, b, FLA_BR )
/--------------------------------------------
---/
LU_unb_var5( A11 )
FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 )

45
An algorithm for every occasion

There are five loop-based algorithms for LU
factorization without pivoting
Five unblocked
Five blocked
On different architectures different algorithm
will achieve best performance

46
Performance (Intel Xeon 3.4GHz)
47
AMD Opteron 8 cores (2.4GHz)
FLAME
LAPACK
48
Just Show Me the Library

Reference implementations
BLAS Reference implementation http//www.netli
b.org/blas/
Netlib version of LAPACK (reference
implementation?) http//www.netlib.org/lapack/
Open Source implementations
ATLAS Automatically Tuned Linear Algebra
Subprograms http//www.netlib.org/atlas/
GotoBLAS (TACC) http//www.tacc.utexas.
edu/resources/software/
libFLAME (UT-Austin, Dept. of CS)
http//www.tacc.utexas.edu/resources/software/
Vendor implementations
ESSL (IBM) ACML (AMD)
MKL (Intel) MLIB (HP)
MathKeisan (NEC)
All of these are plug compatible. Choose based
on functionality, performance, and expense
Can mix and match

49
A Few Words about the GotoBLAS

Implemented by Kazushige Goto (TACC)
Often the fastest implementations of the BLAS and
most commonly used operations in LAPACK
Under a pretty restrictive Open Source license
Licensed to a number of the vendors

50
And the winner is
51
dgemm Pentium4 (3.6 GHz)
52
dgemm Itanium2 (1.5 GHz)
53
dgemm Power5 (1.9 GHz)
54
GotoBLAS Level 3 BLAS PPC440 FP2 (700 MHz)
55
GotoBLAS Level 3 BLAS PPC440 FP2 (700 MHz)
56
LU w/pivoting on 8 cores4 x AMD 2.4GHz dual-core
Opteron 880
GotoBLAS
FLAME
LAPACK
57
QR factorization on 8 cores4 x AMD 2.4GHz
dual-core Opteron 880
FLAME
LAPACK
58
Cholesky factorization on 8 cores4 x AMD 2.4GHz
dual-core Opteron 880
FLAME/GotoBLAS
LAPACK
59
Triangular Inversion on 8 cores4 x AMD 2.4GHz
dual-core Opteron 880
FLAME/GotoBLAS
LAPACK
60
Similar results on other platforms
61
Conclusion

BLAS
Netlibs BLAS are a reference implementation
Excellent BLAS libraries are available as Open
Source software GotoBLAS
ATLAS is an also-ran
Many vendor implementations are derived from the
GotoBLAS
LAPACK
Netlibs LAPACK should be considered a reference
implementation
Excellent performance is achieved by GotoBLAS for
a few important operations
Very good to excellent performance is achieved by
libFLAME
The future
Much more flexible libraries are needed as SMPs
and multicore architectures become common-place
Coding at a high level helps manage complexity

62
Unsung heroes

Along the way, there were many unsung heroes
Fred Gustavson at IBM greatly influenced the move
towards blocked algorithms
Kagstrom et al noticed the importance of casting
level-3 BLAS in terms of matrix-matrix
multiplication
Other library people who work for vendors get
little personal credit for their achievements

63
More Information