Title: BLAS, LAPACK, and beyond
1BLAS, LAPACK, and beyond
- Robert van de Geijn
- Department of Computer Sciences
- UT-Austin Institute for Computational Engineering
and Sciences
2Motivation
- Naïve implementation of matrix-matrix
multiplication - Pentium 4
- 1.8 GHz
- Simple loops
- C code
3Solution Leave it to the Experts
- Dense linear algebra libraries are often at the
bottom of the food chain. - Basic Linear Algebra Subprograms (BLAS)
standardized interfaces for - vector-vector (BLAS1)
- matrix-vector (BLAS2)
- matrix-matrix (BLAS3) operations.
- Linear Algebra Package (LAPACK)
- De facto standard for higher level dense and
banded linear algebra operations - Vendors are expected to provide
high-performance implementations of the BLAS. - Sometimes LAPACK level routines are optimized as
well.
4Overview
- Optimizations over time BLAS, LAPACK
- And beyond FLAME
- Just show me the library!
- Conclusion
5Once upon a time there was package
- 1972 EISPACK
- Package for the solution of the dense eigenvalue
problem - E.g. A x l x
- First robust software pack
- Numerical stability, performance, and portability
were a concern - Consistent formatting of code
- Coded in Algol
- First released in 1972
6Basic Linear Algebra Subprograms (BLAS)
- In the 1970s vector supercomputers ruled
- Dense (and many sparse) linear algebra algorithms
can be formulated in terms of vector operations - By standardizing an interface to such operations,
portable high performance could be achieved - Vendors responsible for optimized implementations
- Other libraries coded in terms of the BLAS
- First proposed in 1973. Published in
- C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T.
Krogh, - Basic Linear Algebra Subprograms for FORTRAN
usage, - ACM Trans. Math. Soft., 5 (1979).
- Later became known as the level-1 BLAS
- Fortran77 interface
7Linear Algebra Package (LINPACK)
- Targeted solution of linear equations and linear
least-squares - Coded in terms of the level-1 BLAS for
portability - Started in 1974. Published/released in 1977
- J. J. Dongarra, J. R. Bunch, C. B. Moler and G.
W. Stewart. LINPACK Users Guide, SIAM, 1977 - Fortran77 interface
8- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
9- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
10- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
11- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
12- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
13- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
14- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
15- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
16- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
17- LINPACK
- LU factorization
- with partial pivoting
- (abbreviated)
18LINPACK (continued)
- Portable high performance on vector architectures
- Poor performance when architectures with complex
memory hierarchies arrived in the 1980s - O(n) operations on O(n) data means bandwidth to
main memory becomes the limiting factor
19Complex Memory Hierarchies
fast
expensive
registers
L1 cache
L2 cache
RAM
disk
slow
cheap
20Towards Higher Performance
- Level 2 BLAS matrix-vector operations
- Started in 1984. Published in 1988
- J. J. Dongarra, J. Du Croz, S. Hammarling, and R.
J. Hanson, An extended set of FORTRAN Basic
Linear Algebra Subprograms, ACM Trans. Math.
Soft., 14 (1988), pp. 1--17. - Casts computation in terms of operations like
matrix-vector multiplication and rank-1 update - Benefit vector(s) can be kept in cache memory
- Problem O( n2 ) operations on O(n2 ) data
21- LAPACK
- LU factorization
- with partial pivoting
- unblocked algorithm
- (abbreviated)
22- LAPACK
- LU factorization
- with partial pivoting
- unblocked algorithm
- (abbreviated)
23- LAPACK
- LU factorization
- with partial pivoting
- unblocked algorithm
- (abbreviated)
24Performance (Intel Xeon 3.4GHz)
25Towards High Performance
- Level 3 BLAS matrix-matrix operations
- Started in 1986. Published in 1990
- J. J. Dongarra, J. Du Croz, I. S. Duff, and S.
Hammarling, A set of Level 3 Basic Linear Algebra
Subprograms, ACM Trans. Math. Soft., 16 (1990) - Casts computation in terms of operations like
matrix-matrix multiplication - Benefit submatrices can be kept in cache
- O( n3 ) operations on O(n2 ) data
26- LAPACK
- LU factorization
- with partial pivoting
- blocked algorithm
- (abbreviated)
nb
27- LAPACK
- LU factorization
- with partial pivoting
- blocked algorithm
- (abbreviated)
nb
28- LAPACK
- LU factorization
- with partial pivoting
- blocked algorithm
- (abbreviated)
nb
29- LAPACK
- LU factorization
- with partial pivoting
- blocked algorithm
- (abbreviated)
nb
30Performance (Intel Xeon 3.4GHz)
31And Beyond
- New architectures with many cores
- Symmetric Multi-Processors (SMPs)
- Multicore architectures
- Algorithms encoded in LAPACK dont always
parallelize well
32AMD Opteron 8 cores (2.4GHz)
LAPACK
33And Beyond (continued)
- LAPACK code is hard to write/read/maintain/alter
- Formal Linear Algebra Methods Environment (FLAME)
Project - Started around 2000. First official library
release (libFLAME 1.0) April 1, 2007 - Collaboration between UT Dept. of Computer
Sciences, TACC, and UJI-Spain - Systematic approach to deriving, presenting, and
implementing algorithms
34Why the FLAME API?
- From the LAPACK 3.1.1. changes log
- replaced calls of the form
- CALL SCOPY( N, WORK, 1, TAU, 1 )
- with
- CALL SCOPY( N-1, WORK, 1, TAU, 1 )
- at line 694 for s/dchkhs and line 698 for
c/zchkhs. - (TAU is only of length N-1.)
35- FLAME
- LU factorization
- blocked algorithm
nb
36- FLAME
- LU factorization
- blocked algorithm
nb
37- FLAME
- LU factorization
- blocked algorithm
nb
38- FLAME
- LU factorization
- blocked algorithm
nb
39- FLAME
- LU factorization
- blocked algorithm
nb
40- FLA_Part_2x2( A, ATL, ATR,
- ABL, ABR, 0, 0,
FLA_TL ) - while (FLA_Obj_length( ATL ) A ))
- b min( FLA_Obj_length( ABR ), nb_alg )
- FLA_Repart_2x2_to_3x3
- ( ATL, // ATR, A00, // A01,
A02, - / / /
/ - A10, // A11,
A12, - ABL, // ABR, A20, // A21,
A22, - b, b, FLA_BR )
- /--------------------------------------------
---/ - LU_unb_var5( A11 )
- FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
- FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
- FLA_ONE, A11, A12 )
41- FLA_Part_2x2( A, ATL, ATR,
- ABL, ABR, 0, 0,
FLA_TL ) - while (FLA_Obj_length( ATL ) A ))
- b min( FLA_Obj_length( ABR ), nb_alg )
- FLA_Repart_2x2_to_3x3
- ( ATL, // ATR, A00, // A01,
A02, - / / /
/ - A10, // A11,
A12, - ABL, // ABR, A20, // A21,
A22, - b, b, FLA_BR )
- /--------------------------------------------
---/ - LU_unb_var5( A11 )
- FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
- FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
- FLA_ONE, A11, A12 )
42- FLA_Part_2x2( A, ATL, ATR,
- ABL, ABR, 0, 0,
FLA_TL ) - while (FLA_Obj_length( ATL ) A ))
- b min( FLA_Obj_length( ABR ), nb_alg )
- FLA_Repart_2x2_to_3x3
- ( ATL, // ATR, A00, // A01,
A02, - / / /
/ - A10, // A11,
A12, - ABL, // ABR, A20, // A21,
A22, - b, b, FLA_BR )
- /--------------------------------------------
---/ - LU_unb_var5( A11 )
- FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
- FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
- FLA_ONE, A11, A12 )
43- FLA_Part_2x2( A, ATL, ATR,
- ABL, ABR, 0, 0,
FLA_TL ) - while (FLA_Obj_length( ATL ) A ))
- b min( FLA_Obj_length( ABR ), nb_alg )
- FLA_Repart_2x2_to_3x3
- ( ATL, // ATR, A00, // A01,
A02, - / / /
/ - A10, // A11,
A12, - ABL, // ABR, A20, // A21,
A22, - b, b, FLA_BR )
- /--------------------------------------------
---/ - LU_unb_var5( A11 )
- FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
- FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
- FLA_ONE, A11, A12 )
44- FLA_Part_2x2( A, ATL, ATR,
- ABL, ABR, 0, 0,
FLA_TL ) - while (FLA_Obj_length( ATL ) A ))
- b min( FLA_Obj_length( ABR ), nb_alg )
- FLA_Repart_2x2_to_3x3
- ( ATL, // ATR, A00, // A01,
A02, - / / /
/ - A10, // A11,
A12, - ABL, // ABR, A20, // A21,
A22, - b, b, FLA_BR )
- /--------------------------------------------
---/ - LU_unb_var5( A11 )
- FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
- FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
- FLA_ONE, A11, A12 )
45An algorithm for every occasion
- There are five loop-based algorithms for LU
factorization without pivoting - Five unblocked
- Five blocked
- On different architectures different algorithm
will achieve best performance
46Performance (Intel Xeon 3.4GHz)
47AMD Opteron 8 cores (2.4GHz)
FLAME
LAPACK
48Just Show Me the Library
- Reference implementations
- BLAS Reference implementation http//www.netli
b.org/blas/ - Netlib version of LAPACK (reference
implementation?) http//www.netlib.org/lapack/ - Open Source implementations
- ATLAS Automatically Tuned Linear Algebra
Subprograms http//www.netlib.org/atlas/ - GotoBLAS (TACC) http//www.tacc.utexas.
edu/resources/software/ - libFLAME (UT-Austin, Dept. of CS)
http//www.tacc.utexas.edu/resources/software/ - Vendor implementations
- ESSL (IBM) ACML (AMD)
- MKL (Intel) MLIB (HP)
- MathKeisan (NEC)
- All of these are plug compatible. Choose based
on functionality, performance, and expense - Can mix and match
49A Few Words about the GotoBLAS
- Implemented by Kazushige Goto (TACC)
- Often the fastest implementations of the BLAS and
most commonly used operations in LAPACK - Under a pretty restrictive Open Source license
- Licensed to a number of the vendors
50And the winner is
51dgemm Pentium4 (3.6 GHz)
52dgemm Itanium2 (1.5 GHz)
53dgemm Power5 (1.9 GHz)
54GotoBLAS Level 3 BLAS PPC440 FP2 (700 MHz)
55GotoBLAS Level 3 BLAS PPC440 FP2 (700 MHz)
56LU w/pivoting on 8 cores4 x AMD 2.4GHz dual-core
Opteron 880
GotoBLAS
FLAME
LAPACK
57QR factorization on 8 cores4 x AMD 2.4GHz
dual-core Opteron 880
FLAME
LAPACK
58Cholesky factorization on 8 cores4 x AMD 2.4GHz
dual-core Opteron 880
FLAME/GotoBLAS
LAPACK
59Triangular Inversion on 8 cores4 x AMD 2.4GHz
dual-core Opteron 880
FLAME/GotoBLAS
LAPACK
60Similar results on other platforms
61Conclusion
- BLAS
- Netlibs BLAS are a reference implementation
- Excellent BLAS libraries are available as Open
Source software GotoBLAS - ATLAS is an also-ran
- Many vendor implementations are derived from the
GotoBLAS - LAPACK
- Netlibs LAPACK should be considered a reference
implementation - Excellent performance is achieved by GotoBLAS for
a few important operations - Very good to excellent performance is achieved by
libFLAME -
- The future
- Much more flexible libraries are needed as SMPs
and multicore architectures become common-place - Coding at a high level helps manage complexity
62Unsung heroes
- Along the way, there were many unsung heroes
- Fred Gustavson at IBM greatly influenced the move
towards blocked algorithms - Kagstrom et al noticed the importance of casting
level-3 BLAS in terms of matrix-matrix
multiplication - Other library people who work for vendors get
little personal credit for their achievements
63More Information
- www.tacc.utexas.edu/resources/software
- www.cs.utexas.edu/users/flame
- www.netlib.org
- www.linearalgebrawiki.org
64A Bit of History
- 1972 EISPACK is released
- 1973- 1977 BLAS for vector-vector operations
- 1974-1977 LINPACK
- 1984-1988 Level 2 BLAS for matrix-vector
operations - 1986-1990 Level 3 BLAS for matrix-matrix
operations - 1987-presentLAPACK for linear systems and linear
eigenvalue problems - 1990-present ScaLAPACK (subset of LAPACK for
distributed memory architectures) - 1996-1997 PLAPACK (subset of LAPACK for
distributed memory architectures) - 1999 ATLAS
- 2002 GotoBLAS
- 2007 libFLAME