BLAS, LAPACK, and beyond - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

BLAS, LAPACK, and beyond

Description:

C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh, ... libFLAME (UT-Austin, Dept. of CS) http://www.tacc.utexas.edu/resources/software ... – PowerPoint PPT presentation

Number of Views:257
Avg rating:3.0/5.0
Slides: 64
Provided by: admi639
Category:

less

Transcript and Presenter's Notes

Title: BLAS, LAPACK, and beyond


1
BLAS, LAPACK, and beyond
  • Robert van de Geijn
  • Department of Computer Sciences
  • UT-Austin Institute for Computational Engineering
    and Sciences

2
Motivation
  • Naïve implementation of matrix-matrix
    multiplication
  • Pentium 4
  • 1.8 GHz
  • Simple loops
  • C code

3
Solution Leave it to the Experts
  • Dense linear algebra libraries are often at the
    bottom of the food chain.
  • Basic Linear Algebra Subprograms (BLAS)
    standardized interfaces for
  • vector-vector (BLAS1)
  • matrix-vector (BLAS2)
  • matrix-matrix (BLAS3) operations.
  • Linear Algebra Package (LAPACK)
  • De facto standard for higher level dense and
    banded linear algebra operations
  • Vendors are expected to provide
    high-performance implementations of the BLAS.
  • Sometimes LAPACK level routines are optimized as
    well.

4
Overview
  • Optimizations over time BLAS, LAPACK
  • And beyond FLAME
  • Just show me the library!
  • Conclusion

5
Once upon a time there was package
  • 1972 EISPACK
  • Package for the solution of the dense eigenvalue
    problem
  • E.g. A x l x
  • First robust software pack
  • Numerical stability, performance, and portability
    were a concern
  • Consistent formatting of code
  • Coded in Algol
  • First released in 1972

6
Basic Linear Algebra Subprograms (BLAS)
  • In the 1970s vector supercomputers ruled
  • Dense (and many sparse) linear algebra algorithms
    can be formulated in terms of vector operations
  • By standardizing an interface to such operations,
    portable high performance could be achieved
  • Vendors responsible for optimized implementations
  • Other libraries coded in terms of the BLAS
  • First proposed in 1973. Published in
  • C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T.
    Krogh,
  • Basic Linear Algebra Subprograms for FORTRAN
    usage,
  • ACM Trans. Math. Soft., 5 (1979).
  • Later became known as the level-1 BLAS
  • Fortran77 interface

7
Linear Algebra Package (LINPACK)
  • Targeted solution of linear equations and linear
    least-squares
  • Coded in terms of the level-1 BLAS for
    portability
  • Started in 1974. Published/released in 1977
  • J. J. Dongarra, J. R. Bunch, C. B. Moler and G.
    W. Stewart. LINPACK Users Guide, SIAM, 1977
  • Fortran77 interface

8
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

9
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

10
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

11
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

12
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

13
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

14
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

15
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

16
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

17
  • LINPACK
  • LU factorization
  • with partial pivoting
  • (abbreviated)

18
LINPACK (continued)
  • Portable high performance on vector architectures
  • Poor performance when architectures with complex
    memory hierarchies arrived in the 1980s
  • O(n) operations on O(n) data means bandwidth to
    main memory becomes the limiting factor

19
Complex Memory Hierarchies
fast
expensive
registers
L1 cache
L2 cache
RAM
disk
slow
cheap
20
Towards Higher Performance
  • Level 2 BLAS matrix-vector operations
  • Started in 1984. Published in 1988
  • J. J. Dongarra, J. Du Croz, S. Hammarling, and R.
    J. Hanson, An extended set of FORTRAN Basic
    Linear Algebra Subprograms, ACM Trans. Math.
    Soft., 14 (1988), pp. 1--17.
  • Casts computation in terms of operations like
    matrix-vector multiplication and rank-1 update
  • Benefit vector(s) can be kept in cache memory
  • Problem O( n2 ) operations on O(n2 ) data

21
  • LAPACK
  • LU factorization
  • with partial pivoting
  • unblocked algorithm
  • (abbreviated)

22
  • LAPACK
  • LU factorization
  • with partial pivoting
  • unblocked algorithm
  • (abbreviated)

23
  • LAPACK
  • LU factorization
  • with partial pivoting
  • unblocked algorithm
  • (abbreviated)

24
Performance (Intel Xeon 3.4GHz)
25
Towards High Performance
  • Level 3 BLAS matrix-matrix operations
  • Started in 1986. Published in 1990
  • J. J. Dongarra, J. Du Croz, I. S. Duff, and S.
    Hammarling, A set of Level 3 Basic Linear Algebra
    Subprograms, ACM Trans. Math. Soft., 16 (1990)
  • Casts computation in terms of operations like
    matrix-matrix multiplication
  • Benefit submatrices can be kept in cache
  • O( n3 ) operations on O(n2 ) data

26
  • LAPACK
  • LU factorization
  • with partial pivoting
  • blocked algorithm
  • (abbreviated)

nb
27
  • LAPACK
  • LU factorization
  • with partial pivoting
  • blocked algorithm
  • (abbreviated)

nb
28
  • LAPACK
  • LU factorization
  • with partial pivoting
  • blocked algorithm
  • (abbreviated)

nb
29
  • LAPACK
  • LU factorization
  • with partial pivoting
  • blocked algorithm
  • (abbreviated)

nb
30
Performance (Intel Xeon 3.4GHz)
31
And Beyond
  • New architectures with many cores
  • Symmetric Multi-Processors (SMPs)
  • Multicore architectures
  • Algorithms encoded in LAPACK dont always
    parallelize well

32
AMD Opteron 8 cores (2.4GHz)
LAPACK
33
And Beyond (continued)
  • LAPACK code is hard to write/read/maintain/alter
  • Formal Linear Algebra Methods Environment (FLAME)
    Project
  • Started around 2000. First official library
    release (libFLAME 1.0) April 1, 2007
  • Collaboration between UT Dept. of Computer
    Sciences, TACC, and UJI-Spain
  • Systematic approach to deriving, presenting, and
    implementing algorithms

34
Why the FLAME API?
  • From the LAPACK 3.1.1. changes log
  • replaced calls of the form
  • CALL SCOPY( N, WORK, 1, TAU, 1 )
  • with
  • CALL SCOPY( N-1, WORK, 1, TAU, 1 )
  • at line 694 for s/dchkhs and line 698 for
    c/zchkhs.
  • (TAU is only of length N-1.)

35
  • FLAME
  • LU factorization
  • blocked algorithm

nb
36
  • FLAME
  • LU factorization
  • blocked algorithm

nb
37
  • FLAME
  • LU factorization
  • blocked algorithm

nb
38
  • FLAME
  • LU factorization
  • blocked algorithm

nb
39
  • FLAME
  • LU factorization
  • blocked algorithm

nb
40
  • FLA_Part_2x2( A, ATL, ATR,
  • ABL, ABR, 0, 0,
    FLA_TL )
  • while (FLA_Obj_length( ATL ) A ))
  • b min( FLA_Obj_length( ABR ), nb_alg )
  • FLA_Repart_2x2_to_3x3
  • ( ATL, // ATR, A00, // A01,
    A02,
  • / / /
    /
  • A10, // A11,
    A12,
  • ABL, // ABR, A20, // A21,
    A22,
  • b, b, FLA_BR )
  • /--------------------------------------------
    ---/
  • LU_unb_var5( A11 )
  • FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
  • FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
  • FLA_ONE, A11, A12 )

41
  • FLA_Part_2x2( A, ATL, ATR,
  • ABL, ABR, 0, 0,
    FLA_TL )
  • while (FLA_Obj_length( ATL ) A ))
  • b min( FLA_Obj_length( ABR ), nb_alg )
  • FLA_Repart_2x2_to_3x3
  • ( ATL, // ATR, A00, // A01,
    A02,
  • / / /
    /
  • A10, // A11,
    A12,
  • ABL, // ABR, A20, // A21,
    A22,
  • b, b, FLA_BR )
  • /--------------------------------------------
    ---/
  • LU_unb_var5( A11 )
  • FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
  • FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
  • FLA_ONE, A11, A12 )

42
  • FLA_Part_2x2( A, ATL, ATR,
  • ABL, ABR, 0, 0,
    FLA_TL )
  • while (FLA_Obj_length( ATL ) A ))
  • b min( FLA_Obj_length( ABR ), nb_alg )
  • FLA_Repart_2x2_to_3x3
  • ( ATL, // ATR, A00, // A01,
    A02,
  • / / /
    /
  • A10, // A11,
    A12,
  • ABL, // ABR, A20, // A21,
    A22,
  • b, b, FLA_BR )
  • /--------------------------------------------
    ---/
  • LU_unb_var5( A11 )
  • FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
  • FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
  • FLA_ONE, A11, A12 )

43
  • FLA_Part_2x2( A, ATL, ATR,
  • ABL, ABR, 0, 0,
    FLA_TL )
  • while (FLA_Obj_length( ATL ) A ))
  • b min( FLA_Obj_length( ABR ), nb_alg )
  • FLA_Repart_2x2_to_3x3
  • ( ATL, // ATR, A00, // A01,
    A02,
  • / / /
    /
  • A10, // A11,
    A12,
  • ABL, // ABR, A20, // A21,
    A22,
  • b, b, FLA_BR )
  • /--------------------------------------------
    ---/
  • LU_unb_var5( A11 )
  • FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
  • FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
  • FLA_ONE, A11, A12 )

44
  • FLA_Part_2x2( A, ATL, ATR,
  • ABL, ABR, 0, 0,
    FLA_TL )
  • while (FLA_Obj_length( ATL ) A ))
  • b min( FLA_Obj_length( ABR ), nb_alg )
  • FLA_Repart_2x2_to_3x3
  • ( ATL, // ATR, A00, // A01,
    A02,
  • / / /
    /
  • A10, // A11,
    A12,
  • ABL, // ABR, A20, // A21,
    A22,
  • b, b, FLA_BR )
  • /--------------------------------------------
    ---/
  • LU_unb_var5( A11 )
  • FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
  • FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
  • FLA_ONE, A11, A12 )

45
An algorithm for every occasion
  • There are five loop-based algorithms for LU
    factorization without pivoting
  • Five unblocked
  • Five blocked
  • On different architectures different algorithm
    will achieve best performance

46
Performance (Intel Xeon 3.4GHz)
47
AMD Opteron 8 cores (2.4GHz)
FLAME
LAPACK
48
Just Show Me the Library
  • Reference implementations
  • BLAS Reference implementation http//www.netli
    b.org/blas/
  • Netlib version of LAPACK (reference
    implementation?) http//www.netlib.org/lapack/
  • Open Source implementations
  • ATLAS Automatically Tuned Linear Algebra
    Subprograms http//www.netlib.org/atlas/
  • GotoBLAS (TACC) http//www.tacc.utexas.
    edu/resources/software/
  • libFLAME (UT-Austin, Dept. of CS)
    http//www.tacc.utexas.edu/resources/software/
  • Vendor implementations
  • ESSL (IBM) ACML (AMD)
  • MKL (Intel) MLIB (HP)
  • MathKeisan (NEC)
  • All of these are plug compatible. Choose based
    on functionality, performance, and expense
  • Can mix and match

49
A Few Words about the GotoBLAS
  • Implemented by Kazushige Goto (TACC)
  • Often the fastest implementations of the BLAS and
    most commonly used operations in LAPACK
  • Under a pretty restrictive Open Source license
  • Licensed to a number of the vendors

50
And the winner is
51
dgemm Pentium4 (3.6 GHz)
52
dgemm Itanium2 (1.5 GHz)
53
dgemm Power5 (1.9 GHz)
54
GotoBLAS Level 3 BLAS PPC440 FP2 (700 MHz)
55
GotoBLAS Level 3 BLAS PPC440 FP2 (700 MHz)
56
LU w/pivoting on 8 cores4 x AMD 2.4GHz dual-core
Opteron 880
GotoBLAS
FLAME
LAPACK
57
QR factorization on 8 cores4 x AMD 2.4GHz
dual-core Opteron 880
FLAME
LAPACK
58
Cholesky factorization on 8 cores4 x AMD 2.4GHz
dual-core Opteron 880
FLAME/GotoBLAS
LAPACK
59
Triangular Inversion on 8 cores4 x AMD 2.4GHz
dual-core Opteron 880
FLAME/GotoBLAS
LAPACK
60
Similar results on other platforms
61
Conclusion
  • BLAS
  • Netlibs BLAS are a reference implementation
  • Excellent BLAS libraries are available as Open
    Source software GotoBLAS
  • ATLAS is an also-ran
  • Many vendor implementations are derived from the
    GotoBLAS
  • LAPACK
  • Netlibs LAPACK should be considered a reference
    implementation
  • Excellent performance is achieved by GotoBLAS for
    a few important operations
  • Very good to excellent performance is achieved by
    libFLAME
  • The future
  • Much more flexible libraries are needed as SMPs
    and multicore architectures become common-place
  • Coding at a high level helps manage complexity

62
Unsung heroes
  • Along the way, there were many unsung heroes
  • Fred Gustavson at IBM greatly influenced the move
    towards blocked algorithms
  • Kagstrom et al noticed the importance of casting
    level-3 BLAS in terms of matrix-matrix
    multiplication
  • Other library people who work for vendors get
    little personal credit for their achievements

63
More Information
  • www.tacc.utexas.edu/resources/software
  • www.cs.utexas.edu/users/flame
  • www.netlib.org
  • www.linearalgebrawiki.org

64
A Bit of History
  • 1972 EISPACK is released
  • 1973- 1977 BLAS for vector-vector operations
  • 1974-1977 LINPACK
  • 1984-1988 Level 2 BLAS for matrix-vector
    operations
  • 1986-1990 Level 3 BLAS for matrix-matrix
    operations
  • 1987-presentLAPACK for linear systems and linear
    eigenvalue problems
  • 1990-present ScaLAPACK (subset of LAPACK for
    distributed memory architectures)
  • 1996-1997 PLAPACK (subset of LAPACK for
    distributed memory architectures)
  • 1999 ATLAS
  • 2002 GotoBLAS
  • 2007 libFLAME
Write a Comment
User Comments (0)
About PowerShow.com