Title: CS 267 Dense Linear Algebra: Possible Class Projects
1CS 267 Dense Linear AlgebraPossible Class
Projects
- James Demmel
- www.cs.berkeley.edu/demmel/cs267_Spr09
2Kinds of class projects
- Try tuning existing (widely used) codes in
LAPACK, ScaLAPACK or possible future versions - Possible impact help many people to run faster
- Add missing functionality to these libraries
- Possible impact lots of users want it
- Experiment with algorithms on new architectures
- Possible impact What do we need to do
differently for performance on these platforms?
Are there any bottlenecks or other problems in
the architecture? Could they be fixed? - Experiment with new software approaches
- Possible impact Is it easier to write these
algorithms while getting most of the performance?
Should we produce future versions of the
libraries this way? - Experiment with new algorithms
- Possible impact Find a better one!
3Challenges to Libraries (and parallel SW in
general)
- Minimizing communication costs
- Cost of bandwidth and latency (to main memory or
over a network) growing exponentially compared to
arithmetic - Heterogeneous platforms
- Different communication costs depending on
destination - Same chip vs different socket vs different board
- CPU GPU
- Perform different operations at very different
rates - Dynamic scheduling load balancing
- Cant always assume each core/processor makes
constant progress on your task - May be faster to grab next available task than
use predesigned perfectly balanced schedule - OS may give, take away resources on the fly
- Fault tolerance how to recover when one proc
fails
4Strassens Matmul on Multicore or GPU
- Why no Strassen in most libraries?
- See Baleful Effect of Benchmarks by Prof.
Kahan - Likely to be faster for modest-to-large matrix
sizes - Where is the crossover?
- May want hybrid switch to O(n3) algorithm for
certain sizes (smaller) - Autotuning?
- Lots of blocking opportunities as for standard
matmul - What is least amount of data movement possible?
- How well does it work for the rectangular matmuls
in LU, QR and Cholesky? - Do we need to modify LU, QR or Cholesky to take
advantage of Strassen (by using a variant that
multiplies different size matrices)?
5Review Alternative recursive GE formulation
- Toledo (1997)
- Describe without pivoting for simplicity
- Do left half of matrix, then right half
function L,U RLU (A) assume A is m by n
if (n1) L A/A(1,1), U A(1,1)
else L1,U1 RLU( A(1m , 1n/2))
do left half of A let L11
denote top n/2 rows of L1 A( 1n/2 ,
n/21 n ) L11-1 A( 1n/2 , n/21 n )
update top n/2 rows of right half
of A A( n/21 m, n/21n ) A(
n/21 m, n/21n ) - A( n/21
m, 1n/2 ) A( 1n/2 , n/21 n )
update rest of right half of A
L2,U2 RLU( A(n/21m , n/21n) ) do right
half of A return L1,0L2 and
U1, A(.,.) U2
6Register-file resident Linear Algebra on GPUs
- Vasilys results for LU, QR and Cholesky on GPU
target single large matrices, too large to fit
just in the fast memory (shared registers) of
the GPU - There is also demand for solving many smaller
problems in parallel, eg A(i) x(i) b(i) for
many different A(1),,A(k) and b(1),,b(k) - Project Design linear algebra algorithms that
operate on many different matrices in parallel,
each small enough to fit in the 64 KB register
set of each multiprocessor - single precision square matrix of dimension n128
- Question Does possible need to branch
differently on each multiprocessor (because of
different pivot orders) matter? If so, is QR
better than LU? - Question Do we need BLAS3 code versions on such
small matrices, or is BLAS2 enough?
7Extend Vasilys GPU analysis, code to ATI
- Vasilys Best Student Paper Award from SC08 had
two parts - Analyzed bottlenecks, speedup possibilities in
NVIDIA architecture - Applied lessons to reorganization of LU, QR,
Cholesky - What about ATI GPU?
- Both above aspects interesting
- ATI GPU available in ParLab
- What are pros, cons of ATI, NVIDIA architectures?
Others? - Do we need to reorganize algorithms differently
for each, or does one algorithm (perhaps with
different block sizes, other parameters) work for
both (which would be simpler)? - Other BLAS-like operations on GPU
- Needed for finite-element analysis
8Missing Drivers in Sca/LAPACK
9More missing drivers
10Missing matrix types in ScaLAPACK
- Symmetric, Hermitian, triangular
- Band, Packed
- Positive Definite
- Packed
- Orthogonal, Unitary
- Packed
11Tuning the data layout
Layout depends on block size b and processor grid
Pr x Pc Simple layouts easy for user, but bad for
performance
Speedups for using 2D processor grid range from
2x to 8x
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect,
2GB Memory
12Cost of tuning the data layout, compared to
runtime
Cost of redistributing matrix to optimal layout
is small
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect,
2GB Memory
Possible project build wrapper that chooses
fastest layout, whether to convert back and
forth, and hides details from the user.
13Parallel Eigenvalue Algorithms on GPU
- Harder to use all BLAS3 than solving Axb, least
squares - Symmetric eigenvalue problem for AAT (SVD
similar) - Find orthogonal Q to transform A QTQT, where
TTT is tridiagonal (nonzero on main
diagonal, right above and below - Find eigenvals ? diag(?1,,?n)and orthog.
eigenvecs U of T U?UT - Good parallel algorithms cheaper than first
step - Then A (QU) ?(QU)T so orthog. eigenvectors QU,
eigenvalues ? - A QTQT is proposed challenge
- Use Successive Band Reduction (Sun, Bischof et
al) - Go from A to wide band matrix B via A VBVT , V
orthogonal - All BLAS3, fast on GPU
- Go from B to tridiagonal T via B WTWT , W
orthogonal - BLAS1 and BLAS2, do it on CPU
- Find T U?UT as above, then A (VWU) ?(VWU)T
- Prospect of minimizing communication in theory
14Experiment with PLASMA for Multicore
- PLASMA is experimental system for writing,
scheduling linear algebra algorithms as Directed
Acyclic Graphs (DAGs) - icl.cs.utk.edu/plasma/
15Fork-Join vs. Dynamic Execution on Multicore
Source Jack Dongarra
Fork-Join parallel BLAS
Time
DAG-based dynamic scheduling
Time saved
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
16Experiment with PLASMA for Multicore
- PLASMA is experimental system for writing,
scheduling linear algebra algorithms as Directed
Acyclic Graphs (DAGs) - icl.cs.utk.edu/plasma/
- Experiment with PLASMA
- Implement other factorizations
- Compare performance
- To LAPACK with parallel BLAS
- To ScaLAPACK
- Evaluate expressiveness for eigenvalue problems
- Study interaction of scheduler with higher level
scheduler being designed in ParLab - Can PLASMA gracefully accept, give up,
resources?
- Perform analogous experiments with UPC, Titanium
or other PGAS languages
17Investigate role of Dense Motif in ParLab Apps
- Initial study (below) showed Dense Linear
Algebra in - Image, Speech, Music
- Determine what is really needed
- Functions, problem sizes, performance
requirements - What do we still need to optimize?