CS 267 Dense Linear Algebra: Possible Class Projects - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

CS 267 Dense Linear Algebra: Possible Class Projects

Description:

Extend Vasily's GPU analysis, code to ATI ... about ATI GPU? Both above aspects interesting. ATI GPU available in ParLab. What are pros, cons of ATI, NVIDIA ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 18

Provided by: DavidE1

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 267 Dense Linear Algebra: Possible Class Projects

1
CS 267 Dense Linear AlgebraPossible Class
Projects

James Demmel
www.cs.berkeley.edu/demmel/cs267_Spr09

2
Kinds of class projects

Try tuning existing (widely used) codes in
LAPACK, ScaLAPACK or possible future versions
Possible impact help many people to run faster
Add missing functionality to these libraries
Possible impact lots of users want it
Experiment with algorithms on new architectures
Possible impact What do we need to do
differently for performance on these platforms?
Are there any bottlenecks or other problems in
the architecture? Could they be fixed?
Experiment with new software approaches
Possible impact Is it easier to write these
algorithms while getting most of the performance?
Should we produce future versions of the
libraries this way?
Experiment with new algorithms
Possible impact Find a better one!

3
Challenges to Libraries (and parallel SW in
general)

Minimizing communication costs
Cost of bandwidth and latency (to main memory or
over a network) growing exponentially compared to
arithmetic
Heterogeneous platforms
Different communication costs depending on
destination
Same chip vs different socket vs different board
CPU GPU
Perform different operations at very different
rates
Dynamic scheduling load balancing
Cant always assume each core/processor makes
constant progress on your task
May be faster to grab next available task than
use predesigned perfectly balanced schedule
OS may give, take away resources on the fly
Fault tolerance how to recover when one proc
fails

4
Strassens Matmul on Multicore or GPU

Why no Strassen in most libraries?
See Baleful Effect of Benchmarks by Prof.
Kahan
Likely to be faster for modest-to-large matrix
sizes
Where is the crossover?
May want hybrid switch to O(n3) algorithm for
certain sizes (smaller)
Autotuning?
Lots of blocking opportunities as for standard
matmul
What is least amount of data movement possible?
How well does it work for the rectangular matmuls
in LU, QR and Cholesky?
Do we need to modify LU, QR or Cholesky to take
advantage of Strassen (by using a variant that
multiplies different size matrices)?

5
Review Alternative recursive GE formulation

Toledo (1997)
Describe without pivoting for simplicity
Do left half of matrix, then right half

function L,U RLU (A) assume A is m by n
if (n1) L A/A(1,1), U A(1,1)
else L1,U1 RLU( A(1m , 1n/2))
do left half of A let L11
denote top n/2 rows of L1 A( 1n/2 ,
n/21 n ) L11-1 A( 1n/2 , n/21 n )
update top n/2 rows of right half
of A A( n/21 m, n/21n ) A(
n/21 m, n/21n ) - A( n/21
m, 1n/2 ) A( 1n/2 , n/21 n )
update rest of right half of A
L2,U2 RLU( A(n/21m , n/21n) ) do right
half of A return L1,0L2 and
U1, A(.,.) U2
6
Register-file resident Linear Algebra on GPUs

Vasilys results for LU, QR and Cholesky on GPU
target single large matrices, too large to fit
just in the fast memory (shared registers) of
the GPU
There is also demand for solving many smaller
problems in parallel, eg A(i) x(i) b(i) for
many different A(1),,A(k) and b(1),,b(k)
Project Design linear algebra algorithms that
operate on many different matrices in parallel,
each small enough to fit in the 64 KB register
set of each multiprocessor
single precision square matrix of dimension n128
Question Does possible need to branch
differently on each multiprocessor (because of
different pivot orders) matter? If so, is QR
better than LU?
Question Do we need BLAS3 code versions on such
small matrices, or is BLAS2 enough?

7
Extend Vasilys GPU analysis, code to ATI

Vasilys Best Student Paper Award from SC08 had
two parts
Analyzed bottlenecks, speedup possibilities in
NVIDIA architecture
Applied lessons to reorganization of LU, QR,
Cholesky
What about ATI GPU?
Both above aspects interesting
ATI GPU available in ParLab
What are pros, cons of ATI, NVIDIA architectures?
Others?
Do we need to reorganize algorithms differently
for each, or does one algorithm (perhaps with
different block sizes, other parameters) work for
both (which would be simpler)?
Other BLAS-like operations on GPU
Needed for finite-element analysis

8
Missing Drivers in Sca/LAPACK
9
More missing drivers
10
Missing matrix types in ScaLAPACK

Symmetric, Hermitian, triangular
Band, Packed
Positive Definite
Packed
Orthogonal, Unitary
Packed

11
Tuning the data layout
Layout depends on block size b and processor grid
Pr x Pc Simple layouts easy for user, but bad for
performance
Speedups for using 2D processor grid range from
2x to 8x
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect,
2GB Memory
12
Cost of tuning the data layout, compared to
runtime
Cost of redistributing matrix to optimal layout
is small
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect,
2GB Memory
Possible project build wrapper that chooses
fastest layout, whether to convert back and
forth, and hides details from the user.
13
Parallel Eigenvalue Algorithms on GPU

Harder to use all BLAS3 than solving Axb, least
squares
Symmetric eigenvalue problem for AAT (SVD
similar)
Find orthogonal Q to transform A QTQT, where
TTT is tridiagonal (nonzero on main
diagonal, right above and below
Find eigenvals ? diag(?1,,?n)and orthog.
eigenvecs U of T U?UT
Good parallel algorithms cheaper than first
step
Then A (QU) ?(QU)T so orthog. eigenvectors QU,
eigenvalues ?
A QTQT is proposed challenge
Use Successive Band Reduction (Sun, Bischof et
al)
Go from A to wide band matrix B via A VBVT , V
orthogonal
All BLAS3, fast on GPU
Go from B to tridiagonal T via B WTWT , W
orthogonal
BLAS1 and BLAS2, do it on CPU
Find T U?UT as above, then A (VWU) ?(VWU)T
Prospect of minimizing communication in theory

14
Experiment with PLASMA for Multicore

PLASMA is experimental system for writing,
scheduling linear algebra algorithms as Directed
Acyclic Graphs (DAGs)
icl.cs.utk.edu/plasma/

15
Fork-Join vs. Dynamic Execution on Multicore
Source Jack Dongarra
Fork-Join parallel BLAS
Time
DAG-based dynamic scheduling
Time saved
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
16
Experiment with PLASMA for Multicore

PLASMA is experimental system for writing,
scheduling linear algebra algorithms as Directed
Acyclic Graphs (DAGs)
icl.cs.utk.edu/plasma/
Experiment with PLASMA
Implement other factorizations
Compare performance
To LAPACK with parallel BLAS
To ScaLAPACK
Evaluate expressiveness for eigenvalue problems
Study interaction of scheduler with higher level
scheduler being designed in ParLab
Can PLASMA gracefully accept, give up,
resources?