Title: Linear Algebra on GPUs
1Linear Algebra on GPUs
2GPU Architecture Features
- SIMD architecture
- Dont be confused by scalar ISA which is only a
program model - We use vector thread program model instead
- Similar to SSE/Cell SPE/vector units, not
superscalar units - Native vector length is 32 can simulate larger
(thread blocks) - Multithreading using register windows
- Context switch never flushes registers to memory
- If more threads than can fit, some dont start
until others finish - Huge register files, small high-latency caches
- Fundamental difference with CPUs, similar to
vector processors - Cache is to save memory bandwidth, not reduce
latency - Use register blocking, not cache blocking
- Large startup costs (5?s)
- Not good for fine-grain computations
- Use global barriers instead? (1.5 ?s)
3Relation to Other Multicores
Processor Core2 SSE, 2.66GHz Cell SPEs G80/G84
Gflop/s/core 21.3 Gflop/s 25.6 Gflop/s 19-23 Gflop/s
Vector length 4 4 32
of cores 2-4 8 4-16
- All offer both multithreading and SIMD
capabilities - Use CUDA to program all of them?
4Pointer Chase Benchmark, 8800GTX
- run kAk
- in a loop
- in a scalar thread
- latency bound
- larger latency at cache miss
- Reveals cache sizes
- 8-word cache line
- L1 8 x 5kB, each is 20-way associative
- L2 6 x 32kB, each is 4-way associative
- 512kB memory pages
- TLB 16 entries, fully associative
8800GTX/8800GTS/8600GTS/FX5600 Different number
of similar caches
5GPU Memory System (8800GTX)
6Matrix-Matrix multiply, C C AB
- 64x16 blocks in C, rank-4 updates
- Ensures that our code is compute-bound
- Each thread computes one block in C
- ijk/jik form other choices produce race
condition in C - Keep As and Cs block in vector registers
- Similarly done on IBM 3090 VF and Cray X1
- Keep Bs block in local memory
- Others keep it in scalar registers (n/a on GPU)
or caches (slow) - Use compiler options to enforce tight register
budget - To hide memory latencies better by multithreading
- Use prefetching to hide latencies even better
- Now, performance is not bound by latency and
bandwidth in reading blocks in A and B! - Bound by instruction issue and local memory ops
(230 Gflop/s)
7Performance of SGEMM
- CUBLAS 1.1
- keeps square blocks in A and B in local memory
- uses long vectors (poor instruction mix)
- exposing too much of data parallelism may cost
you - Our SGEMM is now in CUBLAS 2.0 beta
8SGEMM, 8800GTX, k 1024
Constant work per vector thread (function of
k) Optimized version does better load balancing
by computing partial sums
9Panel Factorization
CPU runtime on Core2 Duo 2.66GHz, Intel MKL 10.0
(includes CPU-GPU transfer!) GPU estimated for
8800GTX as
10Design of Matrix Factorizations
- Right-looking scheme most parallelism best on
16-core GPUs - Crout scheme least bandwidth best on 4-core
GPU and if using CUBLAS 1.1 - Left-looking half of work in triangular solve
limited parallelism inefficient - 2-level blocking
- Both levels are right-looking premature exit
from finer level to keep - Up to 6 speedup only, at large matrices
(n10,000) - Block size on GPU is 64 (same as in matrix
multiply) - Autotuning in QR (up to 7 speedup)
- Row-major layout on GPU in LU decomposition
- Since gathers with large stride are inefficient
- Requires transposition at every CPU-GPU transfer
- gt2x speedup!
- Panel factorization on CPU overlapped with BLAS3
on GPU (use lookahead) - Multiply by inverse (GEMM) instead of triangular
solve (TRSM) - TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if
matrix is 64x64 - Parallelism in TRSM is not embarrassing enough
11Test Platform
- GPU
- Core2 Duo 2.67GHz GeForce 8800 GTX
- Core2 Duo 2.67GHz two GeForce 8800 GTX
- CPU
- Core2 Duo 2.67GHz
- Core2 Quad 2.4GHz
12Summary of Performance
13Speedups vs. CPU
14Summary of Speedups
Â
8800GTX Gflop/s Core2 Duo Core2 Duo Core2 Quad Core2 Quad
8800GTX Gflop/s Gflop/s speedup Gflop/s speedup
Cholesky 183 23.0 7.4? 34.9 5.5?
LU 179 22.7 7.7? 59.2 3.0?
QR 192 22.5 8.3? 44.8 4.3?
SGEMM 206 25.9 8.0? 69.8 3.0?
15Speedup using 2 GPUs
Using column-cyclic layout
16Breakdown of runtime (LU)
17What if omit one optimization in LU?
18Other Work Done
- Tridiagonal eigenvalue solver (bisection)
- Most work factorize A?iI LDLT, count signs in
D (compute bound) - Done for many ?i in parallel traditionally
vectorized - If need more parallelism do multisection
instead of bisection - But it increases total flop count
- Rest is difficult to parallelize, does not worth
it - Our solution
- Run vectorized loops on the GPU, rest (least
work) on the CPU - Autotune to decide optimal redundancy and when
involve CPU - Use features of IEEE arithmetic to save another
15-30 of runtime - Up to 156x faster than LAPACK on Pentium 4
- Tridiagonal eigenvector solver (inverse
iteration) - Most work Solve (A?iI)xk1xk for fixed ?i
(bandwidth bound) - Factorize A?iI LDLT once, keep D only.
Reconstruct L on need - Reconstruction is overlapped with memory access
still bandwidth bound - Dont pivot recompute using safe code if fails
(do it on CPU) - Up to 120x faster than LAPACK on Core2 Duo so far
- More complicated when eigenvalues are clustered
- Stencil computation (7-point on 3D grid)
19Future Work
- Analysis of architecture
- Find best parallels in past architectures to
reuse methods - Catching up with newer GPUs
- More micro-benchmarking to get better performance
models - More scientific kernels
- CUFFT is 50Gflop/s, can do better (e.g. by not
doing sin/cos in the inner loop) - More LAPACK
- Two-sided factorizations used in eigensolvers and
SVD - LAPACK does 50 of work is in BLAS1/BLAS2
- Mostly BLAS3 algorithm is known, but has requires
more flops if eigenvectors are needed - May use divide-and-conquer instead
- MRRR (improved inverse iteration algorithm, also
rich in parallelism) - Non-symmetric eigensolvers such as QR iterations
- currently fine-grained, can do better?
- Iterative refinement for eigenvalue problem?
- ScaLAPACK (distributed memory LAPACK)
- One-sided factorizations on a GPU cluster