Linear Algebra on GPUs - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Linear Algebra on GPUs

Description:

Run vectorized loops on the GPU, rest (least work) on the CPU. Autotune to decide optimal redundancy and when involve CPU ... LAPACK does 50% of work is in BLAS1/BLAS2 ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 20
Provided by: vol96
Category:
Tags: algebra | cpu | gpus | linear | work

less

Transcript and Presenter's Notes

Title: Linear Algebra on GPUs


1
Linear Algebra on GPUs
  • Vasily Volkov

2
GPU Architecture Features
  • SIMD architecture
  • Dont be confused by scalar ISA which is only a
    program model
  • We use vector thread program model instead
  • Similar to SSE/Cell SPE/vector units, not
    superscalar units
  • Native vector length is 32 can simulate larger
    (thread blocks)
  • Multithreading using register windows
  • Context switch never flushes registers to memory
  • If more threads than can fit, some dont start
    until others finish
  • Huge register files, small high-latency caches
  • Fundamental difference with CPUs, similar to
    vector processors
  • Cache is to save memory bandwidth, not reduce
    latency
  • Use register blocking, not cache blocking
  • Large startup costs (5?s)
  • Not good for fine-grain computations
  • Use global barriers instead? (1.5 ?s)

3
Relation to Other Multicores
Processor Core2 SSE, 2.66GHz Cell SPEs G80/G84
Gflop/s/core 21.3 Gflop/s 25.6 Gflop/s 19-23 Gflop/s
Vector length 4 4 32
of cores 2-4 8 4-16
  • All offer both multithreading and SIMD
    capabilities
  • Use CUDA to program all of them?

4
Pointer Chase Benchmark, 8800GTX
  • run kAk
  • in a loop
  • in a scalar thread
  • latency bound
  • larger latency at cache miss
  • Reveals cache sizes
  • 8-word cache line
  • L1 8 x 5kB, each is 20-way associative
  • L2 6 x 32kB, each is 4-way associative
  • 512kB memory pages
  • TLB 16 entries, fully associative

8800GTX/8800GTS/8600GTS/FX5600 Different number
of similar caches
5
GPU Memory System (8800GTX)
6
Matrix-Matrix multiply, C C AB
  • 64x16 blocks in C, rank-4 updates
  • Ensures that our code is compute-bound
  • Each thread computes one block in C
  • ijk/jik form other choices produce race
    condition in C
  • Keep As and Cs block in vector registers
  • Similarly done on IBM 3090 VF and Cray X1
  • Keep Bs block in local memory
  • Others keep it in scalar registers (n/a on GPU)
    or caches (slow)
  • Use compiler options to enforce tight register
    budget
  • To hide memory latencies better by multithreading
  • Use prefetching to hide latencies even better
  • Now, performance is not bound by latency and
    bandwidth in reading blocks in A and B!
  • Bound by instruction issue and local memory ops
    (230 Gflop/s)

7
Performance of SGEMM
  • CUBLAS 1.1
  • keeps square blocks in A and B in local memory
  • uses long vectors (poor instruction mix)
  • exposing too much of data parallelism may cost
    you
  • Our SGEMM is now in CUBLAS 2.0 beta

8
SGEMM, 8800GTX, k 1024
Constant work per vector thread (function of
k) Optimized version does better load balancing
by computing partial sums
9
Panel Factorization
CPU runtime on Core2 Duo 2.66GHz, Intel MKL 10.0
(includes CPU-GPU transfer!) GPU estimated for
8800GTX as
10
Design of Matrix Factorizations
  • Right-looking scheme most parallelism best on
    16-core GPUs
  • Crout scheme least bandwidth best on 4-core
    GPU and if using CUBLAS 1.1
  • Left-looking half of work in triangular solve
    limited parallelism inefficient
  • 2-level blocking
  • Both levels are right-looking premature exit
    from finer level to keep
  • Up to 6 speedup only, at large matrices
    (n10,000)
  • Block size on GPU is 64 (same as in matrix
    multiply)
  • Autotuning in QR (up to 7 speedup)
  • Row-major layout on GPU in LU decomposition
  • Since gathers with large stride are inefficient
  • Requires transposition at every CPU-GPU transfer
  • gt2x speedup!
  • Panel factorization on CPU overlapped with BLAS3
    on GPU (use lookahead)
  • Multiply by inverse (GEMM) instead of triangular
    solve (TRSM)
  • TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if
    matrix is 64x64
  • Parallelism in TRSM is not embarrassing enough

11
Test Platform
  • GPU
  • Core2 Duo 2.67GHz GeForce 8800 GTX
  • Core2 Duo 2.67GHz two GeForce 8800 GTX
  • CPU
  • Core2 Duo 2.67GHz
  • Core2 Quad 2.4GHz

12
Summary of Performance
13
Speedups vs. CPU
14
Summary of Speedups
 
8800GTX Gflop/s Core2 Duo Core2 Duo Core2 Quad Core2 Quad
8800GTX Gflop/s Gflop/s speedup Gflop/s speedup
Cholesky 183 23.0 7.4? 34.9 5.5?
LU 179 22.7 7.7? 59.2 3.0?
QR 192 22.5 8.3? 44.8 4.3?
SGEMM 206 25.9 8.0? 69.8 3.0?
15
Speedup using 2 GPUs
Using column-cyclic layout
16
Breakdown of runtime (LU)
17
What if omit one optimization in LU?
18
Other Work Done
  • Tridiagonal eigenvalue solver (bisection)
  • Most work factorize A?iI LDLT, count signs in
    D (compute bound)
  • Done for many ?i in parallel traditionally
    vectorized
  • If need more parallelism do multisection
    instead of bisection
  • But it increases total flop count
  • Rest is difficult to parallelize, does not worth
    it
  • Our solution
  • Run vectorized loops on the GPU, rest (least
    work) on the CPU
  • Autotune to decide optimal redundancy and when
    involve CPU
  • Use features of IEEE arithmetic to save another
    15-30 of runtime
  • Up to 156x faster than LAPACK on Pentium 4
  • Tridiagonal eigenvector solver (inverse
    iteration)
  • Most work Solve (A?iI)xk1xk for fixed ?i
    (bandwidth bound)
  • Factorize A?iI LDLT once, keep D only.
    Reconstruct L on need
  • Reconstruction is overlapped with memory access
    still bandwidth bound
  • Dont pivot recompute using safe code if fails
    (do it on CPU)
  • Up to 120x faster than LAPACK on Core2 Duo so far
  • More complicated when eigenvalues are clustered
  • Stencil computation (7-point on 3D grid)

19
Future Work
  • Analysis of architecture
  • Find best parallels in past architectures to
    reuse methods
  • Catching up with newer GPUs
  • More micro-benchmarking to get better performance
    models
  • More scientific kernels
  • CUFFT is 50Gflop/s, can do better (e.g. by not
    doing sin/cos in the inner loop)
  • More LAPACK
  • Two-sided factorizations used in eigensolvers and
    SVD
  • LAPACK does 50 of work is in BLAS1/BLAS2
  • Mostly BLAS3 algorithm is known, but has requires
    more flops if eigenvectors are needed
  • May use divide-and-conquer instead
  • MRRR (improved inverse iteration algorithm, also
    rich in parallelism)
  • Non-symmetric eigensolvers such as QR iterations
  • currently fine-grained, can do better?
  • Iterative refinement for eigenvalue problem?
  • ScaLAPACK (distributed memory LAPACK)
  • One-sided factorizations on a GPU cluster
Write a Comment
User Comments (0)
About PowerShow.com