Title: Autotuning Sparse Matrix Kernels
1Auto-tuning Sparse Matrix Kernels
- Sam Williams1,2
- Richard Vuduc3, Leonid Oliker1,2, John Shalf2,
Katherine Yelick1,2, - James Demmel1,2
- 1University of California Berkeley 2Lawrence
Berkeley National Laboratory 3Georgia
Institute of Technology - samw_at_cs.berkeley.edu
2Motivation
- Multicore is the de facto solution for improving
peak performance for the next decade - How do we ensure this applies to sustained
performance as well ? - Processor architectures are extremely diverse and
compilers can rarely fully exploit them - Require a HW/SW solution that guarantees
performance without completely sacrificing
productivity
3Overview
- Examine Sparse Matrix Vector Multiplication
(SpMV) kernel - Present and analyze two threaded auto-tuned
implementations - Benchmarked performance across 4 diverse
multicore architectures - Intel Xeon (Clovertown)
- AMD Opteron
- Sun Niagara2 (Huron)
- IBM QS20 Cell Blade
- We show
- Auto-tuning can significantly improve performance
- Cell consistently delivers good performance and
efficiency - Niagara2 delivers good performance and
productivity
4Multicore SMPs used
5Multicore SMP Systems
6Multicore SMP Systems(memory hierarchy)
Conventional Cache-based Memory Hierarchy
7Multicore SMP Systems(memory hierarchy)
Conventional Cache-based Memory Hierarchy
Disjoint Local Store Memory Hierarchy
8Multicore SMP Systems(memory hierarchy)
Cache Pthreads implementationsb
Local Store libspe implementations
9Multicore SMP Systems(peak flops)
75 Gflop/s
17 Gflop/s
PPEs 13 Gflop/s SPEs 29 Gflop/s
11 Gflop/s
10Multicore SMP Systems(peak DRAM bandwidth)
21 GB/s(read) 10 GB/s(write)
21 GB/s
51 GB/s
42 GB/s(read) 21 GB/s(write)
11Multicore SMP Systems
Non-Uniform Memory Access
Uniform Memory Access
12Arithmetic Intensity
O( N )
O( 1 )
O( log(N) )
A r i t h m e t i c I n t e n s i t y
SpMV, BLAS1,2
FFTs
Stencils (PDEs)
Dense Linear Algebra (BLAS3)
Lattice Methods
Particle Methods
- Arithmetic Intensity Total Flops / Total DRAM
Bytes - Some HPC kernels have an arithmetic intensity
that scales with with problem size (increasing
temporal locality) - But there are many important and interesting
kernels that dont
13Auto-tuning
14Auto-tuning
- Hand optimizing each architecture/dataset
combination is not feasible - Goal Productive Solution for Performance
portability - Our auto-tuning approach finds a good performance
solution by a combination of heuristics and
exhaustive search - Perl script generates many possible kernels
- (Generate SIMD optimized kernels)
- Auto-tuning benchmark examines kernels and
reports back with the best one for the current
architecture/dataset/compiler/ - Performance depends on the optimizations
generated - Heuristics are often desirable when the search
space isnt tractable - Proven value in Dense Linear Algebra(ATLAS),
Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)
15Sparse Matrix-Vector Multiplication (SpMV)
- Samuel Williams, Leonid Oliker, Richard Vuduc,
John Shalf, Katherine Yelick, James Demmel,
"Optimization of Sparse Matrix-Vector
Multiplication on Emerging Multicore Platforms",
Supercomputing (SC), 2007.
16Sparse MatrixVector Multiplication
- Sparse Matrix
- Most entries are 0.0
- Performance advantage in only
- storing/operating on the nonzeros
- Requires significant meta data
- Evaluate yAx
- A is a sparse matrix
- x y are dense vectors
- Challenges
- Difficult to exploit ILP(bad for superscalar),
- Difficult to exploit DLP(bad for SIMD)
- Irregular memory access to source vector
- Difficult to load balance
- Very low computational intensity (often gt6
bytes/flop) - likely memory bound
A
x
y
17Dataset (Matrices)
2K x 2K Dense matrix stored in sparse format
Dense
Well Structured (sorted by nonzeros/row)
Protein
FEM / Spheres
FEM / Cantilever
Wind Tunnel
FEM / Harbor
QCD
FEM / Ship
Economics
Epidemiology
Poorly Structured hodgepodge
FEM / Accelerator
Circuit
webbase
Extreme Aspect Ratio (linear programming)
LP
- Pruned original SPARSITY suite down to 14
- none should fit in cache
- Subdivided them into 4 categories
- Rank ranges from 2K to 1M
18Naïve Serial Implementation
- Vanilla C implementation
- Matrix stored in CSR (compressed sparse row)
- Explored compiler options, but only the best is
presented here - x86 core delivers gt 10x the performance of a
Niagara2 thread
19Naïve Parallel Implementation
- SPMD style
- Partition by rows
- Load balance by nonzeros
- N2 2.5x x86 machine
Naïve Pthreads
Naïve
20Naïve Parallel Implementation
- SPMD style
- Partition by rows
- Load balance by nonzeros
- N2 2.5x x86 machine
8x cores 1.9x performance
4x cores 1.5x performance
64x threads 41x performance
4x threads 3.4x performance
Naïve Pthreads
Naïve
21Naïve Parallel Implementation
- SPMD style
- Partition by rows
- Load balance by nonzeros
- N2 2.5x x86 machine
1.4 of peak flops 29 of bandwidth
4 of peak flops 20 of bandwidth
25 of peak flops 39 of bandwidth
2.7 of peak flops 4 of bandwidth
Naïve Pthreads
Naïve
22Auto-tuned Performance(NUMA SW Prefetching)
- Use first touch, or libnuma to exploit NUMA.
- Also includes process affinity.
- Tag prefetches with temporal locality
- Auto-tune search for the optimal prefetch
distances
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
23Auto-tuned Performance(Matrix Compression)
- If memory bound, only hope is minimizing memory
traffic - Heuristically compress the parallelized matrix to
minimize it - Implemented with SSE
- Benefit of prefetching is hidden by requirement
of register blocking - Options register blocking, index size, format,
etc
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
24Auto-tuned Performance(Cache/TLB Blocking)
- Reorganize matrix to maximize locality of source
vector accesses
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
25Auto-tuned Performance(DIMMs, Firmware, Padding)
- Clovertown was already fully populated with DIMMs
- Gave Opteron as many DIMMs as Clovertown
- Firmware update for Niagara2
- Array padding to avoid inter-thread conflict
misses - PPEs use 1/3 of Cell chip area
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
26Auto-tuned Performance(DIMMs, Firmware, Padding)
- Clovertown was already fully populated with DIMMs
- Gave Opteron as many DIMMs as Clovertown
- Firmware update for Niagara2
- Array padding to avoid inter-thread conflict
misses - PPEs use 1/3 of Cell chip area
4 of peak flops 52 of bandwidth
20 of peak flops 65 of bandwidth
54 of peak flops 57 of bandwidth
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
10 of peak flops 10 of bandwidth
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
27Auto-tuned Performance(Cell/SPE version)
- Wrote a double precision Cell/SPE version
- DMA, local store blocked, NUMA aware, etc
- Only 2x1 and larger BCOO
- Only the SpMV-proper routine changed
- About 12x faster (median) than using the PPEs
alone.
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
28Auto-tuned Performance(Cell/SPE version)
- Wrote a double precision Cell/SPE version
- DMA, local store blocked, NUMA aware, etc
- Only 2x1 and larger BCOO
- Only the SpMV-proper routine changed
- About 12x faster than using the PPEs alone.
4 of peak flops 52 of bandwidth
20 of peak flops 65 of bandwidth
54 of peak flops 57 of bandwidth
More DIMMs(opteron), FW fix, array
padding(N2), etc
40 of peak flops 92 of bandwidth
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
29Auto-tuned Performance(How much did double
precision and 2x1 blocking hurt)
- Model faster cores by commenting out the inner
kernel calls, but still performing all DMAs - Enabled 1x1 BCOO
- 16 improvement
better Cell implementation
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
30Speedup from Auto-tuningMedian (max)
- Wrote a double precision Cell/SPE version
- DMA, local store blocked, NUMA aware, etc
- Only 2x1 and larger BCOO
- Only the SpMV-proper routine changed
- About 12x faster than using the PPEs alone.
3.9x (4.4x)
1.6x (2.7x)
1.3x (2.9x)
26x (34x)
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
31Summary
32Aggregate Performance (Fully optimized)
- Cell consistently delivers the best full system
performance - Although, Niagara2 delivers near comparable per
socket performance - Dual core Opteron delivers far better performance
(bandwidth) than Clovertown - Clovertown has far too little effective FSB
bandwidth - Huron has far more bandwidth than it can exploit
- (too much latency, too few cores)
33Parallel Efficiency(average performance per
thread, Fully optimized)
- Aggregate Mflop/s / cores
- Niagara2 Cell showed very good multicore
scaling - Clovertown showed very poor multicore scaling on
both applications - For SpMV, Opteron and Clovertown showed good
multisocket scaling
34Power Efficiency(Fully Optimized)
- Used a digital power meter to measure sustained
power under load - Calculate power efficiency as
- sustained performance / sustained power
- All cache-based machines delivered similar power
efficiency - FBDIMMs (12W each) sustained power
- 8 DIMMs on Clovertown (total of 330W)
- 16 DIMMs on N2 machine (total of 450W)
35Productivity
- Niagara2 required significantly less work to
deliver good performance. - Cache based machines required search for some
optimizations, while Cell relied solely on
heuristics (less time to tune)
36Summary
- Paradoxically, the most complex/advanced
architectures required the most tuning, and
delivered the lowest performance. - Niagara2 delivered both very good performance and
productivity - Cell delivered very good performance and
efficiency (processor and power) - Our multicore specific auto-tuned SpMV
implementation significantly outperformed
existing parallelization strategies including an
auto-tuned MPI implementation (as discussed
_at_SC07) - Architectural transparency is invaluable in
optimizing code
37Acknowledgements
- UC Berkeley
- RADLab Cluster (Opterons)
- PSI cluster(Clovertowns)
- Sun Microsystems
- Niagara2 donations
- Forschungszentrum Jülich
- Cell blade cluster access
38Questions?
39switch to pOSKI