Autotuning Sparse Matrix Kernels - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Autotuning Sparse Matrix Kernels

Description:

Sun Niagara2 (Huron) IBM QS20 Cell Blade. We show ... Sun Niagara2 (Huron) AMD Opteron. Intel Clovertown. Opteron. Opteron. 667MHz DDR2 DIMMs ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 40

Provided by: samwil

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Autotuning Sparse Matrix Kernels

1
Auto-tuning Sparse Matrix Kernels

Sam Williams1,2
Richard Vuduc3, Leonid Oliker1,2, John Shalf2,
Katherine Yelick1,2,
James Demmel1,2
1University of California Berkeley 2Lawrence
Berkeley National Laboratory 3Georgia
Institute of Technology
samw_at_cs.berkeley.edu

2
Motivation

Multicore is the de facto solution for improving
peak performance for the next decade
How do we ensure this applies to sustained
performance as well ?
Processor architectures are extremely diverse and
compilers can rarely fully exploit them
Require a HW/SW solution that guarantees
performance without completely sacrificing
productivity

3
Overview

Examine Sparse Matrix Vector Multiplication
(SpMV) kernel
Present and analyze two threaded auto-tuned
implementations
Benchmarked performance across 4 diverse
multicore architectures
Intel Xeon (Clovertown)
AMD Opteron
Sun Niagara2 (Huron)
IBM QS20 Cell Blade
We show
Auto-tuning can significantly improve performance
Cell consistently delivers good performance and
efficiency
Niagara2 delivers good performance and
productivity

4
Multicore SMPs used
5
Multicore SMP Systems
6
Multicore SMP Systems(memory hierarchy)
Conventional Cache-based Memory Hierarchy
7
Multicore SMP Systems(memory hierarchy)
Conventional Cache-based Memory Hierarchy
Disjoint Local Store Memory Hierarchy
8
Multicore SMP Systems(memory hierarchy)
Cache Pthreads implementationsb
Local Store libspe implementations
9
Multicore SMP Systems(peak flops)
75 Gflop/s
17 Gflop/s
PPEs 13 Gflop/s SPEs 29 Gflop/s
11 Gflop/s
10
Multicore SMP Systems(peak DRAM bandwidth)
21 GB/s(read) 10 GB/s(write)
21 GB/s
51 GB/s
42 GB/s(read) 21 GB/s(write)
11
Multicore SMP Systems
Non-Uniform Memory Access
Uniform Memory Access
12
Arithmetic Intensity
O( N )
O( 1 )
O( log(N) )
A r i t h m e t i c I n t e n s i t y
SpMV, BLAS1,2
FFTs
Stencils (PDEs)
Dense Linear Algebra (BLAS3)
Lattice Methods
Particle Methods

Arithmetic Intensity Total Flops / Total DRAM
Bytes
Some HPC kernels have an arithmetic intensity
that scales with with problem size (increasing
temporal locality)
But there are many important and interesting
kernels that dont

13
Auto-tuning
14
Auto-tuning

Hand optimizing each architecture/dataset
combination is not feasible
Goal Productive Solution for Performance
portability
Our auto-tuning approach finds a good performance
solution by a combination of heuristics and
exhaustive search
Perl script generates many possible kernels
(Generate SIMD optimized kernels)
Auto-tuning benchmark examines kernels and
reports back with the best one for the current
architecture/dataset/compiler/
Performance depends on the optimizations
generated
Heuristics are often desirable when the search
space isnt tractable
Proven value in Dense Linear Algebra(ATLAS),
Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

15
Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc,
John Shalf, Katherine Yelick, James Demmel,
"Optimization of Sparse Matrix-Vector
Multiplication on Emerging Multicore Platforms",
Supercomputing (SC), 2007.

16
Sparse MatrixVector Multiplication

Sparse Matrix
Most entries are 0.0
Performance advantage in only
storing/operating on the nonzeros
Requires significant meta data
Evaluate yAx
A is a sparse matrix
x y are dense vectors
Challenges
Difficult to exploit ILP(bad for superscalar),
Difficult to exploit DLP(bad for SIMD)
Irregular memory access to source vector
Difficult to load balance
Very low computational intensity (often gt6
bytes/flop)
likely memory bound

A
x
y
17
Dataset (Matrices)
2K x 2K Dense matrix stored in sparse format
Dense
Well Structured (sorted by nonzeros/row)
Protein
FEM / Spheres
FEM / Cantilever
Wind Tunnel
FEM / Harbor
QCD
FEM / Ship
Economics
Epidemiology
Poorly Structured hodgepodge
FEM / Accelerator
Circuit
webbase
Extreme Aspect Ratio (linear programming)
LP

Pruned original SPARSITY suite down to 14
none should fit in cache
Subdivided them into 4 categories
Rank ranges from 2K to 1M

18
Naïve Serial Implementation

Vanilla C implementation
Matrix stored in CSR (compressed sparse row)
Explored compiler options, but only the best is
presented here
x86 core delivers gt 10x the performance of a
Niagara2 thread

19
Naïve Parallel Implementation

SPMD style
Partition by rows
Load balance by nonzeros
N2 2.5x x86 machine

Naïve Pthreads
Naïve
20
Naïve Parallel Implementation

SPMD style
Partition by rows
Load balance by nonzeros
N2 2.5x x86 machine

8x cores 1.9x performance
4x cores 1.5x performance
64x threads 41x performance
4x threads 3.4x performance
Naïve Pthreads
Naïve
21
Naïve Parallel Implementation

SPMD style
Partition by rows
Load balance by nonzeros
N2 2.5x x86 machine

1.4 of peak flops 29 of bandwidth
4 of peak flops 20 of bandwidth
25 of peak flops 39 of bandwidth
2.7 of peak flops 4 of bandwidth
Naïve Pthreads
Naïve
22
Auto-tuned Performance(NUMA SW Prefetching)

Use first touch, or libnuma to exploit NUMA.
Also includes process affinity.
Tag prefetches with temporal locality
Auto-tune search for the optimal prefetch
distances

SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
23
Auto-tuned Performance(Matrix Compression)

If memory bound, only hope is minimizing memory
traffic
Heuristically compress the parallelized matrix to
minimize it
Implemented with SSE
Benefit of prefetching is hidden by requirement
of register blocking
Options register blocking, index size, format,
etc

Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
24
Auto-tuned Performance(Cache/TLB Blocking)

Reorganize matrix to maximize locality of source
vector accesses

Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
25
Auto-tuned Performance(DIMMs, Firmware, Padding)

Clovertown was already fully populated with DIMMs
Gave Opteron as many DIMMs as Clovertown
Firmware update for Niagara2
Array padding to avoid inter-thread conflict
misses
PPEs use 1/3 of Cell chip area

More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
26
Auto-tuned Performance(DIMMs, Firmware, Padding)

Clovertown was already fully populated with DIMMs
Gave Opteron as many DIMMs as Clovertown
Firmware update for Niagara2
Array padding to avoid inter-thread conflict
misses
PPEs use 1/3 of Cell chip area

4 of peak flops 52 of bandwidth
20 of peak flops 65 of bandwidth
54 of peak flops 57 of bandwidth
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
10 of peak flops 10 of bandwidth
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
27
Auto-tuned Performance(Cell/SPE version)

Wrote a double precision Cell/SPE version
DMA, local store blocked, NUMA aware, etc
Only 2x1 and larger BCOO
Only the SpMV-proper routine changed
About 12x faster (median) than using the PPEs
alone.

More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
28
Auto-tuned Performance(Cell/SPE version)

Wrote a double precision Cell/SPE version
DMA, local store blocked, NUMA aware, etc
Only 2x1 and larger BCOO
Only the SpMV-proper routine changed
About 12x faster than using the PPEs alone.

4 of peak flops 52 of bandwidth
20 of peak flops 65 of bandwidth
54 of peak flops 57 of bandwidth
More DIMMs(opteron), FW fix, array
padding(N2), etc
40 of peak flops 92 of bandwidth
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
29
Auto-tuned Performance(How much did double
precision and 2x1 blocking hurt)

Model faster cores by commenting out the inner
kernel calls, but still performing all DMAs
Enabled 1x1 BCOO
16 improvement

better Cell implementation
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
30
Speedup from Auto-tuningMedian (max)

Wrote a double precision Cell/SPE version
DMA, local store blocked, NUMA aware, etc
Only 2x1 and larger BCOO
Only the SpMV-proper routine changed
About 12x faster than using the PPEs alone.

3.9x (4.4x)
1.6x (2.7x)
1.3x (2.9x)
26x (34x)
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
31
Summary
32
Aggregate Performance (Fully optimized)

Cell consistently delivers the best full system
performance
Although, Niagara2 delivers near comparable per
socket performance
Dual core Opteron delivers far better performance
(bandwidth) than Clovertown
Clovertown has far too little effective FSB
bandwidth
Huron has far more bandwidth than it can exploit
(too much latency, too few cores)

33
Parallel Efficiency(average performance per
thread, Fully optimized)

Aggregate Mflop/s / cores
Niagara2 Cell showed very good multicore
scaling
Clovertown showed very poor multicore scaling on
both applications
For SpMV, Opteron and Clovertown showed good
multisocket scaling

34
Power Efficiency(Fully Optimized)

Used a digital power meter to measure sustained
power under load
Calculate power efficiency as
sustained performance / sustained power
All cache-based machines delivered similar power
efficiency
FBDIMMs (12W each) sustained power
8 DIMMs on Clovertown (total of 330W)
16 DIMMs on N2 machine (total of 450W)

35
Productivity

Niagara2 required significantly less work to
deliver good performance.
Cache based machines required search for some
optimizations, while Cell relied solely on
heuristics (less time to tune)

36
Summary

Paradoxically, the most complex/advanced
architectures required the most tuning, and
delivered the lowest performance.
Niagara2 delivered both very good performance and
productivity
Cell delivered very good performance and
efficiency (processor and power)
Our multicore specific auto-tuned SpMV
implementation significantly outperformed
existing parallelization strategies including an
auto-tuned MPI implementation (as discussed
_at_SC07)
Architectural transparency is invaluable in
optimizing code

37
Acknowledgements