Title: CS 267: Applications of Parallel Computers Lecture 17: Parallel Sparse MatrixVector Multiplication
1CS 267 Applications of Parallel
ComputersLecture 17Parallel Sparse
Matrix-Vector Multiplication
- Horst D. Simon
- http//www.cs.berkeley.edu/strive/cs267
2Outline
- Matrix-Vector multiplication (review)
- Matrices and Graphs (review)
- Serial optimizations (cf thesis work by E.J. Im
http//www.cs.berkeley.edu/ejim/publication/ ) - Register blocking
- Cache blocking
- Exploiting symmetry
- Reordering
- Parallel optimizations (cf. Lecture by Lenny
Oliker and references http//www.nersc.gov/oliker
/paperlinks.html ) - Reordering
- Multiple vectors or operations
- Other operations
3Parallel Dense Matrix-Vector Product (Review)
- y y Ax, where A is a dense matrix
- Layout
- 1D by rows
- Algorithm
- Foreach processor j
- Broadcast X(j)
- Compute A(p)x(j)
- A(i) refers to the n by n/p block row that
processor i owns - Algorithm uses the formula
- Y(i) Y(i) A(i)X Y(i) Sj A(i)X(j)
Po P1 P2 P3
x
Po P1 P2 P3
y
4Graphs and Sparse Matrices
- Sparse matrix is a representation of a (sparse)
graph
1 2 3 4 5 6
1 1 1 2 1 1
1 3
1 1 1 4 1
1 5 1 1
6 1 1
3
2
4
1
5
6
- Matrix entries are edge weights
- Diagonal contains self-edges (usually non-zero)
- Matrix-vector multiplication
- Vector is spread over nodes
- Compute new vector by multiplying incoming edges
source node
5Data Structure for Sparse Matrix
- Matvec (Matrix Vector Multiplication) will
- Access rows of matrix serially
- ? use an array for each row (indices and
values) - Access destination vector only once per element
- ? layout doesnt matter
- Access source vector once per row, only nonzeros
- Only opportunity for memory hierarchy
improvements is in the source vector - Layout by rows may not be best more on this
later
6Compressed Sparse Row Format
- Compressed Sparse Row (CSR), also called
Compressed Rows Storage (CRS) is - General does not make assumptions about the
matrix structure - Relatively efficient rows are sorted, only 1
index per nonzero
7Performance Challenge
- Consider a 167 MHz UltraSPARC I
- Dense matrix operation
- Naïve matrix-vector multiplication
38Mflops - Vendor optimized matrix-vector multiplication
57Mflops - Vendor optimized matrix-matrix multiplication
185Mflops - Reason Low Flop to memory ratio 2
- Sparse matrix operation
- Naïve matrix-vector multiplication of dense
matrix in sparse representation (Compressed
Sparse Row) 25Mflops - As low as 5.7Mflops for some matrices
- Reason Indexing overhead and irregular memory
access patterns
8Serial Optimizations
- Several optimizations, depending on matrix
structure - Register blocking
- Sparsity and PetSC
- Cache blocking
- Sparsity
- Symmetry
- BeBop and project at CMU
- Reordering
- Toledo (97) and Oliker et al (00) do this
- Exploit bands, and other structures
- Sparse compiler project by Bik (96) does this
9Register Blocking Optimization
- Identify a small dense blocks of nonzeros.
- Fill in extra zeros to complete blocks
- Use an optimized multiplication code for the
particular block size. - Reuse source elements
- Unroll loop
- Saves index memory
2x2 register blocked matrix
2
1
2
3
0
2
4
1
2
5
0
1
0
0
1
3
0
2
1
3
0
5
7
3
0
4
1
1
- Improves register reuse, lowers indexing
overhead. - Computes extra Flops for added zeros
10Register Blocking Blocked Row Format
- Blocked Compressed Sparse Row Format
- Advantages of the format
- Better temporal locality in registers
- The multiplication loop can be unrolled for
better performance
11Register Blocking Fill Overhead
- We use uniform block size, adding fill overhead.
- fill overhead 12/7 1.71
- This increases both space and the number of
floating point operations.
12Performance Model for Register Blocking
- Estimate performance of register blocking
- Estimated raw performance Mflop/s of dense
matrix in sparse bxb blocked format on machine of
choice - Estimated overhead for particular matrix to fill
in bxb blocks - Maximize over b
- Estimated raw performance
- Estimated overhead
- Use sampling to further reduce time, row and
column dimensions are computed separately
Matrix-dependent
13Register Blocking Profile of Ultra I
- Dense Matrix profile on an UltraSPARC I
14Register Blocking Profile of MIPS R10K
- Performance of dense matrix in sparse blocked
format on MIPS R10K
15Register Blocking Profile of P III
- Performance of dense matrix in sparse blocked
format on P III
16Register Blocking Profile of P4
- Performance of dense matrix in sparse blocked
format on P 4
17Benchmark matrices
- Matrix 1 Dense matrix (1000 x 1000)
- Matrices 2-17 Finite Element Method matrices
- Matrices 18-39 matrices from Structural
Engineering, Device Simulation - Matrices 40-44 Linear Programming matrices
- Matrix 45 document retrieval matrix
- used for Latent Semantic
Indexing - Matrix 46 random matrix (10000 x 10000, 0.15)
18Register Blocking Performance
- The optimization is effective most on FEM
matrices and dense matrix (lower-numbered). - A single search step (compare blocked and
unblocked) raises the numbers below the line to 1
19Register Blocking Performance
- Speedup is generally best on MIPS R10000, which
is competitive with the dense BLAS performance.
(DGEMV/DGEMM 0.38)
20Register Blocking on P4 for FEM/fluids matrix 1
21Register Blocking on P4 for FEM/fluids matrix 2
22Register Blocking Model Validation
23Register Blocking Overhead
- Pre-computation overhead
- Estimating fill overhead (red bars)
- Reorganizing the matrix (yellow bars)
- The ratio means the number of repetitions for
which the optimization is beneficial.
24Cache Blocking Optimization
- Keeping part of source vector in cache
Source vector (x)
Destination Vector (y)
Sparse matrix(A)
- Improves cache reuse of source vector.
- Challenge choosing a block size
25Cache Blocking
- Temporal locality of access to source vector
- Makes sense when X does not fit in cache
- Rectangular matrices, in particular
Source vector x
Destination Vector y
In memory
26Cache Blocking Speedup
- MIPS speedup is generally better.
- Larger cache, larger miss penalty (26/589 ns for
MIPS, 36/268 ns for Ultra.) - Document retrieval and random matrix are
exceptions. - Perform very poorly in baseline code
27Cache Blocking Web Document Matrix
- This matrix seems to be an exception
- Cache blocking helps
- Only on a random matrix with similar shape/size
does it help (without other optimizations)
28Cache Blocking Searching for Block Size
29Combination of Register and Cache blocking
- The combination is rarely beneficial, often
slower than either of the two optimization.
30Combination of Register and Cache blocking
31Cache Blocking Summary
- On matrices from real problems, rarely useful.
- Need to be
- Random in structure
- Rectangular
- Can determine reasonable block size through
coarse grained search. - Combination of register and cache blocking not
useful. - Work on disjoint sets of matrices (as far as we
can tell)
32Exploiting Symmetry
- Many matrices are symmetric (undirected graph)
- CG only works, for example, on symmetric matrices
- Symmetric storage only store/access ½ of matrix
x
- Update y twice for each element of A
- yi Ai,jxj
- yj Ai,jxi
- Save up to ½ of memory traffic (to A)
- Works for sparse format
y
33Symmetry Speedups
34Multiple Vector Optimization
- Better potential for reuse
- Loop unrolled codes multiplying across vectors
are generated by a code generator.
x
j1
y
a
i2
y
ij
i1
- Allows reuse of matrix elements.
- Choosing the number of vectors for loop unrolling.
35Multiple Vector Multiplication
- Better chance of optimization BLAS2 vs. BLAS3
Repetition of single-vector case
Multiple-vector case
- Blocked algorithms (blocked CG) may use this
36Multiple Vector Effect on Synthetic Matrices
- matrix with random sparsity pattern
- dense matrix in sparse format
37Multiple Vectors with Register Blocking
- The speedup is larger than single vector register
blocking. - Even the performance of the matrices that did not
speedup improved. (middle group in UltraSPARC)
38Multiple Vectors with Cache Blocking
MIPS
UltraSPARC
- Noticeable speedup for matrices that did not
speedup with cache blocking alone (UltraSPARC) - Block sizes are much smaller than that of single
vector cache blocking.
39Sparsity System Organization
- Guide a choice of optimization
- Automatic selection of optimization parameters
such as block size, number of vectors - http//www.cs.berkeley.edu/yelick/sparsity
- Released version does not include optimization
for symmetry
Representative Matrix
Optimized Format Code
Sparsity machine profiler
Machine Profile
Sparsity optimizer
Maximum vectors
40Summary of Serial Optimizations
- Sparse matrix optimization has same problems as
dense, plus nonzero pattern - Optimization of format/code is effective, but
expensive - Should be done across an application, not at each
mvm call or even iterative solve - Optimization choice may depended on matrix
characteristics rather than specifics - Optimizing over large kernel is also important
- Heuristics plus search can be used to select
optimization parameters - Future
- Different code organization, instruction mixes
- Different view of A A A1 A2
- Different matrix orderings
41Tuning other sparse operations
- Symmetric matrix-vector multiply Ax
- Solve a triangular system of equations T-1x
- ATAx
- Kernel of Information Retrieval via LSI
- Same number of memory references as Ax
- A2x, Akx
- Kernel of Information Retrieval used by Google
- Changes calling algorithm
- ATMA
- Matrix triple product
- Used in multigrid solver
42Symmetric Sparse Matrix-Vector Multiply
P4 using icc
43Sparse Triangular Solve
P4 using icc
44Parallel Sparse Matrix-Vector Multiplication
45Parallel Sparse Matrix-Vector Product
- y y Ax, where A is a sparse matrix
- Layout
- 1D by rows
- Algorithm
- Foreach processor j
- Broadcast X(j)
- Compute A(p)x(j)
- Same algorithm works, but
- May be unbalanced
- Communicates more of X than needed
Po P1 P2 P3
x
Po P1 P2 P3
y
46Optimization Opportunities
- Send only necessary parts of x
- Overlap communication and computation
- Load balance
- Take advantage of symmetry
- Trickier than serial case, because of multiple
accesses to Y - Take advantage of blocked structure
- For serial performance, using FEM matrices or
similar - Take advantage of bands
- Store only diagonal bands with starting
coordinate - Reduces index storage and access, running time?
- Reorder your matrix
- Use blocked CG
- Better serial performance, reduces a, but not b
in comm
47Following slides based on Ordering for
Efficiency of Sparse CG X. Li, L. Oliker,
G. Heber, R. Biswas
48Partitioning and Ordering Schemes
- RCM, METIS
- Self-Avoiding Walk (SAW) Heber/Biswas/Gao 99
- A SAW over a triangular mesh is an enumeration of
the triangles such that two consecutive triangles
in the SAW share an edge, or a vertex (there are
no jumps). - It is a mesh-based linearization technique with
similar application areas as space-filling
curves. - Construction time is linear in the number of
triangles. - Amenable to hierarchical coarsening and
refinement, and can be easily parallelized.
49Graph Partitioning Strategy MeTiS
- Most popular class of multilevel partitioners
- Objectives
- balance computational workload
- minimize edge cut (interprocessor communication)
- Algorithm
- collapses vertices edges using heavy-edge
matching scheme - applies greedy partitioning algorithm to coarsest
graph - uncoarsens it back using greedy graph growing
Kernighan-Lin
Initial partitioning
Refinement phase
Coarsening phase
50Strategy Reverse Cuthill-McKee (RCM)
- Matrix bandwidth (profile) has significant impact
on efficiency of linear systems eigensolvers - Geometry-based algorithm that generates a
permutation so that non-zero entries are close to
diagonal - Good preordering for LU or Cholesky factorization
(reduces fill) - Improves cache performance (but does not
explicitly reduce edge cut) - Can be used as partitioner by assigning segments
to processors - Starting from vertex of min degree, generates
level structure by BFS orders vertices by
decreasing distance from original vertex
51Strategy Self-Avoiding Walks (SAW)
- Mesh-based technique similar to space-filling
curves - Two consecutive triangles in walk share edge or
vertex (no jumps) - Visits each triangle exactly once,
entering/exiting over edge/vertex - Improves parallel efficiency related to locality
(cache reuse) load balancing, but does not
explicitly reduce edge cuts - Amenable to hierarchical coarsening refinement
-
Biswas, Oliker, Li, Concurrency -
Practice
Experience, -
12 (2000)
85-109
52Experiment
- Mesh generation
- 2D Delaunay triangulation, generated by Triangle
Shewchuk 96. - The mesh is shaped like letter A
- 661,054 vertices, 1,313,099 triangles
- Matrix generation
- For each vertex pair (vi, vj) with distance lt 3,
we assign a random value in (0, 1) to A(i, j). - Diagonal is chosen to make it diagonally
dominant. - About 39 nonzeros per row, total of 25,753,034
nonzeros. - Implementations
- MPI version uses Aztec.
- Origin 2000 and Tera MTA uses compiler
directives.
53Sparse CG Timings
54Sparse Matvec Cache Misses
55Sparse Matve Communication Volume
56Observations
- On T3E and Origin, ordering methods significantly
improve the CG performance. - On T3E, matrix-vector efficiency depends more on
cache reuse than communication reduction. - RCM, SAW, METIS gt 93 total time servicing
cache misses - ORIG 72-80 total time servicing cache misses.
- On Origin, can use SMP directives, but require
data distribution. - On the Tera MTA, no ordering and data
distribution are required. Program is the
simplest.
57Summary
- Lots of opportunities for optimization
- Data structures
- Local performance
- Communication performance
- Numerical techniques (preconditioning)
- Take advantage of special structures
- OK to focus on class of matrices
58Test Problem
- 2D Delaunay triangulation of letter A shaped
mesh generated using Triangle package (661,054
vertices 1,313,099 triangles) - Underlying matrix assembled by assigning random
value in (0,1) to each (i,j) entry corresponding
to vertex pair (vi,vj) if dist(vi,vj) lt 4 (other
entries set to 0) - Diagonal entries set to 40 (to make matrix
positive definite) - Final matrix has approx 39 entries/row and
25,753,034 nonzeros - CG converges in 13 iterations SPMV accounts for
87 of flops
59Distributed-Memory Implementation
- Each processor has local memory that only it can
directly access message passing required to
access memory of another processor - User decides how to distribute data and organize
comm structure - Allows efficient code design at the cost of
higher complexity - Parallel CG calls Aztec sparse linear library
- Matrix A partitioned into blocks of rows each
block to a processor - Associated component vectors (x, b) distributed
accordingly - Comm needed to transfer some components of x
- Data from
- T3E (450 MHz Alpha processor, 900 Mflops peak, 96
KB secondary cache, 3D torus interconnect)
60Distributed-Memory Implementation
- ORIG ordering has large edge cut (interprocessor
comm) and poor locality (high number of cache
misses) - MeTiS minimizes edge cut, while SAW minimizes
cache misses
61Distributed Shared-Memory System
- Origin2000 (SMP of nodes, each with dual 250 MHz
R10000 processor local memory) - Hardware makes all memory equally accessible from
software perspective by sending memory requests
through routers on nodes - Memory access time non-uniform (depends on
hops) - Hypercube interconnect (maximum log(p) hops)
- Each processor has 4 MB cache where only it can
fetch/store data - If processor accesses data not in cache, delay
while copy fetched - When processor modifies word, all other copies of
that cache line invalidated
62Distributed Shared-Memory Implementation
- Use OpenMP-style directives to parallelize loops
- requires less effort than MPI
- Two implementation approaches taken
- SHMEM assume that Origin2000 has flat
shared-memory (arrays not explicitly distributed,
non-local data handled by cache coherence) - CC-NUMA consider underlying distributed
shared-memory by explicit data distribution - Computational kernels identical for SHMEM and
CC-NUMA - Each processor assigned equal rows in matrix
(block) - No explicit synchronization required since no
concurrent writes - Global reduction required for DOT operation
63Distributed Shared-Memory Implementation
- CC-NUMA performs significantly better than SHMEM
- RCM SAW reduce runtimes compared to ORIG (more
for CC-NUMA) - Little difference between RCM SAW, probably due
to large cache - CC-NUMA (with ordering) MPI runtimes
comparable, even though programming methodologies
quite different
64Tera MTA Multithreaded Architecture
- 255 MHz Tera uses multithreading to hide latency
(100-150 cycles per word) keep processors
saturated with work - No data cache
- Randomized memory mapping (data layout
impossible) - Near uniform data access from any processor to
any memory location - Each processor has 100 hardware streams (32
registers PC) - Context switch on each cycle, choose next
instruction from ready streams - A stream can execute an instruction only once
every 21 cycles, even if no instructions
reference memory - Synchronization between threads accomplished
using full/empty bits in memory, allowing
fine-grained threads - No explicit load balancing required since dynamic
scheduling of work to threads can keep processor
saturated - No difference between uni- and multiprocessor
parallelism
65Tera MTA Implementation
- Multithreading implementation trivial (only
require compiler directives) - Special assertions used to indicate no
loop-carried dependencies - Compiler then able to parallelize loop segments
- Load balancing handled by OS (dynamically assigns
rows of matrix to threads) - Other than reduction for DOT, no special
synchronization constructs required (no possible
race conditions in CG) - No special ordering required (or possible) to
achieve good parallel performance
66Tera MTA Results
- Both SPMV and CG show high scalability (over 90)
with 60 streams per processor - Sufficient TLP to tolerate high overhead of
memory access - 8-proc Tera faster than 32-proc Origin2000
16-proc T3E with no partitioning / ordering (but
will scaling continue beyond 8 procs?) - Adaptivity No extra work required to maintain
performance