CS 267: Applications of Parallel Computers Lecture 17: Parallel Sparse MatrixVector Multiplication

About This Presentation

Title:

CS 267: Applications of Parallel Computers Lecture 17: Parallel Sparse MatrixVector Multiplication

Description:

Reason: Low Flop to memory ratio: 2. Sparse matrix operation ... Makes sense when X does not fit in cache. Rectangular matrices, in particular. Source vector x ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 64

Provided by: kathyy151

Category:

more less

Transcript and Presenter's Notes

Title: CS 267: Applications of Parallel Computers Lecture 17: Parallel Sparse MatrixVector Multiplication

1
CS 267 Applications of Parallel
ComputersLecture 17Parallel Sparse
Matrix-Vector Multiplication

Horst D. Simon
http//www.cs.berkeley.edu/strive/cs267

2
Outline

Matrix-Vector multiplication (review)
Matrices and Graphs (review)
Serial optimizations (cf thesis work by E.J. Im
http//www.cs.berkeley.edu/ejim/publication/ )
Register blocking
Cache blocking
Exploiting symmetry
Reordering
Parallel optimizations (cf. Lecture by Lenny
Oliker and references http//www.nersc.gov/oliker
/paperlinks.html )
Reordering
Multiple vectors or operations
Other operations

3
Parallel Dense Matrix-Vector Product (Review)

y y Ax, where A is a dense matrix
Layout
1D by rows
Algorithm
Foreach processor j
Broadcast X(j)
Compute A(p)x(j)
A(i) refers to the n by n/p block row that
processor i owns
Algorithm uses the formula
Y(i) Y(i) A(i)X Y(i) Sj A(i)X(j)

Po P1 P2 P3
x
Po P1 P2 P3
y
4
Graphs and Sparse Matrices

Sparse matrix is a representation of a (sparse)
graph

1 2 3 4 5 6
1 1 1 2 1 1
1 3
1 1 1 4 1
1 5 1 1
6 1 1
3
2
4
1
5
6

Matrix entries are edge weights
Diagonal contains self-edges (usually non-zero)
Matrix-vector multiplication
Vector is spread over nodes
Compute new vector by multiplying incoming edges
source node

5
Data Structure for Sparse Matrix

Matvec (Matrix Vector Multiplication) will
Access rows of matrix serially
? use an array for each row (indices and
values)
Access destination vector only once per element
? layout doesnt matter
Access source vector once per row, only nonzeros
Only opportunity for memory hierarchy
improvements is in the source vector
Layout by rows may not be best more on this
later

6
Compressed Sparse Row Format

Compressed Sparse Row (CSR), also called
Compressed Rows Storage (CRS) is
General does not make assumptions about the
matrix structure
Relatively efficient rows are sorted, only 1
index per nonzero

7
Performance Challenge

Consider a 167 MHz UltraSPARC I
Dense matrix operation
Naïve matrix-vector multiplication
38Mflops
Vendor optimized matrix-vector multiplication
57Mflops
Vendor optimized matrix-matrix multiplication
185Mflops
Reason Low Flop to memory ratio 2
Sparse matrix operation
Naïve matrix-vector multiplication of dense
matrix in sparse representation (Compressed
Sparse Row) 25Mflops
As low as 5.7Mflops for some matrices
Reason Indexing overhead and irregular memory
access patterns

8
Serial Optimizations

Several optimizations, depending on matrix
structure
Register blocking
Sparsity and PetSC
Cache blocking
Sparsity
Symmetry
BeBop and project at CMU
Reordering
Toledo (97) and Oliker et al (00) do this
Exploit bands, and other structures
Sparse compiler project by Bik (96) does this

9
Register Blocking Optimization

Identify a small dense blocks of nonzeros.
Fill in extra zeros to complete blocks
Use an optimized multiplication code for the
particular block size.
Reuse source elements
Unroll loop
Saves index memory

2x2 register blocked matrix
2
1
2
3
0
2
4
1
2
5
0
1
0
0
1
3
0
2
1
3
0
5
7
3
0
4
1
1

Improves register reuse, lowers indexing
overhead.
Computes extra Flops for added zeros

10
Register Blocking Blocked Row Format

Blocked Compressed Sparse Row Format
Advantages of the format
Better temporal locality in registers
The multiplication loop can be unrolled for
better performance

11
Register Blocking Fill Overhead

We use uniform block size, adding fill overhead.
fill overhead 12/7 1.71
This increases both space and the number of
floating point operations.

12
Performance Model for Register Blocking

Estimate performance of register blocking
Estimated raw performance Mflop/s of dense
matrix in sparse bxb blocked format on machine of
choice
Estimated overhead for particular matrix to fill
in bxb blocks
Maximize over b
Estimated raw performance
Estimated overhead
Use sampling to further reduce time, row and
column dimensions are computed separately

Matrix-dependent
13
Register Blocking Profile of Ultra I

Dense Matrix profile on an UltraSPARC I

14
Register Blocking Profile of MIPS R10K

Performance of dense matrix in sparse blocked
format on MIPS R10K

15
Register Blocking Profile of P III

Performance of dense matrix in sparse blocked
format on P III

16
Register Blocking Profile of P4

Performance of dense matrix in sparse blocked
format on P 4

17
Benchmark matrices

Matrix 1 Dense matrix (1000 x 1000)
Matrices 2-17 Finite Element Method matrices
Matrices 18-39 matrices from Structural
Engineering, Device Simulation
Matrices 40-44 Linear Programming matrices
Matrix 45 document retrieval matrix
used for Latent Semantic
Indexing
Matrix 46 random matrix (10000 x 10000, 0.15)

18
Register Blocking Performance

The optimization is effective most on FEM
matrices and dense matrix (lower-numbered).
A single search step (compare blocked and
unblocked) raises the numbers below the line to 1

19
Register Blocking Performance

Speedup is generally best on MIPS R10000, which
is competitive with the dense BLAS performance.
(DGEMV/DGEMM 0.38)

20
Register Blocking on P4 for FEM/fluids matrix 1
21
Register Blocking on P4 for FEM/fluids matrix 2
22
Register Blocking Model Validation
23
Register Blocking Overhead

Pre-computation overhead
Estimating fill overhead (red bars)
Reorganizing the matrix (yellow bars)
The ratio means the number of repetitions for
which the optimization is beneficial.

24
Cache Blocking Optimization

Keeping part of source vector in cache

Source vector (x)

Destination Vector (y)
Sparse matrix(A)

Improves cache reuse of source vector.
Challenge choosing a block size

25
Cache Blocking

Temporal locality of access to source vector
Makes sense when X does not fit in cache
Rectangular matrices, in particular

Source vector x
Destination Vector y
In memory
26
Cache Blocking Speedup

MIPS speedup is generally better.
Larger cache, larger miss penalty (26/589 ns for
MIPS, 36/268 ns for Ultra.)
Document retrieval and random matrix are
exceptions.
Perform very poorly in baseline code

27
Cache Blocking Web Document Matrix

This matrix seems to be an exception
Cache blocking helps
Only on a random matrix with similar shape/size
does it help (without other optimizations)

28
Cache Blocking Searching for Block Size
29
Combination of Register and Cache blocking

The combination is rarely beneficial, often
slower than either of the two optimization.

30
Combination of Register and Cache blocking
31
Cache Blocking Summary

On matrices from real problems, rarely useful.
Need to be
Random in structure
Rectangular
Can determine reasonable block size through
coarse grained search.
Combination of register and cache blocking not
useful.
Work on disjoint sets of matrices (as far as we
can tell)

32
Exploiting Symmetry

Many matrices are symmetric (undirected graph)
CG only works, for example, on symmetric matrices
Symmetric storage only store/access ½ of matrix

Update y twice for each element of A
yi Ai,jxj
yj Ai,jxi
Save up to ½ of memory traffic (to A)
Works for sparse format

y
33
Symmetry Speedups
34
Multiple Vector Optimization

Better potential for reuse
Loop unrolled codes multiplying across vectors
are generated by a code generator.

x
j1
y
a
i2
y
ij
i1

Allows reuse of matrix elements.
Choosing the number of vectors for loop unrolling.

35
Multiple Vector Multiplication

Better chance of optimization BLAS2 vs. BLAS3

Repetition of single-vector case
Multiple-vector case

Blocked algorithms (blocked CG) may use this

36
Multiple Vector Effect on Synthetic Matrices

matrix with random sparsity pattern

dense matrix in sparse format

37
Multiple Vectors with Register Blocking

The speedup is larger than single vector register
blocking.
Even the performance of the matrices that did not
speedup improved. (middle group in UltraSPARC)

38
Multiple Vectors with Cache Blocking
MIPS
UltraSPARC

Noticeable speedup for matrices that did not
speedup with cache blocking alone (UltraSPARC)
Block sizes are much smaller than that of single
vector cache blocking.

39
Sparsity System Organization

Guide a choice of optimization
Automatic selection of optimization parameters
such as block size, number of vectors
http//www.cs.berkeley.edu/yelick/sparsity
Released version does not include optimization
for symmetry

Representative Matrix
Optimized Format Code
Sparsity machine profiler
Machine Profile
Sparsity optimizer
Maximum vectors
40
Summary of Serial Optimizations

Sparse matrix optimization has same problems as
dense, plus nonzero pattern
Optimization of format/code is effective, but
expensive
Should be done across an application, not at each
mvm call or even iterative solve
Optimization choice may depended on matrix
characteristics rather than specifics
Optimizing over large kernel is also important
Heuristics plus search can be used to select
optimization parameters
Future
Different code organization, instruction mixes
Different view of A A A1 A2
Different matrix orderings

41
Tuning other sparse operations

Symmetric matrix-vector multiply Ax
Solve a triangular system of equations T-1x
ATAx
Kernel of Information Retrieval via LSI
Same number of memory references as Ax
A2x, Akx
Kernel of Information Retrieval used by Google
Changes calling algorithm
ATMA
Matrix triple product
Used in multigrid solver

42
Symmetric Sparse Matrix-Vector Multiply
P4 using icc
43
Sparse Triangular Solve
P4 using icc
44
Parallel Sparse Matrix-Vector Multiplication
45
Parallel Sparse Matrix-Vector Product

y y Ax, where A is a sparse matrix
Layout
1D by rows
Algorithm
Foreach processor j
Broadcast X(j)
Compute A(p)x(j)
Same algorithm works, but
May be unbalanced
Communicates more of X than needed

Po P1 P2 P3
x
Po P1 P2 P3
y
46
Optimization Opportunities

Send only necessary parts of x
Overlap communication and computation
Load balance
Take advantage of symmetry
Trickier than serial case, because of multiple
accesses to Y
Take advantage of blocked structure
For serial performance, using FEM matrices or
similar
Take advantage of bands
Store only diagonal bands with starting
coordinate
Reduces index storage and access, running time?
Reorder your matrix
Use blocked CG
Better serial performance, reduces a, but not b
in comm

47
Following slides based on Ordering for
Efficiency of Sparse CG X. Li, L. Oliker,
G. Heber, R. Biswas
48
Partitioning and Ordering Schemes

RCM, METIS
Self-Avoiding Walk (SAW) Heber/Biswas/Gao 99
A SAW over a triangular mesh is an enumeration of
the triangles such that two consecutive triangles
in the SAW share an edge, or a vertex (there are
no jumps).
It is a mesh-based linearization technique with
similar application areas as space-filling
curves.
Construction time is linear in the number of
triangles.
Amenable to hierarchical coarsening and
refinement, and can be easily parallelized.

49
Graph Partitioning Strategy MeTiS

Most popular class of multilevel partitioners
Objectives
balance computational workload
minimize edge cut (interprocessor communication)
Algorithm
collapses vertices edges using heavy-edge
matching scheme
applies greedy partitioning algorithm to coarsest
graph
uncoarsens it back using greedy graph growing
Kernighan-Lin

Initial partitioning
Refinement phase
Coarsening phase
50
Strategy Reverse Cuthill-McKee (RCM)

Matrix bandwidth (profile) has significant impact
on efficiency of linear systems eigensolvers
Geometry-based algorithm that generates a
permutation so that non-zero entries are close to
diagonal
Good preordering for LU or Cholesky factorization
(reduces fill)
Improves cache performance (but does not
explicitly reduce edge cut)
Can be used as partitioner by assigning segments
to processors
Starting from vertex of min degree, generates
level structure by BFS orders vertices by
decreasing distance from original vertex

51
Strategy Self-Avoiding Walks (SAW)

Mesh-based technique similar to space-filling
curves
Two consecutive triangles in walk share edge or
vertex (no jumps)
Visits each triangle exactly once,
entering/exiting over edge/vertex
Improves parallel efficiency related to locality
(cache reuse) load balancing, but does not
explicitly reduce edge cuts
Amenable to hierarchical coarsening refinement
Biswas, Oliker, Li, Concurrency
Practice
Experience,
12 (2000)
85-109

52
Experiment

Mesh generation
2D Delaunay triangulation, generated by Triangle
Shewchuk 96.
The mesh is shaped like letter A
661,054 vertices, 1,313,099 triangles
Matrix generation
For each vertex pair (vi, vj) with distance lt 3,
we assign a random value in (0, 1) to A(i, j).
Diagonal is chosen to make it diagonally
dominant.
About 39 nonzeros per row, total of 25,753,034
nonzeros.
Implementations
MPI version uses Aztec.
Origin 2000 and Tera MTA uses compiler
directives.

53
Sparse CG Timings
54
Sparse Matvec Cache Misses
55
Sparse Matve Communication Volume
56
Observations

On T3E and Origin, ordering methods significantly
improve the CG performance.
On T3E, matrix-vector efficiency depends more on
cache reuse than communication reduction.
RCM, SAW, METIS gt 93 total time servicing
cache misses
ORIG 72-80 total time servicing cache misses.
On Origin, can use SMP directives, but require
data distribution.
On the Tera MTA, no ordering and data
distribution are required. Program is the
simplest.

57
Summary

Lots of opportunities for optimization
Data structures
Local performance
Communication performance
Numerical techniques (preconditioning)
Take advantage of special structures
OK to focus on class of matrices

58
Test Problem

2D Delaunay triangulation of letter A shaped
mesh generated using Triangle package (661,054
vertices 1,313,099 triangles)
Underlying matrix assembled by assigning random
value in (0,1) to each (i,j) entry corresponding
to vertex pair (vi,vj) if dist(vi,vj) lt 4 (other
entries set to 0)
Diagonal entries set to 40 (to make matrix
positive definite)
Final matrix has approx 39 entries/row and
25,753,034 nonzeros
CG converges in 13 iterations SPMV accounts for
87 of flops

59
Distributed-Memory Implementation

Each processor has local memory that only it can
directly access message passing required to
access memory of another processor
User decides how to distribute data and organize
comm structure
Allows efficient code design at the cost of
higher complexity
Parallel CG calls Aztec sparse linear library
Matrix A partitioned into blocks of rows each
block to a processor
Associated component vectors (x, b) distributed
accordingly
Comm needed to transfer some components of x
Data from
T3E (450 MHz Alpha processor, 900 Mflops peak, 96
KB secondary cache, 3D torus interconnect)

60
Distributed-Memory Implementation

ORIG ordering has large edge cut (interprocessor
comm) and poor locality (high number of cache
misses)
MeTiS minimizes edge cut, while SAW minimizes
cache misses

61
Distributed Shared-Memory System

Origin2000 (SMP of nodes, each with dual 250 MHz
R10000 processor local memory)
Hardware makes all memory equally accessible from
software perspective by sending memory requests
through routers on nodes
Memory access time non-uniform (depends on
hops)
Hypercube interconnect (maximum log(p) hops)
Each processor has 4 MB cache where only it can
fetch/store data
If processor accesses data not in cache, delay
while copy fetched
When processor modifies word, all other copies of
that cache line invalidated

62
Distributed Shared-Memory Implementation

Use OpenMP-style directives to parallelize loops
requires less effort than MPI
Two implementation approaches taken
SHMEM assume that Origin2000 has flat
shared-memory (arrays not explicitly distributed,
non-local data handled by cache coherence)
CC-NUMA consider underlying distributed
shared-memory by explicit data distribution
Computational kernels identical for SHMEM and
CC-NUMA
Each processor assigned equal rows in matrix
(block)
No explicit synchronization required since no
concurrent writes
Global reduction required for DOT operation

63
Distributed Shared-Memory Implementation

CC-NUMA performs significantly better than SHMEM
RCM SAW reduce runtimes compared to ORIG (more
for CC-NUMA)
Little difference between RCM SAW, probably due
to large cache
CC-NUMA (with ordering) MPI runtimes
comparable, even though programming methodologies
quite different

64
Tera MTA Multithreaded Architecture

255 MHz Tera uses multithreading to hide latency
(100-150 cycles per word) keep processors
saturated with work
No data cache
Randomized memory mapping (data layout
impossible)
Near uniform data access from any processor to
any memory location
Each processor has 100 hardware streams (32
registers PC)
Context switch on each cycle, choose next
instruction from ready streams
A stream can execute an instruction only once
every 21 cycles, even if no instructions
reference memory
Synchronization between threads accomplished
using full/empty bits in memory, allowing
fine-grained threads
No explicit load balancing required since dynamic
scheduling of work to threads can keep processor
saturated
No difference between uni- and multiprocessor
parallelism

65
Tera MTA Implementation

Multithreading implementation trivial (only
require compiler directives)
Special assertions used to indicate no
loop-carried dependencies
Compiler then able to parallelize loop segments
Load balancing handled by OS (dynamically assigns
rows of matrix to threads)
Other than reduction for DOT, no special
synchronization constructs required (no possible
race conditions in CG)
No special ordering required (or possible) to
achieve good parallel performance

66
Tera MTA Results

Both SPMV and CG show high scalability (over 90)
with 60 streams per processor
Sufficient TLP to tolerate high overhead of
memory access
8-proc Tera faster than 32-proc Origin2000
16-proc T3E with no partitioning / ordering (but
will scaling continue beyond 8 procs?)
Adaptivity No extra work required to maintain
performance

Write a Comment

User Comments (0)