CS 267: Applications of Parallel Computers Lecture 17: Parallel Sparse MatrixVector Multiplication - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

CS 267: Applications of Parallel Computers Lecture 17: Parallel Sparse MatrixVector Multiplication

Description:

Reason: Low Flop to memory ratio: 2. Sparse matrix operation ... Makes sense when X does not fit in cache. Rectangular matrices, in particular. Source vector x ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 64
Provided by: kathyy151
Category:

less

Transcript and Presenter's Notes

Title: CS 267: Applications of Parallel Computers Lecture 17: Parallel Sparse MatrixVector Multiplication


1
CS 267 Applications of Parallel
ComputersLecture 17Parallel Sparse
Matrix-Vector Multiplication
  • Horst D. Simon
  • http//www.cs.berkeley.edu/strive/cs267

2
Outline
  • Matrix-Vector multiplication (review)
  • Matrices and Graphs (review)
  • Serial optimizations (cf thesis work by E.J. Im
    http//www.cs.berkeley.edu/ejim/publication/ )
  • Register blocking
  • Cache blocking
  • Exploiting symmetry
  • Reordering
  • Parallel optimizations (cf. Lecture by Lenny
    Oliker and references http//www.nersc.gov/oliker
    /paperlinks.html )
  • Reordering
  • Multiple vectors or operations
  • Other operations

3
Parallel Dense Matrix-Vector Product (Review)
  • y y Ax, where A is a dense matrix
  • Layout
  • 1D by rows
  • Algorithm
  • Foreach processor j
  • Broadcast X(j)
  • Compute A(p)x(j)
  • A(i) refers to the n by n/p block row that
    processor i owns
  • Algorithm uses the formula
  • Y(i) Y(i) A(i)X Y(i) Sj A(i)X(j)

Po P1 P2 P3
x
Po P1 P2 P3
y
4
Graphs and Sparse Matrices
  • Sparse matrix is a representation of a (sparse)
    graph

1 2 3 4 5 6
1 1 1 2 1 1
1 3
1 1 1 4 1
1 5 1 1
6 1 1
3
2
4
1
5
6
  • Matrix entries are edge weights
  • Diagonal contains self-edges (usually non-zero)
  • Matrix-vector multiplication
  • Vector is spread over nodes
  • Compute new vector by multiplying incoming edges
    source node

5
Data Structure for Sparse Matrix
  • Matvec (Matrix Vector Multiplication) will
  • Access rows of matrix serially
  • ? use an array for each row (indices and
    values)
  • Access destination vector only once per element
  • ? layout doesnt matter
  • Access source vector once per row, only nonzeros
  • Only opportunity for memory hierarchy
    improvements is in the source vector
  • Layout by rows may not be best more on this
    later

6
Compressed Sparse Row Format
  • Compressed Sparse Row (CSR), also called
    Compressed Rows Storage (CRS) is
  • General does not make assumptions about the
    matrix structure
  • Relatively efficient rows are sorted, only 1
    index per nonzero

7
Performance Challenge
  • Consider a 167 MHz UltraSPARC I
  • Dense matrix operation
  • Naïve matrix-vector multiplication
    38Mflops
  • Vendor optimized matrix-vector multiplication
    57Mflops
  • Vendor optimized matrix-matrix multiplication
    185Mflops
  • Reason Low Flop to memory ratio 2
  • Sparse matrix operation
  • Naïve matrix-vector multiplication of dense
    matrix in sparse representation (Compressed
    Sparse Row) 25Mflops
  • As low as 5.7Mflops for some matrices
  • Reason Indexing overhead and irregular memory
    access patterns

8
Serial Optimizations
  • Several optimizations, depending on matrix
    structure
  • Register blocking
  • Sparsity and PetSC
  • Cache blocking
  • Sparsity
  • Symmetry
  • BeBop and project at CMU
  • Reordering
  • Toledo (97) and Oliker et al (00) do this
  • Exploit bands, and other structures
  • Sparse compiler project by Bik (96) does this

9
Register Blocking Optimization
  • Identify a small dense blocks of nonzeros.
  • Fill in extra zeros to complete blocks
  • Use an optimized multiplication code for the
    particular block size.
  • Reuse source elements
  • Unroll loop
  • Saves index memory

2x2 register blocked matrix
2
1
2
3
0
2
4
1
2
5
0
1
0
0
1
3
0
2
1
3
0
5
7
3
0
4
1
1
  • Improves register reuse, lowers indexing
    overhead.
  • Computes extra Flops for added zeros

10
Register Blocking Blocked Row Format
  • Blocked Compressed Sparse Row Format
  • Advantages of the format
  • Better temporal locality in registers
  • The multiplication loop can be unrolled for
    better performance

11
Register Blocking Fill Overhead
  • We use uniform block size, adding fill overhead.
  • fill overhead 12/7 1.71
  • This increases both space and the number of
    floating point operations.

12
Performance Model for Register Blocking
  • Estimate performance of register blocking
  • Estimated raw performance Mflop/s of dense
    matrix in sparse bxb blocked format on machine of
    choice
  • Estimated overhead for particular matrix to fill
    in bxb blocks
  • Maximize over b
  • Estimated raw performance
  • Estimated overhead
  • Use sampling to further reduce time, row and
    column dimensions are computed separately

Matrix-dependent
13
Register Blocking Profile of Ultra I
  • Dense Matrix profile on an UltraSPARC I

14
Register Blocking Profile of MIPS R10K
  • Performance of dense matrix in sparse blocked
    format on MIPS R10K

15
Register Blocking Profile of P III
  • Performance of dense matrix in sparse blocked
    format on P III

16
Register Blocking Profile of P4
  • Performance of dense matrix in sparse blocked
    format on P 4

17
Benchmark matrices
  • Matrix 1 Dense matrix (1000 x 1000)
  • Matrices 2-17 Finite Element Method matrices
  • Matrices 18-39 matrices from Structural
    Engineering, Device Simulation
  • Matrices 40-44 Linear Programming matrices
  • Matrix 45 document retrieval matrix
  • used for Latent Semantic
    Indexing
  • Matrix 46 random matrix (10000 x 10000, 0.15)

18
Register Blocking Performance
  • The optimization is effective most on FEM
    matrices and dense matrix (lower-numbered).
  • A single search step (compare blocked and
    unblocked) raises the numbers below the line to 1

19
Register Blocking Performance
  • Speedup is generally best on MIPS R10000, which
    is competitive with the dense BLAS performance.
    (DGEMV/DGEMM 0.38)

20
Register Blocking on P4 for FEM/fluids matrix 1
21
Register Blocking on P4 for FEM/fluids matrix 2
22
Register Blocking Model Validation
23
Register Blocking Overhead
  • Pre-computation overhead
  • Estimating fill overhead (red bars)
  • Reorganizing the matrix (yellow bars)
  • The ratio means the number of repetitions for
    which the optimization is beneficial.

24
Cache Blocking Optimization
  • Keeping part of source vector in cache

Source vector (x)

Destination Vector (y)
Sparse matrix(A)
  • Improves cache reuse of source vector.
  • Challenge choosing a block size

25
Cache Blocking
  • Temporal locality of access to source vector
  • Makes sense when X does not fit in cache
  • Rectangular matrices, in particular

Source vector x
Destination Vector y
In memory
26
Cache Blocking Speedup
  • MIPS speedup is generally better.
  • Larger cache, larger miss penalty (26/589 ns for
    MIPS, 36/268 ns for Ultra.)
  • Document retrieval and random matrix are
    exceptions.
  • Perform very poorly in baseline code

27
Cache Blocking Web Document Matrix
  • This matrix seems to be an exception
  • Cache blocking helps
  • Only on a random matrix with similar shape/size
    does it help (without other optimizations)

28
Cache Blocking Searching for Block Size
29
Combination of Register and Cache blocking
  • The combination is rarely beneficial, often
    slower than either of the two optimization.

30
Combination of Register and Cache blocking
31
Cache Blocking Summary
  • On matrices from real problems, rarely useful.
  • Need to be
  • Random in structure
  • Rectangular
  • Can determine reasonable block size through
    coarse grained search.
  • Combination of register and cache blocking not
    useful.
  • Work on disjoint sets of matrices (as far as we
    can tell)

32
Exploiting Symmetry
  • Many matrices are symmetric (undirected graph)
  • CG only works, for example, on symmetric matrices
  • Symmetric storage only store/access ½ of matrix

x
  • Update y twice for each element of A
  • yi Ai,jxj
  • yj Ai,jxi
  • Save up to ½ of memory traffic (to A)
  • Works for sparse format

y
33
Symmetry Speedups
34
Multiple Vector Optimization
  • Better potential for reuse
  • Loop unrolled codes multiplying across vectors
    are generated by a code generator.

x
j1
y
a
i2
y
ij
i1
  • Allows reuse of matrix elements.
  • Choosing the number of vectors for loop unrolling.

35
Multiple Vector Multiplication
  • Better chance of optimization BLAS2 vs. BLAS3

Repetition of single-vector case
Multiple-vector case
  • Blocked algorithms (blocked CG) may use this

36
Multiple Vector Effect on Synthetic Matrices
  • matrix with random sparsity pattern
  • dense matrix in sparse format

37
Multiple Vectors with Register Blocking
  • The speedup is larger than single vector register
    blocking.
  • Even the performance of the matrices that did not
    speedup improved. (middle group in UltraSPARC)

38
Multiple Vectors with Cache Blocking
MIPS
UltraSPARC
  • Noticeable speedup for matrices that did not
    speedup with cache blocking alone (UltraSPARC)
  • Block sizes are much smaller than that of single
    vector cache blocking.

39
Sparsity System Organization
  • Guide a choice of optimization
  • Automatic selection of optimization parameters
    such as block size, number of vectors
  • http//www.cs.berkeley.edu/yelick/sparsity
  • Released version does not include optimization
    for symmetry

Representative Matrix
Optimized Format Code
Sparsity machine profiler
Machine Profile
Sparsity optimizer
Maximum vectors
40
Summary of Serial Optimizations
  • Sparse matrix optimization has same problems as
    dense, plus nonzero pattern
  • Optimization of format/code is effective, but
    expensive
  • Should be done across an application, not at each
    mvm call or even iterative solve
  • Optimization choice may depended on matrix
    characteristics rather than specifics
  • Optimizing over large kernel is also important
  • Heuristics plus search can be used to select
    optimization parameters
  • Future
  • Different code organization, instruction mixes
  • Different view of A A A1 A2
  • Different matrix orderings

41
Tuning other sparse operations
  • Symmetric matrix-vector multiply Ax
  • Solve a triangular system of equations T-1x
  • ATAx
  • Kernel of Information Retrieval via LSI
  • Same number of memory references as Ax
  • A2x, Akx
  • Kernel of Information Retrieval used by Google
  • Changes calling algorithm
  • ATMA
  • Matrix triple product
  • Used in multigrid solver

42
Symmetric Sparse Matrix-Vector Multiply
P4 using icc
43
Sparse Triangular Solve
P4 using icc
44
Parallel Sparse Matrix-Vector Multiplication
45
Parallel Sparse Matrix-Vector Product
  • y y Ax, where A is a sparse matrix
  • Layout
  • 1D by rows
  • Algorithm
  • Foreach processor j
  • Broadcast X(j)
  • Compute A(p)x(j)
  • Same algorithm works, but
  • May be unbalanced
  • Communicates more of X than needed

Po P1 P2 P3
x
Po P1 P2 P3
y
46
Optimization Opportunities
  • Send only necessary parts of x
  • Overlap communication and computation
  • Load balance
  • Take advantage of symmetry
  • Trickier than serial case, because of multiple
    accesses to Y
  • Take advantage of blocked structure
  • For serial performance, using FEM matrices or
    similar
  • Take advantage of bands
  • Store only diagonal bands with starting
    coordinate
  • Reduces index storage and access, running time?
  • Reorder your matrix
  • Use blocked CG
  • Better serial performance, reduces a, but not b
    in comm

47
Following slides based on Ordering for
Efficiency of Sparse CG X. Li, L. Oliker,
G. Heber, R. Biswas
48
Partitioning and Ordering Schemes
  • RCM, METIS
  • Self-Avoiding Walk (SAW) Heber/Biswas/Gao 99
  • A SAW over a triangular mesh is an enumeration of
    the triangles such that two consecutive triangles
    in the SAW share an edge, or a vertex (there are
    no jumps).
  • It is a mesh-based linearization technique with
    similar application areas as space-filling
    curves.
  • Construction time is linear in the number of
    triangles.
  • Amenable to hierarchical coarsening and
    refinement, and can be easily parallelized.

49
Graph Partitioning Strategy MeTiS
  • Most popular class of multilevel partitioners
  • Objectives
  • balance computational workload
  • minimize edge cut (interprocessor communication)
  • Algorithm
  • collapses vertices edges using heavy-edge
    matching scheme
  • applies greedy partitioning algorithm to coarsest
    graph
  • uncoarsens it back using greedy graph growing
    Kernighan-Lin

Initial partitioning
Refinement phase
Coarsening phase
50
Strategy Reverse Cuthill-McKee (RCM)
  • Matrix bandwidth (profile) has significant impact
    on efficiency of linear systems eigensolvers
  • Geometry-based algorithm that generates a
    permutation so that non-zero entries are close to
    diagonal
  • Good preordering for LU or Cholesky factorization
    (reduces fill)
  • Improves cache performance (but does not
    explicitly reduce edge cut)
  • Can be used as partitioner by assigning segments
    to processors
  • Starting from vertex of min degree, generates
    level structure by BFS orders vertices by
    decreasing distance from original vertex

51
Strategy Self-Avoiding Walks (SAW)
  • Mesh-based technique similar to space-filling
    curves
  • Two consecutive triangles in walk share edge or
    vertex (no jumps)
  • Visits each triangle exactly once,
    entering/exiting over edge/vertex
  • Improves parallel efficiency related to locality
    (cache reuse) load balancing, but does not
    explicitly reduce edge cuts
  • Amenable to hierarchical coarsening refinement

  • Biswas, Oliker, Li, Concurrency

  • Practice
    Experience,

  • 12 (2000)
    85-109

52
Experiment
  • Mesh generation
  • 2D Delaunay triangulation, generated by Triangle
    Shewchuk 96.
  • The mesh is shaped like letter A
  • 661,054 vertices, 1,313,099 triangles
  • Matrix generation
  • For each vertex pair (vi, vj) with distance lt 3,
    we assign a random value in (0, 1) to A(i, j).
  • Diagonal is chosen to make it diagonally
    dominant.
  • About 39 nonzeros per row, total of 25,753,034
    nonzeros.
  • Implementations
  • MPI version uses Aztec.
  • Origin 2000 and Tera MTA uses compiler
    directives.

53
Sparse CG Timings
54
Sparse Matvec Cache Misses
55
Sparse Matve Communication Volume
56
Observations
  • On T3E and Origin, ordering methods significantly
    improve the CG performance.
  • On T3E, matrix-vector efficiency depends more on
    cache reuse than communication reduction.
  • RCM, SAW, METIS gt 93 total time servicing
    cache misses
  • ORIG 72-80 total time servicing cache misses.
  • On Origin, can use SMP directives, but require
    data distribution.
  • On the Tera MTA, no ordering and data
    distribution are required. Program is the
    simplest.

57
Summary
  • Lots of opportunities for optimization
  • Data structures
  • Local performance
  • Communication performance
  • Numerical techniques (preconditioning)
  • Take advantage of special structures
  • OK to focus on class of matrices

58
Test Problem
  • 2D Delaunay triangulation of letter A shaped
    mesh generated using Triangle package (661,054
    vertices 1,313,099 triangles)
  • Underlying matrix assembled by assigning random
    value in (0,1) to each (i,j) entry corresponding
    to vertex pair (vi,vj) if dist(vi,vj) lt 4 (other
    entries set to 0)
  • Diagonal entries set to 40 (to make matrix
    positive definite)
  • Final matrix has approx 39 entries/row and
    25,753,034 nonzeros
  • CG converges in 13 iterations SPMV accounts for
    87 of flops

59
Distributed-Memory Implementation
  • Each processor has local memory that only it can
    directly access message passing required to
    access memory of another processor
  • User decides how to distribute data and organize
    comm structure
  • Allows efficient code design at the cost of
    higher complexity
  • Parallel CG calls Aztec sparse linear library
  • Matrix A partitioned into blocks of rows each
    block to a processor
  • Associated component vectors (x, b) distributed
    accordingly
  • Comm needed to transfer some components of x
  • Data from
  • T3E (450 MHz Alpha processor, 900 Mflops peak, 96
    KB secondary cache, 3D torus interconnect)

60
Distributed-Memory Implementation
  • ORIG ordering has large edge cut (interprocessor
    comm) and poor locality (high number of cache
    misses)
  • MeTiS minimizes edge cut, while SAW minimizes
    cache misses

61
Distributed Shared-Memory System
  • Origin2000 (SMP of nodes, each with dual 250 MHz
    R10000 processor local memory)
  • Hardware makes all memory equally accessible from
    software perspective by sending memory requests
    through routers on nodes
  • Memory access time non-uniform (depends on
    hops)
  • Hypercube interconnect (maximum log(p) hops)
  • Each processor has 4 MB cache where only it can
    fetch/store data
  • If processor accesses data not in cache, delay
    while copy fetched
  • When processor modifies word, all other copies of
    that cache line invalidated

62
Distributed Shared-Memory Implementation
  • Use OpenMP-style directives to parallelize loops
  • requires less effort than MPI
  • Two implementation approaches taken
  • SHMEM assume that Origin2000 has flat
    shared-memory (arrays not explicitly distributed,
    non-local data handled by cache coherence)
  • CC-NUMA consider underlying distributed
    shared-memory by explicit data distribution
  • Computational kernels identical for SHMEM and
    CC-NUMA
  • Each processor assigned equal rows in matrix
    (block)
  • No explicit synchronization required since no
    concurrent writes
  • Global reduction required for DOT operation

63
Distributed Shared-Memory Implementation
  • CC-NUMA performs significantly better than SHMEM
  • RCM SAW reduce runtimes compared to ORIG (more
    for CC-NUMA)
  • Little difference between RCM SAW, probably due
    to large cache
  • CC-NUMA (with ordering) MPI runtimes
    comparable, even though programming methodologies
    quite different

64
Tera MTA Multithreaded Architecture
  • 255 MHz Tera uses multithreading to hide latency
    (100-150 cycles per word) keep processors
    saturated with work
  • No data cache
  • Randomized memory mapping (data layout
    impossible)
  • Near uniform data access from any processor to
    any memory location
  • Each processor has 100 hardware streams (32
    registers PC)
  • Context switch on each cycle, choose next
    instruction from ready streams
  • A stream can execute an instruction only once
    every 21 cycles, even if no instructions
    reference memory
  • Synchronization between threads accomplished
    using full/empty bits in memory, allowing
    fine-grained threads
  • No explicit load balancing required since dynamic
    scheduling of work to threads can keep processor
    saturated
  • No difference between uni- and multiprocessor
    parallelism

65
Tera MTA Implementation
  • Multithreading implementation trivial (only
    require compiler directives)
  • Special assertions used to indicate no
    loop-carried dependencies
  • Compiler then able to parallelize loop segments
  • Load balancing handled by OS (dynamically assigns
    rows of matrix to threads)
  • Other than reduction for DOT, no special
    synchronization constructs required (no possible
    race conditions in CG)
  • No special ordering required (or possible) to
    achieve good parallel performance

66
Tera MTA Results
  • Both SPMV and CG show high scalability (over 90)
    with 60 streams per processor
  • Sufficient TLP to tolerate high overhead of
    memory access
  • 8-proc Tera faster than 32-proc Origin2000
    16-proc T3E with no partitioning / ordering (but
    will scaling continue beyond 8 procs?)
  • Adaptivity No extra work required to maintain
    performance
Write a Comment
User Comments (0)
About PowerShow.com