Sparse Matrix Techniques (Tutorial) - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Sparse Matrix Techniques (Tutorial)

Description:

Computer representations of sparse matrices. Sparse matrix-vector ... 'triplets' format ({i, j, val}) is not sufficient . . . Storage: 2*NNZ integers, NNZ reals ... – PowerPoint PPT presentation

Number of Views:568
Avg rating:3.0/5.0
Slides: 19
Provided by: csBer
Category:

less

Transcript and Presenter's Notes

Title: Sparse Matrix Techniques (Tutorial)


1
Sparse Matrix Techniques(Tutorial)
  • X. Sherry Li
  • Lawrence Berkeley National Lab
  • Math 290 / CS 298, UCB
  • Jan. 31, 2007

2
Outline
  • Part I
  • Computer representations of sparse matrices
  • Sparse matrix-vector multiply with various
    storages
  • Performance optimizations
  • Part II
  • Techniques for sparse factorizations
  • (e.g., SuperLU solver)

3
Sparse Storage Schemes
  • Notation
  • N dimension
  • NNZ number of nonzeros
  • Assume arbitrary sparsity pattern
  • triplets format (i, j, val) is not sufficient
    . . .
  • Storage 2NNZ integers, NNZ reals
  • Not easy to randomly access one row or column
  • Linked list format provides flexibility, but not
    friendly on modern architectures . . .
  • Cannot call BLAS directly

4
Compressed Row Storage (CRS)
  • Store nonzeros row by row contiguously
  • Example N 7, NNZ 19
  • 3 arrays
  • Storage NNZ reals, NNZN1 integers

5
SpMV (y Ax) with CRS
  • dot product
  • No locality for x
  • Vector length usually short
  • Memory-bound 3 reads, 2 flops

6
Compressed Column Storage (CCS)
  • Also known as Harwell-Boeing format
  • Store nonzeros columnwise contiguously
  • 3 arrays
  • Storage NNZ reals, NNZN1 integers

7
SpMV (y Ax) with CCS
  • SAXPY
  • No locality for y
  • Vector length usually short
  • Memory-bound 3 reads, 1 write, 2 flops

y(i) 0.0, i 1N do j 1, N . . .
column j of A t x(j) do i colptr(j),
colptr(j1) 1 y(rowind(i))
y(rowind(i)) nzval(i) t enddo enddo
8
Jagged Diagonal Storage (JDS)
  • Also known as ITPACK, or Ellpack storage Saad,
    Kincaid et al.
  • Force all rows to have the same length as the
    longest row,
  • then columns are stored contiguously
  • 2 arrays nzval(N,L) and colind(N,L), where L
    max row length
  • NL reals, NL integers
  • Usually L ltlt N

9
SpMV with JDS
  • Neither dot nor SAXPY
  • Good for vector processor long vector length (N)
  • Extra memory, flops for padded zeros, especially
    bad if row lengths vary a lot

y(i) 0.0, i 1N do j 1, L do i 1, N
y(i) y(i) nzval(i, j) x(colind(i,
j)) enddo enddo
10
Segmented-Sum Blelloch et al.
  • Data structure is an augmented form of CRS
  • Computational structure is similar to JDS
  • Each row is treated as a segment in a long vector
  • Underlined elements denote the beginning of each
    segment
  • (i.e., a row in A)
  • Dimension S L NNZ, where L is chosen to
    approximate the hardware vector length

11
SpMV with Segmented-Sum
  • 2 arrays nzval(S, L) and colind(S, L), where SL
    NNZ
  • NNZ reals, NNZ integers
  • Good for vector processors
  • SpMV is performed bottom-up, with each row-sum
    (dot) of Ax stored in the beginning of each
    segment
  • Similar to JDS, but with more control logic in
    inner-loop

do i S, 1 do j 1, L . . .
enddo enddo
12
Performance (megaflop rate) Gaeke et al.
  • Test matrix N 10000, NNZ 177782, random
    pattern
  • 18 nonzeros per row on average
  • JDS does 4.6x more operations

machine Ultra 2i Pentium 4 VIRAM
Clock rate 333 MHz 1.5 GHz 200 MHz
Peak flop rate 667 Mflops 1.5 Gflops 1.6 Gflops
CRS 29 209 110
JDS (effective) 27 6 17 4 632 137
Seg-Sum 5 29 165
13
Optimization Techniques
  • Matrix reordering
  • For CRS SpMV, can improve x-vector locality by
    reducing the bandwidth of matrix A
  • Example reverse Cuthill-McKee (breadth-first
    search)
  • Observed 2-3x improvement Toledo, et al.

14
Optimization Techniques
  • Register blocking
  • Find dense blocks of size r-by-c in A
  • (If needed, allow some zeros to be filled in)
  • Ax is proceeded block by block
  • keep c elements of x and r elements of y in
    registers
  • x element re-used r times, y element re-used c
    times
  • Amount of indexed load and store is reduced
  • Obtained up to 2.5x improvement Vuduc et al.

15
SPARSITY Im, Yelick
16
Performance Improvement Vuduc et al.
17
Other Representations
  • Block entry formats (e.g., multiple degrees of
    freedom are associated with a single physical
    location)
  • Constant block size (BCRS)
  • Varying block sizes (VBCRS)
  • Skyline (or profile) storage (SKS)
  • Lower triangle stored row by row
  • Upper triangle stored column by column
  • In each row (column), first nonzero
  • defines a profile
  • All entries within the profile
  • (some may be zeros) are stored

18
References
  • Templates for the solution of linear systems
  • Barrett, et al., SIAM, 1994
  • BeBOP http//bebop.cs.berkeley.edu/
  • Sparse BLAS standard
  • http//www.netlib.org/blas/blast-forum
Write a Comment
User Comments (0)
About PowerShow.com