Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Dense Linear Algebra: Optimizing Sequential Matrix Multiply - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Dense Linear Algebra: Optimizing Sequential Matrix Multiply

Description:

demmel_at_cs.berkeley.edu * CS267 Lecture 2 * Paper: ... Sun Ultra2 Model 2200. SGI R8000ip21. Vendor BLAS. Reference BLAS. DGEMM. DSYMM. DSYR2K. DSYRK. DTRMM. DTRSM. Vendor. HP9000/735/125. – PowerPoint PPT presentation

Number of Views:379
Avg rating:3.0/5.0
Slides: 91
Provided by: microsof83
Category:

less

Transcript and Presenter's Notes

Title: Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Dense Linear Algebra: Optimizing Sequential Matrix Multiply


1
Parallelism and Locality in Matrix
Computationswww.cs.berkeley.edu/demmel/cs267_Sp
r09Dense Linear AlgebraOptimizing Sequential
Matrix Multiply
  • James Demmel
  • EECS Math Departments, UC Berkeley
  • demmel_at_cs.berkeley.edu

2
Outline of Dense Linear Algebra
  • A brief history of linear algebra software
  • Optimizing Sequential Matrix Multiplication
    (MatMul)
  • Optimizing Parallel MatMul
  • Optimizing Gaussian Elimination
  • Lower bounds on communication (moving data)

3
A brief history of (Dense) Linear Algebra
software (1/6)
  • In the beginning was the do-loop
  • Libraries like EISPACK (for eigenvalue problems)
  • Then the BLAS (1) were invented (1973-1977)
  • Standard library of 15 operations (mostly) on
    vectors
  • AXPY ( y ax y ), dot product, scale (x
    ax ), etc
  • Up to 4 versions of each (S/D/C/Z), 46 routines,
    3300 LOC
  • Goals
  • Common pattern to ease programming,
    readability, self-documentation
  • Robustness, via careful coding (avoiding
    over/underflow)
  • Portability Efficiency via machine specific
    implementations
  • Why BLAS 1 ? They do O(n1) ops on O(n1) data
  • Used in libraries like LINPACK (for linear
    systems)
  • Source of the name LINPACK Benchmark (not the
    code!)

4
A brief history of (Dense) Linear Algebra
software (2/6)
  • But the BLAS-1 werent enough
  • Consider AXPY ( y ax y ) 2n flops on 3n
    read/writes
  • Computational intensity flops / mem_refs
    (2n)/(3n) 2/3
  • Too low to run near peak speed (time for mem_refs
    dominates)
  • Hard to vectorize (SIMDize) on supercomputers
    of the day (1980s)
  • So the BLAS-2 were invented (1984-1986)
  • Standard library of 25 operations (mostly) on
    matrix/vector pairs
  • GEMV y aAx ßx, GER A A axyT,
    TRSV y T-1x
  • Up to 4 versions of each (S/D/C/Z), 66 routines,
    18K LOC
  • Why BLAS 2 ? They do O(n2) ops on O(n2) data
  • So computational intensity still just (2n2)/(n2)
    2
  • OK for vector machines, but not for machine with
    caches

5
A brief history of (Dense) Linear Algebra
software (3/6)
  • The next step BLAS-3 (1987-1988)
  • Standard library of 9 operations (mostly) on
    matrix/matrix pairs
  • GEMM C aAB ßC, SYRK C aAAT
    ßC, TRSM C T-1B
  • Up to 4 versions of each (S/D/C/Z), 30 routines,
    10K LOC
  • Why BLAS 3 ? They do O(n3) ops on O(n2) data
  • So computational intensity (2n3)/(4n2) n/2
    big at last!
  • Tuning opportunities machines with caches, other
    mem. hierarchy levels
  • How much BLAS1/2/3 code so far (all at
    www.netlib.org/blas)
  • Source 142 routines, 31K LOC, Testing 28K
    LOC
  • Reference (unoptimized) implementation only
  • Ex 3 nested loops for GEMM
  • Lots more optimized code
  • Most computer vendors provide own optimized
    versions
  • Motivates automatic tuning of the BLAS (details
    later)

6
A brief history of (Dense) Linear Algebra
software (4/6)
  • LAPACK Linear Algebra PACKage - uses BLAS-3
    (1989 now)
  • Ex Obvious way to express Gaussian Elimination
    (GE) is adding multiples of one row to other rows
    BLAS-1
  • How do we reorganize GE to use BLAS-3 ? (details
    later)
  • Contents of LAPACK (summary)
  • Algorithms we can turn into (nearly) 100 BLAS 3
  • Linear Systems solve Axb for x
  • Least Squares choose x to minimize r2 ? ?S
    ri2 where rAx-b
  • Algorithms we can only make up to 50 BLAS 3
    (so far)
  • Eigenproblems Find l and x where Ax l x
  • Singular Value Decomposition (SVD) ATAx?2x
  • Error bounds for everything
  • Lots of variants depending on As structure
    (banded, AAT, etc)
  • How much code? (Release 3.2, Nov 2008)
    (www.netlib.org/lapack)
  • Source 1582 routines, 490K LOC, Testing 352K
    LOC
  • Ongoing development (at UCB, UTK and elsewhere)

7
What could go into a linear algebra library?
For all linear algebra problems
For all matrix/problem structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to prioritize, automate!
8
A brief history of (Dense) Linear Algebra
software (5/6)
  • Is LAPACK parallel?
  • Only if the BLAS are parallel (possible in shared
    memory)
  • ScaLAPACK Scalable LAPACK (1995 now)
  • For distributed memory uses MPI
  • More complex data structures, algorithms than
    LAPACK
  • Only (small) subset of LAPACKs functionality
    available
  • Details later (class projects!)
  • All at www.netlib.org/scalapack

9
Success Stories for Sca/LAPACK
  • Widely used
  • Adopted by Mathworks, Cray, Fujitsu, HP, IBM,
    IMSL, Intel, NAG, NEC, SGI,
  • gt100M web hits(in 2009, 56M in 2006) _at_ Netlib
    (incl. CLAPACK, LAPACK95)
  • New Science discovered through the solution of
    dense matrix systems
  • Nature article on the flat universe used
    ScaLAPACK
  • Other articles in Physics Review B that also use
    it
  • 1998 Gordon Bell Prize
  • www.nersc.gov/news/reports/newNERSCresults050703.p
    df

Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).
ScaLAPACK
10
A brief history/future of (Dense) Linear Algebra
software (6/6)
  • Versions for GPUs some algorithms change
    significantly
  • PLASMA extensions to multicore
  • Communication-minimizing Linear Algebra
  • Lower bounds on data movement for all direct
    linear algebra
  • LU, QR, eig, svd, dense sparse
    sequential parallel
  • New algorithms that attain these lower bounds
  • Can one software infrastructure accommodate all
    algorithms and platforms of current (future)
    interest?
  • How much code generation and tuning can we
    automate?
  • MAGMA planned PLASMA extension to
    multicore/GPU/heterogeneous
  • FLAME (www.cs.utexas.edu/users/flame/)
  • Related projects
  • BLAST Forum (www.netlib.org/blas/blast-forum)
    BLAS extensions
  • Other languages, add some new functions, sparse
    matrices, extra-precision
  • Extra precision BLAS in latest LAPACK release

11
Organizing Linear Algebra in books
www.netlib.org/lapack
www.netlib.org/scalapack
gams.nist.gov
www.cs.utk.edu/dongarra/etemplates
www.netlib.org/templates
12
Outline of Dense Linear Algebra
  • A brief history of linear algebra software
  • Optimizing Sequential Matrix Multiplication
    (MatMul)
  • Simple performance models to understand
    performance
  • Warm-up Matrix-vector multiplication
  • Reducing memory traffic for MatMul
  • Lower bound on memory traffic for MatMul
  • Other variants on MatMul
  • Automatic vs Hand-Tuning of Algorithms
  • Optimizing Parallel MatMul
  • Optimizing Gaussian Elimination
  • Lower bounds on communication (moving data)

13
Why Matrix Multiplication?
  • An important kernel in many problems
  • Used in many linear algebra algorithms
  • Closely related to other algorithms, e.g.,
    transitive closure on a graph using
    Floyd-Warshall
  • Optimization ideas can be used in other problems
  • The best case for optimization payoffs
  • The most-studied algorithm in high performance
    computing

14
Matrix-multiply, optimized several ways
Speed of n-by-n matrix multiply on Sun
Ultra-1/170, peak 330 MFlops
15
Note on Matrix Storage
  • A matrix is (usually) a 2-D array of elements,
    but memory addresses are 1-D
  • Conventions for matrix layout
  • by column, or column major (Fortran default)
    A(i,j) at Aijn
  • by row, or row major (C default) A(i,j) at
    Ainj
  • recursive (later)
  • Column major (for now)

Column major matrix in memory
Row major
Column major
0
5
10
15
0
1
2
3
1
6
11
16
4
5
6
7
2
7
12
17
8
9
10
11
3
8
13
18
12
13
14
15
4
9
14
19
16
17
18
19
Blue row of matrix is stored in red cachelines
cachelines
Figure source Larry Carter, UCSD
16
Using a Simple Model of Memory to Optimize
  • Assume just 2 levels in memory hierarchy, fast
    and slow
  • All data initially in slow memory
  • m number of memory elements (words) moved
    between fast and slow memory
  • tm time per slow memory operation
  • f number of arithmetic operations
  • tf time per arithmetic operation ltlt tm
  • q f / m average number of flops per slow
    memory access
  • Minimum possible time f tf when all data in
    fast memory
  • Actual time
  • f tf m tm f tf (1 tm/tf 1/q)
  • Larger q means time closer to minimum f tf
  • q ? tm/tf needed to get at least half of peak
    speed

17
Warm up Matrix-vector multiplication
  • implements y y Ax
  • for i 1n
  • for j 1n
  • y(i) y(i) A(i,j)x(j)

A(i,)



y(i)
y(i)
x()
18
Warm up Matrix-vector multiplication
  • read x(1n) into fast memory
  • read y(1n) into fast memory
  • for i 1n
  • read row i of A into fast memory
  • for j 1n
  • y(i) y(i) A(i,j)x(j)
  • write y(1n) back to slow memory
  • m number of slow memory refs 3n n2
  • f number of arithmetic operations 2n2
  • q f / m ? 2
  • Matrix-vector multiplication limited by slow
    memory speed

19
Modeling Matrix-Vector Multiplication
  • Compute time for nxn 1000x1000 matrix
  • Time
  • f tf m tm f tf (1 tm/tf 1/q)
  • 2n2 tf (1
    tm/tf 1/2)
  • For tf and tm, using data from R. Vuducs PhD (pp
    351-3)
  • http//bebop.cs.berkeley.edu/pubs/vuduc2003-disser
    tation.pdf
  • For tm use minimum-memory-latency /
    words-per-cache-line

machine balance (q must be at least this for ½
peak speed, but q2
20
Validating the Model
  • How well does the model predict actual
    performance?
  • Actual DGEMV Most highly optimized code for the
    platform
  • Model sufficient to compare across machines
  • But under-predicting on most recent ones due to
    latency estimate

21
Naïve Matrix Multiply
  • implements C C AB
  • for i 1 to n
  • for j 1 to n
  • for k 1 to n
  • C(i,j) C(i,j) A(i,k) B(k,j)

Algorithm has 2n3 O(n3) Flops and operates on
3n2 words of memory q potentially as large as
2n3 / 3n2 O(n)
A(i,)
C(i,j)
C(i,j)
B(,j)



22
Naïve Matrix Multiply
  • implements C C AB
  • for i 1 to n
  • read row i of A into fast memory
  • for j 1 to n
  • read C(i,j) into fast memory
  • read column j of B into fast memory
  • for k 1 to n
  • C(i,j) C(i,j) A(i,k) B(k,j)
  • write C(i,j) back to slow memory

A(i,)
C(i,j)
C(i,j)
B(,j)



23
Naïve Matrix Multiply
  • Number of slow memory references on unblocked
    matrix multiply
  • m n3 to read each column of B n times
  • n2 to read each row of A once
  • 2n2 to read and write each element
    of C once
  • n3 3n2
  • So q f / m 2n3 / (n3 3n2)
  • ? 2 for large n, no improvement over
    matrix-vector multiply
  • Inner two loops are just matrix-vector multiply,
    of row i of A times B
  • Similar for any other order of 3 loops

A(i,)
C(i,j)
C(i,j)
B(,j)



24
Matrix-multiply, optimized several ways
Speed of n-by-n matrix multiply on Sun
Ultra-1/170, peak 330 MFlops
25
Naïve Matrix Multiply on IBM RS/6000
12000 would take 1095 years
T N4.7
Size 2000 took 5 days
O(N3) performance would have constant
cycles/flop Performance looks like O(N4.7)
Slide source Larry Carter, UCSD
26
Naïve Matrix Multiply on RS/6000
Page miss every iteration
TLB miss every iteration
Cache miss every 16 iterations
Page miss every 512 iterations
Slide source Larry Carter, UCSD
27
From Naïve to Blocked (or Tiled) Matrix Multiply
  • C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
  • C(i,j) A(i,1n)B(1n,j)
  • True if C(i,j) is a scalar (1x1 submatrix)
  • A(i,1n) is a row of A and B(1n,j) is a column
    of B
  • Operation is a dot product
  • True if C(i,j) is a b x b submatrix
  • A(i,1n) is a b x n submatrix of A and B(1n,j)
    is an n x b submatrix of B
  • C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
    A(i,N)B(N,j) where
  • All factors are b x b subblocks
  • N n/b number of b x b blocks

28
Blocked (Tiled) Matrix Multiply
  • Consider A,B,C to be N-by-N matrices of b-by-b
    subblocks where bn / N is called the
    block size
  • for i 1 to N
  • for j 1 to N
  • read block C(i,j) into fast memory
  • for k 1 to N
  • read block A(i,k) into fast
    memory
  • read block B(k,j) into fast
    memory
  • C(i,j) C(i,j) A(i,k)
    B(k,j) do a matrix multiply on blocks
  • write block C(i,j) back to slow memory

A(i,k)
C(i,j)
C(i,j)



B(k,j)
29
Blocked (Tiled) Matrix Multiply
  • Recall
  • m is amount memory traffic between slow and
    fast memory
  • matrix has nxn elements, and NxN blocks each
    of size bxb
  • f is number of floating point operations, 2n3
    for this problem
  • q f / m is our measure of algorithm
    efficiency in the memory system
  • So

m Nn2 read each block of B N3 times (N3
b2 N3 (n/N)2 Nn2) Nn2 read
each block of A N3 times 2n2 read
and write each block of C once (2N
2) n2 So computational intensity q f / m
2n3 / ((2N 2) n2)
? n / N b for large n So we
can improve performance by increasing the
blocksize b Can be much faster than
matrix-vector multiply (q2)
30
How large can we make computional intensity of
matmul?
  • The blocked algorithm has computational intensity
    q ? b
  • The larger the block size, the more efficient our
    algorithm will be
  • Limit All three blocks from A,B,C must fit in
    fast memory (cache), so we cannot make these
    blocks arbitrarily large
  • Assume your fast memory has size Mfast
  • 3b2 ? Mfast, so q ? b ?
    (Mfast/3)1/2
  • To build a machine to run matrix multiply at 1/2
    peak arithmetic speed of the machine, we need a
    fast memory of size
  • Mfast ? 3b2 ? 3q2 3(tm/tf)2
  • This size is reasonable for L1 cache, but not for
    register sets
  • Note analysis assumes it is possible to schedule
    the instructions perfectly

31
Limits to Optimizing Matrix Multiply
  • The blocked algorithm changes the order in which
    values are accumulated into each Ci,j by
    applying commutativity and associativity
  • Get slightly different answers from naïve code,
    because of roundoff - OK
  • We just showed that the blocked algorithm has
    computational intensity
  • q ? b ? (Mfast/3)1/2
  • Theorem (Hong Kung, 1981) Any reorganization
    of this algorithm (that uses only
    commutativity/associativity) is limited to q O(
    (Mfast)1/2 )
  • Said another way,

    words moved between fast/slow memory ?( n3 /
    Mfast1/2 )
  • Proof of more general result later
  • Can get a lower bound on the latency cost (
    messages sent) by dividing above lower bound by
    maximum message size Mfast
  • messages for sequential MatMul ?( n3 / Mfast
    3/2 )
  • Ex disk accesses

32
BLAS speeds on an IBM RS6000/590
Peak speed 266 Mflops
Peak
BLAS 3
BLAS 2
BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
33
Dense Linear Algebra BLAS2 vs. BLAS3
  • BLAS2 and BLAS3 have very different computational
    intensity, and therefore different performance

Data source Jack Dongarra
34
Recursion Cache Oblivious Algorithms
  • The tiled algorithm requires finding a good block
    size
  • Machine dependent
  • What if there are multiple levels of cache? Need
    to block b x b matrix multiply in inner most
    loop
  • 1 level of memory ? 3 nested loops (naïve
    algorithm)
  • 2 levels of memory ? 6 nested loops
  • 3 levels of memory ? 9 nested loops
  • Cache Oblivious Algorithms offer an alternative
  • Treat nxn matrix multiply as a set of smaller
    problems
  • Eventually, these will fit in cache
  • Will minimize words moved between every level
    of memory hierarchy (between L1 and L2 cache, L2
    and L3, L3 and main memory etc.) at least
    asymptotically

35
Recursive Matrix Multiplication (RMM) (1/2)
  • For simplicity square matrices with n 2m
  • C A B
  • True when each Aij etc 1x1 or n/2 x n/2

func C RMM (A, B, n) if n 1, C A
B, else C11 RMM (A11 , B11 , n/2)
RMM (A12 , B21 , n/2) C12 RMM (A11
, B12 , n/2) RMM (A12 , B22 , n/2)
C21 RMM (A21 , B11 , n/2) RMM (A22 , B21 ,
n/2) C22 RMM (A21 , B12 , n/2)
RMM (A22 , B22 , n/2) return
36
Recursive Matrix Multiplication (2/2)
func C RMM (A, B, n) if n1, C A B,
else C11 RMM (A11 , B11 , n/2)
RMM (A12 , B21 , n/2) C12 RMM (A11
, B12 , n/2) RMM (A12 , B22 , n/2)
C21 RMM (A21 , B11 , n/2) RMM (A22 , B21 ,
n/2) C22 RMM (A21 , B12 , n/2)
RMM (A22 , B22 , n/2) return
A(n) arithmetic operations in RMM( . , . ,
n) 8 A(n/2) 4(n/2)2 if n gt 1,
else 1 2n3 same operations as
usual, in different order M(n) words moved
between fast, slow memory by RMM( . , . , n)
8 M(n/2) 4(n/2)2 if 3n2 gt Mfast ,
else 3n2 O( n3 / (Mfast )1/2 n2 )
same as blocked matmul
37
Recursion Cache Oblivious Algorithms
  • Recursion for general A (nxm) B (mxp)
  • Case1 mgt maxn,p split A horizontally
  • Case 2 ngt maxm,p split A vertically and B
    horizontally
  • Case 3 pgt maxm,n split B vertically
  • Attains lower bound in O() sense

Case 1
Case 3
38
Experience with Cache-Oblivious Algorithms
  • In practice, need to cut off recursion well
    before 1x1 blocks
  • Call Micro-kernel for small blocks, eg 16 x 16
  • Implementing a high-performance Cache-Oblivious
    code is not easy
  • Using fully recursive approach with highly
    optimized recursive micro-kernel, Pingali et al
    report that they never got more than 2/3 of peak.
  • Issues with Cache Oblivious (recursive) approach
  • Recursive Micro-Kernels yield less performance
    than iterative ones using same scheduling
    techniques
  • Pre-fetching is needed to compete with best code
    not well-understood in the context of Cache
    Oblivous codes

Unpublished work, presented at LACSI 2006
39
Recursive Data Layouts
  • A related idea is to use a recursive structure
    for the matrix
  • Goal any square submatrix (suitably aligned)
    should lie in contiguous memory
  • Improves locality with machine-independent data
    structure
  • Can minimize latency with multiple levels of
    memory hierarchy
  • This figure shows Z-Morton Ordering (space
    filling curve)
  • Other orderings possible, eg C-Morton
  • See papers on cache oblivious algorithms and
    recursive layouts
  • Gustavson, Kagstrom, et al, SIAM Review, 2004
  • Advantages
  • the recursive layout works well for any cache
    size
  • Disadvantages
  • The index calculations to find Ai,j are
    expensive
  • Implementations switch to column-major for small
    sizes

40
Strassens Matrix Multiply
  • The traditional algorithm (with or without
    tiling) has O(n3) flops
  • Strassen discovered an algorithm with
    asymptotically lower flops
  • O(n2.81)
  • Consider a 2x2 matrix multiply, normally takes 8
    multiplies, 4 adds
  • Strassen does it with 7 multiplies and 18 adds

Let M m11 m12 a11 a12 b11 b12
m21 m22 a21 a22 b21 b22 Let p1
(a12 - a22) (b21 b22)
p5 a11 (b12 - b22) p2 (a11
a22) (b11 b22)
p6 a22 (b21 - b11) p3 (a11 - a21)
(b11 b12) p7
(a21 a22) b11 p4 (a11 a12)
b22 Then m11 p1 p2 - p4 p6 m12
p4 p5 m21 p6 p7 m22
p2 - p3 p5 - p7
Extends to nxn by divideconquer
41
Strassen Matrix Multiplication (2/2)
func C StrMM (A, B, n) if n1 (or small
enough), C A B, else P1 StrMM
(A12 - A22 , B21 B22 , n/2) P2
StrMM (A11 A22 , B11 B22 , n/2)
P3 StrMM (A11 - A21 , B11 B12 , n/2)
P4 StrMM (A11 A12 , B22 , n/2)
P5 StrMM (A11 , B12 - B22 , n/2) P6
StrMM (A22 , B21 B11 , n/2) P7
StrMM (A21 A22 , B11 , n/2) C11
P1 P2 P4 P6, C12 P4 P5
C22 P2 - P3 P5 P7, C21 P6 P7
return
42
Strassen (continued)
  • Asymptotically faster
  • Several times faster for large n in practice
  • Cross-over depends on machine
  • Available in some libraries (but see below)
  • Tuning Strassen's Matrix Multiplication for
    Memory Efficiency, M. S. Thottethodi, S.
    Chatterjee, and A. Lebeck, in Proceedings of
    Supercomputing '98
  • Caveats
  • Needs more memory than standard algorithm
  • Can be a little less accurate because of roundoff
    error
  • Forbidden by rules of Top500 list (so not widely
    used)

43
Other Fast Matrix Multiplication Algorithms
  • Current worlds record is O(n 2.376... )
  • Coppersmith Winograd, 1987
  • Why does Hong/Kung theorem not apply?
  • Extension is open problem
  • Possibility of O(n2?) algorithm!
  • Cohn, Umans, Kleinberg, 2003
  • Can show they all can be made numerically stable
  • D., Dumitriu, Holtz, Kleinberg, 2007
  • Can do rest of linear algebra (solve Axb, least
    squares, Ax?x, etc) as fast , and numerically
    stably
  • D., Dumitriu, Holtz, 2008
  • Fast methods (besides Strassen) may need
    unrealistically large n
  • But can reuse ideas to minimize communication

44
Tuning Code in Practice
  • Tuning code can be tedious
  • Lots of code variations to try besides blocking
    details later
  • Machine hardware performance hard to predict
  • Compiler behavior hard to predict
  • Response Autotuning
  • Let computer generate large set of possible code
    variations, and search them for the fastest ones
  • Field started with CS267 homework assignment in
    mid 1990s
  • PHiPAC, leading to ATLAS, incorporated in Matlab
  • We still use the same assignment
  • We (and others) are extending autotuning to other
    dwarfs / motifs
  • Still need to understand how to do it by hand
  • Not every code will have an autotuner
  • Need to know if you want to build autotuners

45
Search Over Block Sizes
  • Performance models are useful for high level
    algorithms
  • Helps in developing a blocked algorithm
  • Models have not proven very useful for block size
    selection
  • too complicated to be useful
  • See work by Sid Chatterjee for detailed model
  • too simple to be accurate
  • Multiple multidimensional arrays, virtual memory,
    etc.
  • Speed depends on matrix dimensions, details of
    code, compiler, processor

46
What the Search Space Can Look Like
Number of columns in register block
Number of rows in register block
A 2-D slice of a 3-D register-tile search space.
The dark blue region was pruned. (Platform Sun
Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0
compiler)
47
Tiling Alone Might Not Be Enough
  • Naïve and a naïvely tiled code on Itanium 2
  • Searched all block sizes to find best, b56
  • Starting point for next homework

48
Example Select a Matmul Implementation
49
Example Support Vector Classification
50
ATLAS (DGEMM n 500)
Source Jack Dongarra
  • ATLAS is faster than all other portable BLAS
    implementations and it is comparable with
    machine-specific libraries provided by the vendor.

51
Optimizing in Practice
  • Tiling for registers
  • loop unrolling, use of named register variables
  • Tiling for multiple levels of cache and TLB
  • Exploiting fine-grained parallelism in processor
  • superscalar pipelining
  • Complicated compiler interactions
  • Hard to do by hand (but you should try!)
  • Automatic optimization an active research area
  • BeBOP bebop.cs.berkeley.edu , FFTW, Spiral,
  • Older work
  • PHiPAC www.icsi.berkeley.edu/bilmes/phipac
  • in particular tr-98-035.ps.gz
  • ATLAS www.netlib.org/atlas
  • "Performance Optimization of Numerically
    Intensive Codes", by Stefan Goedecker and Adolfy
    Hoisie, SIAM 2001.

52
Removing False Dependencies
  • Using local variables, reorder operations to
    remove false dependencies

ai bi c ai1 bi1 d
false read-after-write hazard between ai and
bi1
float f1 bi float f2 bi1 ai f1
c ai1 f2 d
  • With some compilers, you can declare a and b
    unaliased.
  • Done via restrict pointers, compiler flag, or
    pragma

53
Exploit Multiple Registers
  • Reduce demands on memory bandwidth by pre-loading
    into local variables

while( ) res filter0signal0
filter1signal1
filter2signal2 signal
float f0 filter0 float f1 filter1 float
f2 filter2 while( ) res
f0signal0 f1signal1
f2signal2 signal
also register float f0
Example is a convolution
54
Minimize Pointer Updates
  • Replace pointer updates for strided memory
    addressing with constant array offsets

f0 r8 r8 4 f1 r8 r8 4 f2 r8
r8 4
f0 r80 f1 r84 f2 r88 r8 12
  • Pointer vs. array expression costs may differ.
  • Some compilers do a better job at analyzing one
    than the other

55
Loop Unrolling
  • Expose instruction-level parallelism

float f0 filter0, f1 filter1, f2
filter2 float s0 signal0, s1 signal1,
s2 signal2 res f0s0 f1s1
f2s2 do signal 3 s0 signal0
res0 f0s1 f1s2 f2s0 s1
signal1 res1 f0s2 f1s0 f2s1
s2 signal2 res2 f0s0 f1s1
f2s2 res 3 while( )
56
Expose Independent Operations
  • Hide instruction latency
  • Use local variables to expose independent
    operations that can execute in parallel or in a
    pipelined fashion
  • Balance the instruction mix (what functional
    units are available?)

f1 f5 f9 f2 f6 f10 f3 f7 f11 f4
f8 f12
57
Copy optimization
  • Copy input operands or blocks
  • Reduce cache conflicts
  • Constant array offsets for fixed size blocks
  • Expose page-level locality
  • Alternative use different data structures from
    start (if users willing)

Original matrix (numbers are addresses)
Reorganized into 2x2 blocks
0
4
8
12
0
2
8
10
1
5
9
13
1
3
9
11
2
6
10
14
4
6
12
13
3
7
11
15
5
7
14
15
58
Summary of Performance Tuning
  • Performance programming on uniprocessors requires
  • understanding of memory system moving data
    slower than arithmetic
  • understanding of fine-grained parallelism in
    processor
  • Simple performance models can aid in
    understanding
  • Two ratios are key to efficiency (relative to
    peak)
  • computational intensity of the algorithm
  • q f/m floating point operations / slow
    memory references
  • machine balance in the memory system
  • tm/tf time for slow memory reference / time for
    floating point operation
  • Want q gt tm/tf to get half machine peak
  • Blocking (tiling) is a basic approach to increase
    q
  • Techniques apply generally, but the details
    (e.g., block size) are architecture dependent
  • Similar techniques are possible on other data
    structures and algorithms

59
Extra slides
60
More Reading for Today
  • Sourcebook Chapter 3, (note that chapters 2 and 3
    cover the material of lecture 2 and lecture 3,
    but not in the same order).
  • "Performance Optimization of Numerically
    Intensive Codes", by Stefan Goedecker and Adolfy
    Hoisie, SIAM 2001.
  • Web pages for reference
  • BeBOP Homepage
  • ATLAS Homepage
  • BLAS (Basic Linear Algebra Subroutines),
    Reference for (unoptimized) implementations of
    the BLAS, with documentation.
  • LAPACK (Linear Algebra PACKage), a standard
    linear algebra library optimized to use the BLAS
    effectively on uniprocessors and shared memory
    machines (software, documentation and reports)
  • ScaLAPACK (Scalable LAPACK), a parallel version
    of LAPACK for distributed memory machines
    (software, documentation and reports)
  • Tuning Strassen's Matrix Multiplication for
    Memory Efficiency Mithuna S. Thottethodi,
    Siddhartha Chatterjee, and Alvin R. Lebeck in
    Proceedings of Supercomputing '98, November 1998
    postscript
  • Recursive Array Layouts and Fast Parallel Matrix
    Multiplication by Chatterjee et al. IEEE TPDS
    November 2002.

61
Questions You Should Be Able to Answer
  • What is the key to understand algorithm
    efficiency in our simple memory model?
  • What is the key to understand machine efficiency
    in our simple memory model?
  • What is tiling?
  • Why does block matrix multiply reduce the number
    of memory references?
  • What are the BLAS?
  • Why does loop unrolling improve uniprocessor
    performance?

62
Proof of Communication Lower Bound on C AB
(1/6)
  • Proof from Irony/Tishkin/Toledo (2004)
  • Well need it for the rest of linear algebra
  • Think of instruction stream being executed
  • Looks like add, load, multiply, store,
    load, add,
  • We want to count the number of loads and stores,
    given that we are multiplying n-by-n matrices C
    AB using the usual 2n3 flops, possibly
    reordered assuming addition is commutative/associa
    tive
  • It actually isnt associative in floating point,
    but close enough
  • Assuming that at most M words can be stored in
    fast memory
  • Outline
  • Break instruction stream into segments, each
    containing M loads and stores
  • Somehow bound the maximum number of adds and
    multiplies that could be done in each segment,
    call it F
  • So F segments ? 2n3 , and
    segments ? 2n3 / F
  • So loads stores M segments ? M 2
    n3 / F

63
Proof of Communication Lower Bound on C AB
(2/6)
  • Given segment of instruction stream with M loads
    stores, how many adds multiplies (F) can we
    do?
  • At most 2M entries of C, 2M entries of A and/or
    2M entries of B can be accessed
  • Use geometry
  • Represent 2n3 operations by n x n x n cube
  • One n x n face represents A
  • each 1 x 1 subsquare represents one A(i,k)
  • One n x n face represents B
  • each 1 x 1 subsquare represents one B(k,j)
  • One n x n face represents C
  • each 1 x 1 subsquare represents one C(i,j)
  • Each 1 x 1 x 1 subcube represents one C(i,j)
    A(i,k) B(k,j)

64
Proof of Communication Lower Bound on C AB
(3/6)
k
C face
C(2,3)
C(1,1)
B(3,1)
A(1,3)
j
B(1,3)
A(2,1)
B face
i
A face
65
Proof of Communication Lower Bound on C AB
(4/6)
  • Given segment of instruction stream with M load
    stores, how many adds multiplies (F) can we do?
  • At most 2M entries of C, and/or of A and/or of
    B can be accessed
  • Use geometry
  • Represent 2n3 operations by n x n x n cube
  • One n x n face represents A
  • each 1 x 1 subsquare represents one A(i,k)
  • One n x n face represents B
  • each 1 x 1 subsquare represents one B(k,j)
  • One n x n face represents C
  • each 1 x 1 subsquare represents one C(i,j)
  • Each 1 x 1 x 1 subcube represents one C(i,j)
    A(i,k) B(k,j)
  • If we have at most 2M A squares, 2M B
    squares, and 2M C squares on faces, how many
    cubes can we have?

66
Proof of Communication Lower Bound on C AB
(5/6)
cubes in black box with side lengths x, y
and z Volume of black box xyz (A?s
B?s C?s )1/2 ( xz zy yx)1/2
67
Proof of Communication Lower Bound on C AB
(6/6)
  • Consider one segment of instructions with M
    loads and stores
  • There can be at most 2M entries of A, B, C
    available in one segment
  • Volume of set of cubes representing possible
    multiply/adds (2M 2M 2M)1/2 (2M) 3/2 F
  • Segments ? 2n3 / F
  • Loads Stores M Segments ? M 2n3 / F
    n3 / (2M)1/2
  • Parallel Case apply reasoning to one processor
    out of P
  • Adds and Muls 2n3 / P (assuming load
    balanced)
  • M n2 / P (each processor gets equal fraction of
    matrix)
  • Load Stores words communicated with
    other procs ? M (2n3 /P) / F M (2n3 /P) /
    (2M) 3/2 n2 / (2P)1/2

68
Communication Lower Bound for Parallel MatMul
  • Apply reasoning to one processor out of P
  • Adds and Muls 2n3 / P
  • At least one processor must do this much work
  • M n2 / P
  • Assume each processor gets equal fraction of
    matrix
  • (Other assumptions possible)
  • Load Stores words communicated with
    other procs ? M (2n3 /P) / F M (2n3 /P) /
    (2M) 3/2 n2 / (2P)1/2
  • Possible to communicate less than this
  • Need to increase M by having multiple copies of
    matrices
  • 3D algorithm
  • P1/3 copies of each matrix, so M n2 / P2/3
  • Lower bound becomes ?(n2 / (2P)2/3) ,
  • O(P1/6) times smaller than before attainable

69
Lower bounds on latency O(messages)
  • Hong/Kung and Irony/Tishkin/Toledo theorem is a
    lower bound on amount of data communicated
  • Bandwidth cost
  • Can get a lower bound on the latency cost (number
    of messages sent) by dividing about lower bound
    by maximum message size fast memory size M
  • messages for sequential MatMul ?( n3 / M3/2
    )
  • Ex disk accesses
  • messages for parallel Matmul ?((n2 / P1/2 ) /
    (n2 / P)) ?( P1/2)

70
Communication lower bounds for all Linear Algebra
  • Irony/Tiskin/Toledo proof can be extended
  • Proof does not depend on doing matrix
    multiplication
  • Out(i,j) func( In1(i,k), In2(k,j) for k ? S,
    where S is some set )
  • Only need Out(i,j), etc to be a one-to-one map
    from (i,j) to memory
  • Proof does not depend on inputs being dense
  • S can be any set, depending on (i,j)
  • Proof does not assume inputs and outputs do not
    overlap
  • Eg in Gaussian Elimination A LU, L and U
    overwrite A
  • Applies to LU, Cholesky, solving triangular
    systems,
  • Dense or sparse matrices
  • Sequential or parallel algorithm
  • Lower bound on words moved ? ( flops / M1/2 )
  • Lower bound on messages ? ( flops / M3/2 )
  • Also extends to QR, eigenvalue algorithms, SVD
  • New proof need to account for data that is
    created in fast memory, used, discarded without
    being written to memory
  • Same lower bounds

71
  • Review of material on performance of
    matrix multiplication so far
  • More about tuning matrix multiplication, other
    codes

72
Using a Simple Model of Memory to Optimize
  • Assume just 2 levels in the hierarchy, fast and
    slow
  • All data initially in slow memory
  • m number of memory elements (words) moved
    between fast and slow memory
  • tm time per slow memory operation
  • f number of arithmetic operations
  • tf time per arithmetic operation ltlt tm
  • q f / m average number of flops per slow
    memory access
  • Minimum possible time f tf when all data in
    fast memory
  • Actual time
  • f tf m tm f tf (1 tm/tf 1/q)
  • Larger q means time closer to minimum f tf
  • q ? tm/tf needed to get at least half of peak
    speed
  • Goal reorganize algorithm (change order of
    operations) to maximize q ( minimize tm )

73
Naïve Matrix Multiply no better than
Matrix-vector multiply
  • implements C C AB
  • for i 1 to n
  • read row i of A into fast memory n reads,
    n2 total
  • for j 1 to n
  • read C(i,j) into fast memory 1 read,
    n2 total
  • read column j of B into fast memory n
    reads, n3 total
  • for k 1 to n
  • C(i,j) C(i,j) A(i,k) B(k,j)
    all data in fast memory
  • write C(i,j) back to slow memory 1
    write, n2 total

m slow memory accesses n3 3n2 , so q
f/m 2n3 /( n3 3n2) ? 2
74
From Naïve to Blocked Matrix Multiply
  • C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
  • C(i,j) A(i,1n)B(1n,j)
  • True if C(i,j) is a scalar (1x1 submatrix)
  • A(i,1n) is a row of A and B(1n,j) is a column
    of B
  • Operation is a dot product
  • True if C(i,j) is a b x b submatrix
  • A(i,1n) is a b x n submatrix of A and B(1n,j)
    is an n x b submatrix of B
  • C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
    A(i,N)B(N,j) where
  • All factors are b x b subblocks
  • N n/b number of b x b blocks

75
Blocked (Tiled) Matrix Multiply
  • Consider A,B,C to be N-by-N matrices of b-by-b
    subblocks where
  • bn / N is called the block size
  • for i 1 to N
  • for j 1 to N
  • read block C(i,j) into fast memory
    b2 reads, N2 b2 n2 total
  • for k 1 to N
  • read block A(i,k) into fast
    memory b2 reads, N3 b2 n3/b total
  • read block B(k,j) into fast
    memory b2 reads, N3 b2 n3/b total
  • C(i,j) C(i,j) A(i,k)
    B(k,j) do a matrix multiply on blocks

  • all data in fast memory
  • write block C(i,j) back to slow memory
    b2 reads, N2 b2 n2 total

m slow memory accesses 2n3/b 2n2 , so q
f/m ? b
76
Blocked (Tiled) Matrix Multiply (Review)
  • m slow memory accesses 2n3/b 2n2 ,
  • so q f/m 2n3 / ( 2n3/b 2n2 ) ? b for
    large n
  • To minimize m ( maximize q ) we want to make b as
    large as possible
  • 3b2 ? Mfast, so that all three b x b
    submatrices C(i,j), A(i,k), B(k,j) fit
  • So q ? b ? (Mfast/3)1/2
  • Theorem (Hong/Kung, 1981 Irony/Tishkin/Toledo
    2004) Any reorganization of
    this algorithm (that uses only associativity and
    commutativity of addition) is limited to q O(
    (Mfast)1/2 )
  • Blocked matrix multiplication minimizes the
    number of slow memory accesses (up to a constant
    factor)

77
(No Transcript)
78
Current Records for Solving Dense Systems
(11/2008)
www.netlib.org, click on Performance Database
Server

Gigaflops Machine n100
n1000 Any n Peak  Roadrunner (LANL)
1.1M
1.5M (1.1 Petaflops, 130K cores, 2.5 MW,
n2.3M)



NEC SX 8
(fastest in 2005) (8 proc, 2 GHz)
75.1
128 (1 proc, 2 GHz) 2.2
15.0 16
Palm Pilot III .00000169
(1.69 Kiloflops)
79
For all linear algebra problemsEx LAPACK Table
of Contents
  • Linear Systems
  • Least Squares
  • Overdetermined, underdetermined
  • Unconstrained, constrained, weighted
  • Eigenvalues and vectors of Symmetric Matrices
  • Standard (Ax ?x), Generalized (Ax?xB)
  • Eigenvalues and vectors of Unsymmetric matrices
  • Eigenvalues, Schur form, eigenvectors, invariant
    subspaces
  • Standard, Generalized
  • Singular Values and vectors (SVD)
  • Standard, Generalized
  • Level of detail
  • Simple Driver
  • Expert Drivers with error bounds,
    extra-precision, other options
  • Lower level routines (apply certain kind of
    orthogonal transformation)

80
For all matrix/problem structuresEx LAPACK
Table of Contents
  • SB symmetric, banded
  • SP symmetric, packed
  • ST symmetric, tridiagonal
  • SY symmetric
  • TB triangular, banded
  • TG triangular, pair
  • TB triangular, banded
  • TP triangular, packed
  • TR triangular
  • TZ trapezoidal
  • UN unitary
  • UP unitary packed
  • BD bidiagonal
  • GB general banded
  • GE general
  • GG general , pair
  • GT tridiagonal
  • HB Hermitian banded
  • HE Hermitian
  • HG upper Hessenberg, pair
  • HP Hermitian, packed
  • HS upper Hessenberg
  • OR (real) orthogonal
  • OP (real) orthogonal, packed
  • PB positive definite, banded
  • PO positive definite
  • PP positive definite, packed
  • PT positive definite, tridiagonal

81
For all matrix/problem structuresEx LAPACK
Table of Contents
  • BD bidiagonal
  • GB general banded
  • GE general
  • GG general , pair
  • GT tridiagonal
  • HB Hermitian banded
  • HE Hermitian
  • HG upper Hessenberg, pair
  • HP Hermitian, packed
  • HS upper Hessenberg
  • OR (real) orthogonal
  • OP (real) orthogonal, packed
  • PB positive definite, banded
  • PO positive definite
  • PP positive definite, packed
  • PT positive definite, tridiagonal
  • SB symmetric, banded
  • SP symmetric, packed
  • ST symmetric, tridiagonal
  • SY symmetric
  • TB triangular, banded
  • TG triangular, pair
  • TB triangular, banded
  • TP triangular, packed
  • TR triangular
  • TZ trapezoidal
  • UN unitary
  • UP unitary packed

82
For all matrix/problem structuresEx LAPACK
Table of Contents
  • BD bidiagonal
  • GB general banded
  • GE general
  • GG general, pair
  • GT tridiagonal
  • HB Hermitian banded
  • HE Hermitian
  • HG upper Hessenberg, pair
  • HP Hermitian, packed
  • HS upper Hessenberg
  • OR (real) orthogonal
  • OP (real) orthogonal, packed
  • PB positive definite, banded
  • PO positive definite
  • PP positive definite, packed
  • PT positive definite, tridiagonal
  • SB symmetric, banded
  • SP symmetric, packed
  • ST symmetric, tridiagonal
  • SY symmetric
  • TB triangular, banded
  • TG triangular, pair
  • TB triangular, banded
  • TP triangular, packed
  • TR triangular
  • TZ trapezoidal
  • UN unitary
  • UP unitary packed

83
For all matrix/problem structuresEx LAPACK
Table of Contents
  • BD bidiagonal
  • GB general banded
  • GE general
  • GG general, pair
  • GT tridiagonal
  • HB Hermitian banded
  • HE Hermitian
  • HG upper Hessenberg, pair
  • HP Hermitian, packed
  • HS upper Hessenberg
  • OR (real) orthogonal
  • OP (real) orthogonal, packed
  • PB positive definite, banded
  • PO positive definite
  • PP positive definite, packed
  • PT positive definite, tridiagonal
  • SB symmetric, banded
  • SP symmetric, packed
  • ST symmetric, tridiagonal
  • SY symmetric
  • TB triangular, banded
  • TG triangular, pair
  • TB triangular, banded
  • TP triangular, packed
  • TR triangular
  • TZ trapezoidal
  • UN unitary
  • UP unitary packed

84
For all matrix/problem structuresEx LAPACK
Table of Contents
  • BD bidiagonal
  • GB general banded
  • GE general
  • GG general, pair
  • GT tridiagonal
  • HB Hermitian banded
  • HE Hermitian
  • HG upper Hessenberg, pair
  • HP Hermitian, packed
  • HS upper Hessenberg
  • OR (real) orthogonal
  • OP (real) orthogonal, packed
  • PB positive definite, banded
  • PO positive definite
  • PP positive definite, packed
  • PT positive definite, tridiagonal
  • SB symmetric, banded
  • SP symmetric, packed
  • ST symmetric, tridiagonal
  • SY symmetric
  • TB triangular, banded
  • TG triangular, pair
  • TB triangular, banded
  • TP triangular, packed
  • TR triangular
  • TZ trapezoidal
  • UN unitary
  • UP unitary packed

85
For all matrix/problem structuresEx LAPACK
Table of Contents
  • BD bidiagonal
  • GB general banded
  • GE general
  • GG general , pair
  • GT tridiagonal
  • HB Hermitian banded
  • HE Hermitian
  • HG upper Hessenberg, pair
  • HP Hermitian, packed
  • HS upper Hessenberg
  • OR (real) orthogonal
  • OP (real) orthogonal, packed
  • PB positive definite, banded
  • PO positive definite
  • PP positive definite, packed
  • PT positive definite, tridiagonal
  • SB symmetric, banded
  • SP symmetric, packed
  • ST symmetric, tridiagonal
  • SY symmetric
  • TB triangular, banded
  • TG triangular, pair
  • TB triangular, banded
  • TP triangular, packed
  • TR triangular
  • TZ trapezoidal
  • UN unitary
  • UP unitary packed

86
For all matrix/problem structuresEx LAPACK
Table of Contents
  • BD bidiagonal
  • GB general banded
  • GE general
  • GG general, pair
  • GT tridiagonal
  • HB Hermitian banded
  • HE Hermitian
  • HG upper Hessenberg, pair
  • HP Hermitian, packed
  • HS upper Hessenberg
  • OR (real) orthogonal
  • OP (real) orthogonal, packed
  • PB positive definite, banded
  • PO positive definite
  • PP positive definite, packed
  • PT positive definite, tridiagonal
  • SB symmetric, banded
  • SP symmetric, packed
  • ST symmetric, tridiagonal
  • SY symmetric
  • TB triangular, banded
  • TG triangular, pair
  • TB triangular, banded
  • TP triangular, packed
  • TR triangular
  • TZ trapezoidal
  • UN unitary
  • UP unitary packed

87
Motif/Dwarf Common Computational Methods (Red
Hot ? Blue Cool)
What do commercial and CSE applications have in
common?
88
Basic Linear Algebra Subroutines (BLAS)
  • Industry standard interface (evolving)
  • www.netlib.org/blas, www.netlib.org/blas/blast-
    -forum
  • Vendors, others supply optimized implementations
  • History
  • BLAS1 (1970s)
  • vector operations dot product, saxpy (yaxy),
    etc
  • m2n, f2n, q 1 or less
  • BLAS2 (mid 1980s)
  • matrix-vector operations matrix vector multiply,
    etc
  • mn2, f2n2, q2, less overhead
  • somewhat faster than BLAS1
  • BLAS3 (late 1980s)
  • matrix-matrix operations matrix matrix multiply,
    etc
  • m ? 3n2, fO(n3), so qf/m can possibly be as
    large as n (or Mfast1/2 ), so BLAS3
    is potentially much faster than BLAS2
  • Good algorithms used BLAS3 when possible (LAPACK
    ScaLAPACK)
  • More later in course

89
Locality in Other Algorithms
  • The performance of any algorithm is limited by q
    computational intensity
  • In matrix multiply, we increase q by changing
    computation order
  • increased temporal locality
  • For other algorithms and data structures, even
    hand-transformations are still an open problem
  • Lots of open problems

90
Simplifying Assumptions
  • What simplifying assumptions did we make in this
    analysis?
  • Ignored parallelism in processor between memory
    and arithmetic within the processor
  • Sometimes drop arithmetic term in this type of
    analysis
  • Assumed fast memory was large enough to hold
    three vectors
  • Reasonable if we are talking about any level of
    cache
  • Not if we are talking about registers (32 words)
  • Assumed the cost of a fast memory access is 0
  • Reasonable if we are talking about registers
  • Not necessarily if we are talking about cache
    (1-2 cycles for L1)
  • Memory latency is constant
  • Could simplify even further by ignoring memory
    operations in X and Y vectors
  • Mflop rate/element 2 / (2 tf tm)

91
Experiments on Search vs. Modeling
  • Study compares search (Atlas) to optimization
    selection based on performance models
  • Ten architectures
  • Model did well on most cases
  • Better on UltraSparc
  • Worse on Itanium
  • Eliminating performance gaps think globally,
    search locally
  • -small performance gaps local search
  • -large performance gaps
  • refine model
  • Substantial gap between ATLAS CGw/S and ATLAS
    Unleashed on some machines

Source K. Pingali. Results from IEEE 05 paper
by K Yotov, X Li, G Ren, M Garzarán, D Padua, K
Pingali, P Stodghill.
Write a Comment
User Comments (0)
About PowerShow.com