Title: The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev
1The Future of LAPACK and ScaLAPACK
www.netlib.org/lapack-dev
- Jim Demmel
- UC Berkeley
- 21 June 2006
- PARA 06
2Outline
- Motivation for new Sca/LAPACK
- Challenges (or research opportunities)
- Goals of new Sca/LAPACK
- Highlights of progress
-
3Motivation
- LAPACK and ScaLAPACK are widely used
- Adopted by Cray, Fujitsu, HP, IBM, IMSL,
MathWorks, NAG, NEC, SGI, - gt60M web hits _at_ Netlib (incl. CLAPACK, LAPACK95)
4Impact (with NERSC, LBNL)
Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).
ScaLAPACK
5Motivation
- LAPACK and ScaLAPACK are widely used
- Adopted by Cray, Fujitsu, HP, IBM, IMSL,
MathWorks, NAG, NEC, SGI, - gt60M web hits _at_ Netlib (incl. CLAPACK, LAPACK95)
- Many ways to improve them, based on
- Own algorithmic research
- Enthusiastic participation of research community
- User/vendor survey
- Opportunities and demands of new architectures,
programming languages - New releases planned (NSF support)
6Participants
- UC Berkeley
- Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett,
Xiaoye Li, Osni Marques, Christof Voemel,
David Bindel, Yozo Hida, Jason Riedy,
Jianlin Xia, Jiang Zhu, undergrads - U Tennessee, Knoxville
- Jack Dongarra, Julien Langou, Julie Langou, Piotr
Luszczek, Stan Tomov, Alfredo Buttari,
Jakub Kurzak - Other Academic Institutions
- UT Austin, UC Davis, Florida IT, U Kansas, U
Maryland, North Carolina SU, San
Jose SU, UC Santa Barbara - TU Berlin, U Electrocomm. (Japan), FU Hagen, U
Carlos III Madrid, U Manchester, U Umeå, U
Wuppertal, U Zagreb - Research Institutions
- CERFACS, LBL
- Industrial Partners
- Cray, HP, Intel, MathWorks, NAG, SGI
7Challenges
- For all large scale computing, not just linear
algebra!
8Parallelism in the Top500
9Challenges
- For all large scale computing, not just linear
algebra!
- Example your laptop
- 256 Threads/multicore chip by 2010
10Challenges
- For all large scale computing, not just linear
algebra!
- Example your laptop
- 256 Threads/multicore chip by 2010
- Exponentially growing gaps between
- Floating point time ltlt 1/Memory BW ltlt Memory
Latency - Floating point time ltlt 1/Network BW ltlt Network
Latency
11Challenges
- For all large scale computing, not just linear
algebra!
- Example your laptop
- 256 Threads/multicore chip by 2010
- Exponentially growing gaps between
- Floating point time ltlt 1/Memory BW ltlt Memory
Latency - Floating point time ltlt 1/Network BW ltlt Network
Latency - Heterogeneity (performance and semantics)
- Asynchrony
- Unreliability
12What do users want?
- High performance, ease of use,
- Survey results at www.netlib.org/lapack-dev
- Small but interesting sample
- What matrix sizes do you care about?
- 1000s 34
- 10,000s 26
- 100,000s or 1Ms 26
- How many processors, on distributed memory?
- gt10 34, gt100 31, gt1000 19
- Do you use more than double precision?
- Sometimes or frequently 16
- Would Automatic Memory Allocation help?
- Very useful 72, Not useful 14
13Goals of next Sca/LAPACK
- Better algorithms
- Faster, more accurate
- Expand contents
- More functions, more parallel implementations
- Automate performance tuning
- Improve ease of use
- Better software engineering
- Increased community involvement
14Goal 1 Better Algorithms
- Faster
- But provide usual accuracy, stability
- Or accurate for an important subclass
- More accurate
- But provide usual speed
- Or at any cost
15Goal 1a Faster Algorithms (Highlights)
- MRRR algorithm for symmetric eigenproblem / SVD
- Parlett / Dhillon / Voemel / Marques / Willems
(MS 19) - Up to 10x faster HQR
- Byers / Mathias / Braman
- Extensions to QZ
- Kågström / Kressner / Adlerborn (MS 19)
- Faster Hessenberg, tridiagonal, bidiagonal
reductions - van de Geijn/Quintana-Orti, Howell / Fulton,
Bischof / Lang - Novel Data Layouts
- Gustavson / Kågström / Elmroth / Jonsson /
Wasniewski (MS 15/23/30)
16Goal 1a Faster Algorithms (Highlights)
- MRRR algorithm for symmetric eigenproblem / SVD
- Parlett / Dhillon / Voemel / Marques / Willems
(MS 19) - Faster and more accurate than previous algorithms
- SIAM SIAG/LA Prize in 2006
- New sequential, first parallel versions out in
2006
17Flop Counts of Eigensolvers(2.2 GHz Opteron
ACML)
18Flop Counts of Eigensolvers(2.2 GHz Opteron
ACML)
19Flop Counts of Eigensolvers(2.2 GHz Opteron
ACML)
20Flop Counts of Eigensolvers(2.2 GHz Opteron
ACML)
21Flop Count Ratios of Eigensolvers(2.2 GHz
Opteron ACML)
22Run Time Ratios of Eigensolvers(2.2 GHz Opteron
ACML)
23MFlop Rates of Eigensolvers(2.2 GHz Opteron
ACML)
24Parallel Runtimes of Eigensolvers(2.4 GHz Xeon
Cluster Ethernet)
25Accuracy of Eigensolvers
QQT I / (n e )
maxi Tqi li qi / ( n e )
26Accuracy of Eigensolvers Old vs New Grail
QQT I / (n e )
maxi Tqi li qi / ( n e )
27Goal 1a Faster Algorithms (Highlights)
- MRRR algorithm for symmetric eigenproblem / SVD
- Parlett / Dhillon / Voemel / Marques / Willems
(MS 19) - Faster and more accurate than previous algorithms
- SIAM SIAG/LA Prize in 2006
- New sequential, first parallel versions out in
2006 - Both DC and MR are important
28Goal 1a Faster Algorithms (Highlights)
- MRRR algorithm for symmetric eigenproblem / SVD
- Parlett / Dhillon / Voemel / Marques / Willems
(MS19) - Up to 10x faster HQR
- Byers / Mathias / Braman
- SIAM SIAG/LA Prize in 2003
- Sequential version out in 2006
- More on performance later
29Goal 1a Faster Algorithms (Highlights)
- MRRR algorithm for symmetric eigenproblem / SVD
- Parlett / Dhillon / Voemel / Marques / Willems
(MS19) - Up to 10x faster HQR
- Byers / Mathias / Braman
- Extensions to QZ
- Kågström / Kressner / Adlerborn (MS 19)
- LAPACK Working Note (LAWN) 173
- On 26 real test matrices, speedups up to 14.7x,
4.4x average
30Comparison of ScaLAPACK QR and new parallel
multishift QZ
- Execution times in secs for 4096 x 4096 random
problems - Ax sx and Ax sBx,
- using processor grids including 1-16 processors.
- Note
- work(QZ) gt 2 work(QR) but
- Time(// QZ) ltlt Time (//QR)!!
-
- Times include cost for computing eigenvalues and
transformation matrices.
Adlerborn-Kågström-Kressner, SIAM PP2006, also
MS19
31Goal 1a Faster Algorithms (Highlights)
- MRRR algorithm for symmetric eigenproblem / SVD
- Parlett / Dhillon / Voemel / Marques / Willems
(MS19) - Up to 10x faster HQR
- Byers / Mathias / Braman
- Extensions to QZ
- Kågström / Kressner / Adlerborn (MS 19)
- Faster Hessenberg, tridiagonal, bidiagonal
reductions - van de Geijn/Quintana-Orti, Howell / Fulton,
Bischof / Lang - Full nonsymmetric eigenproblem n1500 3.43x
faster - HQR 5x faster, Reduction 14 faster
- Bidiagonal Reduction (LAWN174) n2000 1.32x
faster - Sequential versions out in 2006
32Goal 1a Faster Algorithms (Highlights)
- MRRR algorithm for symmetric eigenproblem / SVD
- Parlett / Dhillon / Voemel / Marques / Willems
(MS 19) - Up to 10x faster HQR
- Byers / Mathias / Braman
- Extensions to QZ
- Kågström / Kressner / Adlerborn (MS19)
- Faster Hessenberg, tridiagonal, bidiagonal
reductions - van de Geijn/Quintana-Orti, Howell / Fulton,
Bischof / Lang - Novel Data Layouts
- Gustavson / Kågström / Elmroth / Jonsson /
Wasniewski (MS 15/23/30) - SIAM Review Article 2004
33Novel Data Layouts and Algorithms
- Still merges multiple elimination steps into a
few BLAS 3 operations - MS 15/23/30 Novel Data Formats
- Rectangular Packed Format good speedups for
packed storage of symmetric matrices
34Goal 1b More Accurate Algorithms
- Iterative refinement for Axb, least squares
- Promise the right answer for O(n2) additional
cost - Jacobi-based SVD
- Faster than QR, can be arbitrarily more accurate
- Arbitrary precision versions of everything
- Using your favorite multiple precision package
35Goal 1b More Accurate Algorithms
- Iterative refinement for Axb, least squares
- Kahan, Riedy, Hida, Li
- Promise the right answer for O(n2) additional
cost - Iterative refinement with extra-precise residuals
- Extra-precise BLAS needed (LAWN165)
36More Accurate Solve Axb
Conventional Gaussian Elimination
e
e n1/2 2-24
37Goal 1b More Accurate Algorithms
- Iterative refinement for Axb, least squares
- Promise the right answer for O(n2) additional
cost - Iterative refinement with extra-precise residuals
- Extra-precise BLAS needed (LAWN165)
- Guarantees based on condition number estimates
- Condition estimate lt 1/(n1/2 e) ? reliable
answer and tiny error bounds - No bad bounds in 6.2M tests
- Can condition estimators lie?
38Can condition estimators lie?
- Yes, but rarely, unless they cost as much as
matrix multiply cost of LU factorization - Demmel/Diament/Malajovich (FCM2001)
- But what if matrix multiply costs O(n2)?
- More later
39Goal 1b More Accurate Algorithms
- Iterative refinement for Axb, least squares
- Promise the right answer for O(n2) additional
cost - Iterative refinement with extra-precise residuals
- Extra-precise BLAS needed (LAWN165)
- Guarantees based on condition number estimates
- Get tiny componentwise bounds too
- Each xi accurate
- Slightly different condition number
- Extends to Least Squares (Li)
- Release in 2006
40Goal 1b More Accurate Algorithms
- Iterative refinement for Axb, least squares
- Promise the right answer for O(n2) additional
cost - Jacobi-based SVD
- Faster than QR, can be arbitrarily more accurate
- Drmac / Veselic (MS 3)
- LAWNS 169, 170
- Can be arbitrarily more accurate on tiny singular
values - Yet faster than QR iteration, within 2x of DC
41Goal 1b More Accurate Algorithms
- Iterative refinement for Axb, least squares
- Promise the right answer for O(n2) additional
cost - Jacobi-based SVD
- Faster than QR, can be arbitrarily more accurate
- Arbitrary precision versions of everything
- Using your favorite multiple precision package
- Quad, Quad-double, ARPREC, MPFR,
- Using Fortran 95 modules
42Iterative Refinement for speed (MS 3)
- What if double precision much slower than
single? - Cell processor in Playstation 3
- 256 GFlops single, 25 GFlops double
- Pentium SSE2 single twice as fast as double
- Given Axb in double precision
- Factor in single, do refinement in double
- If k(A) lt 1/esingle, runs at speed of single
- 1.9x speedup on Intel-based laptop
- Applies to many algorithms, if difference large
43Exploiting GPUs
- Numerous emerging co-processors
- Cell, SSE, Grape, GPU, physics coprocessor,
- When can we exploit them?
- LIttle help if memory is bottleneck
- Various attempts to use GPUs for dense linear
algebra - Bisection on GPUs for symmetric tridiagonal
eigenproblem - Evaluate Count(x) (evals lt x) for many x
- Very little memory traffic
- Speedups up to 100x (Volkov)
44Goal 2 Expanded Content
- Make content of ScaLAPACK mirror LAPACK as much
as possible
45Missing Drivers in Sca/LAPACK
LAPACK ScaLAPACK
Linear Equations LU Cholesky LDLT xGESV xPOSV xSYSV PxGESV PxPOSV missing
Least Squares (LS) QR QRpivot SVD/QR SVD/DC SVD/MRRR xGELS xGELSY xGELSS xGELSD missing PxGELS missing missing missing (ok?) missing
Generalized LS LS equality constr. Generalized LM Above Iterative ref. xGGLSE xGGGLM missing missing missing missing
46More missing drivers
LAPACK ScaLAPACK
Symmetric EVD QR / BisectionInvit DC MRRR xSYEV / X xSYEVD xSYEVR PxSYEV / X PxSYEVD missing
Nonsymmetric EVD Schur form Vectors too xGEES / X xGEEV / X missing (driver) missing
SVD QR DC MRRR Jacobi xGESVD xGESDD missing missing PxGESVD missing (ok?) missing missing
Generalized Symmetric EVD QR / BisectionInvit DC MRRR xSYGV / X xSYGVD missing PxSYGV / X missing (ok?) missing
Generalized Nonsymmetric EVD Schur form Vectors too xGGES / X xGGEV / X missing missing
Generalized SVD Kogbetliantz MRRR xGGSVD missing missing (ok) missing
47Goal 2 Expanded Content
- Make content of ScaLAPACK mirror LAPACK as much
as possible - New functions (highlights)
- Updating / downdating of factorizations
- Stewart, Langou
- More generalized SVDs
- Bai , Wang, Drmac (MS 3)
- More generalized Sylvester/Lyapunov eqns
- Kågström, Jonsson, Granat (MS19)
- Structured eigenproblems
- Selected matrix polynomials
- Mehrmann, Higham, Tisseur
- O(n2) version of roots(p)
- Gu, Chandrasekaran, Bindel et al
48New algorithm for roots(p)
- To find roots of polynomial p
- Roots(p) does eig(C(p))
- Costs O(n3), stable, reliable
- O(n2) Alternatives
- Newton, Jenkins-Traub, Laguerre,
- Stable? Reliable?
- New Exploit semiseparable structure of C(p)
- Low rank of any submatrix of upper triangle of
C(p) preserved under QR iteration - Complexity drops from O(n3) to O(n2), stable in
practice - Related work Van Barel (MS3), Gemignani, Bini,
Pan, et al - Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin
Xia, David Bindel, David Garmire, Jim Demmel
49Goal 2 Expanded Content
- Make content of ScaLAPACK mirror LAPACK as much
as possible - New functions (highlights)
- Updating / downdating of factorizations
- Stewart, Langou
- More generalized SVDs
- Bai , Wang, Drmac (MS 3)
- More generalized Sylvester/Lyapunov eqns
- Kågström, Jonsson, Granat (MS19)
- Structured eigenproblems
- Selected matrix polynomials
- Mehrmann, Higham, Tisseur
- O(n2) version of roots(p)
- Gu, Chandrasekaran, Bindel et al
- How should we prioritize missing functions?
50Goal 3 Automate Performance Tuning
- Widely used in performance tuning of Kernels
- ATLAS (PhiPAC) BLAS - www.netlib.org/atlas
- FFTW Fast Fourier Transform www.fftw.org
- Spiral signal processing - www.spiral.net
- OSKI Sparse BLAS bebop.cs.berkeley.edu/oski
- Integrated into PETSc
51Optimizing blocksizes for mat-mul
Finding a Needle in a Haystack So Automate
52Goal 3 Automate Performance Tuning
- Widely used in performance tuning of Kernels
- 1300 calls to ILAENV() to get block sizes, etc.
- Never been systematically tuned
- Extend automatic tuning techniques of ATLAS, etc.
to these other parameters - Automation important as architectures evolve
- Convert ScaLAPACK data layouts on the fly
- Important for ease-of-use too
53ScaLAPACK Data Layouts
1D Cyclic
1D Block
2D Block Cyclic
1D Block Cyclic
54Speedups for using 2D processor grid range from
2x to 8x Cost of redistributing from 1D to best
2D layout 1 - 10
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB
Memory
55Fast Matrix Multiplication (1) (Cohn, Kleinberg,
Szegedy, Umans)
- Can think of fast convolution of polynomials p, q
as - Map p (q) into group algebra Si pi zi ? CG of
cyclic group G zi - Multiply elements of CG (use
divideconquer FFT) - Extract coefficients
- For matrix multiply, need non-abelian group
satisfying triple product property - There are subsets X, Y, Z of G where xyz 1 with
x ? X, y ? Y, z ?Z
? x y z 1 - Map matrix A into group algebra via Sxy Axy
x-1y, B into Syz
Byz y-1z. - Since x-1y y-1z x-1z iff y y we get Sy Axy
Byz (AB)xz - Search for fast algorithms reduced to search for
groups with certain properties - Fastest algorithm so far is O(n2.38), same as
Coppersmith/Winograd
56Fast Matrix Multiplication (2)(Cohn, Kleinberg,
Szegedy, Umans)
- Embed A, B in group algebra (exact)
- Perform FFT (roundoff)
- Reorganize results into new matrices (exact)
- Multiply new matrices recursively (roundoff)
- Reorganize results into new matrices (exact)
- Perform IFFT (roundoff)
- Extract C AB from group algebra (exact)
57Fast Matrix Multiplication (3)(D., Dumitriu,
Holtz, Kleinberg)
- Thm 1 Any algorithm of this class for C AB is
numerically stable - Ccomp - C lt c nd e A B
O(e2) - c and d are modest constants
- Like Strassen
- Let ? be the exponent of matrix multiplication,
i.e. no algorithm is faster than O(n?). - Thm 2 For all ?gt0 there exists an algorithm with
complexity O(n??) that is numerically stable in
the sense of Thm 1. -
58Conclusions
- Lots to do in Dense Linear Algebra
- New numerical algorithms
- Continuing architectural challenges
- Parallelism, performance tuning
- Ease of use, software engineering
- Grant support, but success depends on
contributions from community - www.netlib.org/lapack-dev
- www.cs.berkeley.edu/demmel
59Extra Slides
60Goal 4 Improved Ease of Use
A \ B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO)
CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA,
DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C,
B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR,
BERR, WORK, LWORK, IWORK, LIWORK, INFO)
61Goal 4 Improved Ease of Use
- Easy interfaces vs access to details
- Some users want access to all details, because
- Peak performance matters
- Control over memory allocation
- Other users want simpler interface
- Automatic allocation of workspace
- No universal agreement across systems on easiest
interface - Leave decision to higher level packages
- Keep expert driver / simple driver /
computational routines - Add wrappers for other languages
- Fortran95, Java, Matlab, Python, even C
- Automatic allocation of workspace
- Add wrappers to convert to best parallel layout
62Goal 5 Better SW EngineeringWhat could go into
Sca/LAPACK?
For all linear algebra problems
For all matrix structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to prioritize, automate!
63Goal 5 Better SW Engineering
- How to map multiple SW layers to emerging HW
layers? - How much better are asynchronous algorithms?
- Are emerging PGAS languages better?
- Statistical modeling to limit performance tuning
costs, improve use of shared clusters - Only some things understood well enough for
automation now - Telescoping languages, Bernoulli, Rose, FLAME,
- Research Plan explore above design space
- Development Plan to deliver code (some aspects)
- Maintain core in F95 subset
- Friendly wrappers for other programming
environments - Use variety of source control, maintenance,
development tools
64Goal 6 Involve the Community
- To help identify priorities
- More interesting tasks than we are funded to do
- See www.netlib.org/lapack-dev for list
- To help identify promising algorithms
- What have we missed?
- To help do the work
- Bug reports, provide fixes
- Again, more tasks than we are funded to do
- Already happening thank you!
65CPU Trends
- Relative processing power will continue to double
every 18 months - 256 logical processors per chip in late 2010
66Challenges
- For all large scale computing, not just linear
algebra!
- Example your laptop
- Exponentially growing gaps between
- Floating point time ltlt 1/Memory BW ltlt Memory
Latency
67Commodity Processor Trends
Annual increase Typical valuein 2006 Predicted valuein 2010 Typical valuein 2020
Single-chip floating-point performance 59 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s
Memory bus bandwidth 23 1 GWord/s 0.25 word/flop 3.5 GWord/s 0.11 word/flop 27 GWord/s 0.008 word/flop
Memory latency (5.5) 70 ns 280 FP ops 70 loads 50 ns 1600 FP ops 170 loads 28 ns 94,000 FP ops 780 loads
Will our algorithms run at a high fraction of
peak?
Source Getting Up to Speed The Future of
Supercomputing, National Research Council, 222
pages, 2004, National Academies Press, Washington
DC, ISBN 0-309-09502-6.
68Parallel Processor Trends
Annual increase Typical valuein 2004 Predicted valuein 2010 Typical valuein 2020
Processors 20 4,000 12,000 3300 GFLOP/s
Network Bandwidth 26 65 MWord/s 0.03 word/flop 260 MWord/s 0.008 word/flop 27 GWord/s 0.008 word/flop
Network latency (15) 5 ms 20K FP ops 2 ms 64K FP ops 28 ns 94,000 FP ops 780 loads
Will our algorithms scale up to more processors?
Source Getting Up to Speed The Future of
Supercomputing, National Research Council, 222
pages, 2004, National Academies Press, Washington
DC, ISBN 0-309-09502-6.
69When is High Accuracy LA Possible? (1)(D.,
Dumitriu, Holtz)
- Model fl(a ? b) (a ? b) (1 ?), ? ? ?
- Not bit model, since ? small but arbitrary
- Goal NASC for ? accurate LA algorithm
- Subgoal NASC for ? accurate algorithm to
evaluate p(x)
70Classical Arithmetic (2)
- ? ? , -, ? , exact comparisons, branches
- Basic Allowable Sets (BAS)
- Zi xi 0, Sik xi xk 0, Dik xi
xk 0 - V(p) allowable if V(p) ? ? BAS
- Thm 1 V(p) unallowable ? p cannot be evaluated
accurately - Thm 2 D ? (V(p) - ? A allowable A ? V(p)) ?
? ? p cannot be evaluated accurately on domain
D - Thm 3 p(x) integer coeffs, x complex ?
V(p) allowable iff p can be evaluated
accurately
finite
finite
71Black Box Arithmetic (3)
- ? ? any set of polynomials qi(x)
- Ex FMA q(x) x1 ? x2 x3
- V(p) allowable if V(p) ? ? V(qi)
- Thm 1 V(p) unallowable ? p cannot be evaluated
accurately - Thm 2 D ? (V(p) - ? A allowable A ? V(p) ? ?
? p cannot be evaluated accurately on domain D - Cor No accurate LA alg exists for Toeplitz
matrices using any finite set of arithmetic
operations - Proof Det(Toeplitz) contain irreducible
components of any degree
Irred parts
finite
72Open Questions and Future Work (4)
- Complete decision procedure for ? accurate
algorithms, in particular for real p, arbitrary
block box operations, domains D - Apply to more structured matrix classes
- Incorporate division, rational functions
- Incorporate perturbation theory
- Conj Accurate eval possible iff condition number
has certain simply singularities - Extend to interval arithmetic
- Math ArXiv math.NA/0508353
73Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
74Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
75Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
76Timing of Eigensolvers(only matrices where time
gt .1 sec)
77Accuracy Results (old vs new MRRR)
QQT I / (n e )
maxi Tqi li qi / ( n e )
78Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices where time gt .01 sec, ngt200)
79Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices where time gt .01 sec, ngt200)
80Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices where time gt .01 sec, ngt200)
81Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices where time gt .01 sec, ngt200)
82Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices with clusters)
83Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices without clusters)
84Accuracy Results on Power 5
QQT I / (n e )
maxi Tqi li qi / ( n e )
85Accuracy Results on Power 5, Old vs New Grail
QQT I / (n e )
maxi Tqi li qi / ( n e )
86Timing of Eigensolvers(Cray XD1 2.2 GHz Opteron
ACML)
87Timing of Eigensolvers(Cray XD1 2.2 GHz Opteron
ACML)
88Timing of Eigensolvers(Cray XD1 2.2 GHz Opteron
ACML)
89Timing Ratios of Eigensolvers(Cray XD1 2.2 GHz
Opteron ACML)
90Timing Ratios of Eigensolvers(Cray XD1 2.2 GHz
Opteron ACML) Matrices with tight clusters
91Timing Ratios of Eigensolvers(Cray XD1 2.2 GHz
Opteron ACML) Matrices without tight clusters
92Timing Ratios of Eigensolvers(Cray XD1 2.2 GHz
Opteron ACML) Random Matrices
93Performance Ratios of Eigensolvers(Cray XD1 2.2
GHz Opteron ACML) (1,2,1) Matrices
94Performance Ratios of Eigensolvers(Cray XD1 2.2
GHz Opteron ACML) Practical Matrices
95New GSVD Algorithm
Given m x n A and p x n B, factor A U ?a X
and B V ?b X
Bai et al, UC Davis
PSVD, CSD on the way
96Goal 2 Expanded Content
- Make content of ScaLAPACK mirror LAPACK as much
as possible - New functions (highlights)
- Updating / downdating of factorizations
- Stewart, Langou
- More generalized SVDs
- Bai , Wang, Drmac (MS 3)
97Plans for Summer 06 Release
- Byers HQR (Byers, Smith)
- MRRR (Voemel, Marques, Parlett)
- Hessenberg Reduction (Kressner)
- XBLAS (Li, Hida, Riedy)
- Iterative Refinement
- Hida, Riedy, Li, Demmel
- Dongarra, Langou
- RFP for packed Cholesky
- Langou, Gustavson
- Bug fixes
98Plans for Summer 07
- Generalized nonsymmetric eigenproblem
- Reduction to condensed form
- QZ
- Reordering evals
- Balancing
- Everything inside xGGEV(X)
- Reuse test/timing code?
- Sylvester
- New functions gt new test/timing