The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

About This Presentation

Title:

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

Description:

Motivation for new Sca/LAPACK. Challenges (or research opportunities...) Goals of new Sca/LAPACK ... van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 99

Provided by: jimde3

Category:

more less

Transcript and Presenter's Notes

Title: The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

1
The Future of LAPACK and ScaLAPACK
www.netlib.org/lapack-dev

Jim Demmel
UC Berkeley
21 June 2006
PARA 06

2
Outline

Motivation for new Sca/LAPACK
Challenges (or research opportunities)
Goals of new Sca/LAPACK
Highlights of progress

3
Motivation

LAPACK and ScaLAPACK are widely used
Adopted by Cray, Fujitsu, HP, IBM, IMSL,
MathWorks, NAG, NEC, SGI,
gt60M web hits _at_ Netlib (incl. CLAPACK, LAPACK95)

4
Impact (with NERSC, LBNL)
Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).
ScaLAPACK
5
Motivation

LAPACK and ScaLAPACK are widely used
Adopted by Cray, Fujitsu, HP, IBM, IMSL,
MathWorks, NAG, NEC, SGI,
gt60M web hits _at_ Netlib (incl. CLAPACK, LAPACK95)
Many ways to improve them, based on
Own algorithmic research
Enthusiastic participation of research community
User/vendor survey
Opportunities and demands of new architectures,
programming languages
New releases planned (NSF support)

6
Participants

UC Berkeley
Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett,
Xiaoye Li, Osni Marques, Christof Voemel,
David Bindel, Yozo Hida, Jason Riedy,
Jianlin Xia, Jiang Zhu, undergrads
U Tennessee, Knoxville
Jack Dongarra, Julien Langou, Julie Langou, Piotr
Luszczek, Stan Tomov, Alfredo Buttari,
Jakub Kurzak
Other Academic Institutions
UT Austin, UC Davis, Florida IT, U Kansas, U
Maryland, North Carolina SU, San
Jose SU, UC Santa Barbara
TU Berlin, U Electrocomm. (Japan), FU Hagen, U
Carlos III Madrid, U Manchester, U Umeå, U
Wuppertal, U Zagreb
Research Institutions
CERFACS, LBL
Industrial Partners
Cray, HP, Intel, MathWorks, NAG, SGI

7
Challenges

For all large scale computing, not just linear
algebra!

Example

8
Parallelism in the Top500
9
Challenges

For all large scale computing, not just linear
algebra!

Example your laptop
256 Threads/multicore chip by 2010

10
Challenges

For all large scale computing, not just linear
algebra!

Example your laptop
256 Threads/multicore chip by 2010
Exponentially growing gaps between
Floating point time ltlt 1/Memory BW ltlt Memory
Latency
Floating point time ltlt 1/Network BW ltlt Network
Latency

11
Challenges

For all large scale computing, not just linear
algebra!

Example your laptop
256 Threads/multicore chip by 2010
Exponentially growing gaps between
Floating point time ltlt 1/Memory BW ltlt Memory
Latency
Floating point time ltlt 1/Network BW ltlt Network
Latency
Heterogeneity (performance and semantics)
Asynchrony
Unreliability

12
What do users want?

High performance, ease of use,
Survey results at www.netlib.org/lapack-dev
Small but interesting sample
What matrix sizes do you care about?
1000s 34
10,000s 26
100,000s or 1Ms 26
How many processors, on distributed memory?
gt10 34, gt100 31, gt1000 19
Do you use more than double precision?
Sometimes or frequently 16
Would Automatic Memory Allocation help?
Very useful 72, Not useful 14

13
Goals of next Sca/LAPACK

Better algorithms
Faster, more accurate
Expand contents
More functions, more parallel implementations
Automate performance tuning
Improve ease of use
Better software engineering
Increased community involvement

14
Goal 1 Better Algorithms

Faster
But provide usual accuracy, stability
Or accurate for an important subclass
More accurate
But provide usual speed
Or at any cost

15
Goal 1a Faster Algorithms (Highlights)

MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques / Willems
(MS 19)
Up to 10x faster HQR
Byers / Mathias / Braman
Extensions to QZ
Kågström / Kressner / Adlerborn (MS 19)
Faster Hessenberg, tridiagonal, bidiagonal
reductions
van de Geijn/Quintana-Orti, Howell / Fulton,
Bischof / Lang
Novel Data Layouts
Gustavson / Kågström / Elmroth / Jonsson /
Wasniewski (MS 15/23/30)

16
Goal 1a Faster Algorithms (Highlights)

MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques / Willems
(MS 19)
Faster and more accurate than previous algorithms
SIAM SIAG/LA Prize in 2006
New sequential, first parallel versions out in
2006

17
Flop Counts of Eigensolvers(2.2 GHz Opteron
ACML)
18
Flop Counts of Eigensolvers(2.2 GHz Opteron
ACML)
19
Flop Counts of Eigensolvers(2.2 GHz Opteron
ACML)
20
Flop Counts of Eigensolvers(2.2 GHz Opteron
ACML)
21
Flop Count Ratios of Eigensolvers(2.2 GHz
Opteron ACML)
22
Run Time Ratios of Eigensolvers(2.2 GHz Opteron
ACML)
23
MFlop Rates of Eigensolvers(2.2 GHz Opteron
ACML)
24
Parallel Runtimes of Eigensolvers(2.4 GHz Xeon
Cluster Ethernet)
25
Accuracy of Eigensolvers
QQT I / (n e )
maxi Tqi li qi / ( n e )
26
Accuracy of Eigensolvers Old vs New Grail
QQT I / (n e )
maxi Tqi li qi / ( n e )
27
Goal 1a Faster Algorithms (Highlights)

MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques / Willems
(MS 19)
Faster and more accurate than previous algorithms
SIAM SIAG/LA Prize in 2006
New sequential, first parallel versions out in
2006
Both DC and MR are important

28
Goal 1a Faster Algorithms (Highlights)

MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques / Willems
(MS19)
Up to 10x faster HQR
Byers / Mathias / Braman
SIAM SIAG/LA Prize in 2003
Sequential version out in 2006
More on performance later

29
Goal 1a Faster Algorithms (Highlights)

MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques / Willems
(MS19)
Up to 10x faster HQR
Byers / Mathias / Braman
Extensions to QZ
Kågström / Kressner / Adlerborn (MS 19)
LAPACK Working Note (LAWN) 173
On 26 real test matrices, speedups up to 14.7x,
4.4x average

30
Comparison of ScaLAPACK QR and new parallel
multishift QZ

Execution times in secs for 4096 x 4096 random
problems
Ax sx and Ax sBx,
using processor grids including 1-16 processors.
Note
work(QZ) gt 2 work(QR) but
Time(// QZ) ltlt Time (//QR)!!
Times include cost for computing eigenvalues and
transformation matrices.

Adlerborn-Kågström-Kressner, SIAM PP2006, also
MS19
31
Goal 1a Faster Algorithms (Highlights)

MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques / Willems
(MS19)
Up to 10x faster HQR
Byers / Mathias / Braman
Extensions to QZ
Kågström / Kressner / Adlerborn (MS 19)
Faster Hessenberg, tridiagonal, bidiagonal
reductions
van de Geijn/Quintana-Orti, Howell / Fulton,
Bischof / Lang
Full nonsymmetric eigenproblem n1500 3.43x
faster
HQR 5x faster, Reduction 14 faster
Bidiagonal Reduction (LAWN174) n2000 1.32x
faster
Sequential versions out in 2006

32
Goal 1a Faster Algorithms (Highlights)

MRRR algorithm for symmetric eigenproblem / SVD
Parlett / Dhillon / Voemel / Marques / Willems
(MS 19)
Up to 10x faster HQR
Byers / Mathias / Braman
Extensions to QZ
Kågström / Kressner / Adlerborn (MS19)
Faster Hessenberg, tridiagonal, bidiagonal
reductions
van de Geijn/Quintana-Orti, Howell / Fulton,
Bischof / Lang
Novel Data Layouts
Gustavson / Kågström / Elmroth / Jonsson /
Wasniewski (MS 15/23/30)
SIAM Review Article 2004

33
Novel Data Layouts and Algorithms

Still merges multiple elimination steps into a
few BLAS 3 operations
MS 15/23/30 Novel Data Formats

Rectangular Packed Format good speedups for
packed storage of symmetric matrices

34
Goal 1b More Accurate Algorithms

Iterative refinement for Axb, least squares
Promise the right answer for O(n2) additional
cost
Jacobi-based SVD
Faster than QR, can be arbitrarily more accurate
Arbitrary precision versions of everything
Using your favorite multiple precision package

35
Goal 1b More Accurate Algorithms

Iterative refinement for Axb, least squares
Kahan, Riedy, Hida, Li
Promise the right answer for O(n2) additional
cost
Iterative refinement with extra-precise residuals
Extra-precise BLAS needed (LAWN165)

36
More Accurate Solve Axb
Conventional Gaussian Elimination

e
e n1/2 2-24
37
Goal 1b More Accurate Algorithms

Iterative refinement for Axb, least squares
Promise the right answer for O(n2) additional
cost
Iterative refinement with extra-precise residuals
Extra-precise BLAS needed (LAWN165)
Guarantees based on condition number estimates
Condition estimate lt 1/(n1/2 e) ? reliable
answer and tiny error bounds
No bad bounds in 6.2M tests
Can condition estimators lie?

38
Can condition estimators lie?

Yes, but rarely, unless they cost as much as
matrix multiply cost of LU factorization
Demmel/Diament/Malajovich (FCM2001)
But what if matrix multiply costs O(n2)?
More later

39
Goal 1b More Accurate Algorithms

Iterative refinement for Axb, least squares
Promise the right answer for O(n2) additional
cost
Iterative refinement with extra-precise residuals
Extra-precise BLAS needed (LAWN165)
Guarantees based on condition number estimates
Get tiny componentwise bounds too
Each xi accurate
Slightly different condition number
Extends to Least Squares (Li)
Release in 2006

40
Goal 1b More Accurate Algorithms

Iterative refinement for Axb, least squares
Promise the right answer for O(n2) additional
cost
Jacobi-based SVD
Faster than QR, can be arbitrarily more accurate
Drmac / Veselic (MS 3)
LAWNS 169, 170
Can be arbitrarily more accurate on tiny singular
values
Yet faster than QR iteration, within 2x of DC

41
Goal 1b More Accurate Algorithms

Iterative refinement for Axb, least squares
Promise the right answer for O(n2) additional
cost
Jacobi-based SVD
Faster than QR, can be arbitrarily more accurate
Arbitrary precision versions of everything
Using your favorite multiple precision package
Quad, Quad-double, ARPREC, MPFR,
Using Fortran 95 modules

42
Iterative Refinement for speed (MS 3)

What if double precision much slower than
single?
Cell processor in Playstation 3
256 GFlops single, 25 GFlops double
Pentium SSE2 single twice as fast as double
Given Axb in double precision
Factor in single, do refinement in double
If k(A) lt 1/esingle, runs at speed of single
1.9x speedup on Intel-based laptop
Applies to many algorithms, if difference large

43
Exploiting GPUs

Numerous emerging co-processors
Cell, SSE, Grape, GPU, physics coprocessor,
When can we exploit them?
LIttle help if memory is bottleneck
Various attempts to use GPUs for dense linear
algebra
Bisection on GPUs for symmetric tridiagonal
eigenproblem
Evaluate Count(x) (evals lt x) for many x
Very little memory traffic
Speedups up to 100x (Volkov)

44
Goal 2 Expanded Content

Make content of ScaLAPACK mirror LAPACK as much
as possible

45
Missing Drivers in Sca/LAPACK
LAPACK ScaLAPACK
Linear Equations LU Cholesky LDLT xGESV xPOSV xSYSV PxGESV PxPOSV missing
Least Squares (LS) QR QRpivot SVD/QR SVD/DC SVD/MRRR xGELS xGELSY xGELSS xGELSD missing PxGELS missing missing missing (ok?) missing
Generalized LS LS equality constr. Generalized LM Above Iterative ref. xGGLSE xGGGLM missing missing missing missing
46
More missing drivers
LAPACK ScaLAPACK
Symmetric EVD QR / BisectionInvit DC MRRR xSYEV / X xSYEVD xSYEVR PxSYEV / X PxSYEVD missing
Nonsymmetric EVD Schur form Vectors too xGEES / X xGEEV / X missing (driver) missing
SVD QR DC MRRR Jacobi xGESVD xGESDD missing missing PxGESVD missing (ok?) missing missing
Generalized Symmetric EVD QR / BisectionInvit DC MRRR xSYGV / X xSYGVD missing PxSYGV / X missing (ok?) missing
Generalized Nonsymmetric EVD Schur form Vectors too xGGES / X xGGEV / X missing missing
Generalized SVD Kogbetliantz MRRR xGGSVD missing missing (ok) missing
47
Goal 2 Expanded Content

Make content of ScaLAPACK mirror LAPACK as much
as possible
New functions (highlights)
Updating / downdating of factorizations
Stewart, Langou
More generalized SVDs
Bai , Wang, Drmac (MS 3)
More generalized Sylvester/Lyapunov eqns
Kågström, Jonsson, Granat (MS19)
Structured eigenproblems
Selected matrix polynomials
Mehrmann, Higham, Tisseur
O(n2) version of roots(p)
Gu, Chandrasekaran, Bindel et al

48
New algorithm for roots(p)

To find roots of polynomial p
Roots(p) does eig(C(p))
Costs O(n3), stable, reliable
O(n2) Alternatives
Newton, Jenkins-Traub, Laguerre,
Stable? Reliable?
New Exploit semiseparable structure of C(p)
Low rank of any submatrix of upper triangle of
C(p) preserved under QR iteration
Complexity drops from O(n3) to O(n2), stable in
practice
Related work Van Barel (MS3), Gemignani, Bini,
Pan, et al
Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin
Xia, David Bindel, David Garmire, Jim Demmel

49
Goal 2 Expanded Content

Make content of ScaLAPACK mirror LAPACK as much
as possible
New functions (highlights)
Updating / downdating of factorizations
Stewart, Langou
More generalized SVDs
Bai , Wang, Drmac (MS 3)
More generalized Sylvester/Lyapunov eqns
Kågström, Jonsson, Granat (MS19)
Structured eigenproblems
Selected matrix polynomials
Mehrmann, Higham, Tisseur
O(n2) version of roots(p)
Gu, Chandrasekaran, Bindel et al
How should we prioritize missing functions?

50
Goal 3 Automate Performance Tuning

Widely used in performance tuning of Kernels
ATLAS (PhiPAC) BLAS - www.netlib.org/atlas
FFTW Fast Fourier Transform www.fftw.org
Spiral signal processing - www.spiral.net
OSKI Sparse BLAS bebop.cs.berkeley.edu/oski
Integrated into PETSc

51
Optimizing blocksizes for mat-mul
Finding a Needle in a Haystack So Automate
52
Goal 3 Automate Performance Tuning

Widely used in performance tuning of Kernels
1300 calls to ILAENV() to get block sizes, etc.
Never been systematically tuned
Extend automatic tuning techniques of ATLAS, etc.
to these other parameters
Automation important as architectures evolve
Convert ScaLAPACK data layouts on the fly
Important for ease-of-use too

53
ScaLAPACK Data Layouts
1D Cyclic
1D Block
2D Block Cyclic
1D Block Cyclic
54
Speedups for using 2D processor grid range from
2x to 8x Cost of redistributing from 1D to best
2D layout 1 - 10
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB
Memory
55
Fast Matrix Multiplication (1) (Cohn, Kleinberg,
Szegedy, Umans)

Can think of fast convolution of polynomials p, q
as
Map p (q) into group algebra Si pi zi ? CG of
cyclic group G zi
Multiply elements of CG (use
divideconquer FFT)
Extract coefficients
For matrix multiply, need non-abelian group
satisfying triple product property
There are subsets X, Y, Z of G where xyz 1 with
x ? X, y ? Y, z ?Z
? x y z 1
Map matrix A into group algebra via Sxy Axy
x-1y, B into Syz
Byz y-1z.
Since x-1y y-1z x-1z iff y y we get Sy Axy
Byz (AB)xz
Search for fast algorithms reduced to search for
groups with certain properties
Fastest algorithm so far is O(n2.38), same as
Coppersmith/Winograd

56
Fast Matrix Multiplication (2)(Cohn, Kleinberg,
Szegedy, Umans)

Embed A, B in group algebra (exact)
Perform FFT (roundoff)
Reorganize results into new matrices (exact)
Multiply new matrices recursively (roundoff)
Reorganize results into new matrices (exact)
Perform IFFT (roundoff)
Extract C AB from group algebra (exact)

57
Fast Matrix Multiplication (3)(D., Dumitriu,
Holtz, Kleinberg)

Thm 1 Any algorithm of this class for C AB is
numerically stable
Ccomp - C lt c nd e A B
O(e2)
c and d are modest constants
Like Strassen
Let ? be the exponent of matrix multiplication,
i.e. no algorithm is faster than O(n?).
Thm 2 For all ?gt0 there exists an algorithm with
complexity O(n??) that is numerically stable in
the sense of Thm 1.

58
Conclusions

Lots to do in Dense Linear Algebra
New numerical algorithms
Continuing architectural challenges
Parallelism, performance tuning
Ease of use, software engineering
Grant support, but success depends on
contributions from community
www.netlib.org/lapack-dev
www.cs.berkeley.edu/demmel

59
Extra Slides
60
Goal 4 Improved Ease of Use

Which do you prefer?

A \ B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO)
CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA,
DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C,
B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR,
BERR, WORK, LWORK, IWORK, LIWORK, INFO)
61
Goal 4 Improved Ease of Use

Easy interfaces vs access to details
Some users want access to all details, because
Peak performance matters
Control over memory allocation
Other users want simpler interface
Automatic allocation of workspace
No universal agreement across systems on easiest
interface
Leave decision to higher level packages
Keep expert driver / simple driver /
computational routines
Add wrappers for other languages
Fortran95, Java, Matlab, Python, even C
Automatic allocation of workspace
Add wrappers to convert to best parallel layout

62
Goal 5 Better SW EngineeringWhat could go into
Sca/LAPACK?
For all linear algebra problems
For all matrix structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to prioritize, automate!
63
Goal 5 Better SW Engineering

How to map multiple SW layers to emerging HW
layers?
How much better are asynchronous algorithms?
Are emerging PGAS languages better?
Statistical modeling to limit performance tuning
costs, improve use of shared clusters
Only some things understood well enough for
automation now
Telescoping languages, Bernoulli, Rose, FLAME,
Research Plan explore above design space
Development Plan to deliver code (some aspects)
Maintain core in F95 subset
Friendly wrappers for other programming
environments
Use variety of source control, maintenance,
development tools

64
Goal 6 Involve the Community

To help identify priorities
More interesting tasks than we are funded to do
See www.netlib.org/lapack-dev for list
To help identify promising algorithms
What have we missed?
To help do the work
Bug reports, provide fixes
Again, more tasks than we are funded to do
Already happening thank you!

65
CPU Trends

Relative processing power will continue to double
every 18 months
256 logical processors per chip in late 2010

66
Challenges

For all large scale computing, not just linear
algebra!

Example your laptop
Exponentially growing gaps between
Floating point time ltlt 1/Memory BW ltlt Memory
Latency

67
Commodity Processor Trends
Annual increase Typical valuein 2006 Predicted valuein 2010 Typical valuein 2020
Single-chip floating-point performance 59 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s
Memory bus bandwidth 23 1 GWord/s 0.25 word/flop 3.5 GWord/s 0.11 word/flop 27 GWord/s 0.008 word/flop
Memory latency (5.5) 70 ns 280 FP ops 70 loads 50 ns 1600 FP ops 170 loads 28 ns 94,000 FP ops 780 loads
Will our algorithms run at a high fraction of
peak?
Source Getting Up to Speed The Future of
Supercomputing, National Research Council, 222
pages, 2004, National Academies Press, Washington
DC, ISBN 0-309-09502-6.
68
Parallel Processor Trends
Annual increase Typical valuein 2004 Predicted valuein 2010 Typical valuein 2020
Processors 20 4,000 12,000 3300 GFLOP/s
Network Bandwidth 26 65 MWord/s 0.03 word/flop 260 MWord/s 0.008 word/flop 27 GWord/s 0.008 word/flop
Network latency (15) 5 ms 20K FP ops 2 ms 64K FP ops 28 ns 94,000 FP ops 780 loads
Will our algorithms scale up to more processors?
Source Getting Up to Speed The Future of
Supercomputing, National Research Council, 222
pages, 2004, National Academies Press, Washington
DC, ISBN 0-309-09502-6.
69
When is High Accuracy LA Possible? (1)(D.,
Dumitriu, Holtz)

Model fl(a ? b) (a ? b) (1 ?), ? ? ?
Not bit model, since ? small but arbitrary
Goal NASC for ? accurate LA algorithm
Subgoal NASC for ? accurate algorithm to
evaluate p(x)

70
Classical Arithmetic (2)

? ? , -, ? , exact comparisons, branches
Basic Allowable Sets (BAS)
Zi xi 0, Sik xi xk 0, Dik xi
xk 0
V(p) allowable if V(p) ? ? BAS
Thm 1 V(p) unallowable ? p cannot be evaluated
accurately
Thm 2 D ? (V(p) - ? A allowable A ? V(p)) ?
? ? p cannot be evaluated accurately on domain
D
Thm 3 p(x) integer coeffs, x complex ?
V(p) allowable iff p can be evaluated
accurately

finite
finite
71
Black Box Arithmetic (3)

? ? any set of polynomials qi(x)
Ex FMA q(x) x1 ? x2 x3
V(p) allowable if V(p) ? ? V(qi)
Thm 1 V(p) unallowable ? p cannot be evaluated
accurately
Thm 2 D ? (V(p) - ? A allowable A ? V(p) ? ?
? p cannot be evaluated accurately on domain D
Cor No accurate LA alg exists for Toeplitz
matrices using any finite set of arithmetic
operations
Proof Det(Toeplitz) contain irreducible
components of any degree

Irred parts
finite
72
Open Questions and Future Work (4)

Complete decision procedure for ? accurate
algorithms, in particular for real p, arbitrary
block box operations, domains D
Apply to more structured matrix classes
Incorporate division, rational functions
Incorporate perturbation theory
Conj Accurate eval possible iff condition number
has certain simply singularities
Extend to interval arithmetic
Math ArXiv math.NA/0508353

73
Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
74
Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
75
Timing of Eigensolvers(1.2 GHz Athlon, only
matrices where time gt .1 sec)
76
Timing of Eigensolvers(only matrices where time
gt .1 sec)
77
Accuracy Results (old vs new MRRR)
QQT I / (n e )
maxi Tqi li qi / ( n e )
78
Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices where time gt .01 sec, ngt200)
79
Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices where time gt .01 sec, ngt200)
80
Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices where time gt .01 sec, ngt200)
81
Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices where time gt .01 sec, ngt200)
82
Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices with clusters)
83
Timing of Eigensolvers(1.9 GHz IBM Power 5
ESSL, only matrices without clusters)
84
Accuracy Results on Power 5
QQT I / (n e )
maxi Tqi li qi / ( n e )
85
Accuracy Results on Power 5, Old vs New Grail
QQT I / (n e )
maxi Tqi li qi / ( n e )
86
Timing of Eigensolvers(Cray XD1 2.2 GHz Opteron
ACML)
87
Timing of Eigensolvers(Cray XD1 2.2 GHz Opteron
ACML)
88
Timing of Eigensolvers(Cray XD1 2.2 GHz Opteron
ACML)
89
Timing Ratios of Eigensolvers(Cray XD1 2.2 GHz
Opteron ACML)
90
Timing Ratios of Eigensolvers(Cray XD1 2.2 GHz
Opteron ACML) Matrices with tight clusters
91
Timing Ratios of Eigensolvers(Cray XD1 2.2 GHz
Opteron ACML) Matrices without tight clusters
92
Timing Ratios of Eigensolvers(Cray XD1 2.2 GHz
Opteron ACML) Random Matrices
93
Performance Ratios of Eigensolvers(Cray XD1 2.2
GHz Opteron ACML) (1,2,1) Matrices
94
Performance Ratios of Eigensolvers(Cray XD1 2.2
GHz Opteron ACML) Practical Matrices
95
New GSVD Algorithm
Given m x n A and p x n B, factor A U ?a X
and B V ?b X
Bai et al, UC Davis
PSVD, CSD on the way
96
Goal 2 Expanded Content

Make content of ScaLAPACK mirror LAPACK as much
as possible
New functions (highlights)
Updating / downdating of factorizations
Stewart, Langou
More generalized SVDs
Bai , Wang, Drmac (MS 3)

97
Plans for Summer 06 Release

Byers HQR (Byers, Smith)
MRRR (Voemel, Marques, Parlett)
Hessenberg Reduction (Kressner)
XBLAS (Li, Hida, Riedy)
Iterative Refinement
Hida, Riedy, Li, Demmel
Dongarra, Langou
RFP for packed Cholesky
Langou, Gustavson
Bug fixes

98
Plans for Summer 07

Generalized nonsymmetric eigenproblem
Reduction to condensed form
QZ
Reordering evals
Balancing
Everything inside xGGEV(X)
Reuse test/timing code?
Sylvester
New functions gt new test/timing

Write a Comment

User Comments (0)

About PowerShow.com

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev - PowerPoint PPT Presentation

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

Motivation for new Sca/LAPACK. Challenges (or research opportunities...) Goals of new Sca/LAPACK ... van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang ... – PowerPoint PPT presentation