Title: Sparse Scientific Applications : Improving Performance and Energy Characteristics on Advanced Archit
1Sparse Scientific Applications Improving
Performance and Energy Characteristics on
Advanced Architectures
2Motivations
- In many scientific and engineering disciplines
- Computational modeling and simulation is used to
understand complex systems or physical phenomena
in addition to theory and experiment. - Models are described by partial differential
equations (PDEs). - Numeric solutions depends on sparse linear system
solution. - This thesis focus on enabling faster and more
reliable sparse linear system solution taking
into account trends in modern computer
architectures.
3Outline
- Introduction and Background
- Scientific Applications
- Computational Quantum Mechanics
- Computational Fluid Dynamics Applications
- Performance-Power Characteristics of Sparse
Scientific Applications - Tuned SPEC benchmark
- Towards PxP benchmark
- Interleaved Minimum Degree (IMD) Preconditioner
for Symmetric Positive Definite Matrix - IMD algorithm
- Experimental results
- Research Plans and Conclusions
4Introduction and Background
- Many scientific and engineering phenomena are
described by partial differential equations
(PDEs) - Schrödinger equation of quantum mechanics
describes the wave function of a particle by
prescribing the relationships among its mass,
potential energy, and total energy. - Navier-Stokes equations describe the behavior of
a fluid by prescribing the relationships among
its velocity, density, pressure and viscosity.
5Outline
- Introduction and Background
- Scientific Applications
- Computational Quantum Mechanics
- Computational Fluid Dynamics Applications
- Performance-Power Characteristics of Sparse
Scientific Applications - Tuned SPEC benchmark
- Towards PxP benchmark
- Interleaved Minimum Degree (IMD) Preconditioner
for Symmetric Positive Definite Matrix - Incomplete Cholesky Preconditioner
- IMD preconditioner
- Research Plans and Conclusions
6Schrödinger Equation
- Sparse linear solution dominates cost.
- Modeling quantumnano using generalized tight
binding molecular dynamics (GTBMD)
7Dense Method to Solve Hv?Sv
- Main steps
- Compute a tridiagonal matrix T using orthogonal
transformations and orthogonal factor, Q. - Compute eigenvalues of T.
- Compute eigenvectors of T and the use Q to
compute eigenvectors of the original matrix. - Solve the generalized eigenvalue problems using
LAPACK SCALAPACK. - Lionxo cluster.
- 80 nodes, 2.4Ghz AMD, 8GB Main memory.
8Sparse Method to Solve Hv?Sv
- Sparse Steps
- Transform A into band matrix, B.
- Tridiagonalize B by Givens rotations, T
- Q is not computed.
- Find all eigenvalues of T with any method
- QR, Bisection, Divide and Conquer, RRR.
- Find all eigenvectors of A by sparse inverse
iteration, using eigenvalues obtained in the
previous step. - Shifted inverse method
- Using LAPACK for eigenvalues and ARPACK for
eigenvectors. - Significance Memory Scalability Bottleneck
Efficiency
9Dense vs. Sparse Method
10Si-nanotree
- We consider
- Si-nanotrees and stems.
- 2000-2500 atoms.
- Four-fold coordinated inner core surrounded by an
outer surface of atoms with three-fold
coordination. - Two categories
- Tetrahedral Si(Td)
- Clathrate(cage-like) Si (fcc), Si(sc)
- Inner core made of the Si clathrate structure
consisting of fcc and sc.
11Structure and Electronics
- Si-Nanowire Structures
- Top Tetrahedral structure.
- Middle Clathrate structure with 34 atoms basis,
fcc. - Bottom Clathrate structure with 46 atoms basis,
sc. - Reconstruction
- Smooth for Si(34), Si(46) nanotrees.
- Si(Td) junction appear abrupt.
- Cage-like arrangement allow seamless connection
between stems and the branches - Electronics Achieve smaller HOMO-LUMO gap,
Enhance conductivity due to more conduction
channels available.
12Summary Quantumnano
- GTBMD computation results for structure and
stability studies using large scale quantum
mechanical simulations of nano-trees from Si. - Sparse method reduces memory usage.
- Si structures are stable, electronic structure
shows narrow HOMO-LUMO gap. - Paper HPCNano05 Workshop Large scale
simulations of branched Si-nanowires. - Plans Parallel spectrum slicing for sparse
eigenvalue computation.
13Outline
- Introduction and Background
- Scientific Applications
- Computational Quantum Mechanics
- Computational Fluid Dynamics Applications
- Performance-Power Characteristics of Sparse
Scientific Applications - Tuned SPEC benchmark
- Towards PxP benchmark
- Interleaved Minimum Degree (IMD) Preconditioner
for Symmetric Positive Definite Matrix - Incomplete Cholesky Preconditioner
- IMD preconditioner
- Research Plans and Conclusions
14Navier-Stokes equations
- Sparse linear solution dominates cost.
15Computational Fluid Dynamics
- Three dimensional flow problems in science and
engineering feature geometries that are
homogeneous in at least one of the flow
directions - e.g. a) Rayleigh-Bernard convection, b) reactor
core cooling - Numerical simulation cost can be reduced by
recasting - Problem in terms of eigenvectors of the
one-dimensional operators to yield a decoupled
set of problems of lower-dimensions.
16Reduced Dimension
17Three steps solution
- Transformation to wave space
- Solution
- Inverse transform
18Two Dimensional Solves
- Consider AFExb
- AFELLT, Lyb, LTxy.
- Steps
- Ordering, Symbolic factorization, Numeric
factorization, Triangular solution. - Bottleneck Triangular solution
- SI scheme Replace substitution with selective
inverse matrix multiplication, latency tolerant. - Multiple right hand sides Reduce communication
cost.
19Using SI scheme
20Reducing Communication Cost using Multiple RHS
21Summary CFD Application
- Reduce dimension for three dimensional problems
using tensor-product bases. - O(nlogn) algorithm
- Lower order complexity realized by
- Using SI scheme.
- Handling multiple right hand sides.
- Paper 2nd MIT Conf. on Structural and Fluid
Mechanics An O(nlogn) Solution Algorithm for
Spectral Element Methods.
22Outline
- Introduction and Background
- Scientific Applications
- Computational Quantum Mechanics
- Computational Fluid Dynamics Applications
- Performance-Power Characteristics of Sparse
Scientific Applications - Interleaved Minimum Degree (IMD) Preconditioner
for Symmetric Positive Definite Matrix - Incomplete Cholesky Preconditioner
- IMD preconditioner
- Research Plans and Conclusions
23Characteristics of Scientific Applications
- What we need ?
- Faster sparse computation for scientific
application. - Microprocessor design focus on performance and
power, architectures for scaling. - Sparse computation
- Kernels utilize only a fraction of available
processing speed large number of data accesses
per floating point operation, limited data
locality and data reuse. - Sparse applications are good candidate for
co-managing performance and power.
24Questions ?
- How do memory subsystem optimizations affect the
power and performance of sparse applications ? - How can the interactions between algorithm and
architectural features be characterized ?
25NAS CG benchmark
- Conjugate Gradient is the most popular sparse
iterative solver. - One matrix-vector multiplication, two vector
inner-product, three vector additions, two
floating point divisions. - Matrix-vector multiplication dominates the cost
- Need efficient sparse matrix-vector routine.
26SMV optimization SPARSITY
- Optimized routine for matrix-vector
multiplication. - Reduce Index overhead by adding nonzeros.
- Matrix-vector multiplication routine uses this
blocking structure utilizing registers. - OSKI has forward and backward solution.
Sparse Matrix
Row
Column
Values
SPARSITY 2x1 blocking
Row
Column
Values
27Experiment
- Measure the power and performance difference
between optimized and unoptimized code. - Software
- SimpleScalar and Wattch.
- Metrics
- Base Execution Time, Power, Energy.
- Data bcsstk31 (dimension35588,
nonzeros1181416) - Hardware
- Base Architecture 2 floating point units, 64 or
128 bits bus, 32KB data/32KB instruction, cache
(L1), 2KB L2 cache, 4MB L3 cache. - Level 2 prefetcher (LP), Memory prefetcher (MP)
28CG Time Power
Utilizing low power modes of caches decrease
power consumption
29CG LP and MP, Time
256KB cache, relative to CG-U, B, 4MB, 1GHz.
30CG LP and MP, Power
At 400MHz, 40 power consumption.
Performance does not suffer with DVS.
31Power and Performance tradeoffs
- CG-O has more possibilities for reducing power
without sacrificing performance - Cache size 256KB.
32Interactions between tuned code and architectures
- The equake in SPEC CFP 2000
- Simulates the propagation of elastic waves.
- Data structure is application specific.
- Improve data locality.
Memory
Memory
Vectors
Matrix
Tune
Tune
33Simulation Time and Energy
- Memory subsystem optimization helps but not as
good as tuned code. - Tuned code without memory subsystem optimization
better than untuned code.
34Relative Incremental Improvements
- For untuned code, LP degrade performance and MP
does not contribute much. - For tuned code, LP still improve performance and
MP helps a lot.
35Discrepancy
- MP makes huge difference between tuned and
untuned code. - LP still makes 120 difference between tuned and
untuned code.
36Toward PxP benchmark
- Performance analysis with such tuned codes in
addition to the traditional benchmarks will
enable a more comprehensive assessment of
architectural optimizations for future high end
systems. - Provide a parameterized synthetic benchmark with
sparse iterative solver with multiple levels of
optimizations. - Tunable level of optimizations Ordering (RCM,
MMD), Preconditioner type (ICC(0), ICC(1), ICT),
Blocking size, etc.
37Outline
- Introduction and Background
- Scientific Applications
- Computational Quantum Mechanics
- Computational Fluid Dynamics Applications
- Performance-Power Characteristics of Sparse
Scientific Applications - Tuned SPEC benchmark
- Towards PxP benchmark
- Interleaved Minimum Degree (IMD) Preconditioner
for Symmetric Positive Definite Matrix - Research Plans and Conclusions.
38Sparse Linear System Solution
- A is symmetric and positive definite
- Conjugate Gradients (CG) with an incomplete
Cholesky factor as a preconditioner. - During Cholesky factorization, some of the
elements in A that are zeros will fill-in.
39Two types of Incomplete Cholesky Factorization
- Incomplete factor is obtained by performing
Cholesky factorization while selectively
discarding some fraction of the fill-in nonzeros. - Level-of-fill (k)
- Lij?0 if there is a i-j fill-path in G(A).
- L(k)?0 if there is a i-j fill-path of length
(k1). - Drop-threshold
- L(i,j) ?0 if L(i,j)gtsqrt(L(i,i)xL(j,j)threshol
d. - Final nonzero structure, , is known after
factorization.
40Traditional Approach
- Traditional approach
- Incomplete Cholesky factorization based on
predetermined ordering. - Using fill reduce ordering for
- MMD, RCM, ND.
- Drawbacks
- To reduce fill-in for the complete Cholesky
Factor - Since the ordering is determined in advance, not
easy to accommodate alternative pivoting
sequences based on numerical magnitude.
41Interleaved Minimum Degree Ordering (IMD)
- Consider sparse Cholesky factorization at step i
- Incomplete Cholesky uses HiHi-Ei where Ei is the
error matrix corresponding to values dropped from
vi and Hi1.
dii
vi
Bi1
42IMD Algorithm
- AD-1/2AD1/2
- While has not been computed do
- While there are unnumbered columns do
- Select a column with minimum degree, number it
- Compute column L,i
- If Li,i0 then
- Shift diagonal by a and set a 2 a
- Else
- For all remaining unnumbered columns j do
- Retain/drop using levels of fill/threshold
- End for
- For all remaining unnumbered columns, j do
- Update L,j if Lj,i is nonzero
- Drop/Retain Lp,j using levels of fill/threshold
criterion - End for
- End if
- End while
- End while
43Empirical Experimental Results
- Quality of Preconditioners
- Scaled fill
- Number of nonzeros in the preconditioner relative
to the number of nonzeros in the original matrix,
i.e. L/A. - Iterations
- Number of iterations for PCG to converge.
- Operations
- Product scaled fill x iterations.
- Number of operations required to apply the
preconditioner.
44Experiment
- 34 matrices
- IMD, MMD and RCM ordering.
- Four levels of fills (0,1,2,and3)
- IMDF, IMDF-SD, IMDF-BD, MMD-ICF, RCM-ICF.
- Four threshold (.1, .05, 0.01 and 0.001)
- IMDT, IMDT-SD, IMDT-BD, MMD-ICT, RCM-ICT.
- 272 tests per method.
- 1,360 tests over all five schemes.
- 1e-6 relative residual or 1000 max iterations.
- If agt1 (AaI), then fail.
45Cumulative Iterations
IMDT BD SD MMD RCM Scaled
Fill 1.16 1.05 1.04 1.04
1.0 Iterations 0.71 0.58 0.53 1.11
1.0
46Cumulative Operations
IMDT BD SD MMD RCM Operations
0.83 0.62 0.56 1.16 1.0
47Cumulative Time
IMDT BD SD MMD RCM
L 2.27 5.18 5.11 2.78
1.0 L4 rhs 0.72 0.76 0.69 1.17
1.0
48Quality of ordering
Error from two orderings
49Fill and Dropped Elements
- Amount of fill-in relative to incomplete factor (
) and dropped elements relative ( ) to
incomplete factor
50Summary - IMD
- Developed IMD schemes for effective
preconditioning through interleaving of greedy
minimum degree ordering and incomplete
factorization - Operations
- RCM-IC 110, MMD-IC 152.
- Time
- IMD requires 6 more time than RCM-ICF.
- Four system solution time, IMD-SD requires the
least time. - IMD better approximate by using degree metric.
- Paper SIAM journal of Matrix Analysis and
Applications Effective Preconditioning Through
Ordering Interleaved with Incomplete
Factorization. (18 pages)
51Outline
- Introduction and Background
- Scientific Applications
- Computational Quantum Mechanics
- Computational Fluid Dynamics Applications
- Performance-Power Characteristics of Sparse
Scientific Applications - Tuned SPEC benchmark
- Towards PxP benchmark
- Interleaved Minimum Degree (IMD) Preconditioner
for Symmetric Positive Definite Matrix - Incomplete Cholesky Preconditioner
- IMD preconditioner
- Research Plans and Conclusions
52Conclusions
- Explore the role of sparse linear solvers in
scientific and engineering applications - CFD
- Reduce computation complexity.
- quantum nanomechanics
- Enable large scale simulations by reducing memory
usage. - Analyze the interaction between sparse kernels
and architectural features - Performance can be improved while system power is
reduced. - Developing Interleaved Minimum Degree (IMD)
preconditioner - Stable and accelerate iterative solution.
53Contributions thus far
- Application-Specific Sparse Solvers
- Paper HPCNano05 Workshop Large scale
simulations of branched Si-nanowires. (5 pages) - Paper 2nd MIT Conf. on Structural and Fluid
Mechanics An O(nlogn) Solution Algorithm for
Spectral Element Methods. (8 pages) - Characterizing Interactions between architecture
optimizations and sparse solvers - Submit Memory Optimizations for Tuned
Scientific Applications An Evaluation of
Performance-Power Characteristics, ISPASS06. - Submitting Conjugate Gradient Sparse Iterative
Solvers Performance-Power Characteristics,
HP-PAC. - General-purpose sparse solvers
- Paper SIAM journal of Matrix Analysis and
Applications Effective Preconditioning Through
Ordering Interleaved with Incomplete
Factorization. (18 pages)
54Plans
- Generalize the IMD algorithm to nonsymmetric
matrices - Using constrained diagonal Markowitz.
- Preliminary results show that it is robust than
ILUT. - Provide a synthetic code for benchmarking using
highly optimized sparse kernels - Sparse iterative solvers.
- Parameterize to support different implementations
and phases. - Parallel spectrum slicing for eigenvalue
computation.
55Thanks
56Whats main factor ?
- High performance computers.
- Efficient sparse linear solvers.
- Goal
- Improving the performance of sparse linear
solutions and understanding how such schemes and
high performance architectural features interact.
57Computational Nanotechnology
- Length and time scales of nanoscale systems and
phenomenon - shrunk to where we can directly address them with
computer simulations and theoretical modeling
with high accuracy. - Increasing computer power used large-scale and
high-fidelity simulations - make it possible for nanoscale simulations to be
predictive. - Computational nanotechnology is
- emerging as a engineering analysis tool for novel
nanodevice design.
58Schrödinger Equations
- Describes atoms as a collection of quantum
mechanical particles, nuclei, and electrons,
governed by the Schrödinger equation - Generalized Tight Binding Molecular Dynamics
(GTBMD) - Parameterize the Hamiltonian, H(RI).
- Repulsive pair potential depends on the distance.
- Bonding energy is constant that shifts the zero
of energy.
59Computational Nanotechnology
Parameterized
- Classical MD
- Newton mechanics
- Tight Binding MD
- Semiempirical
- Ab Initio (first-principle)
- - Schrödinger Equations
- Born Oppenheimer
- Car-Parrinello
Classical MD are not accurate enough and ab
initio computations are not feasible.
Accurate, High Computational Cost
Less Accurate, Computationally Cheap
60Schrödinger Equations
- Describes atoms as a collection of quantum
mechanical particles, nuclei, and electrons,
governed by the Schrödinger equation - Born-Oppenheimer
- Electronic degrees of freedom follow the
corresponding nuclear positions
61Generalized Tight Bind MD
- Obtaining Hij and Sij
- Nonorthogonality between two sp3 hybrids.
62Generalized Tight Binding MD
- Construct Hamiltonian
- Nonorthogonal basis of atomic orbitals
- Hamiltonian and overlap matrix elements
- One-electron energies are obtained by solving the
generalized eigenvalue equation
63GTBMD Advantage
- GTBMD
- Computationally efficient because we can
parameterize the Hamiltonian, H(RI). - Transperable parametrization scheme by including
explicitly the effects of the nonorthogonality of
the atomic basis. - To find an energy-minimized structure of a
nanoscale system under consideration without
symmetry constraints.
64Problems ?
- Inverse Iteration
- Once the eigenvalues are computed, the
eigenvectors are computed using factorizations of
the original sparse matrix. - Re-orthogonalization problem.
- Reduced storage.
- Sparse inverse iteration.
65Lanczos Method
- Iterative method for sparse symmetric eigenvalue
problems - Dominant cost is spase matrix-vector
multiplication (Axy). - Suitable for finding several eigenvalues and
eigenvectors with maximum absolute values - Not suitable to find all the eigenvalues.
- By using an shifted inverse implicitly, the
method can find eigenvalues near the shift
(eigenvectors for A-lI) - Similar concept to inverse iteration.
- Called Shifted Inverse Lancozs.
- Can find several eigenpairs per factorization.
66Si-nanowire Matrix Structure
Si40_stem_1776
Si40_tree_1872
Si40_stem_1776
Si40_tree_1872
Si_tree is close to block diagonal matrix
67SCALAPACK Results
68Electronics
- HOMO-LUMO gap (Highest occupied molecular orbital
Lowest unoccupied molecular orbital) - Stems (0.57eV), Trees (0.16eV) and Bulk diamond
phase of silicon 1.1eV. - Trees
- More states from the unoccupied levels are pulled
in towards the Fermi level, EF(dashed). - Enhance conductivity due to more conduction
channels being available.
Electronic densities of states (DOS)
69Shifted Inverse Lanczos for Sparse Matrix
- Eigenvalues near the shift converge very quickly.
- If all eigenvalues are known, they can be used as
a shift - Separate eigenvalues into groups according to
their distribution (slicing spectrum). - Select a shift in the middle of the group.
- Run a Lanczos iteration for each group.
70Computational Fluid Dynamics
- Numerical simulation cost can be reduced by
recasting - Problem in terms of eigenvectors of the
one-dimensional operators to yield a decoupled
set of problems of lower-dimensions. - Reduce three-dimensional problems to independent
two-dimensional subproblems.
71Problem Formulation
72Solution Cost
- Transform from physical to wave space and inverse
transform - Matrix-matrix products.
- Efficient on all architecture.
- If nz ltlt nxy,
- Cost of applying S or ST is subdominant to that
of solving the two-dimensional subproblems.
73Preconditioning
- Reduced SE problems
- Solved by using conjugate gradient iteration.
- Preconditioning with a system AFE that is based
on a finite-element. - Discretization having nodes coincident with the
SE nodes. - Finite-element preconditioning gives bounded
iteration counts.
74Parallel Solver
- Numeric factorization
- Peformed effectively on multiprocessor with
multifrontal or column-block methods. - Triangular solution becomes a numerical
bottleneck. - Get over
- Using Selective Inverse (SI) scheme.
- Using multiple right hand sides.
75Implementation Details
- Implementation
- Tie-breaking
- Random (IMD).
- Big diagonal first (IMD-BD)
- Stability through pivot.
- Small diagonal first (IMD-SD)
- Indistinguishable node effect.
- Heap structure to reduce the degree update
overhead - For tie-break by numerical measure, setup degree
queue list and index reference for each column.
Degree Queue List
Index Reference
76Failures ?
- Manteuffel
- Shows that if ÂaI is an H-matrix, then
incomplete Cholesky factorization of ÂaI
succeeds. - Jennings Malik
- If aijk is dropped, setting.
- aii(k)aii(k)saij(k), ajj(k)ajj(k)1/saij(k)
. - Jones Plassmann
- Increasing a by constants (1e-2).
- Bouaricha, More, and Wu
- a s ½Â8.
- Lin More
- Start with a s 1e-3 and multiply by two.
77Minimum Discarded Fill (MDF, Tang)
- Motivated by the observation
- Hi1 Hi1 Ei1Bi1-viviT/dii-Ei1.
- Eliminate the ith node as the Frobenius norm of
the discarded fill matrix Fi(Ei). - Shows better performance than RCM ordering but
expensive to compute.
78What should be considered
- Ordering schemes that use structural metrics to
reduce fill in rather than . - Ordering interleaved with numeric factorization
to compute . - Greedy minimum degree scheme based on the
elimination graphs of .
79Fails and Diagonal Shift
- Diagonal Shift
- IMD (2.2), SD (1.9), BD (1.8).
- MMD (4.1), RCM(3.7).
- Fails
- MMD (22), RCM (19).
- IMD method is much stable.
80Compare with MDF
81CG LP and MP, Power Energy
82Derived Metric
- RI (Relative Improvement)
- RII (Relative Incremental Improvement)
- Discrepancy in RII