Sparse Scientific Applications : Improving Performance and Energy Characteristics on Advanced Archit PowerPoint PPT Presentation

presentation player overlay
1 / 82
About This Presentation
Transcript and Presenter's Notes

Title: Sparse Scientific Applications : Improving Performance and Energy Characteristics on Advanced Archit


1
Sparse Scientific Applications Improving
Performance and Energy Characteristics on
Advanced Architectures
  • 12. 14. 05
  • Ingyu Lee

2
Motivations
  • In many scientific and engineering disciplines
  • Computational modeling and simulation is used to
    understand complex systems or physical phenomena
    in addition to theory and experiment.
  • Models are described by partial differential
    equations (PDEs).
  • Numeric solutions depends on sparse linear system
    solution.
  • This thesis focus on enabling faster and more
    reliable sparse linear system solution taking
    into account trends in modern computer
    architectures.

3
Outline
  • Introduction and Background
  • Scientific Applications
  • Computational Quantum Mechanics
  • Computational Fluid Dynamics Applications
  • Performance-Power Characteristics of Sparse
    Scientific Applications
  • Tuned SPEC benchmark
  • Towards PxP benchmark
  • Interleaved Minimum Degree (IMD) Preconditioner
    for Symmetric Positive Definite Matrix
  • IMD algorithm
  • Experimental results
  • Research Plans and Conclusions

4
Introduction and Background
  • Many scientific and engineering phenomena are
    described by partial differential equations
    (PDEs)
  • Schrödinger equation of quantum mechanics
    describes the wave function of a particle by
    prescribing the relationships among its mass,
    potential energy, and total energy.
  • Navier-Stokes equations describe the behavior of
    a fluid by prescribing the relationships among
    its velocity, density, pressure and viscosity.

5
Outline
  • Introduction and Background
  • Scientific Applications
  • Computational Quantum Mechanics
  • Computational Fluid Dynamics Applications
  • Performance-Power Characteristics of Sparse
    Scientific Applications
  • Tuned SPEC benchmark
  • Towards PxP benchmark
  • Interleaved Minimum Degree (IMD) Preconditioner
    for Symmetric Positive Definite Matrix
  • Incomplete Cholesky Preconditioner
  • IMD preconditioner
  • Research Plans and Conclusions

6
Schrödinger Equation
  • Sparse linear solution dominates cost.
  • Modeling quantumnano using generalized tight
    binding molecular dynamics (GTBMD)

7
Dense Method to Solve Hv?Sv
  • Main steps
  • Compute a tridiagonal matrix T using orthogonal
    transformations and orthogonal factor, Q.
  • Compute eigenvalues of T.
  • Compute eigenvectors of T and the use Q to
    compute eigenvectors of the original matrix.
  • Solve the generalized eigenvalue problems using
    LAPACK SCALAPACK.
  • Lionxo cluster.
  • 80 nodes, 2.4Ghz AMD, 8GB Main memory.

8
Sparse Method to Solve Hv?Sv
  • Sparse Steps
  • Transform A into band matrix, B.
  • Tridiagonalize B by Givens rotations, T
  • Q is not computed.
  • Find all eigenvalues of T with any method
  • QR, Bisection, Divide and Conquer, RRR.
  • Find all eigenvectors of A by sparse inverse
    iteration, using eigenvalues obtained in the
    previous step.
  • Shifted inverse method
  • Using LAPACK for eigenvalues and ARPACK for
    eigenvectors.
  • Significance Memory Scalability Bottleneck
    Efficiency

9
Dense vs. Sparse Method
10
Si-nanotree
  • We consider
  • Si-nanotrees and stems.
  • 2000-2500 atoms.
  • Four-fold coordinated inner core surrounded by an
    outer surface of atoms with three-fold
    coordination.
  • Two categories
  • Tetrahedral Si(Td)
  • Clathrate(cage-like) Si (fcc), Si(sc)
  • Inner core made of the Si clathrate structure
    consisting of fcc and sc.

11
Structure and Electronics
  • Si-Nanowire Structures
  • Top Tetrahedral structure.
  • Middle Clathrate structure with 34 atoms basis,
    fcc.
  • Bottom Clathrate structure with 46 atoms basis,
    sc.
  • Reconstruction
  • Smooth for Si(34), Si(46) nanotrees.
  • Si(Td) junction appear abrupt.
  • Cage-like arrangement allow seamless connection
    between stems and the branches
  • Electronics Achieve smaller HOMO-LUMO gap,
    Enhance conductivity due to more conduction
    channels available.

12
Summary Quantumnano
  • GTBMD computation results for structure and
    stability studies using large scale quantum
    mechanical simulations of nano-trees from Si.
  • Sparse method reduces memory usage.
  • Si structures are stable, electronic structure
    shows narrow HOMO-LUMO gap.
  • Paper HPCNano05 Workshop Large scale
    simulations of branched Si-nanowires.
  • Plans Parallel spectrum slicing for sparse
    eigenvalue computation.

13
Outline
  • Introduction and Background
  • Scientific Applications
  • Computational Quantum Mechanics
  • Computational Fluid Dynamics Applications
  • Performance-Power Characteristics of Sparse
    Scientific Applications
  • Tuned SPEC benchmark
  • Towards PxP benchmark
  • Interleaved Minimum Degree (IMD) Preconditioner
    for Symmetric Positive Definite Matrix
  • Incomplete Cholesky Preconditioner
  • IMD preconditioner
  • Research Plans and Conclusions

14
Navier-Stokes equations
  • Sparse linear solution dominates cost.

15
Computational Fluid Dynamics
  • Three dimensional flow problems in science and
    engineering feature geometries that are
    homogeneous in at least one of the flow
    directions
  • e.g. a) Rayleigh-Bernard convection, b) reactor
    core cooling
  • Numerical simulation cost can be reduced by
    recasting
  • Problem in terms of eigenvectors of the
    one-dimensional operators to yield a decoupled
    set of problems of lower-dimensions.

16
Reduced Dimension
17
Three steps solution
  • Transformation to wave space
  • Solution
  • Inverse transform

18
Two Dimensional Solves
  • Consider AFExb
  • AFELLT, Lyb, LTxy.
  • Steps
  • Ordering, Symbolic factorization, Numeric
    factorization, Triangular solution.
  • Bottleneck Triangular solution
  • SI scheme Replace substitution with selective
    inverse matrix multiplication, latency tolerant.
  • Multiple right hand sides Reduce communication
    cost.

19
Using SI scheme
20
Reducing Communication Cost using Multiple RHS
21
Summary CFD Application
  • Reduce dimension for three dimensional problems
    using tensor-product bases.
  • O(nlogn) algorithm
  • Lower order complexity realized by
  • Using SI scheme.
  • Handling multiple right hand sides.
  • Paper 2nd MIT Conf. on Structural and Fluid
    Mechanics An O(nlogn) Solution Algorithm for
    Spectral Element Methods.

22
Outline
  • Introduction and Background
  • Scientific Applications
  • Computational Quantum Mechanics
  • Computational Fluid Dynamics Applications
  • Performance-Power Characteristics of Sparse
    Scientific Applications
  • Interleaved Minimum Degree (IMD) Preconditioner
    for Symmetric Positive Definite Matrix
  • Incomplete Cholesky Preconditioner
  • IMD preconditioner
  • Research Plans and Conclusions

23
Characteristics of Scientific Applications
  • What we need ?
  • Faster sparse computation for scientific
    application.
  • Microprocessor design focus on performance and
    power, architectures for scaling.
  • Sparse computation
  • Kernels utilize only a fraction of available
    processing speed large number of data accesses
    per floating point operation, limited data
    locality and data reuse.
  • Sparse applications are good candidate for
    co-managing performance and power.

24
Questions ?
  • How do memory subsystem optimizations affect the
    power and performance of sparse applications ?
  • How can the interactions between algorithm and
    architectural features be characterized ?

25
NAS CG benchmark
  • Conjugate Gradient is the most popular sparse
    iterative solver.
  • One matrix-vector multiplication, two vector
    inner-product, three vector additions, two
    floating point divisions.
  • Matrix-vector multiplication dominates the cost
  • Need efficient sparse matrix-vector routine.

26
SMV optimization SPARSITY
  • Optimized routine for matrix-vector
    multiplication.
  • Reduce Index overhead by adding nonzeros.
  • Matrix-vector multiplication routine uses this
    blocking structure utilizing registers.
  • OSKI has forward and backward solution.

Sparse Matrix
Row
Column
Values
SPARSITY 2x1 blocking
Row
Column
Values
27
Experiment
  • Measure the power and performance difference
    between optimized and unoptimized code.
  • Software
  • SimpleScalar and Wattch.
  • Metrics
  • Base Execution Time, Power, Energy.
  • Data bcsstk31 (dimension35588,
    nonzeros1181416)
  • Hardware
  • Base Architecture 2 floating point units, 64 or
    128 bits bus, 32KB data/32KB instruction, cache
    (L1), 2KB L2 cache, 4MB L3 cache.
  • Level 2 prefetcher (LP), Memory prefetcher (MP)

28
CG Time Power
Utilizing low power modes of caches decrease
power consumption
29
CG LP and MP, Time
256KB cache, relative to CG-U, B, 4MB, 1GHz.
30
CG LP and MP, Power
At 400MHz, 40 power consumption.
Performance does not suffer with DVS.
31
Power and Performance tradeoffs
  • CG-O has more possibilities for reducing power
    without sacrificing performance
  • Cache size 256KB.

32
Interactions between tuned code and architectures
  • The equake in SPEC CFP 2000
  • Simulates the propagation of elastic waves.
  • Data structure is application specific.
  • Improve data locality.

Memory
Memory
Vectors
Matrix
Tune
Tune
33
Simulation Time and Energy
  • Memory subsystem optimization helps but not as
    good as tuned code.
  • Tuned code without memory subsystem optimization
    better than untuned code.

34
Relative Incremental Improvements
  • For untuned code, LP degrade performance and MP
    does not contribute much.
  • For tuned code, LP still improve performance and
    MP helps a lot.

35
Discrepancy
  • MP makes huge difference between tuned and
    untuned code.
  • LP still makes 120 difference between tuned and
    untuned code.

36
Toward PxP benchmark
  • Performance analysis with such tuned codes in
    addition to the traditional benchmarks will
    enable a more comprehensive assessment of
    architectural optimizations for future high end
    systems.
  • Provide a parameterized synthetic benchmark with
    sparse iterative solver with multiple levels of
    optimizations.
  • Tunable level of optimizations Ordering (RCM,
    MMD), Preconditioner type (ICC(0), ICC(1), ICT),
    Blocking size, etc.

37
Outline
  • Introduction and Background
  • Scientific Applications
  • Computational Quantum Mechanics
  • Computational Fluid Dynamics Applications
  • Performance-Power Characteristics of Sparse
    Scientific Applications
  • Tuned SPEC benchmark
  • Towards PxP benchmark
  • Interleaved Minimum Degree (IMD) Preconditioner
    for Symmetric Positive Definite Matrix
  • Research Plans and Conclusions.

38
Sparse Linear System Solution
  • A is symmetric and positive definite
  • Conjugate Gradients (CG) with an incomplete
    Cholesky factor as a preconditioner.
  • During Cholesky factorization, some of the
    elements in A that are zeros will fill-in.

39
Two types of Incomplete Cholesky Factorization
  • Incomplete factor is obtained by performing
    Cholesky factorization while selectively
    discarding some fraction of the fill-in nonzeros.
  • Level-of-fill (k)
  • Lij?0 if there is a i-j fill-path in G(A).
  • L(k)?0 if there is a i-j fill-path of length
    (k1).
  • Drop-threshold
  • L(i,j) ?0 if L(i,j)gtsqrt(L(i,i)xL(j,j)threshol
    d.
  • Final nonzero structure, , is known after
    factorization.

40
Traditional Approach
  • Traditional approach
  • Incomplete Cholesky factorization based on
    predetermined ordering.
  • Using fill reduce ordering for
  • MMD, RCM, ND.
  • Drawbacks
  • To reduce fill-in for the complete Cholesky
    Factor
  • Since the ordering is determined in advance, not
    easy to accommodate alternative pivoting
    sequences based on numerical magnitude.

41
Interleaved Minimum Degree Ordering (IMD)
  • Consider sparse Cholesky factorization at step i
  • Incomplete Cholesky uses HiHi-Ei where Ei is the
    error matrix corresponding to values dropped from
    vi and Hi1.

dii
vi
Bi1
42
IMD Algorithm
  • AD-1/2AD1/2
  • While has not been computed do
  • While there are unnumbered columns do
  • Select a column with minimum degree, number it
  • Compute column L,i
  • If Li,i0 then
  • Shift diagonal by a and set a 2 a
  • Else
  • For all remaining unnumbered columns j do
  • Retain/drop using levels of fill/threshold
  • End for
  • For all remaining unnumbered columns, j do
  • Update L,j if Lj,i is nonzero
  • Drop/Retain Lp,j using levels of fill/threshold
    criterion
  • End for
  • End if
  • End while
  • End while

43
Empirical Experimental Results
  • Quality of Preconditioners
  • Scaled fill
  • Number of nonzeros in the preconditioner relative
    to the number of nonzeros in the original matrix,
    i.e. L/A.
  • Iterations
  • Number of iterations for PCG to converge.
  • Operations
  • Product scaled fill x iterations.
  • Number of operations required to apply the
    preconditioner.

44
Experiment
  • 34 matrices
  • IMD, MMD and RCM ordering.
  • Four levels of fills (0,1,2,and3)
  • IMDF, IMDF-SD, IMDF-BD, MMD-ICF, RCM-ICF.
  • Four threshold (.1, .05, 0.01 and 0.001)
  • IMDT, IMDT-SD, IMDT-BD, MMD-ICT, RCM-ICT.
  • 272 tests per method.
  • 1,360 tests over all five schemes.
  • 1e-6 relative residual or 1000 max iterations.
  • If agt1 (AaI), then fail.

45
Cumulative Iterations
IMDT BD SD MMD RCM Scaled
Fill 1.16 1.05 1.04 1.04
1.0 Iterations 0.71 0.58 0.53 1.11
1.0
46
Cumulative Operations
IMDT BD SD MMD RCM Operations
0.83 0.62 0.56 1.16 1.0
47
Cumulative Time
IMDT BD SD MMD RCM
L 2.27 5.18 5.11 2.78
1.0 L4 rhs 0.72 0.76 0.69 1.17
1.0
48
Quality of ordering
Error from two orderings
49
Fill and Dropped Elements
  • Amount of fill-in relative to incomplete factor (
    ) and dropped elements relative ( ) to
    incomplete factor

50
Summary - IMD
  • Developed IMD schemes for effective
    preconditioning through interleaving of greedy
    minimum degree ordering and incomplete
    factorization
  • Operations
  • RCM-IC 110, MMD-IC 152.
  • Time
  • IMD requires 6 more time than RCM-ICF.
  • Four system solution time, IMD-SD requires the
    least time.
  • IMD better approximate by using degree metric.
  • Paper SIAM journal of Matrix Analysis and
    Applications Effective Preconditioning Through
    Ordering Interleaved with Incomplete
    Factorization. (18 pages)

51
Outline
  • Introduction and Background
  • Scientific Applications
  • Computational Quantum Mechanics
  • Computational Fluid Dynamics Applications
  • Performance-Power Characteristics of Sparse
    Scientific Applications
  • Tuned SPEC benchmark
  • Towards PxP benchmark
  • Interleaved Minimum Degree (IMD) Preconditioner
    for Symmetric Positive Definite Matrix
  • Incomplete Cholesky Preconditioner
  • IMD preconditioner
  • Research Plans and Conclusions

52
Conclusions
  • Explore the role of sparse linear solvers in
    scientific and engineering applications
  • CFD
  • Reduce computation complexity.
  • quantum nanomechanics
  • Enable large scale simulations by reducing memory
    usage.
  • Analyze the interaction between sparse kernels
    and architectural features
  • Performance can be improved while system power is
    reduced.
  • Developing Interleaved Minimum Degree (IMD)
    preconditioner
  • Stable and accelerate iterative solution.

53
Contributions thus far
  • Application-Specific Sparse Solvers
  • Paper HPCNano05 Workshop Large scale
    simulations of branched Si-nanowires. (5 pages)
  • Paper 2nd MIT Conf. on Structural and Fluid
    Mechanics An O(nlogn) Solution Algorithm for
    Spectral Element Methods. (8 pages)
  • Characterizing Interactions between architecture
    optimizations and sparse solvers
  • Submit Memory Optimizations for Tuned
    Scientific Applications An Evaluation of
    Performance-Power Characteristics, ISPASS06.
  • Submitting Conjugate Gradient Sparse Iterative
    Solvers Performance-Power Characteristics,
    HP-PAC.
  • General-purpose sparse solvers
  • Paper SIAM journal of Matrix Analysis and
    Applications Effective Preconditioning Through
    Ordering Interleaved with Incomplete
    Factorization. (18 pages)

54
Plans
  • Generalize the IMD algorithm to nonsymmetric
    matrices
  • Using constrained diagonal Markowitz.
  • Preliminary results show that it is robust than
    ILUT.
  • Provide a synthetic code for benchmarking using
    highly optimized sparse kernels
  • Sparse iterative solvers.
  • Parameterize to support different implementations
    and phases.
  • Parallel spectrum slicing for eigenvalue
    computation.

55
Thanks
  • Q A

56
Whats main factor ?
  • High performance computers.
  • Efficient sparse linear solvers.
  • Goal
  • Improving the performance of sparse linear
    solutions and understanding how such schemes and
    high performance architectural features interact.

57
Computational Nanotechnology
  • Length and time scales of nanoscale systems and
    phenomenon
  • shrunk to where we can directly address them with
    computer simulations and theoretical modeling
    with high accuracy.
  • Increasing computer power used large-scale and
    high-fidelity simulations
  • make it possible for nanoscale simulations to be
    predictive.
  • Computational nanotechnology is
  • emerging as a engineering analysis tool for novel
    nanodevice design.

58
Schrödinger Equations
  • Describes atoms as a collection of quantum
    mechanical particles, nuclei, and electrons,
    governed by the Schrödinger equation
  • Generalized Tight Binding Molecular Dynamics
    (GTBMD)
  • Parameterize the Hamiltonian, H(RI).
  • Repulsive pair potential depends on the distance.
  • Bonding energy is constant that shifts the zero
    of energy.

59
Computational Nanotechnology
Parameterized
  • Classical MD
  • Newton mechanics
  • Tight Binding MD
  • Semiempirical
  • Ab Initio (first-principle)
  • - Schrödinger Equations
  • Born Oppenheimer
  • Car-Parrinello

Classical MD are not accurate enough and ab
initio computations are not feasible.
Accurate, High Computational Cost
Less Accurate, Computationally Cheap
60
Schrödinger Equations
  • Describes atoms as a collection of quantum
    mechanical particles, nuclei, and electrons,
    governed by the Schrödinger equation
  • Born-Oppenheimer
  • Electronic degrees of freedom follow the
    corresponding nuclear positions

61
Generalized Tight Bind MD
  • Obtaining Hij and Sij
  • Nonorthogonality between two sp3 hybrids.

62
Generalized Tight Binding MD
  • Construct Hamiltonian
  • Nonorthogonal basis of atomic orbitals
  • Hamiltonian and overlap matrix elements
  • One-electron energies are obtained by solving the
    generalized eigenvalue equation

63
GTBMD Advantage
  • GTBMD
  • Computationally efficient because we can
    parameterize the Hamiltonian, H(RI).
  • Transperable parametrization scheme by including
    explicitly the effects of the nonorthogonality of
    the atomic basis.
  • To find an energy-minimized structure of a
    nanoscale system under consideration without
    symmetry constraints.

64
Problems ?
  • Inverse Iteration
  • Once the eigenvalues are computed, the
    eigenvectors are computed using factorizations of
    the original sparse matrix.
  • Re-orthogonalization problem.
  • Reduced storage.
  • Sparse inverse iteration.

65
Lanczos Method
  • Iterative method for sparse symmetric eigenvalue
    problems
  • Dominant cost is spase matrix-vector
    multiplication (Axy).
  • Suitable for finding several eigenvalues and
    eigenvectors with maximum absolute values
  • Not suitable to find all the eigenvalues.
  • By using an shifted inverse implicitly, the
    method can find eigenvalues near the shift
    (eigenvectors for A-lI)
  • Similar concept to inverse iteration.
  • Called Shifted Inverse Lancozs.
  • Can find several eigenpairs per factorization.

66
Si-nanowire Matrix Structure
Si40_stem_1776
Si40_tree_1872
Si40_stem_1776
Si40_tree_1872
Si_tree is close to block diagonal matrix
67
SCALAPACK Results
68
Electronics
  • HOMO-LUMO gap (Highest occupied molecular orbital
    Lowest unoccupied molecular orbital)
  • Stems (0.57eV), Trees (0.16eV) and Bulk diamond
    phase of silicon 1.1eV.
  • Trees
  • More states from the unoccupied levels are pulled
    in towards the Fermi level, EF(dashed).
  • Enhance conductivity due to more conduction
    channels being available.

Electronic densities of states (DOS)
69
Shifted Inverse Lanczos for Sparse Matrix
  • Eigenvalues near the shift converge very quickly.
  • If all eigenvalues are known, they can be used as
    a shift
  • Separate eigenvalues into groups according to
    their distribution (slicing spectrum).
  • Select a shift in the middle of the group.
  • Run a Lanczos iteration for each group.

70
Computational Fluid Dynamics
  • Numerical simulation cost can be reduced by
    recasting
  • Problem in terms of eigenvectors of the
    one-dimensional operators to yield a decoupled
    set of problems of lower-dimensions.
  • Reduce three-dimensional problems to independent
    two-dimensional subproblems.

71
Problem Formulation
72
Solution Cost
  • Transform from physical to wave space and inverse
    transform
  • Matrix-matrix products.
  • Efficient on all architecture.
  • If nz ltlt nxy,
  • Cost of applying S or ST is subdominant to that
    of solving the two-dimensional subproblems.

73
Preconditioning
  • Reduced SE problems
  • Solved by using conjugate gradient iteration.
  • Preconditioning with a system AFE that is based
    on a finite-element.
  • Discretization having nodes coincident with the
    SE nodes.
  • Finite-element preconditioning gives bounded
    iteration counts.

74
Parallel Solver
  • Numeric factorization
  • Peformed effectively on multiprocessor with
    multifrontal or column-block methods.
  • Triangular solution becomes a numerical
    bottleneck.
  • Get over
  • Using Selective Inverse (SI) scheme.
  • Using multiple right hand sides.

75
Implementation Details
  • Implementation
  • Tie-breaking
  • Random (IMD).
  • Big diagonal first (IMD-BD)
  • Stability through pivot.
  • Small diagonal first (IMD-SD)
  • Indistinguishable node effect.
  • Heap structure to reduce the degree update
    overhead
  • For tie-break by numerical measure, setup degree
    queue list and index reference for each column.

Degree Queue List
Index Reference
76
Failures ?
  • Manteuffel
  • Shows that if ÂaI is an H-matrix, then
    incomplete Cholesky factorization of ÂaI
    succeeds.
  • Jennings Malik
  • If aijk is dropped, setting.
  • aii(k)aii(k)saij(k), ajj(k)ajj(k)1/saij(k)
    .
  • Jones Plassmann
  • Increasing a by constants (1e-2).
  • Bouaricha, More, and Wu
  • a s ½Â8.
  • Lin More
  • Start with a s 1e-3 and multiply by two.

77
Minimum Discarded Fill (MDF, Tang)
  • Motivated by the observation
  • Hi1 Hi1 Ei1Bi1-viviT/dii-Ei1.
  • Eliminate the ith node as the Frobenius norm of
    the discarded fill matrix Fi(Ei).
  • Shows better performance than RCM ordering but
    expensive to compute.

78
What should be considered
  • Ordering schemes that use structural metrics to
    reduce fill in rather than .
  • Ordering interleaved with numeric factorization
    to compute .
  • Greedy minimum degree scheme based on the
    elimination graphs of .

79
Fails and Diagonal Shift
  • Diagonal Shift
  • IMD (2.2), SD (1.9), BD (1.8).
  • MMD (4.1), RCM(3.7).
  • Fails
  • MMD (22), RCM (19).
  • IMD method is much stable.

80
Compare with MDF
81
CG LP and MP, Power Energy
82
Derived Metric
  • RI (Relative Improvement)
  • RII (Relative Incremental Improvement)
  • Discrepancy in RII
Write a Comment
User Comments (0)
About PowerShow.com