Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

Description:

Evaluating Sparse Linear System. Solvers on Scalable Parallel. Architectures ... SMPs, multicore SMP aggregates) and programming models (PGAs, Messaging APIs) ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 54
Provided by: ayg
Category:

less

Transcript and Presenter's Notes

Title: Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures


1
Evaluating Sparse Linear System Solvers on
Scalable Parallel Architectures
  • Ananth Grama and Ahmed Sameh
  • Department of Computer Science
  • Purdue University.
  • http//www.cs.purdue.edu/people/sameh/ayg

Linear Solvers Grant Kickoff Meeting, 9/26/06.
2
Evaluating Sparse Linear System Solvers on
Scalable Parallel Architectures
  • Project Overview
  • Identify sweet spots in algorithm-architecture-pr
    ogramming model space for efficient sparse linear
    system solvers.
  • Design a new class of highly scalable sparse
    solvers suited to petascale HPC systems.
  • Develop analytical frameworks for performance
    prediction and projection.
  • Objectives and Methodology
  • Design scalable sparse solvers (direct,
    iterative, and hybrid) and evaluate their
    scaling/communication characteristics.
  • Evaluate architectural features and their impact
    on scalable solver performance.
  • Evaluate performance and productivity aspects of
    programming models -- PGAs (CAF, UPC) and MPI.
  • Challenges and Impact
  • Generalizing the space of parallel sparse linear
    system solvers.
  • Analysis and implementation on parallel
    platforms
  • Performance projection to the petascale
  • Guidance for architecture and programming model
    design / performance envelope.
  • Benchmarks and libraries for HPCS.
  • Milestones / Schedule
  • Final deliverable Comprehensive evaluation of
    scaling properties of existing (and novel sparse
    solvers).
  • Six month target Comparative performance of
    solvers on multicore SMPs and clusters.
  • 12-month target Evaluation on these solvers on
    Cray X1, BG, JS20/21, for CAF/UPC/MPI
    implementations.

3
Introduction
  • A critical aspect of High-Productivity is the
    identification of points/regions in the
    algorithm/ architecture/ programming model space
    that are amenable for implementation on petascale
    systems.
  • This project aims at identifying such points for
    commonly used sparse linear system solvers, and
    at developing more robust novel solvers.
  • These novel solvers emphasize reduction in
    memory/remote accesses at the expense of
    (possibly) higher FLOP counts yielding much
    better actual performance.

4
Project Rationale
  • Sparse system solvers govern the overall
    performance of many CSE applications on HPC
    systems.
  • Design of HPC architectures and programming
    models should be influenced by their suitability
    for such solvers and related kernels.
  • Extreme need for concurrency on novel
    architectural models require fundamental
    re-examination of conventional sparse solvers.

5
Project Goals
  • Develop generalizations of direct and iterative
    solvers e.g. the Spike polyalgorithm.
  • Implement such generalizations on various
    architectures (multicore, multicore SMPs,
    multicore SMP aggregates) and programming models
    (PGAs, Messaging APIs)
  • Analytically quantify performance and project to
    petascale platforms.
  • Compare relative performance, identify
    architecture/programming model features, and
    guide algorithm/ architecture/ programming model
    co-design.

6
Background
  • Personnel
  • Ahmed Sameh, Samuel Conte Professor of Computer
    Science, has worked on development of parallel
    numerical algorithms for four decades.
  • Ananth Grama, Professor and University Scholar,
    has worked both on numerical aspects of sparse
    solvers, as well as analytical frameworks for
    parallel systems.
  • (To be named Postdoctoral Researcher) will be
    primarily responsible for implementation and
    benchmarking.
  • We have identified three candidates for this
    position and will shortly be hiring one of them.

7
Background
  • Technical
  • We have built extensive infrastructure for
    parallel sparse solvers including the Spike
    parallel toolkit, augmented-spectral ordering
    techniques, and multipole-based preconditioners
  • We have diverse hardware infrastructure,
    including Intel/AMP multicore SMP clusters,
    JS20/21 Blade servers, BlueGene/L, Cray X1.

8
Background
  • Technical
  • We have initiated installation of Co-Array
    Fortran and Unified Parallel C on our machines
    and porting our toolkits to these PGAs.
  • We have extensive experience in analysis of
    performance and scalability of parallel
    algorithms, including development of the
    isoefficiency metric for scalability.

9
Technical Highlights
  • The SPIKE Toolkit

10
SPIKE Introduction
after RCM reordering
  • Engineering problems usually produce large sparse
    linear systems
  • Banded (or banded with low-rank perturbations)
    structure is often obtained after reordering
  • SPIKE partitions the banded matrix into a block
    tridiagonal form
  • Each partition is associated with one CPU, or
    one node
  • ? multilevel parallelism

11
SPIKE Introduction
12
SPIKE Introduction
AXF ? SXdiag(A1-1,,Ap-1) F
Reduced system ((p-1) x 2m)
V1
W2
V2
V3
W3
W4
Retrieve solution
A(n x n) Bj, Cj (m x m), mltltn
13
SPIKE A Hybrid Algorithm
Different choices depending on the properties of
the matrix and/or platform architecture
  • The spikes can be computed
  • Explicitly (fully or partially)
  • On the Fly
  • Approximately
  • The diagonal blocks can be solved
  • Directly (dense LU, Cholesky, or sparse
    counterparts)
  • Iteratively (with a preconditioning strategy)
  • The reduced system can be solved
  • Directly (Recursive SPIKE)
  • Iteratively (with a preconditioning scheme)
  • Approximately (Truncated SPIKE)

14
The SPIKE algorithm
Hierarchy of Computational Modules (systems dense
within the band)
Level Description
3 SPIKE
2 LAPACK blocked algorithms
1 Primitives for banded matrices (our own)
0 BLAS3 (matrix-matrix primitives)
15
SPIKE versions
Algorithm Factorization E Explicit R Recursive T Truncated F on the Fly
P LU w/ pivoting Explicit generation of spikes- reduced system is solved iteratively with a preconditioner. Explicit generation of spikes- reduced system is solved directly using recursive SPIKE Implicit generation of reduced system which is solved on-the-fly using an iterative method.
L LU w/o pivoting Explicit generation of spikes- reduced system is solved iteratively with a preconditioner. Explicit generation of spikes- reduce system is solved directly using recursive SPIKE Truncated generation of spike tips Vb is exact, Wt is approx.- reduced system is solved directly Implicit generation of reduced system which is solved on-the-fly using an iterative method.
U LU and UL w/o pivot. Truncated generation of spike tips Vb, Wt are exact- reduced system is solved directly Implicit generation of reduced system which is solved on-the-fly using an iterative method with precond.
A alternate LU / UL Explicit generation of spikes using new partitioning- reduced system is solved iteratively with a preconditioner. Truncated generation of spikes using new partitioning- reduced system is solved directly
16
SPIKE Hybrids
  • SPIKE versions
  • R recursive E explicit
  • F on-the-fly T truncated
  • 2. Factorization
  • No pivoting
  • L LU
  • U LU UL
  • A alternate LU UL
  • Pivoting
  • P LU
  • 3. Solution improvement
  • 0 direct solver only
  • 2 iterative refinement
  • 3 outer Bicgstab iterations

17
SPIKE on-the-fly
  • Does not require generating the spikes explicitly
  • Ideally suited for banded systems with large
    sparse bands
  • The reduced system is solved iteratively
    with/without a preconditioning strategy

18
Numerical Experiments(dense within the band)
  • Computing platforms
  • 4 nodes Linux Xeon cluster
  • 512 processor IBM-SP
  • Performance comparison w/ LAPACK, and ScaLAPACK

19
SPIKE Scalability
b401 RHS1
IBM-SP
Spike (RL0)
Speed improvement Spike vs Scalapack
procs. 8 16 32 64 128 256 512
N0.5 M 7.2 8.2 7.9 7.0 6.0 5.0 4.1
N1M 8.4 8.2 7.7 6.9 5.7 4.8
N2M 8.3 8.0 7.6 6.0 5.4
N4M 8.2 8.0 7.4 6.4
20
SPIKE partitioning - 4 processor example
Partitioning -- 1
Partitioning -- 2
Factorizations (w/o pivoting)
Processors
A1
LU UL (p2) LU (p3) UL
1 2,3 4
B1
A2
C2
B2
A3
C3
21
SPIKE Small number of processors
  • ScaLAPACK needs at least 4-8 processors to
    perform as well as LAPACK.
  • SPIKE benefits from the LU-UL strategy and
    realizes speed improvement over LAPACK on 2 or
    more processors.

n960,000 b201
4-node Xeon Intel Linux cluster with infiniband
interconnection - Two 3.2 Ghz processors per
node 4 GB of memory/ node 1 MB cache 64-bit
arithmetic. Intel fortran, Intel MPI Intel MKL
libraries for LAPACK, SCALAPACK
System is diag. dominant
22
General sparse systems
  • Science and engineering problems often produce
    large sparse linear systems
  • Banded (or banded with low-rank perturbations)
    structure is often obtained after reordering
  • While most ILU preconditioners depend on
    reordering strategies that minimize the fill-in
    in the factorization stage, we propose to
  • extract a narrow banded matrix, via a
    reordering-dropping strategy, to be used as a
    preconditioner.
  • make use of an improved SPIKE on-the-fly
    scheme that is ideally suited for banded systems
    that are sparse within the band.

23
Sparse parallel direct solvers and banded systems
N432,000, b 177, nnz 7, 955, 116, fill-in of
the band 10.4
PARDISO Reordering Factorization Solve
1 CPU- (1 Node) 19.0 4.13 0.83
2 CPU- (1 Node) 18.3 3.11 0.83
SuperLU Reordering Factorization Solve
1 CPU - (1 Node) 10.3 17 8.4
2 CPU- (2 Nodes) 8.2 37 12.5
4 CPU- (4 Nodes) 8.2 41 16.7
8 CPU- (4 Nodes) 7.6 178 15.9
Memory swap too much fill-in
MUMPS Reordering Factorization Solve
1 CPU - (1 Node) 16.2 6.3 0.8
2 CPU- (2 Nodes) 17.2 4.2 0.6
4 CPU- (4 Nodes) 17.8 3 0.55
8 CPU- (4 Nodes) 17.7 20 1.9
24
Multilevel Parallelism SPIKE calling MKL-PARDISO
for banded systems that are sparse within the
band
  • This SPIKE hybrid scheme exhibits better
    performance than other parallel direct sparse
    solvers used alone.

25
SPIKE-MKL on-the-fly for systems that are
sparse within the band
N432,000, b 177, nnz 7, 955, 116, sparsity of
the band 10.4
  • MUMPS time (2-nodes) 21.35 s time (4-nodes)
    39.6 s
  • For narrow banded systems, SPIKE will consider
    the matrix dense within the band.
  • Reordering schemes for minimizing the
    bandwidth can be used if necessary.

N471,800, b 1455, nnz 9, 499, 744, sparsity
of the band 1.4
  • Good scalability using on-the-fly SPIKE scheme

26
A Preconditioning Approach
  • Reorder the matrix to bring most of the elements
    within a band via
  • HSLs MC64 (to maximize sum or product of the
    diagonal elements)
  • RCM (Reverse Cuthill-McKee) or MD (Minimum
    Degree)
  • Extract a band from the reordered matrix to use
    as preconditioner.
  • Use an iterative method (e.g. Bicgstab) to solve
    the linear system (outer solver).
  • Use SPIKE to solve the system involving the
    preconditioner (inner solver).

27
Matrix DW8192, order N 8192, NNZ 41746,
Condest (A) 1.39e07, Square Dielectric
Waveguide Sparsity 0.0622 , (AA) indefinite
28
GMRES ILU preconditionervs. Bicgstab with
banded preconditioner
  • Gmres w/o precond gt5000 iterations
  • Gmres ILU (no fill-in) Fails
  • Gmres ILU (1 e-3) Fails
  • Gmres ILU (1 e-4) 16 iters., r 10-5
  • NNZ(L) 612,889
  • NNZ(U) 414,661
  • Spike - Bicgstab
  • preconditioner of bandwidth 70
  • 16 iters., r 10-7
  • NNZ 573,440

29
Comparisons on a 4-node Xeon Linux Cluster (2
CPUs/node)
  • SuperLU
  • 1 node -- 3.56 s
  • 2 nodes -- 1.87 s
  • 4 nodes -- 3.02 s (memory limitation)
  • MUMPS
  • 1 node -- 0.26 s
  • 2 nodes -- 0.36 s
  • 4 nodes -- 0.34 s
  • SPIKE RL3 (bandwidth 129)
  • 1 node -- 0.28 s
  • 2 nodes -- 0.27 s
  • 4 nodes -- 0.20 s
  • Bicgstab required 6 iterations only
  • Norm of rel. res. 10-6

Matrix DW8192
Matrix DW8192
Sparse Matvec kernel needs fine-tuning
Sparse Matvec kernel needs fine-tuning
30
Matrix BMW7st_1(stiffness matrix)
  • Order N 141,347, NNZ 7,318,399

after RCM reordering
31
Comparisons on a 4-node Xeon Linux Cluster (2
CPUs/node)
  • SuperLU
  • 1 node -- 26.56 s
  • 2 nodes -- 21.17 s
  • 4 nodes -- 45.03 s
  • MUMPS
  • 1 node -- 09.71 s
  • 2 nodes -- 08.03 s
  • 4 nodes -- 08.05 s
  • SPIKE TL3 (bandwidth 101)
  • 1 node -- 02.38 s
  • 2 nodes -- 01.62 s
  • 4 nodes -- 01.33 s
  • Bicgstab required 5 iterations only
  • Norm of rel. res. 10-6

Matrix BMW7st_1 is diag. dominant
Sparse Matvec kernel needs fine-tuning
32
Reservoir Modeling
33
Bicgstab ILU (0, fill-in) NNZ 50,720
34
Matrix after RCM reordering
35
Bicgtab banded preconditioner bandwidth 61
NNZ 12,464
36
Driven Cavity Example
  • Square domain ?1 ? x, y ? 1
  • B.CS. ux uy 0 on x, y ?1 x 1
  • ux 1, uy 0 on y 1
  • Picards iterations Q2/Q1 elements Linearized
    equations (Oseen problem)
  • G ? 0
  • E As (A AT)/2
  • viscosity 0.1, 0.02, 0.01,0.002

37
Linear system for 128 x 128 grid
38
after spectral reordering
39
MA48 as a preconditioner for GMRES on a
uniprocessor
Droptol of iter. T(factor.) T(GMRES iters.) nnz(LU) final residual v
0 2 142.7 0.3 7,195,157 2.07E-13 1/50
1.00E-04 24 115.5 1.4 2,177,181 2.83E-09 1/50
1.00E-03 248 109.6 17.3 1,611,030 5.11E-08 1/50
for drop tolerance .0001, total time 117
sec. No convergence for a drop tolerance of
10-2
40
Spike Pardiso for solving systems involving the
banded preconditioner
  • On 4 nodes (8 CPUs) of a Xeon-Intel cluster
  • bandwidth of extracted preconditioner 1401
  • total time 16 sec.
  • of Gmres iters. 131
  • 2-norm of residual 10-7
  • Speed improvement over sequential procedure with
    drop tolerance .0001
  • 117/16 7.3

41
Technical Highlights
  • Analysis of Scaling Properties
  • In early work, we developed the Isoefficiency
    metric for scalability.
  • With the likely scenario of utilizing up to 100K
    processing cores, this work becomes critical.
  • Isoefficiency quantifies the performance of a
    parallel system (a parallel program and the
    underlying architecture) as the number of
    processors is increased.

42
Technical Highlights
  • Isoefficiency Analysis
  • The efficiency of any parallel program for a
    given problem instance goes down with increasing
    number of processors.
  • For a family of parallel programs (formally
    referred to as scalable programs), increasing the
    problem size results in an increase in efficiency.

43
Technical Highlights
  • Isoefficiency is the rate at which problem size
    must be increased w.r.t. number of processors, to
    maintain constant efficiency.
  • This rate is critical, since it is ultimately
    limited by total memory size.
  • Isoefficiency is a key indicator of a programs
    ability to scale to very large machine
    configurations.
  • Isoefficiency analysis will be used extensively
    for performance projection and scaling properties.

44
Architecture
  • We target the following currently available
    architectures
  • IBM JS20/21 and BlueGene/L platforms
  • Cray X1/XT3
  • AMD Opteron multicore SMP and SMP clusters
  • Intel Xeon multicore SMP and SMP clusters
  • These platforms represent a wide range of
    currently available architectural extremes.

45
Implementation
  • Current implementations are MPI based.
  • The Spike tooklit will be ported to
  • POSIX and OpenMP
  • UPC and CAF
  • Titanium and X10 (if releases are available)
  • These implementations will be comprehensively
    benchmarked across platforms.

46
Benchmarks/Metrics
  • We aim to formally specify a number of benchmark
    problems (sparse systems arising in
    nanoelectronics, structural mechanics, and
    fluid-structure interaction)
  • We will abstract architecture characteristics
    processor speed, memory bandwidth, link
    bandwidth, bisection bandwidth.
  • We will quantify solvers on the basis of
    wall-clock time, FLOP count, parallel efficiency,
    scalability, and projected performance to
    petascale systems.

47
Progress/Accomplishments
  • Implementation of the parallel Spike
    polyalgorithm toolkit.
  • Incorporation of a number of parallel direct and
    preconditioned iterative solvers (e.g. SuperLU,
    MUMPS, and Pardiso) into the Spike toolkit.
  • Evaluation of Spike on the IBM/SP, SGI Altix,
    Intel Xeon clusters, and Intel multicore
    platforms.

48
Milestones
  • Final deliverable Comprehensive evaluation of
    scaling properties of existing (and new solvers).
  • Six month target Comparative performance of
    solvers on multicore SMPs and clusters.
  • Twelve-month target Comprehensive evaluation on
    Cray X1, BG, JS20/21, of CAF/UPC/MPI
    implementations.

49
Financials
  • The total cost of this project is approximately
    150K for its one-year duration.
  • The budget primarily accounts for a post-doctoral
    researchers salary/benefits and minor
    summer-time for the PIs.
  • Together, these three project personnel are
    responsible for accomplishing project milestones
    and reporting.

50
Concluding Remarks
  • This project takes a comprehensive view of
    parallel sparse linear system solvers and their
    suitability for petascale HPC systems.
  • Its results directly influence ongoing and future
    development of HPC systems.
  • A number of major challenges are likely to
    emerge, both as a result of this project, and
    from impending architectural innovations.

51
Concluding Remarks
  • Architectural features include
  • Scalable multicore platforms 64 to 128 cores on
    the horizon
  • Heterogeneous multicore It is likely that cores
    will be heterogeneous some with floating point
    units, others with vector units, yet others with
    programmable hardware (indeed such chips are
    commonly used in cell phones)
  • Significantly higher pressure on the memory
    subsystem

52
Concluding Remarks
  • Impact of architectural features on algorithms
    and programming models
  • Affinity scheduling is important for performance
    need to specify tasks that must be co-scheduled
    (suitable programming abstractions needed).
  • Programming constructs for utilizing
    heterogeneity.

53
Concluding Remarks
  • Impact of architectural features on algorithms
    and programming models.
  • FLOPS are cheap, memory references are expensive
    explore new families of algorithms that
    optimize for (minimize) the latter
  • Algorithmic techniques and programming
    constructs for specifying algorithmic asynchrony
    (used to mask system latency)
  • Many of the optimizations are likely to be
    beyond the technical reach of applications
    programmers need for scalable library support
  • Increased emphasis on scalability analysis
Write a Comment
User Comments (0)
About PowerShow.com