Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

About This Presentation

Title:

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

Description:

Linear Solvers Grant Kickoff Meeting, 9/26/06. ... of parallel sparse linear system solvers. ... Engineering problems usually produce large sparse linear systems ... – PowerPoint PPT presentation

Number of Views:234

Avg rating:3.0/5.0

Slides: 60

Provided by: ayg

Learn more at: https://www.cs.purdue.edu

Category:

more less

Transcript and Presenter's Notes

Title: Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

1
Evaluating Sparse Linear System Solvers on
Scalable Parallel Architectures

Ananth Grama and Ahmed Sameh
Department of Computer Science
Purdue University.
http//www.cs.purdue.edu/people/sameh/ayg

Linear Solvers Grant Kickoff Meeting, 9/26/06.
2
Evaluating Sparse Linear System Solvers on
Scalable Parallel Architectures

Project Overview
Identify sweet spots in algorithm-architecture-pr
ogramming model space for efficient sparse linear
system solvers.
Design a new class of highly scalable sparse
solvers suited to petascale HPC systems.
Develop analytical frameworks for performance
prediction and projection.

Methodology
Design generalized sparse solvers (direct,
iterative, and hybrid) and evaluate their
scaling/communication characteristics.
Evaluate architectural features and their impact
on scalable solver performance.
Evaluate performance and productivity aspects of
programming models -- PGAs (CAF, UPC) and MPI.

Challenges and Impact
Generalizing the space of parallel sparse linear
system solvers.
Analysis and implementation on parallel
platforms
Performance projection to the petascale
Guidance for architecture and programming model
design / performance envelope.
Benchmarks and libraries for HPCS.

Milestones / Schedule
Final deliverable Comprehensive evaluation of
scaling properties of existing (and novel sparse
solvers).
Six month target Comparative performance of
solvers on multicore SMPs and clusters.
12-month target Evaluation on these solvers on
Cray X1, BG, JS20/21, for CAF/UPC/MPI
implementations.

3
Introduction

A critical aspect of High-Productivity is the
identification of points/regions in the
algorithm/ architecture/ programming model space
that are amenable for implementation on petascale
systems.
This project aims at identifying such points for
commonly used sparse linear system solvers, and
at developing more robust novel solvers.
These novel solvers emphasize reduction in
memory/remote accesses at the expense of
(possibly) higher FLOP counts yielding much
better actual performance.

4
Project Rationale

Sparse system solvers govern the overall
performance of many CSE applications on HPC
systems.
Design of HPC architectures and programming
models should be influenced by their suitability
for such solvers and related kernels.
Extreme need for concurrency on novel
architectural models require fundamental
re-examination of conventional sparse solvers.

5
Typical Computational Kernels for PDEs
Integration
Newton Iteration
Linear system solvers
?k
?k
?t
6
Fluid Structure Interaction
7
NESSIE Nanoelectronics Simulation Environment
Mathematical Methodologies Finite Element
Method, mode decomposition, multi-scale,
non-linear numerical schemes,
Numerical Parallel Algorithms Linear solver
(SPIKE), Eigenpairs solver (TraceMin),
preconditioning strategies,
Transport/Electrostatics Multi-scale,
Multi-physics Multi-method
3D CNTs-3D Molec. Devices
3D Si Nanowires
3D III/V Devices
2D MOSFETs
E. Polizzi (1998-2005)
8
Simulation, Model Reduction and Real-Time Control
of Structures
9
Fluid-Solid Interaction
10
Project Goals

Develop generalizations of direct and iterative
solvers e.g. the Spike polyalgorithm.
Implement such generalizations on various
architectures (multicore, multicore SMPs,
multicore SMP aggregates) and programming models
(PGAs, Messaging APIs)
Analytically quantify performance and project to
petascale platforms.
Compare relative performance, identify
architecture/programming model features, and
guide algorithm/ architecture/ programming model
co-design.

11
Background

Personnel
Ahmed Sameh, Samuel Conte Professor of Computer
Science, has worked on development of parallel
numerical algorithms for four decades.
Ananth Grama, Professor and University Scholar,
has worked both on numerical aspects of sparse
solvers, as well as analytical frameworks for
parallel systems.
(To be named Postdoctoral Researcher) will be
primarily responsible for implementation and
benchmarking.
We have identified three candidates for this
position and will shortly be hiring one of them.

12
Background

Technical
We have built extensive infrastructure for
parallel sparse solvers including the Spike
parallel toolkit, augmented-spectral ordering
techniques, and multipole-based preconditioners
We have diverse hardware infrastructure,
including Intel/AMP multicore SMP clusters,
JS20/21 Blade servers, BlueGene/L, Cray X1.

13
Background

Technical
We have initiated installation of Co-Array
Fortran and Unified Parallel C on our machines
and porting our toolkits to these PGAs.
We have extensive experience in analysis of
performance and scalability of parallel
algorithms, including development of the
isoefficiency metric for scalability.

14
Technical Highlights

The SPIKE Toolkit

15
SPIKE Introduction
after RCM reordering

Engineering problems usually produce large sparse
linear systems
Banded (or banded with low-rank perturbations)
structure is often obtained after reordering
SPIKE partitions the banded matrix into a block
tridiagonal form
Each partition is associated with one CPU, or
one node
? multilevel parallelism

16
SPIKE Introduction
17
SPIKE Introduction
AXF ? SXdiag(A1-1,,Ap-1) F
Reduced system ((p-1) x 2m)
V1
W2
V2
V3
W3
W4
Retrieve solution
A(n x n) Bj, Cj (m x m), mltltn
18
SPIKE A Hybrid Algorithm
Different choices depending on the properties of
the matrix and/or platform architecture

The spikes can be computed
Explicitly (fully or partially)
On the Fly
Approximately
The diagonal blocks can be solved
Directly (dense LU, Cholesky, or sparse
counterparts)
Iteratively (with a preconditioning strategy)
The reduced system can be solved
Directly (Recursive SPIKE)
Iteratively (with a preconditioning scheme)
Approximately (Truncated SPIKE)

19
The SPIKE algorithm
Hierarchy of Computational Modules (systems dense
within the band)
Level Description
3 SPIKE
2 LAPACK blocked algorithms
1 Primitives for banded matrices (our own)
0 BLAS3 (matrix-matrix primitives)
20
SPIKE versions
Algorithm Factorization E Explicit R Recursive T Truncated F on the Fly
P LU w/ pivoting Explicit generation of spikes- reduced system is solved iteratively with a preconditioner. Explicit generation of spikes- reduced system is solved directly using recursive SPIKE Implicit generation of reduced system which is solved on-the-fly using an iterative method.
L LU w/o pivoting Explicit generation of spikes- reduced system is solved iteratively with a preconditioner. Explicit generation of spikes- reduce system is solved directly using recursive SPIKE Truncated generation of spike tips Vb is exact, Wt is approx.- reduced system is solved directly Implicit generation of reduced system which is solved on-the-fly using an iterative method.
U LU and UL w/o pivot. Truncated generation of spike tips Vb, Wt are exact- reduced system is solved directly Implicit generation of reduced system which is solved on-the-fly using an iterative method with precond.
A alternate LU / UL Explicit generation of spikes using new partitioning- reduced system is solved iteratively with a preconditioner. Truncated generation of spikes using new partitioning- reduced system is solved directly
21
SPIKE Hybrids

SPIKE versions
R recursive E explicit
F on-the-fly T truncated
2. Factorization
No pivoting
L LU
U LU UL
A alternate LU UL
Pivoting
P LU
3. Solution improvement
0 direct solver only
2 iterative refinement
3 outer Bicgstab iterations

22
SPIKE on-the-fly

Does not require generating the spikes explicitly
Ideally suited for banded systems with large
sparse bands
The reduced system is solved iteratively
with/without a preconditioning strategy

23
Numerical Experiments(dense within the band)

Computing platforms
4 nodes Linux Xeon cluster
512 processor IBM-SP
Performance comparison w/ LAPACK, and ScaLAPACK

24
SPIKE Scalability
b401 RHS1
IBM-SP
Spike (RL0)
Speed improvement Spike vs Scalapack
procs. 8 16 32 64 128 256 512
N0.5 M 7.2 8.2 7.9 7.0 6.0 5.0 4.1
N1M 8.4 8.2 7.7 6.9 5.7 4.8
N2M 8.3 8.0 7.6 6.0 5.4
N4M 8.2 8.0 7.4 6.4
25
SPIKE partitioning - 4 processor example
Partitioning -- 1
Partitioning -- 2
Factorizations (w/o pivoting)
Processors
A1
LU UL (p2) LU (p3) UL
1 2,3 4
B1
A2
C2
B2
A3
C3
26
SPIKE Small number of processors

ScaLAPACK needs at least 4-8 processors to
perform as well as LAPACK.
SPIKE benefits from the LU-UL strategy and
realizes speed improvement over LAPACK on 2 or
more processors.

n960,000 b201
4-node Xeon Intel Linux cluster with infiniband
interconnection - Two 3.2 Ghz processors per
node 4 GB of memory/ node 1 MB cache 64-bit
arithmetic. Intel fortran, Intel MPI Intel MKL
libraries for LAPACK, SCALAPACK
System is diag. dominant
27
General sparse systems

Science and engineering problems often produce
large sparse linear systems
Banded (or banded with low-rank perturbations)
structure is often obtained after reordering

While most ILU preconditioners depend on
reordering strategies that minimize the fill-in
in the factorization stage, we propose to
extract a narrow banded matrix, via a
reordering-dropping strategy, to be used as a
preconditioner.
make use of an improved SPIKE on-the-fly
scheme that is ideally suited for banded systems
that are sparse within the band.

28
Sparse parallel direct solvers and banded systems
N432,000, b 177, nnz 7, 955, 116, fill-in of
the band 10.4
PARDISO Reordering Factorization Solve
1 CPU- (1 Node) 19.0 4.13 0.83
2 CPU- (1 Node) 18.3 3.11 0.83
SuperLU Reordering Factorization Solve
1 CPU - (1 Node) 10.3 17 8.4
2 CPU- (2 Nodes) 8.2 37 12.5
4 CPU- (4 Nodes) 8.2 41 16.7
8 CPU- (4 Nodes) 7.6 178 15.9
Memory swap too much fill-in
MUMPS Reordering Factorization Solve
1 CPU - (1 Node) 16.2 6.3 0.8
2 CPU- (2 Nodes) 17.2 4.2 0.6
4 CPU- (4 Nodes) 17.8 3 0.55
8 CPU- (4 Nodes) 17.7 20 1.9
29
Multilevel Parallelism SPIKE calling MKL-PARDISO
for banded systems that are sparse within the
band

This SPIKE hybrid scheme exhibits better
performance than other parallel direct sparse
solvers used alone.

30
SPIKE-MKL on-the-fly for systems that are
sparse within the band
N432,000, b 177, nnz 7, 955, 116, sparsity of
the band 10.4

MUMPS time (2-nodes) 21.35 s time (4-nodes)
39.6 s
For narrow banded systems, SPIKE will consider
the matrix dense within the band.
Reordering schemes for minimizing the
bandwidth can be used if necessary.

N471,800, b 1455, nnz 9, 499, 744, sparsity
of the band 1.4

Good scalability using on-the-fly SPIKE scheme

31
A Preconditioning Approach

Reorder the matrix to bring most of the elements
within a band via
HSLs MC64 (to maximize sum or product of the
diagonal elements)
RCM (Reverse Cuthill-McKee) or MD (Minimum
Degree)
Extract a band from the reordered matrix to use
as preconditioner.
Use an iterative method (e.g. Bicgstab) to solve
the linear system (outer solver).
Use SPIKE to solve the system involving the
preconditioner (inner solver).

32
Matrix DW8192, order N 8192, NNZ 41746,
Condest (A) 1.39e07, Square Dielectric
Waveguide Sparsity 0.0622 , (AA) indefinite
33
GMRES ILU preconditionervs. Bicgstab with
banded preconditioner

Gmres w/o precond gt5000 iterations
Gmres ILU (no fill-in) Fails
Gmres ILU (1 e-3) Fails
Gmres ILU (1 e-4) 16 iters., r 10-5
NNZ(L) 612,889
NNZ(U) 414,661
Spike - Bicgstab
preconditioner of bandwidth 70
16 iters., r 10-7
NNZ 573,440

34
Comparisons on a 4-node Xeon Linux Cluster (2
CPUs/node)

SuperLU
1 node -- 3.56 s
2 nodes -- 1.87 s
4 nodes -- 3.02 s (memory limitation)
MUMPS
1 node -- 0.26 s
2 nodes -- 0.36 s
4 nodes -- 0.34 s
SPIKE RL3 (bandwidth 129)
1 node -- 0.28 s
2 nodes -- 0.27 s
4 nodes -- 0.20 s
Bicgstab required 6 iterations only
Norm of rel. res. 10-6

Matrix DW8192
Matrix DW8192
Sparse Matvec kernel needs fine-tuning
Sparse Matvec kernel needs fine-tuning
35
Matrix BMW7st_1(stiffness matrix)

Order N 141,347, NNZ 7,318,399

after RCM reordering
36
Comparisons on a 4-node Xeon Linux Cluster (2
CPUs/node)

SuperLU
1 node -- 26.56 s
2 nodes -- 21.17 s
4 nodes -- 45.03 s
MUMPS
1 node -- 09.71 s
2 nodes -- 08.03 s
4 nodes -- 08.05 s
SPIKE TL3 (bandwidth 101)
1 node -- 02.38 s
2 nodes -- 01.62 s
4 nodes -- 01.33 s
Bicgstab required 5 iterations only
Norm of rel. res. 10-6

Matrix BMW7st_1 is diag. dominant
Sparse Matvec kernel needs fine-tuning
37
Reservoir Modeling
38
Bicgstab ILU (0, fill-in) NNZ 50,720
39
Matrix after RCM reordering
40
Bicgtab banded preconditioner bandwidth 61
NNZ 12,464
41
Driven Cavity Example

Square domain ?1 ? x, y ? 1
B.CS. ux uy 0 on x, y ?1 x 1
ux 1, uy 0 on y 1
Picards iterations Q2/Q1 elements Linearized
equations (Oseen problem)
G ? 0
E As (A AT)/2
viscosity 0.1, 0.02, 0.01,0.002

42
Linear system for 128 x 128 grid
43
after spectral reordering
44
MA48 as a preconditioner for GMRES on a
uniprocessor
Droptol of iter. T(factor.) T(GMRES iters.) nnz(LU) final residual v
0 2 142.7 0.3 7,195,157 2.07E-13 1/50
1.00E-04 24 115.5 1.4 2,177,181 2.83E-09 1/50
1.00E-03 248 109.6 17.3 1,611,030 5.11E-08 1/50
for drop tolerance .0001, total time 117
sec. No convergence for a drop tolerance of
10-2
45
Spike Pardiso for solving systems involving the
banded preconditioner

On 4 nodes (8 CPUs) of a Xeon-Intel cluster
bandwidth of extracted preconditioner 1401
total time 16 sec.
of Gmres iters. 131
2-norm of residual 10-7
Speed improvement over sequential procedure with
drop tolerance .0001
117/16 7.3

46
Technical Highlights

Analysis of Scaling Properties
In early work, we developed the Isoefficiency
metric for scalability.
With the likely scenario of utilizing up to 100K
processing cores, this work becomes critical.
Isoefficiency quantifies the performance of a
parallel system (a parallel program and the
underlying architecture) as the number of
processors is increased.

47
Technical Highlights

Isoefficiency Analysis
The efficiency of any parallel program for a
given problem instance goes down with increasing
number of processors.
For a family of parallel programs (formally
referred to as scalable programs), increasing the
problem size results in an increase in efficiency.

48
Technical Highlights

Isoefficiency is the rate at which problem size
must be increased w.r.t. number of processors, to
maintain constant efficiency.
This rate is critical, since it is ultimately
limited by total memory size.
Isoefficiency is a key indicator of a programs
ability to scale to very large machine
configurations.
Isoefficiency analysis will be used extensively
for performance projection and scaling properties.

49
Architecture

We target the following currently available
architectures
IBM JS20/21 and BlueGene/L platforms
Cray X1/XT3
AMD Opteron multicore SMP and SMP clusters
Intel Xeon multicore SMP and SMP clusters
These platforms represent a wide range of
currently available architectural extremes.
_______________________________________________
Intel, AMD, and IBM JS21 platforms are
available locally at Purdue. We intend to get
access to the BlueGene at LLNL and the Cray XT3
at ORNL.

50
Implementation

Current implementations are MPI based.
The Spike tooklit will be ported to
POSIX and OpenMP
UPC and CAF
Titanium and X10 (if releases are available)
These implementations will be comprehensively
benchmarked across platforms.

51
Benchmarks/Metrics

We aim to formally specify a number of benchmark
problems (sparse systems arising in
nanoelectronics, structural mechanics, and
fluid-structure interaction)
We will abstract architecture characteristics
processor speed, memory bandwidth, link
bandwidth, bisection bandwidth (some of these
abstractions can be derived from HPCC
benchmarks).
We will quantify solvers on the basis of
wall-clock time, FLOP count, parallel efficiency,
scalability, and projected performance to
petascale systems.

52
Benchmarks/Metrics

Other popular benchmarks such as HPCC do not
address sparse linear solvers in meaningful ways.
HPCC comprises of seven benchmark tests HPL,
DGEMM, STREAM, PTRANS, RandomAccess, FFT, and
Communication bandwidth and latency.
They only peripherally address underlying
performance issues in sparse system solvers.

53
Progress/Accomplishments

Implementation of the parallel Spike
polyalgorithm toolkit.
Incorporation of a number of parallel direct and
preconditioned iterative solvers (e.g. SuperLU,
MUMPS, and Pardiso) into the Spike toolkit.
Evaluation of Spike on the IBM/SP, SGI Altix,
Intel Xeon clusters, and Intel multicore
platforms.

54
Milestones

Final deliverable Comprehensive evaluation of
scaling properties of existing (and new solvers).
Six month target Comparative performance of
solvers on multicore SMPs and clusters.
Twelve-month target Comprehensive evaluation on
Cray X1, BG, JS20/21, of CAF/UPC/MPI
implementations.

55
Financials

The total cost of this project is approximately
150K for its one-year duration.
The budget primarily accounts for a post-doctoral
researchers salary/benefits and minor
summer-time for the PIs.
Together, these three project personnel are
responsible for accomplishing project milestones
and reporting.

56
Concluding Remarks

This project takes a comprehensive view of
parallel sparse linear system solvers and their
suitability for petascale HPC systems.
Its results directly influence ongoing and future
development of HPC systems.
A number of major challenges are likely to
emerge, both as a result of this project, and
from impending architectural innovations.

57
Concluding Remarks

Architectural innovations on the horizon
Scalable multicore platforms 64 to 128 cores on
the horizon
Heterogeneous multicore It is likely that cores
will be heterogeneous some with floating point
units, others with vector units, yet others with
programmable hardware (indeed such chips are
commonly used in cell phones)
Significantly higher pressure on the memory
subsystem

58
Concluding Remarks

Challenges for programming models and runtime
systems
Affinity scheduling is important for performance
need to specify tasks that must be co-scheduled
(suitable programming abstractions needed).
Programming constructs for utilizing
heterogeneity.

59
Concluding Remarks

Challenges for algorithm and application
development
FLOPS are cheap, memory references are expensive
explore new families of algorithms that
optimize for (minimize) the latter
Algorithmic techniques and programming
constructs for specifying algorithmic asynchrony
(used to mask system latency)
Many of the optimizations are likely to be
beyond the technical reach of applications
programmers need for scalable library support
Increased emphasis on scalability analysis