Implementation of parallel Iterative methods on the IBM SP3 Blue Horizon - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Implementation of parallel Iterative methods on the IBM SP3 Blue Horizon

Description:

Study and implementation of a parallel iterative Krylov method based on block ... Run tests using longhorn IBM SP4 (Power4 1.3GHz) at UT Austin. ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 38

Provided by: csU66

Category:

more less

Transcript and Presenter's Notes

Title: Implementation of parallel Iterative methods on the IBM SP3 Blue Horizon

1
Implementation of parallel Iterative methods on
theIBM SP3(Blue Horizon)

Federico Balbi
University of Texas at San Antonio

Block OSOMIN (bosomin)
Study and implementation of a parallel iterative
Krylov method based on block methods for large
non-symmetric linear systems of the form Ax f.
A might be large, sparse, and non-symmetric.
The block methods use a number of linearly
independent initial residual vectors. The
residual vectors are used to compute
simultaneously a block of direction vectors which
are orthonormalized via the Modified Gram Schmidt
(MGS).
Whats next
Previous methods
bosomin(b, k, s) method
The IBM SP3
Implementation on the IBM SP3
Test results
Future work

3
s-Step Methods Chronology

Omin(k) Chronopoulos, Pernice
implemented on the IBM 3090/600S/6VF at USI
Osomin(s, k) Chronopoulos, Swanson
implemented on the CRAY C90 at PSC
Bosomin(b, s, k) Chronopoulos, Kucherov
implemented on the CRAY T3E at UT Austin
Bosomin(b, s, k) Chronopoulos, Liu
implemented on the IBM SP3 at SDSC
Note
Omin(k) Osomin(1, k) bosomin(1, 1, k)

4
Omin(k)

The scalar ai is calculated by a ratio of 2 inner
products. Solution advances and residuals are
calculated.
Respect GCR method Omin retains k search
directions.
K is the right precondition matrix to improve
convergence rate.
Axf transformed to (AK)K-1x f gt XKXhat LU
Kxhat
K A-1
AP(I1) is orthogonal to the previous APs
Preconditioned omin(s)
Choose x0, compute r0 f Ax0, and set p0r0
For i 0, 1, (until convergence gt ri2 lt
eps)
ai (riT Api) / ( (Api)T Api) // scalar
xi1 xi aipi
ri1 ri ai Api
bij ((AKri1)T Apj) / ((Apj)TApj) for ji lt
j lt i
pi1 Kri1 Sijji bj(i) pj
endfor
Note that ji min0, i-k1

5
Osomin(s, k)

Proposed in order to increase the parallelism
over omin(k) by calculating s direction vectors
at each step.
Beta is calculated in order to make the AP(i)
orthogonal to the previous APs.
MGS method used for the process of
orthogonalization.
P made of s directions vectors
R residuals
alpha calculate to minimize the residual ri
bi Axi
K

6
Osomin(s,k)
compute r0 f Ax0 For i 0, 1, (until
convergence gt ri2 lt eps) APi Ari,
A2ri,,Asri Pi ri, Ari, , As-1ri if (1
lt i) then Blj (Alri)TAplj,
,(Alri)TApsj)T where l1..s and jj(i-1)..i-1
APi APi Si-1jj(i-1) APjBljsl1 Pi Pi
Si-1jj(i-1) APjBljsl1 endif Api, Pi
MGS(APi) ai riTApil, ,riTApisT xi1 xi
Piai ri1 ri APiai endfor Note For s1
Osomin(s,k) is the Omin(k) algorithm
7
Bosomin(b, s, k)
We assume that we are going to solve a single
linear system Ax F of dimension n with b
righthand sides. F f1,f2,,fb The matrices
F (RHS), X (solution vectors), R (residual
vectors) are of dimensions n x b. The matrices P
(direction vectors), AP (matrix times direction
vectors), Q, S (used in the orthonormalization of
AP and similar transformations on P,
respectively) are of dimension n x bs. U is an
upper triangular matrix of dimension bs x bs. The
(parameter) matrix of steplengths a is of
dimension bs x b.
8
Algorithm
// Initialization F f1,f2,,fb R1 F - AX1
Q0 S0 0 // Iterations For i 0,1, (until
convergence gt Ri2 lt eps) Pi Ri,ARi,
,As-1 Ri APi ARi,A2Ri,,As-1 Ri //
Orthogonalize APi against matrices Qi-1,,Qji
update Pi if (1 lt i) then APi APi
Si-1jj(i) Qj(QjT APi) Pi Pi Si-1jj(i)
Sj(QjT APi) endif // Si MGS(APi) (QR decomp
of APi) APi QiUi Pi SiUi // compute
steplengths, residuals, and solutions ai QiT
Ri Xi1 Xi Qiai Ri1 Ri
Siai endfor Note For b1 Bosomin(s,k) is the
Osmin(s,k) algorithm
9
Previous implementations and data mapping

2D data mapping (linear array of p processors)

10
Domain overlapping
Region 0
Cpu 0
Region 1
Cpu 1
Region 2
Parallel execution of ILU for linear arrayof p
processors (preconditioning is based on backsolve)
Cpu 2
Region p
Cpu p
11
Bosomin Implementation data mapping

Data mapping significantly affects the
performance of a parallel system.
In order to process the application matrix in
parallel we must partition the 3D grid which is
used to generate the matrix and map to a 3D grid
of processors.
We assume that p processors are available and
arranged in a logical
p1/3 x p1/3 x p1/3 grid.
The processors are labeled according to their
position in the grid.
This mapping is ideal to map a n x n x n grid
problem among p processors and provide the
highest concurrency

Space grid mapping to 3D mesh with 27 processors
12
Processors Grid

Example of 8 processors mapping

z
P0 (0,0,0) P1 (0,0,1) P2 (0,1,0) P3
(1,1,0) P4 (0,0,1) P5 (1,0,1) P6 (0,1,1) P7
(1,1,1)
6
4
5
7
2
0
y
3
1
x
13
Neighbors Ghosts

For each of the 3 dimensions each processor has 3
neighbors.
A neighbor along a dimension is a real neighbor
else we call it ghost.
For example P1, P2, and P4 are real neighbors of
P0.

z
6
4
5
7
2
0
y
3
1
x
14
The IBM SP3
15
Machine Specifications

Number of Processors 1152
Number of Nodes 144 nodes
Processors per node 8
Memory Per Node 4 GB
Processor Power3 RISC
Clock rate 375 MHz
CPU Pipeline up to 8 instr/clock cycle and 4
FP/cycle
Peak Performance 1.728 Teraflops
File Storage 15 TB
Operating System AIX 5L
Internodes connection IBM SP Colony Switch2
Peak unidirectional transfer rate 500 MB/sec per
node pair
Peak bi-directional data transfer rate 1000
MB/sec

16
Programming Model Fortran 90 MPI

Compilers available C, C, Fortan 77, Fortran
90
Interactive and batch execution
Interactive poe (only on 4 cpu nodes and max
2GB)
Batch IBM LoadLeveler job queue manager
(llsubmit, llq, llcancel)
Original bosomin code has been written in IBM
Fortran 77 with MPI extensions (mpxlf77)
New code (bosomin3) has been ported under Fortran
90 with MPI extensions (mpxlf90)
Straight MPI applications utilizing all 8 CPUs on
each node should normally be run with the faster
user space (US) protocol on the Colony switch.
Run tests on 1, 8, 64 processors
Problem grid sizes of 32 x 32 x 32 and 64 x 64 x
64

17
Load Leveler llq command snapshot

Id Owner Submitted
ST PRI Class Running On
------------------------ ---------- -----------
-- --- ------------ -----------
tf005i.39235.0 ux452554 5/3 1537 R
50 normal tf009i
tf005i.39262.0 ux452554 5/4 1104 R
50 normal tf093i
tf005i.39284.0 ux453630 5/5 0713 R
50 normal tf021i
tf005i.39285.0 ux453630 5/5 0713 R
50 normal tf029i
tf005i.39286.0 ux453630 5/5 0713 R
50 normal tf083i
tf005i.39287.0 ux453630 5/5 0713 R
50 normal tf141i
tf005i.39288.0 ux453630 5/5 0713 R
50 normal tf155i
tf005i.39386.0 ux453138 5/5 0806 R
50 normal tf200i
tf004i.64439.0 ux450666 5/5 1015 R
50 normal tf081i
tf004i.64457.0 ux451727 5/5 1216 R
50 normal tf207i
tf004i.64501.0 ux452444 5/5 1510 R
50 normal tf019i
tf004i.64630.0 ux453804 5/6 0621 R
50 high tf097i
tf005i.39473.0 ux451204 5/6 0827 R
50 high tf228i
tf004i.64809.0 ux451204 5/6 1516 R
50 high tf227i
tf004i.64836.0 u13394 5/6 1731 R
50 high tf224i
tf004i.64847.0 ux453039 5/6 1758 R
50 normal tf039i
tf004i.64907.0 ux450833 5/7 0136 R
50 high tf006i

18
Original Bosomin

Fortran 77 with MPI
No compiler switch optimizations adopted
No modularization (2,300 lines of code in one
single file)
No makefile
No data locality (most objects declared in a
COMMON section)
Implicit declaration
Hard to debug
No b-block routines to minimize subroutine calls
and data communication overhead for b gt 1 cases.

19
Bosomin3 implementation

Fortran 90 and MPI
Higher level language modules, classes,
intrinsic functions, user defined data types
Modularization (1,600 lines of code in 7
distinct files)
Makefile to provide faster incremental
compilation process
Added compiler switch optimizations (-03)
Data locality and use of Modules to separate
subroutines
Non implicit declaration. Non declared objects
force the compiler to stop.
Easier to debug and extend
b-block routines to handle b gt 1 cases in one
call
matvec
Preconditioner
Modified MGS to minimize MPI calls (especially
collective)

20
Bosomin3 - Fortran 90 modules

Fortran 90 let the programmer define independent
modules. Bosomin3 is divided in the following
modules
global_data.F constant definitions common to
different modules
mesh3.f 3D grid initialization routines
precond.f preconditioner routines
mat.f matrix-vector routines
mgrs.f Modified Gram Schmidt routines
m_osomin.f bosomin algorithm
bosomin.f main program

21
IBM XLF compiler optimization options

-O2 option
Value numbering - folding several instructions
into a single instruction.
Branch straightening - rearranging program code
to minimize branch logic, and combining
physically separate blocks of code.
Common expression elimination - eliminating
duplicate expressions.
Code motion - performing calculations outside a
loop (if the variables in the loop are not
altered within it) and using those results within
the loop.
Reassociation and strength reduction -
rearranging calculation sequences in loops in
order to replace less efficient instructions with
more efficient ones.
Store motion - moving store operations out of
loops.
Dead store elimination - eliminating stores when
the value stored is never referred to again.
Dead code elimination - eliminating code for
calculations that are not required and portions
of the code that can never be reached.
Global register allocation - keeping variables
and expressions in registers instead of memory.
Instruction scheduling - reordering instructions
to minimize program execution time.

22
Store motion

dot product example of the store motion
optimization done by -O2
x0.0
do i1,ilim
x a(i) b(i) x
enddo
The compiler would recognize that there is no
need to store the value of x until the loop is
completed, and intermediate values would be kept
in registers. Even if the loads and stores are
cached, the optimization could lead to an order
of magnitude or better improvement in the
performance of this loop.

23
-O3 option

Performs all of the optimizations done by O2 as
well as other that require more memory or time to
accomplish.
Some optimizations might change the semantics of
the program and cause slight numeric differences.
Option qstrict avoids this problem
Rewriting floating-point expressions -
computations like abc may be rewritten as acb
if there is a common subexpression. This is not
done at the -O2 level, since different numeric
results may result.
Divides are replaced by multiplies by the
reciprocal. Divides on the POWER3 are very
expensive operations. They require 18 cycles for
64 bit FP operands. The reciprocal and multiply
are much cheaper operations. Might result in
slightly different results.
Aggressive code motion and scheduling - the
compiler will rearrange the code and instruction
sequence much more aggressively. Computations
that have the potential to raise an exception
whose execution is conditional in the program
might be definitely scheduled at this level if
this might lead to improvements in performance.
Load and floating-point computations may be
scheduled even though, according to the actual
semantics of the program, they might not have
been.
The same principle is followed when it comes to
moving many kinds of loads. The compiler will
move safe types of loads for a potential
performance improvement. Loads in general are not
movable at the -O2 level of optimization because
a program can contain a declare a static array
and then load to an element of that array far
beyond the declared boundary which might cause a
segmentation violation.

24
-O3 option

In the following example, at -O2, the computation
of bc is not moved out of the loop for two
reasons it is considered dangerous because it is
a floating-point operation and could thus
possibly cause an exception and it does not occur
on every path through the loop, so it potentially
may never be executed. At -O3, the computation is
moved outside the loop and done only once.
do i1,ilim-1
if(a(i) lt a(i1)) a(i)bc
enddo

25
Bosomin3 - makefile

A makefile has been created in order to avoid
unnecessary recompilation of unmodified modules.
This also let the programmer compile different
modules using different libraries and compiler
switches.
CCmpxlf90
OPS -O3 -qmaxmem-1 qstrict
LIBS -lessl
all bosomin
bosomin global_data.o bosomin.o mat.o
mesh3.o mgrs.o \
precond.o m_osomin.o
(CC) (OPS) -o bosomin bosomin.o
global_data.o mat.o \
mesh3.o mgrs.o precond.o
m_osomin.o (LIBS)
bosomin.o bosomin.f mesh3.o mat.o precond.o
mgrs.o m_osomin.o
(CC) (OPS) -c bosomin.f mesh3.o
mat.o precond.o \
mgrs.o m_osomin.o (LIBS)
global_data.o global_data.F
(CC) (OPS) -c global_data.F
(LIBS)
m_osomin.o m_osomin.f mat.o mgrs.o
global_data.o
(CC) (OPS) -c m_osomin.f mat.o
mgrs.o \
global_data.o (LIBS)
mat.o mat.f precond.o global_data.o
(CC) (OPS) -c mat.f precond.o
global_data.o (LIBS)

26
Results
27
Parameters

Runs have been performed to compare bosomin and
bosomin3 using several different parameters
grids 32x32x32 and 64x64x64
b1, 4
k0,1,2,4
s1, 2, 4, 8
eps 1E-10
maxiter 5 (comparison test)
maxiter 100 (convergence test)
1, 8, 64 processors
-O2, -O3 optimizations
Gain is calculated as T(F77) over T(F90).
Speedup is calculated as T(p1) over T(pn)

28
Compiler optimization 1 processor comparison
29
(No Transcript)
30
-O3 1, 8, 64 processors comparison
31
Note that with b gt 1 the block operations of
bosomin3 further improve the gain over the
original bosomin code. Note also the improved
scalability when run on more processors. Bosomin3
speedup is twice the original code.
32
(No Transcript)
33
Convergence test 1 processor
34
Convergence Test n processorsopt-O3,
maxiter100, eps1E-10
35
Conclusions

bosomin3 is faster and more scalable than
previous implementation by taking advantage of
the cases where b gt 1
It is more precise
The code is smaller, cleaner, and easier to
maintain

36
Future Work

Implement similar algorithms (like GMRES) on the
SP3 using the 3D data mapping to see improvements
Averaging solutions
Test different matrices and show that block
method is faster to converge with b gt 1 RHS.
Run tests using longhorn IBM SP4 (Power4 1.3GHz)
at UT Austin.
Test the new TurboMPI library which promises
upto 300 speed improvement in collective MPI
routines by taking advantage of the shared memory
to communicate within a node.

37
FINE

Write a Comment

User Comments (0)