Title: Implementation of parallel Iterative methods on the IBM SP3 Blue Horizon
1Implementation of parallel Iterative methods on
theIBM SP3(Blue Horizon)
- Federico Balbi
- University of Texas at San Antonio
2- Block OSOMIN (bosomin)
- Study and implementation of a parallel iterative
Krylov method based on block methods for large
non-symmetric linear systems of the form Ax f. - A might be large, sparse, and non-symmetric.
- The block methods use a number of linearly
independent initial residual vectors. The
residual vectors are used to compute
simultaneously a block of direction vectors which
are orthonormalized via the Modified Gram Schmidt
(MGS). - Whats next
- Previous methods
- bosomin(b, k, s) method
- The IBM SP3
- Implementation on the IBM SP3
- Test results
- Future work
3s-Step Methods Chronology
- Omin(k) Chronopoulos, Pernice
- implemented on the IBM 3090/600S/6VF at USI
- Osomin(s, k) Chronopoulos, Swanson
- implemented on the CRAY C90 at PSC
- Bosomin(b, s, k) Chronopoulos, Kucherov
- implemented on the CRAY T3E at UT Austin
- Bosomin(b, s, k) Chronopoulos, Liu
- implemented on the IBM SP3 at SDSC
- Note
- Omin(k) Osomin(1, k) bosomin(1, 1, k)
4Omin(k)
- The scalar ai is calculated by a ratio of 2 inner
products. Solution advances and residuals are
calculated. - Respect GCR method Omin retains k search
directions. - K is the right precondition matrix to improve
convergence rate. - Axf transformed to (AK)K-1x f gt XKXhat LU
Kxhat - K A-1
- AP(I1) is orthogonal to the previous APs
- Preconditioned omin(s)
- Choose x0, compute r0 f Ax0, and set p0r0
- For i 0, 1, (until convergence gt ri2 lt
eps) - ai (riT Api) / ( (Api)T Api) // scalar
- xi1 xi aipi
- ri1 ri ai Api
- bij ((AKri1)T Apj) / ((Apj)TApj) for ji lt
j lt i - pi1 Kri1 Sijji bj(i) pj
- endfor
- Note that ji min0, i-k1
5Osomin(s, k)
- Proposed in order to increase the parallelism
over omin(k) by calculating s direction vectors
at each step. - Beta is calculated in order to make the AP(i)
orthogonal to the previous APs. - MGS method used for the process of
orthogonalization. - P made of s directions vectors
- R residuals
- alpha calculate to minimize the residual ri
bi Axi - K
6Osomin(s,k)
compute r0 f Ax0 For i 0, 1, (until
convergence gt ri2 lt eps) APi Ari,
A2ri,,Asri Pi ri, Ari, , As-1ri if (1
lt i) then Blj (Alri)TAplj,
,(Alri)TApsj)T where l1..s and jj(i-1)..i-1
APi APi Si-1jj(i-1) APjBljsl1 Pi Pi
Si-1jj(i-1) APjBljsl1 endif Api, Pi
MGS(APi) ai riTApil, ,riTApisT xi1 xi
Piai ri1 ri APiai endfor Note For s1
Osomin(s,k) is the Omin(k) algorithm
7Bosomin(b, s, k)
We assume that we are going to solve a single
linear system Ax F of dimension n with b
righthand sides. F f1,f2,,fb The matrices
F (RHS), X (solution vectors), R (residual
vectors) are of dimensions n x b. The matrices P
(direction vectors), AP (matrix times direction
vectors), Q, S (used in the orthonormalization of
AP and similar transformations on P,
respectively) are of dimension n x bs. U is an
upper triangular matrix of dimension bs x bs. The
(parameter) matrix of steplengths a is of
dimension bs x b.
8Algorithm
// Initialization F f1,f2,,fb R1 F - AX1
Q0 S0 0 // Iterations For i 0,1, (until
convergence gt Ri2 lt eps) Pi Ri,ARi,
,As-1 Ri APi ARi,A2Ri,,As-1 Ri //
Orthogonalize APi against matrices Qi-1,,Qji
update Pi if (1 lt i) then APi APi
Si-1jj(i) Qj(QjT APi) Pi Pi Si-1jj(i)
Sj(QjT APi) endif // Si MGS(APi) (QR decomp
of APi) APi QiUi Pi SiUi // compute
steplengths, residuals, and solutions ai QiT
Ri Xi1 Xi Qiai Ri1 Ri
Siai endfor Note For b1 Bosomin(s,k) is the
Osmin(s,k) algorithm
9Previous implementations and data mapping
- 2D data mapping (linear array of p processors)
10Domain overlapping
Region 0
Cpu 0
Region 1
Cpu 1
Region 2
Parallel execution of ILU for linear arrayof p
processors (preconditioning is based on backsolve)
Cpu 2
Region p
Cpu p
11Bosomin Implementation data mapping
- Data mapping significantly affects the
performance of a parallel system. - In order to process the application matrix in
parallel we must partition the 3D grid which is
used to generate the matrix and map to a 3D grid
of processors. - We assume that p processors are available and
arranged in a logical - p1/3 x p1/3 x p1/3 grid.
- The processors are labeled according to their
position in the grid. - This mapping is ideal to map a n x n x n grid
problem among p processors and provide the
highest concurrency
Space grid mapping to 3D mesh with 27 processors
12Processors Grid
- Example of 8 processors mapping
z
P0 (0,0,0) P1 (0,0,1) P2 (0,1,0) P3
(1,1,0) P4 (0,0,1) P5 (1,0,1) P6 (0,1,1) P7
(1,1,1)
6
4
5
7
2
0
y
3
1
x
13Neighbors Ghosts
- For each of the 3 dimensions each processor has 3
neighbors. - A neighbor along a dimension is a real neighbor
else we call it ghost. - For example P1, P2, and P4 are real neighbors of
P0.
z
6
4
5
7
2
0
y
3
1
x
14The IBM SP3
15Machine Specifications
- Number of Processors 1152
- Number of Nodes 144 nodes
- Processors per node 8
- Memory Per Node 4 GB
- Processor Power3 RISC
- Clock rate 375 MHz
- CPU Pipeline up to 8 instr/clock cycle and 4
FP/cycle - Peak Performance 1.728 Teraflops
- File Storage 15 TB
- Operating System AIX 5L
- Internodes connection IBM SP Colony Switch2
- Peak unidirectional transfer rate 500 MB/sec per
node pair - Peak bi-directional data transfer rate 1000
MB/sec
16Programming Model Fortran 90 MPI
- Compilers available C, C, Fortan 77, Fortran
90 - Interactive and batch execution
- Interactive poe (only on 4 cpu nodes and max
2GB) - Batch IBM LoadLeveler job queue manager
(llsubmit, llq, llcancel) - Original bosomin code has been written in IBM
Fortran 77 with MPI extensions (mpxlf77) - New code (bosomin3) has been ported under Fortran
90 with MPI extensions (mpxlf90) - Straight MPI applications utilizing all 8 CPUs on
each node should normally be run with the faster
user space (US) protocol on the Colony switch. - Run tests on 1, 8, 64 processors
- Problem grid sizes of 32 x 32 x 32 and 64 x 64 x
64
17Load Leveler llq command snapshot
- Id Owner Submitted
ST PRI Class Running On - ------------------------ ---------- -----------
-- --- ------------ ----------- - tf005i.39235.0 ux452554 5/3 1537 R
50 normal tf009i - tf005i.39262.0 ux452554 5/4 1104 R
50 normal tf093i - tf005i.39284.0 ux453630 5/5 0713 R
50 normal tf021i - tf005i.39285.0 ux453630 5/5 0713 R
50 normal tf029i - tf005i.39286.0 ux453630 5/5 0713 R
50 normal tf083i - tf005i.39287.0 ux453630 5/5 0713 R
50 normal tf141i - tf005i.39288.0 ux453630 5/5 0713 R
50 normal tf155i - tf005i.39386.0 ux453138 5/5 0806 R
50 normal tf200i - tf004i.64439.0 ux450666 5/5 1015 R
50 normal tf081i - tf004i.64457.0 ux451727 5/5 1216 R
50 normal tf207i - tf004i.64501.0 ux452444 5/5 1510 R
50 normal tf019i - tf004i.64630.0 ux453804 5/6 0621 R
50 high tf097i - tf005i.39473.0 ux451204 5/6 0827 R
50 high tf228i - tf004i.64809.0 ux451204 5/6 1516 R
50 high tf227i - tf004i.64836.0 u13394 5/6 1731 R
50 high tf224i - tf004i.64847.0 ux453039 5/6 1758 R
50 normal tf039i - tf004i.64907.0 ux450833 5/7 0136 R
50 high tf006i
18Original Bosomin
- Fortran 77 with MPI
- No compiler switch optimizations adopted
- No modularization (2,300 lines of code in one
single file) - No makefile
- No data locality (most objects declared in a
COMMON section) - Implicit declaration
- Hard to debug
- No b-block routines to minimize subroutine calls
and data communication overhead for b gt 1 cases.
19Bosomin3 implementation
- Fortran 90 and MPI
- Higher level language modules, classes,
intrinsic functions, user defined data types - Modularization (1,600 lines of code in 7
distinct files) - Makefile to provide faster incremental
compilation process - Added compiler switch optimizations (-03)
- Data locality and use of Modules to separate
subroutines - Non implicit declaration. Non declared objects
force the compiler to stop. - Easier to debug and extend
- b-block routines to handle b gt 1 cases in one
call - matvec
- Preconditioner
- Modified MGS to minimize MPI calls (especially
collective)
20Bosomin3 - Fortran 90 modules
- Fortran 90 let the programmer define independent
modules. Bosomin3 is divided in the following
modules - global_data.F constant definitions common to
different modules - mesh3.f 3D grid initialization routines
- precond.f preconditioner routines
- mat.f matrix-vector routines
- mgrs.f Modified Gram Schmidt routines
- m_osomin.f bosomin algorithm
- bosomin.f main program
21IBM XLF compiler optimization options
- -O2 option
- Value numbering - folding several instructions
into a single instruction. - Branch straightening - rearranging program code
to minimize branch logic, and combining
physically separate blocks of code. - Common expression elimination - eliminating
duplicate expressions. - Code motion - performing calculations outside a
loop (if the variables in the loop are not
altered within it) and using those results within
the loop. - Reassociation and strength reduction -
rearranging calculation sequences in loops in
order to replace less efficient instructions with
more efficient ones. - Store motion - moving store operations out of
loops. - Dead store elimination - eliminating stores when
the value stored is never referred to again. - Dead code elimination - eliminating code for
calculations that are not required and portions
of the code that can never be reached. - Global register allocation - keeping variables
and expressions in registers instead of memory. - Instruction scheduling - reordering instructions
to minimize program execution time.
22Store motion
- dot product example of the store motion
optimization done by -O2 - x0.0
- do i1,ilim
- x a(i) b(i) x
- enddo
- The compiler would recognize that there is no
need to store the value of x until the loop is
completed, and intermediate values would be kept
in registers. Even if the loads and stores are
cached, the optimization could lead to an order
of magnitude or better improvement in the
performance of this loop.
23-O3 option
- Performs all of the optimizations done by O2 as
well as other that require more memory or time to
accomplish. - Some optimizations might change the semantics of
the program and cause slight numeric differences.
Option qstrict avoids this problem - Rewriting floating-point expressions -
computations like abc may be rewritten as acb
if there is a common subexpression. This is not
done at the -O2 level, since different numeric
results may result. - Divides are replaced by multiplies by the
reciprocal. Divides on the POWER3 are very
expensive operations. They require 18 cycles for
64 bit FP operands. The reciprocal and multiply
are much cheaper operations. Might result in
slightly different results. - Aggressive code motion and scheduling - the
compiler will rearrange the code and instruction
sequence much more aggressively. Computations
that have the potential to raise an exception
whose execution is conditional in the program
might be definitely scheduled at this level if
this might lead to improvements in performance.
Load and floating-point computations may be
scheduled even though, according to the actual
semantics of the program, they might not have
been. - The same principle is followed when it comes to
moving many kinds of loads. The compiler will
move safe types of loads for a potential
performance improvement. Loads in general are not
movable at the -O2 level of optimization because
a program can contain a declare a static array
and then load to an element of that array far
beyond the declared boundary which might cause a
segmentation violation.
24-O3 option
- In the following example, at -O2, the computation
of bc is not moved out of the loop for two
reasons it is considered dangerous because it is
a floating-point operation and could thus
possibly cause an exception and it does not occur
on every path through the loop, so it potentially
may never be executed. At -O3, the computation is
moved outside the loop and done only once. - do i1,ilim-1
- if(a(i) lt a(i1)) a(i)bc
- enddo
25Bosomin3 - makefile
- A makefile has been created in order to avoid
unnecessary recompilation of unmodified modules.
This also let the programmer compile different
modules using different libraries and compiler
switches. - CCmpxlf90
- OPS -O3 -qmaxmem-1 qstrict
- LIBS -lessl
- all bosomin
- bosomin global_data.o bosomin.o mat.o
mesh3.o mgrs.o \ - precond.o m_osomin.o
- (CC) (OPS) -o bosomin bosomin.o
global_data.o mat.o \ - mesh3.o mgrs.o precond.o
m_osomin.o (LIBS) - bosomin.o bosomin.f mesh3.o mat.o precond.o
mgrs.o m_osomin.o - (CC) (OPS) -c bosomin.f mesh3.o
mat.o precond.o \ - mgrs.o m_osomin.o (LIBS)
- global_data.o global_data.F
- (CC) (OPS) -c global_data.F
(LIBS) - m_osomin.o m_osomin.f mat.o mgrs.o
global_data.o - (CC) (OPS) -c m_osomin.f mat.o
mgrs.o \ - global_data.o (LIBS)
- mat.o mat.f precond.o global_data.o
- (CC) (OPS) -c mat.f precond.o
global_data.o (LIBS)
26Results
27Parameters
- Runs have been performed to compare bosomin and
bosomin3 using several different parameters - grids 32x32x32 and 64x64x64
- b1, 4
- k0,1,2,4
- s1, 2, 4, 8
- eps 1E-10
- maxiter 5 (comparison test)
- maxiter 100 (convergence test)
- 1, 8, 64 processors
- -O2, -O3 optimizations
- Gain is calculated as T(F77) over T(F90).
- Speedup is calculated as T(p1) over T(pn)
28Compiler optimization 1 processor comparison
29(No Transcript)
30-O3 1, 8, 64 processors comparison
31Note that with b gt 1 the block operations of
bosomin3 further improve the gain over the
original bosomin code. Note also the improved
scalability when run on more processors. Bosomin3
speedup is twice the original code.
32(No Transcript)
33Convergence test 1 processor
34Convergence Test n processorsopt-O3,
maxiter100, eps1E-10
35Conclusions
- bosomin3 is faster and more scalable than
previous implementation by taking advantage of
the cases where b gt 1 - It is more precise
- The code is smaller, cleaner, and easier to
maintain
36Future Work
- Implement similar algorithms (like GMRES) on the
SP3 using the 3D data mapping to see improvements - Averaging solutions
- Test different matrices and show that block
method is faster to converge with b gt 1 RHS. - Run tests using longhorn IBM SP4 (Power4 1.3GHz)
at UT Austin. - Test the new TurboMPI library which promises
upto 300 speed improvement in collective MPI
routines by taking advantage of the shared memory
to communicate within a node.
37FINE