Implementation of parallel Iterative methods on the IBM SP3 Blue Horizon - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Implementation of parallel Iterative methods on the IBM SP3 Blue Horizon

Description:

Study and implementation of a parallel iterative Krylov method based on block ... Run tests using longhorn IBM SP4 (Power4 1.3GHz) at UT Austin. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 38
Provided by: csU66
Category:

less

Transcript and Presenter's Notes

Title: Implementation of parallel Iterative methods on the IBM SP3 Blue Horizon


1
Implementation of parallel Iterative methods on
theIBM SP3(Blue Horizon)
  • Federico Balbi
  • University of Texas at San Antonio

2
  • Block OSOMIN (bosomin)
  • Study and implementation of a parallel iterative
    Krylov method based on block methods for large
    non-symmetric linear systems of the form Ax f.
  • A might be large, sparse, and non-symmetric.
  • The block methods use a number of linearly
    independent initial residual vectors. The
    residual vectors are used to compute
    simultaneously a block of direction vectors which
    are orthonormalized via the Modified Gram Schmidt
    (MGS).
  • Whats next
  • Previous methods
  • bosomin(b, k, s) method
  • The IBM SP3
  • Implementation on the IBM SP3
  • Test results
  • Future work

3
s-Step Methods Chronology
  • Omin(k) Chronopoulos, Pernice
  • implemented on the IBM 3090/600S/6VF at USI
  • Osomin(s, k) Chronopoulos, Swanson
  • implemented on the CRAY C90 at PSC
  • Bosomin(b, s, k) Chronopoulos, Kucherov
  • implemented on the CRAY T3E at UT Austin
  • Bosomin(b, s, k) Chronopoulos, Liu
  • implemented on the IBM SP3 at SDSC
  • Note
  • Omin(k) Osomin(1, k) bosomin(1, 1, k)

4
Omin(k)
  • The scalar ai is calculated by a ratio of 2 inner
    products. Solution advances and residuals are
    calculated.
  • Respect GCR method Omin retains k search
    directions.
  • K is the right precondition matrix to improve
    convergence rate.
  • Axf transformed to (AK)K-1x f gt XKXhat LU
    Kxhat
  • K A-1
  • AP(I1) is orthogonal to the previous APs
  • Preconditioned omin(s)
  • Choose x0, compute r0 f Ax0, and set p0r0
  • For i 0, 1, (until convergence gt ri2 lt
    eps)
  • ai (riT Api) / ( (Api)T Api) // scalar
  • xi1 xi aipi
  • ri1 ri ai Api
  • bij ((AKri1)T Apj) / ((Apj)TApj) for ji lt
    j lt i
  • pi1 Kri1 Sijji bj(i) pj
  • endfor
  • Note that ji min0, i-k1

5
Osomin(s, k)
  • Proposed in order to increase the parallelism
    over omin(k) by calculating s direction vectors
    at each step.
  • Beta is calculated in order to make the AP(i)
    orthogonal to the previous APs.
  • MGS method used for the process of
    orthogonalization.
  • P made of s directions vectors
  • R residuals
  • alpha calculate to minimize the residual ri
    bi Axi
  • K

6
Osomin(s,k)
compute r0 f Ax0 For i 0, 1, (until
convergence gt ri2 lt eps) APi Ari,
A2ri,,Asri Pi ri, Ari, , As-1ri if (1
lt i) then Blj (Alri)TAplj,
,(Alri)TApsj)T where l1..s and jj(i-1)..i-1
APi APi Si-1jj(i-1) APjBljsl1 Pi Pi
Si-1jj(i-1) APjBljsl1 endif Api, Pi
MGS(APi) ai riTApil, ,riTApisT xi1 xi
Piai ri1 ri APiai endfor Note For s1
Osomin(s,k) is the Omin(k) algorithm
7
Bosomin(b, s, k)
We assume that we are going to solve a single
linear system Ax F of dimension n with b
righthand sides. F f1,f2,,fb The matrices
F (RHS), X (solution vectors), R (residual
vectors) are of dimensions n x b. The matrices P
(direction vectors), AP (matrix times direction
vectors), Q, S (used in the orthonormalization of
AP and similar transformations on P,
respectively) are of dimension n x bs. U is an
upper triangular matrix of dimension bs x bs. The
(parameter) matrix of steplengths a is of
dimension bs x b.
8
Algorithm
// Initialization F f1,f2,,fb R1 F - AX1
Q0 S0 0 // Iterations For i 0,1, (until
convergence gt Ri2 lt eps) Pi Ri,ARi,
,As-1 Ri APi ARi,A2Ri,,As-1 Ri //
Orthogonalize APi against matrices Qi-1,,Qji
update Pi if (1 lt i) then APi APi
Si-1jj(i) Qj(QjT APi) Pi Pi Si-1jj(i)
Sj(QjT APi) endif // Si MGS(APi) (QR decomp
of APi) APi QiUi Pi SiUi // compute
steplengths, residuals, and solutions ai QiT
Ri Xi1 Xi Qiai Ri1 Ri
Siai endfor Note For b1 Bosomin(s,k) is the
Osmin(s,k) algorithm
9
Previous implementations and data mapping
  • 2D data mapping (linear array of p processors)

10
Domain overlapping
Region 0
Cpu 0
Region 1
Cpu 1
Region 2
Parallel execution of ILU for linear arrayof p
processors (preconditioning is based on backsolve)
Cpu 2
Region p
Cpu p
11
Bosomin Implementation data mapping
  • Data mapping significantly affects the
    performance of a parallel system.
  • In order to process the application matrix in
    parallel we must partition the 3D grid which is
    used to generate the matrix and map to a 3D grid
    of processors.
  • We assume that p processors are available and
    arranged in a logical
  • p1/3 x p1/3 x p1/3 grid.
  • The processors are labeled according to their
    position in the grid.
  • This mapping is ideal to map a n x n x n grid
    problem among p processors and provide the
    highest concurrency

Space grid mapping to 3D mesh with 27 processors
12
Processors Grid
  • Example of 8 processors mapping

z
P0 (0,0,0) P1 (0,0,1) P2 (0,1,0) P3
(1,1,0) P4 (0,0,1) P5 (1,0,1) P6 (0,1,1) P7
(1,1,1)
6
4
5
7
2
0
y
3
1
x
13
Neighbors Ghosts
  • For each of the 3 dimensions each processor has 3
    neighbors.
  • A neighbor along a dimension is a real neighbor
    else we call it ghost.
  • For example P1, P2, and P4 are real neighbors of
    P0.

z
6
4
5
7
2
0
y
3
1
x
14
The IBM SP3
15
Machine Specifications
  • Number of Processors 1152
  • Number of Nodes 144 nodes
  • Processors per node 8
  • Memory Per Node 4 GB
  • Processor Power3 RISC
  • Clock rate 375 MHz
  • CPU Pipeline up to 8 instr/clock cycle and 4
    FP/cycle
  • Peak Performance 1.728 Teraflops
  • File Storage 15 TB
  • Operating System AIX 5L
  • Internodes connection IBM SP Colony Switch2
  • Peak unidirectional transfer rate 500 MB/sec per
    node pair
  • Peak bi-directional data transfer rate 1000
    MB/sec

16
Programming Model Fortran 90 MPI
  • Compilers available C, C, Fortan 77, Fortran
    90
  • Interactive and batch execution
  • Interactive poe (only on 4 cpu nodes and max
    2GB)
  • Batch IBM LoadLeveler job queue manager
    (llsubmit, llq, llcancel)
  • Original bosomin code has been written in IBM
    Fortran 77 with MPI extensions (mpxlf77)
  • New code (bosomin3) has been ported under Fortran
    90 with MPI extensions (mpxlf90)
  • Straight MPI applications utilizing all 8 CPUs on
    each node should normally be run with the faster
    user space (US) protocol on the Colony switch.
  • Run tests on 1, 8, 64 processors
  • Problem grid sizes of 32 x 32 x 32 and 64 x 64 x
    64

17
Load Leveler llq command snapshot
  • Id Owner Submitted
    ST PRI Class Running On
  • ------------------------ ---------- -----------
    -- --- ------------ -----------
  • tf005i.39235.0 ux452554 5/3 1537 R
    50 normal tf009i
  • tf005i.39262.0 ux452554 5/4 1104 R
    50 normal tf093i
  • tf005i.39284.0 ux453630 5/5 0713 R
    50 normal tf021i
  • tf005i.39285.0 ux453630 5/5 0713 R
    50 normal tf029i
  • tf005i.39286.0 ux453630 5/5 0713 R
    50 normal tf083i
  • tf005i.39287.0 ux453630 5/5 0713 R
    50 normal tf141i
  • tf005i.39288.0 ux453630 5/5 0713 R
    50 normal tf155i
  • tf005i.39386.0 ux453138 5/5 0806 R
    50 normal tf200i
  • tf004i.64439.0 ux450666 5/5 1015 R
    50 normal tf081i
  • tf004i.64457.0 ux451727 5/5 1216 R
    50 normal tf207i
  • tf004i.64501.0 ux452444 5/5 1510 R
    50 normal tf019i
  • tf004i.64630.0 ux453804 5/6 0621 R
    50 high tf097i
  • tf005i.39473.0 ux451204 5/6 0827 R
    50 high tf228i
  • tf004i.64809.0 ux451204 5/6 1516 R
    50 high tf227i
  • tf004i.64836.0 u13394 5/6 1731 R
    50 high tf224i
  • tf004i.64847.0 ux453039 5/6 1758 R
    50 normal tf039i
  • tf004i.64907.0 ux450833 5/7 0136 R
    50 high tf006i

18
Original Bosomin
  • Fortran 77 with MPI
  • No compiler switch optimizations adopted
  • No modularization (2,300 lines of code in one
    single file)
  • No makefile
  • No data locality (most objects declared in a
    COMMON section)
  • Implicit declaration
  • Hard to debug
  • No b-block routines to minimize subroutine calls
    and data communication overhead for b gt 1 cases.

19
Bosomin3 implementation
  • Fortran 90 and MPI
  • Higher level language modules, classes,
    intrinsic functions, user defined data types
  • Modularization (1,600 lines of code in 7
    distinct files)
  • Makefile to provide faster incremental
    compilation process
  • Added compiler switch optimizations (-03)
  • Data locality and use of Modules to separate
    subroutines
  • Non implicit declaration. Non declared objects
    force the compiler to stop.
  • Easier to debug and extend
  • b-block routines to handle b gt 1 cases in one
    call
  • matvec
  • Preconditioner
  • Modified MGS to minimize MPI calls (especially
    collective)

20
Bosomin3 - Fortran 90 modules
  • Fortran 90 let the programmer define independent
    modules. Bosomin3 is divided in the following
    modules
  • global_data.F constant definitions common to
    different modules
  • mesh3.f 3D grid initialization routines
  • precond.f preconditioner routines
  • mat.f matrix-vector routines
  • mgrs.f Modified Gram Schmidt routines
  • m_osomin.f bosomin algorithm
  • bosomin.f main program

21
IBM XLF compiler optimization options
  • -O2 option
  • Value numbering - folding several instructions
    into a single instruction.
  • Branch straightening - rearranging program code
    to minimize branch logic, and combining
    physically separate blocks of code.
  • Common expression elimination - eliminating
    duplicate expressions.
  • Code motion - performing calculations outside a
    loop (if the variables in the loop are not
    altered within it) and using those results within
    the loop.
  • Reassociation and strength reduction -
    rearranging calculation sequences in loops in
    order to replace less efficient instructions with
    more efficient ones.
  • Store motion - moving store operations out of
    loops.
  • Dead store elimination - eliminating stores when
    the value stored is never referred to again.
  • Dead code elimination - eliminating code for
    calculations that are not required and portions
    of the code that can never be reached.
  • Global register allocation - keeping variables
    and expressions in registers instead of memory.
  • Instruction scheduling - reordering instructions
    to minimize program execution time.

22
Store motion
  • dot product example of the store motion
    optimization done by -O2
  • x0.0
  • do i1,ilim
  • x a(i) b(i) x
  • enddo
  • The compiler would recognize that there is no
    need to store the value of x until the loop is
    completed, and intermediate values would be kept
    in registers. Even if the loads and stores are
    cached, the optimization could lead to an order
    of magnitude or better improvement in the
    performance of this loop.

23
-O3 option
  • Performs all of the optimizations done by O2 as
    well as other that require more memory or time to
    accomplish.
  • Some optimizations might change the semantics of
    the program and cause slight numeric differences.
    Option qstrict avoids this problem
  • Rewriting floating-point expressions -
    computations like abc may be rewritten as acb
    if there is a common subexpression. This is not
    done at the -O2 level, since different numeric
    results may result.
  • Divides are replaced by multiplies by the
    reciprocal. Divides on the POWER3 are very
    expensive operations. They require 18 cycles for
    64 bit FP operands. The reciprocal and multiply
    are much cheaper operations. Might result in
    slightly different results.
  • Aggressive code motion and scheduling - the
    compiler will rearrange the code and instruction
    sequence much more aggressively. Computations
    that have the potential to raise an exception
    whose execution is conditional in the program
    might be definitely scheduled at this level if
    this might lead to improvements in performance.
    Load and floating-point computations may be
    scheduled even though, according to the actual
    semantics of the program, they might not have
    been.
  • The same principle is followed when it comes to
    moving many kinds of loads. The compiler will
    move safe types of loads for a potential
    performance improvement. Loads in general are not
    movable at the -O2 level of optimization because
    a program can contain a declare a static array
    and then load to an element of that array far
    beyond the declared boundary which might cause a
    segmentation violation.

24
-O3 option
  • In the following example, at -O2, the computation
    of bc is not moved out of the loop for two
    reasons it is considered dangerous because it is
    a floating-point operation and could thus
    possibly cause an exception and it does not occur
    on every path through the loop, so it potentially
    may never be executed. At -O3, the computation is
    moved outside the loop and done only once.
  • do i1,ilim-1
  • if(a(i) lt a(i1)) a(i)bc
  • enddo

25
Bosomin3 - makefile
  • A makefile has been created in order to avoid
    unnecessary recompilation of unmodified modules.
    This also let the programmer compile different
    modules using different libraries and compiler
    switches.
  • CCmpxlf90
  • OPS -O3 -qmaxmem-1 qstrict
  • LIBS -lessl
  • all bosomin
  • bosomin global_data.o bosomin.o mat.o
    mesh3.o mgrs.o \
  • precond.o m_osomin.o
  • (CC) (OPS) -o bosomin bosomin.o
    global_data.o mat.o \
  • mesh3.o mgrs.o precond.o
    m_osomin.o (LIBS)
  • bosomin.o bosomin.f mesh3.o mat.o precond.o
    mgrs.o m_osomin.o
  • (CC) (OPS) -c bosomin.f mesh3.o
    mat.o precond.o \
  • mgrs.o m_osomin.o (LIBS)
  • global_data.o global_data.F
  • (CC) (OPS) -c global_data.F
    (LIBS)
  • m_osomin.o m_osomin.f mat.o mgrs.o
    global_data.o
  • (CC) (OPS) -c m_osomin.f mat.o
    mgrs.o \
  • global_data.o (LIBS)
  • mat.o mat.f precond.o global_data.o
  • (CC) (OPS) -c mat.f precond.o
    global_data.o (LIBS)

26
Results
27
Parameters
  • Runs have been performed to compare bosomin and
    bosomin3 using several different parameters
  • grids 32x32x32 and 64x64x64
  • b1, 4
  • k0,1,2,4
  • s1, 2, 4, 8
  • eps 1E-10
  • maxiter 5 (comparison test)
  • maxiter 100 (convergence test)
  • 1, 8, 64 processors
  • -O2, -O3 optimizations
  • Gain is calculated as T(F77) over T(F90).
  • Speedup is calculated as T(p1) over T(pn)

28
Compiler optimization 1 processor comparison
29
(No Transcript)
30
-O3 1, 8, 64 processors comparison
31
Note that with b gt 1 the block operations of
bosomin3 further improve the gain over the
original bosomin code. Note also the improved
scalability when run on more processors. Bosomin3
speedup is twice the original code.
32
(No Transcript)
33
Convergence test 1 processor
34
Convergence Test n processorsopt-O3,
maxiter100, eps1E-10
35
Conclusions
  • bosomin3 is faster and more scalable than
    previous implementation by taking advantage of
    the cases where b gt 1
  • It is more precise
  • The code is smaller, cleaner, and easier to
    maintain

36
Future Work
  • Implement similar algorithms (like GMRES) on the
    SP3 using the 3D data mapping to see improvements
  • Averaging solutions
  • Test different matrices and show that block
    method is faster to converge with b gt 1 RHS.
  • Run tests using longhorn IBM SP4 (Power4 1.3GHz)
    at UT Austin.
  • Test the new TurboMPI library which promises
    upto 300 speed improvement in collective MPI
    routines by taking advantage of the shared memory
    to communicate within a node.

37
FINE
Write a Comment
User Comments (0)
About PowerShow.com