Title: High Performance Parallel Computing
1High Performance Parallel Computing
Getting Started on ACRL's IBM SP
- Virendra Bhavsar Eric Aubanel
- Advanced Computational Research Laboratory
- Faculty of Computer Science, UNB
2Overview
- I. Introduction to Parallel Processing the
ACRL - II. How to use the IBM SP Basics
- III. How to use the IBM SP Job scheduling
- IV. Message Passing Interface
- V. Matrix multiply example
- VI. OpenMP
- VII. MPI Performance visualization with
Paragraph
3ACRLs IBM SP
- 4 Winterhawk II nodes
- 16 processors
- Each node has
- 1 GB RAM
- 9 GB (mirrored) disk on each node
- Switch adapter
- High Performance Switch
- Gigabit Ethernet (1 node)
- Control workstation
- Disk SSA tower with 6 18.2 GB disks
Gigabit Ethernet
4 5The Clustered SMP
ACRLs SP Four 4-way SMPs
Each node has its own copy of the
O/S Processors on the node are closer than
those on different nodes
6General Parallel File System
Node 2
Node 3
Node 4
SP Switch
Node 1
7ACRL Software
- Operating System AIX 4.3.3
- Compilers
- IBM XL Fortran 7.1
- IBM High Performance Fortran 1.4
- VisualAge C for AIX, Version 5.0.1.0
- VisualAge C Professional for AIX, Version
5.0.0.0 - Java
- Job Scheduler Loadleveler 2.2
- Parallel Programming Tools
- IBM Parallel Environment 3.1 MPI, MPI-2
parallel I/O - Numerical Libraries ESSL (v. 3.2) and Parallel
ESSL (v. 2.2 ) - Computational Chemistry NWChem
- Visualization OpenDX (not yet installed)
- E-Commerce software (not yet installed)
8ESSL
- Linear algebra, Fourier related transforms,
sorting, interpolation, quadrature, random
numbers - Fast!
- 560x560 real8 matrix multiply
- Hand coding 19 Mflops
- dgemm 1.2 GFlops
- Parallel (threaded and distributed) versions
9Overview
- I. Introduction to Parallel Processing the
ACRL - II. How to use the IBM SP Basics
- III. How to use the IBM SP Job scheduling
- IV. Message Passing Interface
- V. Matrix multiply example
- OpenMP
- VII. MPI Performance visualization with
Paragraph
10How to use the IBM SP Basics
- www.cs.unb.ca/acrl
- Connecting and transferring files
- symphony.unb.ca
- SSH
- The standard system shell can be either ksh or
tcsh - ksh tcsh Note that the current directory is
not in the PATH environment variable. To execute
a program (say a.out) from your current directory
type ./a.out, not a.out. - ksh Command line editing is available using vi.
Enter ESC-k to bring up the last command, then
use vi to edit the previous commands.
11Basics (contd)
- tcsh Command line editing available using arrow
keys, delete key, etc.. - To change your shell, eg. from ksh to tcsh
yppasswd -s ltusernamegt /bin/tcsh - If you are new to Unix, we have suggested two
good Unix tutorial web sites on our links page. - Editing files
- vi
- vim gvim
- emacs
- Compiling running programs
- recommended optimization -O3 -qstrict, then -O3,
then -O3 -qarchpwr3
12Overview
- I. Introduction to Parallel Processing the
ACRL - II. How to use the IBM SP Basics
- How to use the IBM SP Job scheduling
- (see userss guide on ACRL web site)
- IV. Message Passing Interface
- V. OpenMP
- VI. Matrix multiply example
- VII. MPI Performance visualization with
Paragraph
13Overview
- I. Introduction to Parallel Processing the
ACRL - II. How to use the IBM SP Basics
- III. How to use the IBM SP Job scheduling
- IV. Message Passing Interface
- V. Matrix multiply example
- VI. OpenMP
- VII. MPI Performance visualization with
Paragraph
14Message Passing Model
- We have an ensemble of processors and memory they
can access - Each processor executes its own program
- All processors are interconnected by a network
(or a hierarchy of networks) - Processors communicate by passing messages
- Contrast with OpenMP implicit communication
15The Process basic unit of the application
- Characteristics of a process
- A running executable of a (compiled and linked)
program written in a standard sequential language
(e.g. Fortran or C) with library calls to
implement the message passing - A process executes on a processor
- all processes are assigned to processors in a
one-to-one mapping (in the simplest model of
parallel programming) - other processes may execute on other processors
- A process communicates and synchronizes with
other processes via messages.
16Domain Decomposition
- In the scientific world (esp. in the world of
simulation and modelling) this is the most common
solution - The solution space (which often corresponds to
real space) is divided up among the processors.
Each processor solves its own little piece - Finite-difference methods and finite-element
methods lend themselves well to this approach - The method of solution often leads naturally to a
set of simultaneous equations that can be solved
by parallel matrix solvers
17(No Transcript)
18(No Transcript)
19(No Transcript)
20Message Passing Interface
- MPI 1.0 standard in 1994
- MPI 1.1 in 1995
- MPI 2.0 in 1997
- Includes 1.1 but adds new features
- MPI-IO
- One-sided communication
- Dynamic processes
21Advantages of MPI
- Universality
- Expressivity
- Well suited to formulating a parallel algorithm
- Ease of debugging
- Memory is local
- Performance
- Explicit association of data with process allows
good use of cache
22Disadvantages of MPI
- Harder to learn than shared memory programming
(OpenMP) - Keeping track of communication pattern can be
tricky - Does not allow incremental parallelization all
or nothing!
23What System Calls Enable Message Passing?
- A simple subset
- send send a message to another process
- receive receive a message from another process
- size_the_system how many processes am I using to
run this code - who_am_i What is my process number within the
parallel application
24Starting and Stopping MPI
- Every MPI code needs to have the following form
- program my_mpi_application
- include mpif.h
- ...
- call mpi_init (ierror) / ierror is where
mpi_init puts the error code - describing the success or failure of the
subroutine call. / - ...
- lt the program goes here!gt
- ...
- call mpi_finalize (ierror) / Again, make sure
ierror is present! / - stop
- end
- Although, strictly speaking, executable
statements can come before MPI_INIT and after
MPI_FINALIZE, they should have nothing to do with
MPI. - Best practice is to bracket your code completely
by these statements.
25Finding out about the application
- How many processors are in the application?
- call MPI_COMM_SIZE ( comm, num_procs )
- returns the number of processors in the
communicator. - if comm MPI_COMM_WORLD, the number of
processors in the application is returned in
num_procs. - Who am I?
- call MPI_COMM_RANK ( comm, my_id )
- returns the rank of the calling process in the
communicator. - if comm MPI_COMM_WORLD, the identity of the
calling process is returned in my_id. - my_id will be a whole number between 0 and the
num_procs - 1.
26Simple MPI Example
My_Id, numb_of_procs
0, 3
1, 3
2, 3
This is from MPI process number 0
This is from MPI processes other than 0
This is from MPI processes other than 0
27Simple MPI Example
- Program Trivial
- implicit none
- include "mpif.h" ! MPI header file
- integer My_Id, Numb_of_Procs, Ierr
- call MPI_INIT ( ierr )
- call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr
) - call MPI_COMM_SIZE ( MPI_COMM_WORLD,
Numb_of_Procs, ierr ) - print , ' My_id, numb_of_procs ', My_Id,
Numb_of_Procs - if ( My_Id .eq. 0 ) then
- print , ' This is from MPI process number
',My_Id - else
- print , ' This is from MPI processes other than
0 ', My_Id - end if
- call MPI_FINALIZE ( ierr ) ! bad things happen if
you forget ierr - stop
- end
28MPI in Fortran and C
- Important Fortran and C difference
- In Fortran the MPI library is implemented as a
collection of subroutines. - In C, it is a collection of functions.
- In Fortran, any error return code must appear as
the last argument of the subroutine. - In C, the error code is the value the function
returns.
29Simple MPI C Example
- include ltstdio.hgt
- include ltmpi.hgt
- int main(int argc, char argv)
-
- int taskid, ntasks
- MPI_Init(argc, argv)
- MPI_Comm_rank(MPI_COMM_WORLD, taskid)
- MPI_Comm_size(MPI_COMM_WORLD, ntasks)
- printf("Hello from task d.\n", taskid)
- MPI_Finalize()
- return(0)
-
30MPI Functionality
- Several modes of point-to-point message passing
- blocking (e.g. MPI_SEND)
- non-blocking (e.g. MPI_ISEND)
- synchronous (e.g. MPI_SSEND)
- buffered (e.g. MPI_BSEND)
- Collective communication and synchronization
- e.g. MPI_REDUCE, MPI_BARRIER
- User-defined datatypes
- Logically distinct communicator spaces
- Application-level or virtual topologies
31Overview
- I. Introduction to Parallel Processing the
ACRL - II. How to use the IBM SP Basics
- III. How to use the IBM SP Job scheduling
- IV. Message Passing Interface
- V. Matrix multiply example
- VI. OpenMP
- VII. MPI Performance visualization with
Paragraph
32Matrix Multiply Example
A
B
C
X
33Matrix Multiply Serial Program
- Intialize matrices A B
- time1rtc()
- call jikloop
- time2rtc()
- print , time2-time1
- subroutine jikloop
- integer matdim, ncols
- real(8), dimension (, ) a, b, c
- real(8) cc
- do j 1, matdim
- do i 1, matdim
- cc 0.0d0
- do k 1, matdim
- cc cc a(i,k)b(k,j)
- end do
- c(i,j) cc
- end do
- end do
- end subroutine jikloop
34Matrix multiply over 4 processes
A
B
C
X
All processes
0 1 2 3
0 1 2 3
- Process 0
- initially has A and B
- broadcasts A to all processes
- scatters columns of B among all processes
- All processes calculate CA x B for appropriate
columns of C - Columns of C gathered into process 0
C
process 0
35MPI Matrix Multiply
- real a(dim,dim), b(dim,dim), c(dim,dim)
- ncols dim/numprocs
- if( myid .eq. master ) then
- ! Intialize matrices A B
- time1rtc()
- call Broadcast(a to all)
- do i1,numprocs-1
- call Send(ncols columns of b to i)
- end do
- call jikloop ! c(1st ncols) a x b(1st ncols)
- do i1,numprocs-1
- call Receive(ncols columns of c from i)
- end do
- time2rtc()
- print , time2-time1
- else ! Processors other than master
- allocate ( blocal(dim,ncols), clocal(dim,ncols)
) - call Broadcast(a to all)
- call Receive(blocal from master)
- call jikloop ! clocalablocal
- call Send(clocal to master)
- endif
-
36MPI Send
- call Send(ncols columns of b to i)
- call MPI_SEND( b(1,incols1), ncolsmatdim,
MPI_DOUBLE_PRECISION, i,
tag, MPI_COMM_WORLD,
ierr ) - b(1,incols1 ) address where the data start.
- ncolsmatdim The number of elements (items) of
data in the message - MPI_DOUBLE_PRECISION type of data to be
transmitted - i message is sent to process i
- tag message tag, an integer to help distinguish
among messages
37MPI Receive
- call Receive(ncols columns of c from i)
- call MPI_RECV( c(1,incols1), ncolsmatdim,
MPI_DOUBLE_PRECISION, i, tag,
MPI_COMM_WORLD, status,
ierr) - status integer array of size MPI_STATUS_SIZE of
information that is returned. For example, if you
specify a wildcard (MPI_ANY_SOUCE or MPI_ANY_TAG)
for source or tag, status will tell you the
actual rank (status(MPI_SOURCE)) or tag
(status(MPI_TAG)) for the message received.
38MPI Broadcast
- call Broadcast(a to all)
- call MPI_BCAST( a, matdimsq, MPI_DOUBLE_PRECISION,
master,
MPI_COMM_WORLD, ierr ) - master message is broadcast from master process
(myid0)
39Better MPI Matrix Multiply
- if( myid .eq. master ) then
- Intialize matrices A B
- time1timef()/1000
- endif
- call Broadcast ( a, enveloppe)
- call Scatter (b, blocal, enveloppe)
- call jikloop ! clocal a blocal
- call Gather (clocal, c, enveloppe)
- if( myid .eq. Master) then
- time2timef()/1000
- print , time2-time1
- endif
40MPI Scatter
100
100
100
All processes
blocal
100
100
100
Master
b
call MPI_SCATTER( b, 100, MPI_DOUBLE_PRECISION,
blocal, 100, MPI_DOUBLE_PRECISION, master,
MPI_COMM_WORLD, ierr)
41MPI Gather
100
100
100
All processes
clocal
100
100
100
Master
c
call MPI_GATHER( clocal, 100, MPI_DOUBLE_PRECISIO
N, c, 100, MPI_DOUBLE_PRECISION, master,
MPI_COMM_WORLD, ierr)
42Matrix Multiply timing Results on IBM SP
43Measuring Performance
- Assume we time only parallelized region
-
- Ideally
44Matrix Multiply Speedup Results on IBM SP
45Overview
- I. Introduction to Parallel Processing the
ACRL - II. How to use the IBM SP Basics
- III. How to use the IBM SP Job scheduling
- IV. Message Passing Interface
- V. Matrix multiply example
- VI. OpenMP
- VII. MPI Performance visualization with
Paragraph
46What is Shared Memory Parallelization?
- All processors can access all the memory in the
parallel system (one address space). - The time to access the memory may not be equal
for all processors - not necessarily a flat memory
- Parallelizing on a SMP does not reduce CPU time
- it reduces wallclock time
- Parallel execution is achieved by generating
multiple threads which execute in parallel - Number of threads (in principle) is independent
of the number of processors
47Processes vs. Threads
code
heap
IP
Process
stack
code
stack
heap
IP
Threads
stack
IP
48OpenMP Threads
- 1.All OpenMP programs begin as a single process
the master thread - 2.FORK the master thread then creates a
team of parallel threads - 3.Parallel region statements executed
in parallel among
the various team threads - 4.JOIN threads
synchronize and terminate, leaving only the
master thread
49OpenMP example
Subroutine saxpy(z, a, x, y, n) integer i, n real
z(n), a, x(n), y !omp parallel do do i 1, n
z(i) a x(i) y end do return end
50Private vs Shared Variables
Global shared memory
All data references to global shared memory
Serial execution
z
a
x
y
n
Global shared memory
References to z, a, x, y, n are to global shared
memory
Parallel execution
Each thread has a private copy of i
References to i are to the private copy
51Division of Work
n 40, 4 threads
Global shared memory
Subroutine saxpy(z, a, x, y, n) integer i, n real
z(n), a, x(n), y !omp parallel do do i 1, n
z(i) a x(i) y end do return end
local memory
i 11, 20
i 21, 30
i 31, 40
i 1, 10
52OpenMP
- 1997 group of hardware and software vendors
announced their support for OpenMP, a new API for
multi-platform shared-memory programming (SMP) on
UNIX and Microsoft Windows NT platforms. - www.openmp.org
- OpenMP provides comment-line directives, embedded
in C/C or Fortran source code, for - scoping data
- specifying work load
- synchronization of threads
- OpenMP provides function calls for obtaining
information about threads. - e.g., omp_num_threads(), omp_get_thread_num()
53OpenMP in C
- Same functionality as OpenMP for FORTRAN
- Differences in syntax
- pragma omp for
- Differences in variable scoping
- variables "visible" when pragma omp parallel
encountered are shared by default
54OpenMP Overhead
- Overhead for parallelization is large (eg. 8000
cycles for parallel do over 16 processors of SGI
Origin 2000) - size of parallel work construct must be
significant enough to overcome overhead - rule of thumb it takes 10 kFLOPS to amortize
overhead
55OpenMP Use
- How is OpenMP typically used?
- OpenMP is usually used to parallelize loops
- Find your most time consuming loops.
- Split them up between threads.
- Better scaling can be obtained using OpenMP
parallel regions, but can be tricky!
56OpenMP LoadLeveler
- To get exclusive use of a node
- _at_ network.MPI css0,not_shared,us
- _at_ node_usage not_shared
57Matrix Multiply with OpenMP
- subroutine jikloop
- integer matdim, ncols
- real(8), dimension (, ) a, b, c
- real(8) cc
- !OMP PARALLEL DO
- do j 1, matdim
- do i 1, matdim
- cc 0.0d0
- do k 1, matdim
- cc cc a(i,k)b(k,j)
- end do
- c(i,j) cc
- end do
- end do
- end subroutine jikloop
58(No Transcript)
59(No Transcript)
60OpenMP vs. MPI
- Only for shared memory computers
- Easy to incrementally parallelize
- More difficult to write highly scalable programs
- Small API based on compiler directives and
limited library routines - Same program can be used for sequential and
parallel execution - Shared vs private variables can cause confusion
- Portable to all platforms
- Parallelize all or nothing
- Vast collection of library routines
- Possible but difficult to use same program for
serial and parallel execution - variables are local to each processor
61Overview
- I. Introduction to Parallel Processing the
ACRL - II. How to use the IBM SP Basics
- III. How to use the IBM SP Job scheduling
- IV. Message Passing Interface
- V. Matrix multiply example
- VI. OpenMP
- VII. MPI Performance visualization with
Paragraph
62MPI Performance Visualization
- ParaGraph
- Developed by University of Illinois
- Graphical display system for visualizing
behaviour and performance of MPI programs
63? 6
64Master-Slave Parallelization static balancing
65(No Transcript)
66(No Transcript)
67References
- MPI
- Using MPI, by Gropp, Lusk, and Skjellum (MIT)
- Using MPI-2, by same
- MPI Website www-unix.mcs.anl.gov/mpi/
- OpenMP (www.openmp.org)
- Parallel Programming in OpenMP, by Chandra et al.
(Morgan Kauffman) - Lawrence Livemore online tutorials
- www.llnl.gov/computing/tutorials/
- Matrix multiply programs www.cs.unb.ca/acrl/acrl_w
orkshop/ - Parallel programming with generalized fractals
www.cs.unb.ca/staff/aubanel/aubanel_fractals.html