High Performance Parallel Computing

About This Presentation

Title:

High Performance Parallel Computing

Description:

Parallel execution is achieved by generating multiple threads which execute in parallel ... statements executed in parallel among the various team threads ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 68

Provided by: FCS68

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Parallel Computing

1
High Performance Parallel Computing
Getting Started on ACRL's IBM SP

Virendra Bhavsar Eric Aubanel
Advanced Computational Research Laboratory
Faculty of Computer Science, UNB

2
Overview

I. Introduction to Parallel Processing the
ACRL
II. How to use the IBM SP Basics
III. How to use the IBM SP Job scheduling
IV. Message Passing Interface
V. Matrix multiply example
VI. OpenMP
VII. MPI Performance visualization with
Paragraph

3
ACRLs IBM SP

4 Winterhawk II nodes
16 processors
Each node has
1 GB RAM
9 GB (mirrored) disk on each node
Switch adapter
High Performance Switch
Gigabit Ethernet (1 node)
Control workstation
Disk SSA tower with 6 18.2 GB disks

Gigabit Ethernet
4

5
The Clustered SMP
ACRLs SP Four 4-way SMPs
Each node has its own copy of the
O/S Processors on the node are closer than
those on different nodes
6
General Parallel File System
Node 2
Node 3
Node 4
SP Switch
Node 1
7
ACRL Software

Operating System AIX 4.3.3
Compilers
IBM XL Fortran 7.1
IBM High Performance Fortran 1.4
VisualAge C for AIX, Version 5.0.1.0
VisualAge C Professional for AIX, Version
5.0.0.0
Java
Job Scheduler Loadleveler 2.2
Parallel Programming Tools
IBM Parallel Environment 3.1 MPI, MPI-2
parallel I/O
Numerical Libraries ESSL (v. 3.2) and Parallel
ESSL (v. 2.2 )
Computational Chemistry NWChem
Visualization OpenDX (not yet installed)
E-Commerce software (not yet installed)

8
ESSL

Linear algebra, Fourier related transforms,
sorting, interpolation, quadrature, random
numbers
Fast!
560x560 real8 matrix multiply
Hand coding 19 Mflops
dgemm 1.2 GFlops
Parallel (threaded and distributed) versions

9
Overview

I. Introduction to Parallel Processing the
ACRL
II. How to use the IBM SP Basics
III. How to use the IBM SP Job scheduling
IV. Message Passing Interface
V. Matrix multiply example
OpenMP
VII. MPI Performance visualization with
Paragraph

10
How to use the IBM SP Basics

www.cs.unb.ca/acrl
Connecting and transferring files
symphony.unb.ca
SSH
The standard system shell can be either ksh or
tcsh
ksh tcsh Note that the current directory is
not in the PATH environment variable. To execute
a program (say a.out) from your current directory
type ./a.out, not a.out.
ksh Command line editing is available using vi.
Enter ESC-k to bring up the last command, then
use vi to edit the previous commands.

11
Basics (contd)

tcsh Command line editing available using arrow
keys, delete key, etc..
To change your shell, eg. from ksh to tcsh
yppasswd -s ltusernamegt /bin/tcsh
If you are new to Unix, we have suggested two
good Unix tutorial web sites on our links page.
Editing files
vi
vim gvim
emacs
Compiling running programs
recommended optimization -O3 -qstrict, then -O3,
then -O3 -qarchpwr3

12
Overview

I. Introduction to Parallel Processing the
ACRL
II. How to use the IBM SP Basics
How to use the IBM SP Job scheduling
(see userss guide on ACRL web site)
IV. Message Passing Interface
V. OpenMP
VI. Matrix multiply example
VII. MPI Performance visualization with
Paragraph

13
Overview

I. Introduction to Parallel Processing the
ACRL
II. How to use the IBM SP Basics
III. How to use the IBM SP Job scheduling
IV. Message Passing Interface
V. Matrix multiply example
VI. OpenMP
VII. MPI Performance visualization with
Paragraph

14
Message Passing Model

We have an ensemble of processors and memory they
can access
Each processor executes its own program
All processors are interconnected by a network
(or a hierarchy of networks)
Processors communicate by passing messages
Contrast with OpenMP implicit communication

15
The Process basic unit of the application

Characteristics of a process
A running executable of a (compiled and linked)
program written in a standard sequential language
(e.g. Fortran or C) with library calls to
implement the message passing
A process executes on a processor
all processes are assigned to processors in a
one-to-one mapping (in the simplest model of
parallel programming)
other processes may execute on other processors
A process communicates and synchronizes with
other processes via messages.

16
Domain Decomposition

In the scientific world (esp. in the world of
simulation and modelling) this is the most common
solution
The solution space (which often corresponds to
real space) is divided up among the processors.
Each processor solves its own little piece
Finite-difference methods and finite-element
methods lend themselves well to this approach
The method of solution often leads naturally to a
set of simultaneous equations that can be solved
by parallel matrix solvers

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Message Passing Interface

MPI 1.0 standard in 1994
MPI 1.1 in 1995
MPI 2.0 in 1997
Includes 1.1 but adds new features
MPI-IO
One-sided communication
Dynamic processes

21
Advantages of MPI

Universality
Expressivity
Well suited to formulating a parallel algorithm
Ease of debugging
Memory is local
Performance
Explicit association of data with process allows
good use of cache

22
Disadvantages of MPI

Harder to learn than shared memory programming
(OpenMP)
Keeping track of communication pattern can be
tricky
Does not allow incremental parallelization all
or nothing!

23
What System Calls Enable Message Passing?

A simple subset
send send a message to another process
receive receive a message from another process
size_the_system how many processes am I using to
run this code
who_am_i What is my process number within the
parallel application

24
Starting and Stopping MPI

Every MPI code needs to have the following form
program my_mpi_application
include mpif.h
...
call mpi_init (ierror) / ierror is where
mpi_init puts the error code
describing the success or failure of the
subroutine call. /
...
lt the program goes here!gt
...
call mpi_finalize (ierror) / Again, make sure
ierror is present! /
stop
end
Although, strictly speaking, executable
statements can come before MPI_INIT and after
MPI_FINALIZE, they should have nothing to do with
MPI.
Best practice is to bracket your code completely
by these statements.

25
Finding out about the application

How many processors are in the application?
call MPI_COMM_SIZE ( comm, num_procs )
returns the number of processors in the
communicator.
if comm MPI_COMM_WORLD, the number of
processors in the application is returned in
num_procs.
Who am I?
call MPI_COMM_RANK ( comm, my_id )
returns the rank of the calling process in the
communicator.
if comm MPI_COMM_WORLD, the identity of the
calling process is returned in my_id.
my_id will be a whole number between 0 and the
num_procs - 1.

26
Simple MPI Example
My_Id, numb_of_procs
0, 3
1, 3
2, 3
This is from MPI process number 0
This is from MPI processes other than 0
This is from MPI processes other than 0
27
Simple MPI Example

Program Trivial
implicit none
include "mpif.h" ! MPI header file
integer My_Id, Numb_of_Procs, Ierr
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr
)
call MPI_COMM_SIZE ( MPI_COMM_WORLD,
Numb_of_Procs, ierr )
print , ' My_id, numb_of_procs ', My_Id,
Numb_of_Procs
if ( My_Id .eq. 0 ) then
print , ' This is from MPI process number
',My_Id
else
print , ' This is from MPI processes other than
0 ', My_Id
end if
call MPI_FINALIZE ( ierr ) ! bad things happen if
you forget ierr
stop
end

28
MPI in Fortran and C

Important Fortran and C difference
In Fortran the MPI library is implemented as a
collection of subroutines.
In C, it is a collection of functions.
In Fortran, any error return code must appear as
the last argument of the subroutine.
In C, the error code is the value the function
returns.

29
Simple MPI C Example

include ltstdio.hgt
include ltmpi.hgt
int main(int argc, char argv)
int taskid, ntasks
MPI_Init(argc, argv)
MPI_Comm_rank(MPI_COMM_WORLD, taskid)
MPI_Comm_size(MPI_COMM_WORLD, ntasks)
printf("Hello from task d.\n", taskid)
MPI_Finalize()
return(0)

30
MPI Functionality

Several modes of point-to-point message passing
blocking (e.g. MPI_SEND)
non-blocking (e.g. MPI_ISEND)
synchronous (e.g. MPI_SSEND)
buffered (e.g. MPI_BSEND)
Collective communication and synchronization
e.g. MPI_REDUCE, MPI_BARRIER
User-defined datatypes
Logically distinct communicator spaces
Application-level or virtual topologies

31
Overview

I. Introduction to Parallel Processing the
ACRL
II. How to use the IBM SP Basics
III. How to use the IBM SP Job scheduling
IV. Message Passing Interface
V. Matrix multiply example
VI. OpenMP
VII. MPI Performance visualization with
Paragraph

32
Matrix Multiply Example
A
B
C
X

33
Matrix Multiply Serial Program

Intialize matrices A B
time1rtc()
call jikloop
time2rtc()
print , time2-time1

subroutine jikloop
integer matdim, ncols
real(8), dimension (, ) a, b, c
real(8) cc
do j 1, matdim
do i 1, matdim
cc 0.0d0
do k 1, matdim
cc cc a(i,k)b(k,j)
end do
c(i,j) cc
end do
end do
end subroutine jikloop

34
Matrix multiply over 4 processes
A
B
C
X

All processes
0 1 2 3
0 1 2 3

Process 0
initially has A and B
broadcasts A to all processes
scatters columns of B among all processes
All processes calculate CA x B for appropriate
columns of C
Columns of C gathered into process 0

C
process 0
35
MPI Matrix Multiply

real a(dim,dim), b(dim,dim), c(dim,dim)
ncols dim/numprocs
if( myid .eq. master ) then
! Intialize matrices A B
time1rtc()
call Broadcast(a to all)
do i1,numprocs-1
call Send(ncols columns of b to i)
end do
call jikloop ! c(1st ncols) a x b(1st ncols)
do i1,numprocs-1
call Receive(ncols columns of c from i)
end do
time2rtc()
print , time2-time1

else ! Processors other than master
allocate ( blocal(dim,ncols), clocal(dim,ncols)
)
call Broadcast(a to all)
call Receive(blocal from master)
call jikloop ! clocalablocal
call Send(clocal to master)
endif

36
MPI Send

call Send(ncols columns of b to i)
call MPI_SEND( b(1,incols1), ncolsmatdim,
MPI_DOUBLE_PRECISION, i,
tag, MPI_COMM_WORLD,
ierr )
b(1,incols1 ) address where the data start.
ncolsmatdim The number of elements (items) of
data in the message
MPI_DOUBLE_PRECISION type of data to be
transmitted
i message is sent to process i
tag message tag, an integer to help distinguish
among messages

37
MPI Receive

call Receive(ncols columns of c from i)
call MPI_RECV( c(1,incols1), ncolsmatdim,
MPI_DOUBLE_PRECISION, i, tag,
MPI_COMM_WORLD, status,
ierr)
status integer array of size MPI_STATUS_SIZE of
information that is returned. For example, if you
specify a wildcard (MPI_ANY_SOUCE or MPI_ANY_TAG)
for source or tag, status will tell you the
actual rank (status(MPI_SOURCE)) or tag
(status(MPI_TAG)) for the message received.

38
MPI Broadcast

call Broadcast(a to all)
call MPI_BCAST( a, matdimsq, MPI_DOUBLE_PRECISION,
master,
MPI_COMM_WORLD, ierr )
master message is broadcast from master process
(myid0)

39
Better MPI Matrix Multiply

if( myid .eq. master ) then
Intialize matrices A B
time1timef()/1000
endif
call Broadcast ( a, enveloppe)
call Scatter (b, blocal, enveloppe)
call jikloop ! clocal a blocal
call Gather (clocal, c, enveloppe)
if( myid .eq. Master) then
time2timef()/1000
print , time2-time1
endif

40
MPI Scatter
100
100
100
All processes
blocal
100
100
100
Master
b
call MPI_SCATTER( b, 100, MPI_DOUBLE_PRECISION,
blocal, 100, MPI_DOUBLE_PRECISION, master,
MPI_COMM_WORLD, ierr)
41
MPI Gather
100
100
100
All processes
clocal
100
100
100
Master
c
call MPI_GATHER( clocal, 100, MPI_DOUBLE_PRECISIO
N, c, 100, MPI_DOUBLE_PRECISION, master,
MPI_COMM_WORLD, ierr)
42
Matrix Multiply timing Results on IBM SP
43
Measuring Performance

Assume we time only parallelized region
Ideally

44
Matrix Multiply Speedup Results on IBM SP
45
Overview

I. Introduction to Parallel Processing the
ACRL
II. How to use the IBM SP Basics
III. How to use the IBM SP Job scheduling
IV. Message Passing Interface
V. Matrix multiply example
VI. OpenMP
VII. MPI Performance visualization with
Paragraph

46
What is Shared Memory Parallelization?

All processors can access all the memory in the
parallel system (one address space).
The time to access the memory may not be equal
for all processors
not necessarily a flat memory
Parallelizing on a SMP does not reduce CPU time
it reduces wallclock time
Parallel execution is achieved by generating
multiple threads which execute in parallel
Number of threads (in principle) is independent
of the number of processors

47
Processes vs. Threads
code
heap
IP
Process
stack
code
stack
heap
IP
Threads
stack
IP
48
OpenMP Threads

1.All OpenMP programs begin as a single process
the master thread
2.FORK the master thread then creates a
team of parallel threads
3.Parallel region statements executed
in parallel among
the various team threads
4.JOIN threads
synchronize and terminate, leaving only the
master thread

49
OpenMP example
Subroutine saxpy(z, a, x, y, n) integer i, n real
z(n), a, x(n), y !omp parallel do do i 1, n
z(i) a x(i) y end do return end
50
Private vs Shared Variables
Global shared memory
All data references to global shared memory
Serial execution
z
a
x
y
n
Global shared memory
References to z, a, x, y, n are to global shared
memory
Parallel execution
Each thread has a private copy of i
References to i are to the private copy
51
Division of Work
n 40, 4 threads
Global shared memory
Subroutine saxpy(z, a, x, y, n) integer i, n real
z(n), a, x(n), y !omp parallel do do i 1, n
z(i) a x(i) y end do return end
local memory
i 11, 20
i 21, 30
i 31, 40
i 1, 10
52
OpenMP

1997 group of hardware and software vendors
announced their support for OpenMP, a new API for
multi-platform shared-memory programming (SMP) on
UNIX and Microsoft Windows NT platforms.
www.openmp.org
OpenMP provides comment-line directives, embedded
in C/C or Fortran source code, for
scoping data
specifying work load
synchronization of threads
OpenMP provides function calls for obtaining
information about threads.
e.g., omp_num_threads(), omp_get_thread_num()

53
OpenMP in C

Same functionality as OpenMP for FORTRAN
Differences in syntax
pragma omp for
Differences in variable scoping
variables "visible" when pragma omp parallel
encountered are shared by default

54
OpenMP Overhead

Overhead for parallelization is large (eg. 8000
cycles for parallel do over 16 processors of SGI
Origin 2000)
size of parallel work construct must be
significant enough to overcome overhead
rule of thumb it takes 10 kFLOPS to amortize
overhead

55
OpenMP Use

How is OpenMP typically used?
OpenMP is usually used to parallelize loops
Find your most time consuming loops.
Split them up between threads.
Better scaling can be obtained using OpenMP
parallel regions, but can be tricky!

56
OpenMP LoadLeveler

To get exclusive use of a node
_at_ network.MPI css0,not_shared,us
_at_ node_usage not_shared

57
Matrix Multiply with OpenMP

subroutine jikloop
integer matdim, ncols
real(8), dimension (, ) a, b, c
real(8) cc
!OMP PARALLEL DO
do j 1, matdim
do i 1, matdim
cc 0.0d0
do k 1, matdim
cc cc a(i,k)b(k,j)
end do
c(i,j) cc
end do
end do
end subroutine jikloop

58
(No Transcript)
59
(No Transcript)
60
OpenMP vs. MPI

Only for shared memory computers
Easy to incrementally parallelize
More difficult to write highly scalable programs
Small API based on compiler directives and
limited library routines
Same program can be used for sequential and
parallel execution
Shared vs private variables can cause confusion

Portable to all platforms
Parallelize all or nothing
Vast collection of library routines
Possible but difficult to use same program for
serial and parallel execution
variables are local to each processor

61
Overview

I. Introduction to Parallel Processing the
ACRL
II. How to use the IBM SP Basics
III. How to use the IBM SP Job scheduling
IV. Message Passing Interface
V. Matrix multiply example
VI. OpenMP
VII. MPI Performance visualization with
Paragraph

62
MPI Performance Visualization

ParaGraph
Developed by University of Illinois
Graphical display system for visualizing
behaviour and performance of MPI programs

63
? 6
64
Master-Slave Parallelization static balancing
65
(No Transcript)
66
(No Transcript)
67
References

MPI
Using MPI, by Gropp, Lusk, and Skjellum (MIT)
Using MPI-2, by same
MPI Website www-unix.mcs.anl.gov/mpi/
OpenMP (www.openmp.org)
Parallel Programming in OpenMP, by Chandra et al.
(Morgan Kauffman)
Lawrence Livemore online tutorials
www.llnl.gov/computing/tutorials/
Matrix multiply programs www.cs.unb.ca/acrl/acrl_w
orkshop/
Parallel programming with generalized fractals
www.cs.unb.ca/staff/aubanel/aubanel_fractals.html