High Performance Parallel Computing - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

High Performance Parallel Computing

Description:

Parallel execution is achieved by generating multiple threads which execute in parallel ... statements executed in parallel among the various team threads ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 68
Provided by: FCS68
Category:

less

Transcript and Presenter's Notes

Title: High Performance Parallel Computing


1
High Performance Parallel Computing
Getting Started on ACRL's IBM SP
  • Virendra Bhavsar Eric Aubanel
  • Advanced Computational Research Laboratory
  • Faculty of Computer Science, UNB

2
Overview
  • I. Introduction to Parallel Processing the
    ACRL
  • II. How to use the IBM SP Basics
  • III. How to use the IBM SP Job scheduling
  • IV. Message Passing Interface
  • V. Matrix multiply example
  • VI. OpenMP
  • VII. MPI Performance visualization with
    Paragraph

3
ACRLs IBM SP
  • 4 Winterhawk II nodes
  • 16 processors
  • Each node has
  • 1 GB RAM
  • 9 GB (mirrored) disk on each node
  • Switch adapter
  • High Performance Switch
  • Gigabit Ethernet (1 node)
  • Control workstation
  • Disk SSA tower with 6 18.2 GB disks

Gigabit Ethernet
4

5
The Clustered SMP
ACRLs SP Four 4-way SMPs
Each node has its own copy of the
O/S Processors on the node are closer than
those on different nodes
6
General Parallel File System
Node 2
Node 3
Node 4
SP Switch
Node 1
7
ACRL Software
  • Operating System AIX 4.3.3
  • Compilers
  • IBM XL Fortran 7.1
  • IBM High Performance Fortran 1.4
  • VisualAge C for AIX, Version 5.0.1.0
  • VisualAge C Professional for AIX, Version
    5.0.0.0
  • Java
  • Job Scheduler Loadleveler 2.2
  • Parallel Programming Tools
  • IBM Parallel Environment 3.1 MPI, MPI-2
    parallel I/O
  • Numerical Libraries ESSL (v. 3.2) and Parallel
    ESSL (v. 2.2 )
  • Computational Chemistry NWChem
  • Visualization OpenDX (not yet installed)
  • E-Commerce software (not yet installed)

8
ESSL
  • Linear algebra, Fourier related transforms,
    sorting, interpolation, quadrature, random
    numbers
  • Fast!
  • 560x560 real8 matrix multiply
  • Hand coding 19 Mflops
  • dgemm 1.2 GFlops
  • Parallel (threaded and distributed) versions

9
Overview
  • I. Introduction to Parallel Processing the
    ACRL
  • II. How to use the IBM SP Basics
  • III. How to use the IBM SP Job scheduling
  • IV. Message Passing Interface
  • V. Matrix multiply example
  • OpenMP
  • VII. MPI Performance visualization with
    Paragraph

10
How to use the IBM SP Basics
  • www.cs.unb.ca/acrl
  • Connecting and transferring files
  • symphony.unb.ca
  • SSH
  • The standard system shell can be either ksh or
    tcsh
  • ksh tcsh Note that the current directory is
    not in the PATH environment variable. To execute
    a program (say a.out) from your current directory
    type ./a.out, not a.out.
  • ksh Command line editing is available using vi.
    Enter ESC-k to bring up the last command, then
    use vi to edit the previous commands.

11
Basics (contd)
  • tcsh Command line editing available using arrow
    keys, delete key, etc..
  • To change your shell, eg. from ksh to tcsh
    yppasswd -s ltusernamegt /bin/tcsh
  • If you are new to Unix, we have suggested two
    good Unix tutorial web sites on our links page.
  • Editing files
  • vi
  • vim gvim
  • emacs
  • Compiling running programs
  • recommended optimization -O3 -qstrict, then -O3,
    then -O3 -qarchpwr3

12
Overview
  • I. Introduction to Parallel Processing the
    ACRL
  • II. How to use the IBM SP Basics
  • How to use the IBM SP Job scheduling
  • (see userss guide on ACRL web site)
  • IV. Message Passing Interface
  • V. OpenMP
  • VI. Matrix multiply example
  • VII. MPI Performance visualization with
    Paragraph

13
Overview
  • I. Introduction to Parallel Processing the
    ACRL
  • II. How to use the IBM SP Basics
  • III. How to use the IBM SP Job scheduling
  • IV. Message Passing Interface
  • V. Matrix multiply example
  • VI. OpenMP
  • VII. MPI Performance visualization with
    Paragraph

14
Message Passing Model
  • We have an ensemble of processors and memory they
    can access
  • Each processor executes its own program
  • All processors are interconnected by a network
    (or a hierarchy of networks)
  • Processors communicate by passing messages
  • Contrast with OpenMP implicit communication

15
The Process basic unit of the application
  • Characteristics of a process
  • A running executable of a (compiled and linked)
    program written in a standard sequential language
    (e.g. Fortran or C) with library calls to
    implement the message passing
  • A process executes on a processor
  • all processes are assigned to processors in a
    one-to-one mapping (in the simplest model of
    parallel programming)
  • other processes may execute on other processors
  • A process communicates and synchronizes with
    other processes via messages.

16
Domain Decomposition
  • In the scientific world (esp. in the world of
    simulation and modelling) this is the most common
    solution
  • The solution space (which often corresponds to
    real space) is divided up among the processors.
    Each processor solves its own little piece
  • Finite-difference methods and finite-element
    methods lend themselves well to this approach
  • The method of solution often leads naturally to a
    set of simultaneous equations that can be solved
    by parallel matrix solvers

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Message Passing Interface
  • MPI 1.0 standard in 1994
  • MPI 1.1 in 1995
  • MPI 2.0 in 1997
  • Includes 1.1 but adds new features
  • MPI-IO
  • One-sided communication
  • Dynamic processes

21
Advantages of MPI
  • Universality
  • Expressivity
  • Well suited to formulating a parallel algorithm
  • Ease of debugging
  • Memory is local
  • Performance
  • Explicit association of data with process allows
    good use of cache

22
Disadvantages of MPI
  • Harder to learn than shared memory programming
    (OpenMP)
  • Keeping track of communication pattern can be
    tricky
  • Does not allow incremental parallelization all
    or nothing!

23
What System Calls Enable Message Passing?
  • A simple subset
  • send send a message to another process
  • receive receive a message from another process
  • size_the_system how many processes am I using to
    run this code
  • who_am_i What is my process number within the
    parallel application

24
Starting and Stopping MPI
  • Every MPI code needs to have the following form
  • program my_mpi_application
  • include mpif.h
  • ...
  • call mpi_init (ierror) / ierror is where
    mpi_init puts the error code
  • describing the success or failure of the
    subroutine call. /
  • ...
  • lt the program goes here!gt
  • ...
  • call mpi_finalize (ierror) / Again, make sure
    ierror is present! /
  • stop
  • end
  • Although, strictly speaking, executable
    statements can come before MPI_INIT and after
    MPI_FINALIZE, they should have nothing to do with
    MPI.
  • Best practice is to bracket your code completely
    by these statements.

25
Finding out about the application
  • How many processors are in the application?
  • call MPI_COMM_SIZE ( comm, num_procs )
  • returns the number of processors in the
    communicator.
  • if comm MPI_COMM_WORLD, the number of
    processors in the application is returned in
    num_procs.
  • Who am I?
  • call MPI_COMM_RANK ( comm, my_id )
  • returns the rank of the calling process in the
    communicator.
  • if comm MPI_COMM_WORLD, the identity of the
    calling process is returned in my_id.
  • my_id will be a whole number between 0 and the
    num_procs - 1.

26
Simple MPI Example
My_Id, numb_of_procs
0, 3
1, 3
2, 3
This is from MPI process number 0
This is from MPI processes other than 0
This is from MPI processes other than 0
27
Simple MPI Example
  • Program Trivial
  • implicit none
  • include "mpif.h" ! MPI header file
  • integer My_Id, Numb_of_Procs, Ierr
  • call MPI_INIT ( ierr )
  • call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr
    )
  • call MPI_COMM_SIZE ( MPI_COMM_WORLD,
    Numb_of_Procs, ierr )
  • print , ' My_id, numb_of_procs ', My_Id,
    Numb_of_Procs
  • if ( My_Id .eq. 0 ) then
  • print , ' This is from MPI process number
    ',My_Id
  • else
  • print , ' This is from MPI processes other than
    0 ', My_Id
  • end if
  • call MPI_FINALIZE ( ierr ) ! bad things happen if
    you forget ierr
  • stop
  • end

28
MPI in Fortran and C
  • Important Fortran and C difference
  • In Fortran the MPI library is implemented as a
    collection of subroutines.
  • In C, it is a collection of functions.
  • In Fortran, any error return code must appear as
    the last argument of the subroutine.
  • In C, the error code is the value the function
    returns.

29
Simple MPI C Example
  • include ltstdio.hgt
  • include ltmpi.hgt
  • int main(int argc, char argv)
  • int taskid, ntasks
  • MPI_Init(argc, argv)
  • MPI_Comm_rank(MPI_COMM_WORLD, taskid)
  • MPI_Comm_size(MPI_COMM_WORLD, ntasks)
  • printf("Hello from task d.\n", taskid)
  • MPI_Finalize()
  • return(0)

30
MPI Functionality
  • Several modes of point-to-point message passing
  • blocking (e.g. MPI_SEND)
  • non-blocking (e.g. MPI_ISEND)
  • synchronous (e.g. MPI_SSEND)
  • buffered (e.g. MPI_BSEND)
  • Collective communication and synchronization
  • e.g. MPI_REDUCE, MPI_BARRIER
  • User-defined datatypes
  • Logically distinct communicator spaces
  • Application-level or virtual topologies

31
Overview
  • I. Introduction to Parallel Processing the
    ACRL
  • II. How to use the IBM SP Basics
  • III. How to use the IBM SP Job scheduling
  • IV. Message Passing Interface
  • V. Matrix multiply example
  • VI. OpenMP
  • VII. MPI Performance visualization with
    Paragraph

32
Matrix Multiply Example
A
B
C
X

33
Matrix Multiply Serial Program
  • Intialize matrices A B
  • time1rtc()
  • call jikloop
  • time2rtc()
  • print , time2-time1
  • subroutine jikloop
  • integer matdim, ncols
  • real(8), dimension (, ) a, b, c
  • real(8) cc
  • do j 1, matdim
  • do i 1, matdim
  • cc 0.0d0
  • do k 1, matdim
  • cc cc a(i,k)b(k,j)
  • end do
  • c(i,j) cc
  • end do
  • end do
  • end subroutine jikloop

34
Matrix multiply over 4 processes
A
B
C
X

All processes
0 1 2 3
0 1 2 3
  • Process 0
  • initially has A and B
  • broadcasts A to all processes
  • scatters columns of B among all processes
  • All processes calculate CA x B for appropriate
    columns of C
  • Columns of C gathered into process 0

C
process 0
35
MPI Matrix Multiply
  • real a(dim,dim), b(dim,dim), c(dim,dim)
  • ncols dim/numprocs
  • if( myid .eq. master ) then
  • ! Intialize matrices A B
  • time1rtc()
  • call Broadcast(a to all)
  • do i1,numprocs-1
  • call Send(ncols columns of b to i)
  • end do
  • call jikloop ! c(1st ncols) a x b(1st ncols)
  • do i1,numprocs-1
  • call Receive(ncols columns of c from i)
  • end do
  • time2rtc()
  • print , time2-time1
  • else ! Processors other than master
  • allocate ( blocal(dim,ncols), clocal(dim,ncols)
    )
  • call Broadcast(a to all)
  • call Receive(blocal from master)
  • call jikloop ! clocalablocal
  • call Send(clocal to master)
  • endif

36
MPI Send
  • call Send(ncols columns of b to i)
  • call MPI_SEND( b(1,incols1), ncolsmatdim,
    MPI_DOUBLE_PRECISION, i,
    tag, MPI_COMM_WORLD,
    ierr )
  • b(1,incols1 ) address where the data start.
  • ncolsmatdim The number of elements (items) of
    data in the message
  • MPI_DOUBLE_PRECISION type of data to be
    transmitted
  • i message is sent to process i
  • tag message tag, an integer to help distinguish
    among messages

37
MPI Receive
  • call Receive(ncols columns of c from i)
  • call MPI_RECV( c(1,incols1), ncolsmatdim,
    MPI_DOUBLE_PRECISION, i, tag,
    MPI_COMM_WORLD, status,
    ierr)
  • status integer array of size MPI_STATUS_SIZE of
    information that is returned. For example, if you
    specify a wildcard (MPI_ANY_SOUCE or MPI_ANY_TAG)
    for source or tag, status will tell you the
    actual rank (status(MPI_SOURCE)) or tag
    (status(MPI_TAG)) for the message received.

38
MPI Broadcast
  • call Broadcast(a to all)
  • call MPI_BCAST( a, matdimsq, MPI_DOUBLE_PRECISION,
    master,
    MPI_COMM_WORLD, ierr )
  • master message is broadcast from master process
    (myid0)

39
Better MPI Matrix Multiply
  • if( myid .eq. master ) then
  • Intialize matrices A B
  • time1timef()/1000
  • endif
  • call Broadcast ( a, enveloppe)
  • call Scatter (b, blocal, enveloppe)
  • call jikloop ! clocal a blocal
  • call Gather (clocal, c, enveloppe)
  • if( myid .eq. Master) then
  • time2timef()/1000
  • print , time2-time1
  • endif

40
MPI Scatter
100
100
100
All processes
blocal
100
100
100
Master
b
call MPI_SCATTER( b, 100, MPI_DOUBLE_PRECISION,
blocal, 100, MPI_DOUBLE_PRECISION, master,
MPI_COMM_WORLD, ierr)
41
MPI Gather
100
100
100
All processes
clocal
100
100
100
Master
c
call MPI_GATHER( clocal, 100, MPI_DOUBLE_PRECISIO
N, c, 100, MPI_DOUBLE_PRECISION, master,
MPI_COMM_WORLD, ierr)
42
Matrix Multiply timing Results on IBM SP
43
Measuring Performance
  • Assume we time only parallelized region
  • Ideally

44
Matrix Multiply Speedup Results on IBM SP
45
Overview
  • I. Introduction to Parallel Processing the
    ACRL
  • II. How to use the IBM SP Basics
  • III. How to use the IBM SP Job scheduling
  • IV. Message Passing Interface
  • V. Matrix multiply example
  • VI. OpenMP
  • VII. MPI Performance visualization with
    Paragraph

46
What is Shared Memory Parallelization?
  • All processors can access all the memory in the
    parallel system (one address space).
  • The time to access the memory may not be equal
    for all processors
  • not necessarily a flat memory
  • Parallelizing on a SMP does not reduce CPU time
  • it reduces wallclock time
  • Parallel execution is achieved by generating
    multiple threads which execute in parallel
  • Number of threads (in principle) is independent
    of the number of processors

47
Processes vs. Threads
code
heap
IP
Process
stack
code
stack
heap
IP
Threads
stack
IP
48
OpenMP Threads
  • 1.All OpenMP programs begin as a single process
    the master thread
  • 2.FORK the master thread then creates a
    team of parallel threads
  • 3.Parallel region statements executed
    in parallel among
    the various team threads
  • 4.JOIN threads
    synchronize and terminate, leaving only the
    master thread

49
OpenMP example
Subroutine saxpy(z, a, x, y, n) integer i, n real
z(n), a, x(n), y !omp parallel do do i 1, n
z(i) a x(i) y end do return end
50
Private vs Shared Variables
Global shared memory
All data references to global shared memory
Serial execution
z
a
x
y
n
Global shared memory
References to z, a, x, y, n are to global shared
memory
Parallel execution
Each thread has a private copy of i
References to i are to the private copy
51
Division of Work
n 40, 4 threads
Global shared memory
Subroutine saxpy(z, a, x, y, n) integer i, n real
z(n), a, x(n), y !omp parallel do do i 1, n
z(i) a x(i) y end do return end
local memory
i 11, 20
i 21, 30
i 31, 40
i 1, 10
52
OpenMP
  • 1997 group of hardware and software vendors
    announced their support for OpenMP, a new API for
    multi-platform shared-memory programming (SMP) on
    UNIX and Microsoft Windows NT platforms.
  • www.openmp.org
  • OpenMP provides comment-line directives, embedded
    in C/C or Fortran source code, for
  • scoping data
  • specifying work load
  • synchronization of threads
  • OpenMP provides function calls for obtaining
    information about threads.
  • e.g., omp_num_threads(), omp_get_thread_num()

53
OpenMP in C
  • Same functionality as OpenMP for FORTRAN
  • Differences in syntax
  • pragma omp for
  • Differences in variable scoping
  • variables "visible" when pragma omp parallel
    encountered are shared by default

54
OpenMP Overhead
  • Overhead for parallelization is large (eg. 8000
    cycles for parallel do over 16 processors of SGI
    Origin 2000)
  • size of parallel work construct must be
    significant enough to overcome overhead
  • rule of thumb it takes 10 kFLOPS to amortize
    overhead

55
OpenMP Use
  • How is OpenMP typically used?
  • OpenMP is usually used to parallelize loops
  • Find your most time consuming loops.
  • Split them up between threads.
  • Better scaling can be obtained using OpenMP
    parallel regions, but can be tricky!

56
OpenMP LoadLeveler
  • To get exclusive use of a node
  • _at_ network.MPI css0,not_shared,us
  • _at_ node_usage not_shared

57
Matrix Multiply with OpenMP
  • subroutine jikloop
  • integer matdim, ncols
  • real(8), dimension (, ) a, b, c
  • real(8) cc
  • !OMP PARALLEL DO
  • do j 1, matdim
  • do i 1, matdim
  • cc 0.0d0
  • do k 1, matdim
  • cc cc a(i,k)b(k,j)
  • end do
  • c(i,j) cc
  • end do
  • end do
  • end subroutine jikloop

58
(No Transcript)
59
(No Transcript)
60
OpenMP vs. MPI
  • Only for shared memory computers
  • Easy to incrementally parallelize
  • More difficult to write highly scalable programs
  • Small API based on compiler directives and
    limited library routines
  • Same program can be used for sequential and
    parallel execution
  • Shared vs private variables can cause confusion
  • Portable to all platforms
  • Parallelize all or nothing
  • Vast collection of library routines
  • Possible but difficult to use same program for
    serial and parallel execution
  • variables are local to each processor

61
Overview
  • I. Introduction to Parallel Processing the
    ACRL
  • II. How to use the IBM SP Basics
  • III. How to use the IBM SP Job scheduling
  • IV. Message Passing Interface
  • V. Matrix multiply example
  • VI. OpenMP
  • VII. MPI Performance visualization with
    Paragraph

62
MPI Performance Visualization
  • ParaGraph
  • Developed by University of Illinois
  • Graphical display system for visualizing
    behaviour and performance of MPI programs

63
? 6
64
Master-Slave Parallelization static balancing
65
(No Transcript)
66
(No Transcript)
67
References
  • MPI
  • Using MPI, by Gropp, Lusk, and Skjellum (MIT)
  • Using MPI-2, by same
  • MPI Website www-unix.mcs.anl.gov/mpi/
  • OpenMP (www.openmp.org)
  • Parallel Programming in OpenMP, by Chandra et al.
    (Morgan Kauffman)
  • Lawrence Livemore online tutorials
  • www.llnl.gov/computing/tutorials/
  • Matrix multiply programs www.cs.unb.ca/acrl/acrl_w
    orkshop/
  • Parallel programming with generalized fractals
    www.cs.unb.ca/staff/aubanel/aubanel_fractals.html
Write a Comment
User Comments (0)
About PowerShow.com