Title: Parallel Programming Orientation
1Parallel Programming Orientation
2Agenda
- Parallel jobs
- Paradigms to parallelize algorithms
- Profiling and compiler optimization
- Implementations to parallelize code
- OpenMP
- MPI
- Queuing
- Job queuing
- Integrating Parallel Programs
- Questions and Answers
3Traditional Ad-Hoc Linux Cluster
- Full Linux install to disk Load into memory
- Manual Slow 5-30 minutes
- Full set of disparate daemons, services,
user/password, host access setup - Basic parallel shell with complex glue scripts
run jobs - Monitoring management added as isolated tools
4Cluster Virtualization Architecture Realized
- Minimal in-memory OS with single daemon rapidly
deployed in seconds - no disk required - Less than 20 seconds
- Virtual, unified process space enables intuitive
single sign-on, job submission - Effortless job migration to nodes
- Monitor manage efficiently from the Master
- Single System Install
- Single Process Space
- Shared cache of the cluster state
- Single point of provisioning
- Better performance due to lightweight nodes
- No version skew is inherently more reliable
5Just a Primer
- Only a brief introduction is provided here. Many
other in-depth tutorials are available on the web
and in published sources. - http//www.mpi-forum.org/docs/mpi-11-html/mpi-repo
rt.html - https//computing.llnl.gov/?settrainingpageinde
x
6Parallel Code Primer
- Paradigms for writing parallel programs depend
upon the application - SIMD (single-instruction multiple-data)
- MIMD (multiple-instruction multiple-data)
- MISD (multiple-instruction single-data)
- SIMD will be presented here as it is a commonly
used template - A single application source is compiled to
perform operations on different sets of data
(Single Program Multiple Data (SPMD) model) - The data is read by the different threads or
passed between threads via messages (hence MPI
message passing interface) - Contrast this with shared memory or OpenMP where
data is locally via memory - Optimizations in the MPI implementation can
perform localhost optimization however, the
program is still written using a message passing
construct
7Explicitly Parallel Programs
- Different paradigms exist for parallelizing
programs - Shared memory
- OpenMP
- Sockets
- PVM
- Linda
- MPI
- Most distributed parallel programs are now
written using MPI - Different options for MPI stacks MPICH,
OpenMPI, HP, Intel - ClusterWare comes integrated with customized
versions of MPICH and OpenMPI
8Example Code
- Calculate p through numerical integration
- Compute p by integrating f(x) 4/(1 x2) from
0 to 1 - Function is the derivative of arctan(x)
-
- See source code
9Compiling and running code
- set n 400,000,000
- gcc -o cpi-serial cpi-serial.c
- time ./cpi-serial
- Process 0
- pi is approximately 3.1415926535895520, Error is
0.0000000000002411 - real 0m11.009s
- user 0m11.007s
- sys 0m0.000s
- gcc -g -pg -o cpi-serial_prof cpi-serial.c
- time ./cpi-serial_prof
- Process 0
- pi is approximately 3.1415926535895520, Error is
0.0000000000002411 - real 0m11.012s
- user 0m11.010s
- sys 0m0.000s
- ls -ltra
10Profiling
- -g flag includes debugging information in the
binary. Useful for gdb tracing of an application
and for profiling - -pg flag generates code to writing profile
information - gprof cpi-serial_prof gmon.out
- Flat profile
- Each sample counts as 0.01 seconds.
- cumulative self self
total - time seconds seconds calls ns/call
ns/call name - 74.48 2.85 2.85
main - 23.95 3.77 0.92 400000000 2.29
2.29 f - 2.37 3.86 0.09
frame_dummy - Call graph (explanation follows)
- granularity each sample hit covers 2 byte(s) for
0.26 of 3.86 seconds - index time self children called name
-
ltspontaneousgt
11Profiling Tips
- Code should be profiled using realistic data set
- Contrast the call graphs of n100 versus
n400,000,000 - Profiling can give tips about where to optimize
the current algorithm, but it cant suggest
alternative (better) algorithms - e.g. Monte Carlo algorithm to calculate p
- Amdahls Law
- The speedup parallelization achieves is limited
by the serial part of the code
12OpenMP Introduction
- Parallelization using shared memory in a single
machine - Portion of the code is forked on the machine to
parallelize - i.e. not distributed parallelization
- Done using pragmas in the source code. Compiler
must support OpenMP (gcc 4, Intel, etc.) - gcc -fopenmp -o cpi-openmp cpi-openmp.c
- See Source Code
- Profiling can add overhead to resulting
executable - time can be used to measure improvement
- Runtime selection of the number of threads using
OMP_NUM_THREADS environment variable
13Scaling with OpenMP
- time OMP_NUM_THREADS1 ./cpi-openmp
- Process 0
- pi is approximately 3.1415926535895520, Error is
0.0000000000002411 - real 0m10.583s
- user 0m10.581s
- sys 0m0.001s
- time OMP_NUM_THREADS2 ./cpi-openmp
- Process 0
- Process 1
- pi is approximately 3.1415926535900218, Error is
0.0000000000002287 - real 0m5.295s
- user 0m11.297s
- sys 0m0.000s
- time OMP_NUM_THREADS4 ./cpi-openmp
-
- real 0m2.650s
- user 0m10.586s
14Scaling with OpenMP
- Code is easy to parallelize
- Good scaling is seen up to 8 processors, kink in
the curve is expected
15Role of the Compiler
- Parallelization using shared memory in a single
machine - i.e. not distributed parallelization
- Done using pragmas in the source code. Compiler
must support OpenMP (gcc 4, Intel, etc.) - gcc -fopenmp -o cpi-openmp cpi-openmp.c
- Profiling can add overhead to resulting
executable - time can be used to measure improvement
- Runtime selection of the number of threads using
OMP_NUM_THREADS environment variable
16GCC versus Intel C
- time OMP_NUM_THREADS1 ./cpi-openmp
-
- real 0m10.583s
- user 0m10.581s
- sys 0m0.001s
- gcc -O3 -fopenmp -o cpi-openmp-gcc-O3
cpi-openmp.c - time OMP_NUM_THREADS1 ./cpi-openmp-gcc-O3
- Process 0
- pi is approximately 3.1415926535895520, Error is
0.0000000000002411 - real 0m3.154s
- user 0m3.143s
- sys 0m0.011s
- time OMP_NUM_THREADS8 ./cpi-openmp-gcc-O3
-
- real 0m0.399s
- user 0m3.181s
- sys 0m0.001s
17Compiler Timings
18Explicitly Parallel Programs
- Different paradigms exist for parallelizing
programs - Shared memory
- OpenMP
- Sockets
- PVM
- Linda
- MPI
- Most distributed parallel programs are now
written using MPI - Different options for MPI stacks MPICH,
OpenMPI, HP, Intel - ClusterWare comes integrated with customized
versions of MPICH and OpenMPI
19OpenMP Summary
- OpenMP provides a mechanism to parallelize within
a single machine - Shared memory and variables are handled
automatically - Performance, with an appropriate compiler, can
provide significant speedups - Coupled with large core count SMP machines,
OpenMP could be all of the parallelization
required - GPU programming is similar to the OpenMP model
20Explicitly Parallel Programs
- Different paradigms exist for parallelizing
programs - Shared memory
- OpenMP
- Sockets
- PVM
- Linda
- MPI
- Most distributed parallel programs are now
written using MPI - Different options for MPI stacks MPICH,
OpenMPI, HP, Intel - ClusterWare comes integrated with customized
versions of MPICH and OpenMPI
21Running MPI Code
- Binaries are executed simultaneously
- on the same machine or different machines
- After the binaries start running, the
MPI_COMM_WORLD is established - Any data to be transferred must be explicitly
determined by the programmer - Hooks exist for a number of languages
- E.g. Python (https//computing.llnl.gov/code/pdf/p
yMPI.pdf)
22Example MPI Source
- cpi.c calculates p using MPI in C
- include "mpi.h"
- include ltstdio.hgt
- include ltmath.hgt
- double f( double )
- double f( double a )
-
- return (4.0 / (1.0 aa))
-
- int main( int argc, char argv)
-
- int done 0, n, myid, numprocs, i
- double PI25DT 3.141592653589793238462643
- double mypi, pi, h, sum, x
- double startwtime 0.0, endwtime
- int namelen
- while (!done)
-
- if (myid 0)
-
- /
- printf("Enter the number of
intervals (0 quits) ") - scanf("d",n)
- /
- if (n0) n100 else n0
- startwtime MPI_Wtime()
-
- MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD) - if (n 0)
- done 1
- else
-
- h 1.0 / (double) n
- sum 0.0
compute pi by integrating f(x) 4/(1 x2)
System include file which defines the MPI
functions
Initialize the MPI execution environment
Determines the size of the group associated with
a communictor
Determines the rank of the calling process in the
communicator
Gets the name of the processor
Differentiate actions based on rank. Only
master performs this action
MPI built-in function to get time value
Broadcasts 1 MPI_INT from n from the process
with rank 0" to all other processes of the group
Each worker does this loop and increments the
counter by the number of processors (versus
dividing the range -gt possible off-by-one error)
Does MPI_SUM function on 1 MPI_DOUBLE at mypi
on all workers in MPI_COMM_WORLD to a single
value at pi on rank 0
Only rank 0 outputs the value of pi
Terminates MPI execution environment
23Other Common MPI Functions
- MPI_Send, MPI_Recv
- Blocking send and receive between two specific
ranks - MPI_Isend, MPI_Irecv
- Non-blocking send and receive between two
specific ranks - man pages exist for the MPI functions
- Poorly written programs can suffer from poor
communication efficiency (e.g. stair-step) or
lost data if the system buffer fills before a
blocking send or receive is initiated to
correspond with a non-blocking receive or send - Care should be used when creating temporary files
as multiple threads may be running on the same
host overwriting the same temporary file (include
rank in file name in a unique temporary directory
per simulation)
24Compiling MPICH programs
- mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the
correct MPI libraries from /usr/lib64/MPICH - Environment variables can used to set the
compiler - CC, CPP, FC, F90
- Command line options to set the compiler
- -cc, -cxx, -fc, -f90
- GNU, PGI, and Intel compilers are supported
25Running MPICH programs
- mpirun is used to launch MPICH programs
- Dynamic allocation can be done when using the np
flag - Mapping is also supported when using the map
flags - If Infiniband is installed, the interconnect
fabric can be chosen using the machine flag - -machine p4
- -machine vapi
26Scaling with MPI
- which mpicc
- /usr/bin/mpicc
- mpicc -show -o cpi-mpi cpi-mpi.c
- gcc -L/usr/lib64/MPICH/p4/gnu -I/usr/include -o
cpi-mpi cpi-mpi.c -lmpi -lbproc - mpicc -o cpi-mpi cpi-mpi.c
- time mpirun -np 1 ./cpi-mpi
- Process 0 on scyld.localdomain
-
- real 0m11.198s
- user 0m11.187s
- sys 0m0.010s
- time mpirun -np 2 ./cpi-mpi
- Process 0 on scyld.localdomain
- Process 1 on n0
-
- real 0m6.486s
- user 0m5.510s
- sys 0m0.009s
- time mpirun -map -1-1-1-1-1-1-1-10000
0000 ./cpi-mpi
27Environment Variable Options
- Additional environment variable control
- NP The number of processes requested, but not
the number of processors. As in the example
earlier in this section, NP4 ./a.out will run
the MPI program a.out with 4 processes. - ALL_CPUS Set the number of processes to the
number of CPUs available to the current user.
Similar to the example above, --all-cpus1
./a.out would run the MPI program a.out on all
available CPUs. - ALL_NODESSet the number of processes to the
number of nodes available to the current user.
Similar to the ALL_CPUS variable, but you get a
maximum of one CPU per node. This is useful for
running a job per node instead of per CPU. - ALL_LOCAL Run every process on the master node
used for debugging purposes. - NO_LOCAL Dont run any processes on the master
node. - EXCLUDE A colon-delimited list of nodes to be
avoided during node assignment. - BEOWULF_JOB_MAP A colon-delimited list of
nodes. The first node listed will be the first
process (MPI Rank 0) and so on.
28Compiling and Running OpenMPI programs
- env-modules package allow users to change their
environment variables according to predefined
files - module avail
- module load openmpi/gnu
- GNU, PGI, and Intel compilers are supported
- mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the
correct MPI libraries from /opt/scyld/openmpi - mpirun is used to run code
- Interconnect can be selected at runtime
- -mca btl openib,tcp,sm,self
- -mca btl udapl,tcp,sm,self
29Compiling and Running OpenMPI programs
- What env-modules does
- Set user environment prior to compiling
- export PATH/opt/scyld/openmpi/gnu/binPATH
- mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the
correct MPI libraries from /opt/scyld/openmpi - Environment variables can used to set the
compiler - OPMI_CC, OMPI_CXX, OMPI_F77, OMPI_FC
- Prior to running PATH and LD_LIBRARY_PATH should
be set - module load openmpi/gnu
- /opt/scyld/openmpi/gnu/bin/mpirun np 16 a.out
- OR
- export PATH/opt/scyld/openmpi/gnu/binPATHexp
ort MANPATH/opt/scyld/openmpi/gnu/share/manexpor
t LD_LIBRARY_PATH/opt/scyld/openmpi/gnu/libLD_
LIBRARY_PATH - /opt/scyld/openmpi/gnu/bin/mpirun np 16 a.out
30Scaling with MPI Implementations
31Scaling with MPI Implementations
- Infiniband allows wider scaling
- Performance difference between MPICH versus
OpenMPI - A little artificial because its only two
physical machines
32Scaling with MPI Implementations
- Larger problems would allow continued scaling
33MPI Summary
- MPI provides a mechanism to parallelize in a
distributed fashion - Localhost optimization is done is on a shared
memory machine - Shared variables are explicitly handled by the
developer - Tradeoff between CPU versus IO can determine the
performance characteristics - Hybrid programming models are possible
- MPI code with OpenMPI sections
- MPI code with GPU calls
34Queuing
- How are resources allocated among multiple users
and/or groups? - Statically by using bpctl user and group
permissions - ClusterWare supports a variety of queuing
packages - TaskMaster (advanced MOAB policy based scheduler
integrated ClusterWare) - Torque
- SGE
35Interacting with Torque
- To submit a job
- qsub script.sh
- Example script.sh
- !/bin/sh
- PBS j oe
- PBS l nodes4
- cd PBS_O_WORKDIR
- hostname
- qsub does not accept arguments for script.sh.
All executable arguments must be included in the
script itself - Administrators can create a qapp script that
takes user arguments, creates script.sh with the
user arguments embedded, and runs qsub
script.sh
36Interacting with Torque
- Other commands
- qstat Status of queue server and jobs
- qdel Remove a job from the queue
- qhold, qrls Hold and release a job in the queue
- qmgr Administrator command to configure
pbs_server - /var/spool/torque/server_name should match
hostname of the head node - /var/spool/torque/mom_priv/config file to
configure pbs_mom - usecp /home /home indicates that pbs_mom
should use cp rather than rcp or scp to
relocate the stdout and stderr files at the end
of execution - pbsnodes Administrator command to monitor the
status of the resources - qalter Administrator command to modify the
parameters of a particular job (e.g. requested
time)
37Other options to qsub
- Options that can be included in a script (with
the PBS directive) or on the qsub command line - Join output and error files PBS j oe
- Request resources PBS l nodes2ppn2
- Request walltime PBS l walltime240000
- Define a job name PBS N jobname
- Send mail at jobs events PBS m be
- Assign job to an account PBS A account
- Export current environment variables PBS V
- To start an interactive queue job use
- qsub I for Torque
- qrsh for SGE
38Queue script case studies
!/bin/bash Usage qapp arg1 arg2 debug0 opt1
1 opt22 if opt2
then echo Not enough arguments exit 1 fi cat
gt app.sh ltlt EOF !/bin/bash PBS j oe PBS l
nodes1 cd \PBS_O_WORKDIR app opt1
opt2 EOF if debug lt 1
then qsub app.sh fi if debug eq 0
then /bin/rm f app.sh fi
- qapp script
- Be careful about escaping special characters in
the redirect section (\, \, \)
39Queue script case studies
!/bin/bash PBS j oe PBS l nodes1 cd
PBS_O_WORKDIR tmpdir/scratch/USER/PBS_JOBID
/bin/mkdir p tmpdir rsync a ./ tmpdir cd
tmpdir pathto/app arg1 arg2 cd
PBS_O_WORKDIR rsync a tmpdir/ . /bin/rm fr
tmpdir
40Queue script case studies
- Using local scratch for MPICH parallel jobs
- pbsdsh is a Torque command
!/bin/bash PBS j oe PBS l nodes2ppn8 cd
PBS_O_WORKDIR tmpdir/scratch/USER/PBS_JOBID
/usr/bin/pbsdsh u /bin/mkdir p
tmpdir /usr/bin/pbsdsh u bash c cd
PBS_O_WORKDIR rsync a ./ tmpdir cd
tmpdir mpirun machine vapi pathto/app arg1
arg2 cd PBS_O_WORKDIR /usr/bin/pbsdsh u
rsync a tmpdir/ PBS_O_WORKDIR /usr/bin/pbsds
h u /bin/rm fr tmpdir
41Queue script case studies
- Using local scratch for OpenMPI parallel jobs
- Do a module load openmpi/gnu prior to running
qsub - OR explicitly include a module load openmpi/gnu
in the script itself
!/bin/bash PBS j oe PBS l
nodes2ppn8 PBS -V cd PBS_O_WORKDIR tmpdir
/scratch/USER/PBS_JOBID /usr/bin/pbsdsh u
/bin/mkdir p tmpdir /usr/bin/pbsdsh u bash
c cd PBS_O_WORKDIR rsync a ./ tmpdir cd
tmpdir /usr/openmpi/gnu/bin/mpirun np cat
PBS_NODEFILE wc l mca btl openib,sm,self
pathto/app arg1 arg2 cd PBS_O_WORKDIR /usr/bin
/pbsdsh u rsync a tmpdir/ PBS_O_WORKDIR /us
r/bin/pbsdsh u /bin/rm fr tmpdir
42Other considerations
- A queue script need not be a single command
- Multiple steps can be performed from a single
script - Guaranteed resources
- Jobs should typically be a minimum of 2 minutes
- Pre-processing and post-processing can be done
from the same script using the local scratch
space - If configured, it is possible to submit
additional jobs from a running queued job - To remove multiple jobs from the queue
- qstat grep RQ awk print 1 xargs
qdel
43Integrating Parallel Programs
- The scheduler on keeps track of available
resources - Dont monitor how the resources are used
- Onus is on the user to request and use the
correct resources - OpenMP be sure to requests multiple processors
on the same machine - Torque PBS l nodes1ppnx
- SGE Correct PE (parallel environment) submission
- MPI be sure to the use the machines that have
been assigned by the queue system - Torque MPICH and OpenMPI mpirun will do the
correct thing. PBS_NODEFILE contains a list of
assigned hosts - SGE PE_HOSTFILE contains a list of assigned
hosts. OpenMPIs mpirun may need to be recompiled
44Integrating Parallel Programs
- Be careful about task pinning (taskset)
- Different jobs may assuming the same CPU set
resulting in oversubscription of some cores and
some free cores - In a shared environment, not using task pinning
can be easier at a slight trade-off in
performance - Make sure that the same MPI implementation and
compiler combination is used to run the code as
was used to compile and link
45Questions??