Introductions to Parallel Programming Using OpenMP - PowerPoint PPT Presentation

About This Presentation
Title:

Introductions to Parallel Programming Using OpenMP

Description:

OpenMP is a set of extensions to Fortran/C/C . OpenMP contains compiler directives, library routines and environment variables. ... Some Buggy Codes ... – PowerPoint PPT presentation

Number of Views:196
Avg rating:3.0/5.0
Slides: 63
Provided by: zhenyi
Category:

less

Transcript and Presenter's Notes

Title: Introductions to Parallel Programming Using OpenMP


1
Introductions to Parallel Programming Using
OpenMP
April 7, 2005
  • Zhenying Liu, Dr. Barbara Chapman
  • High Performance Computing and Tools group
  • Computer Science Department
  • University of Houston

2
Content
  • Overview of OpenMP
  • Acknowledgement
  • OpenMP constructs (5 categories)
  • OpenMP exercises
  • References

3
Overview of OpenMP
  • OpenMP is a set of extensions to Fortran/C/C
  • OpenMP contains compiler directives, library
    routines and environment variables.
  • Available on most single address space machines.
  • shared memory systems, including cc-NUMA
  • Chip MultiThreading Chip MultiProcessing (Sun
    UltraSPARC IV), Simultaneous Multithreading
    (Intel Xeon)
  • not on distributed memory systems, classic MPPs,
    or PC clusters (yet!)

4
Shared Memory Architecture
  • All processors have access to one global memory
  • All processors share the same address space
  • The system runs a single copy of the OS
  • Processors communicate by reading/writing to the
    global memory
  • Examples multiprocessor PCs (Intel P4), Sun Fire
    15K, NEC SX-7, Fujitsu PrimePower, IBM p690, SGI
    Origin 3000.

5
Shared Memory Systems (cont)
OpenMP Pthreads
6
Distributed Memory Systems
MPI HPF
7
Clustered of SMPs
MPI hybrid MPI OpenMP
8
OpenMP Usage
  • Applications
  • Applications with intense computational needs
  • From video games to big science engineering
  • Programmer Accessibility
  • From very early programmers in school to
    scientists to parallel computing experts
  • Available to millions of programmers
  • In every major (Fortran C/C) compiler

9
OpenMP Syntax
  • Most of the constructs in OpenMP are compiler
    directives or pragmas.
  • For C and C, the pragmas take the form
  • pragma omp construct clause clause
  • For Fortran, the directives take one of the
    forms
  • COMP construct clause clause
  • !OMP construct clause clause
  • OMP construct clause clause
  • Since the constructs are directives, an OpenMP
    program can be compiled by compilers that dont
    support OpenMP.

10
OpenMP Programming Model
  • Fork-Join Parallelism
  • Master thread spawns a team of threads as needed.
  • Parallelism is added incrementally i.e. the
    sequential program evolves into a parallel
    program.

11
OpenMPHow is OpenMP Typically Used?
  • OpenMP is usually used to parallelize loops
  • Find your most time consuming loops.
  • Split them up between threads.

Split-up this loop between multiple threads
void main() double Res1000 pragma omp
parallel for for(int i0ilt1000i)
do_huge_comp(Resi)
void main() double Res1000 for(int
i0ilt1000i) do_huge_comp(Resi)

Sequential program
Parallel program
12
OpenMPHow do Threads Interact?
  • OpenMP is a shared memory model.
  • Threads communicate by sharing variables.
  • Unintended sharing of data can lead to race
    conditions
  • race condition when the programs outcome
    changes as the threads are scheduled differently.
  • To control race conditions
  • Use synchronization to protect data conflicts.
  • Synchronization is expensive so
  • Change how data is stored to minimize the need
    for synchronization.

13
OpenMP vs. POSIX Threads
  • POSIX threads is the other widely used shared
    programming API.
  • Fairly widely available, usually quite simple to
    implement on top of OS kernel threads.
  • Lower level of abstraction than OpenMP
  • library routines only, no directives
  • more flexible, but harder to implement and
    maintain
  • OpenMP can be implemented on top of POSIX threads
  • Not much difference in availability
  • not that many OpenMP C implementations
  • no standard Fortran interface for POSIX threads

14
Content
  • Overview of OpenMP
  • Acknowledgement
  • OpenMP constructs (5 categories)
  • OpenMP exercises
  • References

15
Acknowledgement
  • Slides provided by
  • Tim Mattson and Rudolf Eigenmann, SC 99
  • Mark Bull from EPCC
  • OpenMP program examples
  • Lawrence Livermore National Lab
  • NAS FT parallelization from PGI tutorial
  • Dr. Garbey provided us serial codes of
    Naiver-Stokes

16
Content
  • Overview of OpenMP
  • Acknowledgement
  • OpenMP constructs (5 categories)
  • OpenMP exercises
  • References

17
OpenMP Constructs
  • OpenMPs constructs fall into 5 categories
  • Parallel Regions
  • Worksharing
  • Data Environment
  • Synchronization
  • Runtime functions/environment variables
  • OpenMP is basically the same between Fortran and
    C/C

18
OpenMP Parallel Regions
  • You create threads in OpenMP with the omp
    parallel pragma.
  • For example, To create a 4-thread Parallel
    region
  • Each thread calls pooh(ID,A) for ID 0 to 3

double A1000 omp_set_num_threads(4) pragma
omp parallel int ID omp_get_thread_num()
pooh(ID,A)
Each thread redundantly executes the code within
the structured block
19
(No Transcript)
20
OpenMP Work-Sharing Constructs
  • The for Work-Sharing construct splits up loop
    iterations among the threads in a team

pragma omp parallel pragma omp for for
(I0IltNI) NEAT_STUFF(I)
By default, there is a barrier at the end of the
omp for. Use the nowait clause to turn off
the barrier.
21
Work Sharing ConstructsA motivating example
Sequential code
for(i0IltNi) ai ai bi
pragma omp parallel int id, i, Nthrds,
istart, iend id omp_get_thread_num()
Nthrds omp_get_num_threads() istart id
N / Nthrds iend (id1) N / Nthrds
for(iistartIltiendi) aiaibi
OpenMP Parallel Region
pragma omp parallel pragma omp for
schedule(static) for(i0IltNi)
aiaibi
OpenMP Parallel Region and a work-sharing for
construct
OpenMP parallel region and a work-sharing for
construct
22
OpenMP For ConstructThe Schedule Clause
  • The schedule clause effects how loop iterations
    are mapped onto threads
  • uschedule(static ,chunk)
  • Deal-out blocks of iterations of size chunk to
    each thread.
  • uschedule(dynamic,chunk)
  • Each thread grabs chunk iterations off a queue
    until all iterations have been handled.
  • uschedule(guided,chunk)
  • Threads dynamically grab blocks of iterations.
    The size of the block starts large and shrinks
    down to size chunk as the calculation proceeds.
  • uschedule(runtime)
  • Schedule and chunk size taken from the
    OMP_SCHEDULE environment variable.

23
OpenMP Work-Sharing Constructs
  • The Sections work-sharing construct gives a
    different structured block to each thread.

pragma omp parallel pragma omp sections
X_calculation() pragma omp section
y_calculation() pragma omp section
z_calculation()
By default, there is a barrier at the end of the
omp sections. Use the nowait clause to turn
off the barrier.
24
Data EnvironmentChanging Storage Attributes
  • One can selectively change storage attributes
    constructs using the following clauses
  • SHARED
  • PRIVATE
  • FIRSTPRIVATE
  • THREADPRIVATE
  • The value of a private inside a parallel loop can
    be transmitted to a global value outside the loop
    with
  • LASTPRIVATE
  • The default status can be modified with
  • DEFAULT (PRIVATE SHARED NONE)

All data clauses apply to parallel regions and
worksharing constructs except shared which only
applies to parallel regions.
25
Data EnvironmentDefault Storage Attributes
  • Shared Memory programming model
  • Most variables are shared by default
  • Global variables are SHARED among threads
  • Fortran COMMON blocks, SAVE variables, MODULE
    variables
  • C File scope variables, static
  • But not everything is shared...
  • Stack variables in sub-programs called from
    parallel regions are PRIVATE
  • Automatic variables within a statement block are
    PRIVATE.

26
Private Clause
  • private(var) creates a local copy of var for each
    thread.
  • The value is uninitialized
  • Private copy is not storage associated with
    the original
  • void wrong()
  • int IS 0
  • pragma parallel for private(IS)
  • for(int J1Jlt1000J)
  • IS IS J
  • printf(i, IS)

27
OpenMP Reduction
  • Another clause that effects the way variables are
    shared
  • reduction (op list)
  • The variables in list must be shared in the
    enclosing parallel region.
  • Inside a parallel or a worksharing construct
  • A local copy of each list variable is made and
    initialized depending on the op (e.g. 0 for
    )
  • pair wise op is updated on the local value
  • Local copies are reduced into a single global
    copy at the end of the construct.

28
OpenMP An Reduction Example
  • include ltomp.hgt
  • define NUM_THREADS 2
  • void main ()
  • int i
  • double ZZ, func(), sum0.0
  • omp_set_num_threads(NUM_THREADS)
  • pragma omp parallel for reduction(sum)
    private(ZZ)
  • for (i0 ilt 1000 i)
  • ZZ func(i)
  • sum sum ZZ

29
OpenMP Synchronization
  • OpenMP has the following constructs to support
    synchronization
  • barrier
  • critical section
  • atomic
  • flush
  • ordered
  • single
  • master

30
(No Transcript)
31
Critical and Atomic
  • Only one thread at a time can enter a critical
    section

COMP PARALLEL DO PRIVATE(B) COMP SHARED(RES)
DO 100 I1,NITERS B DOIT(I) COMP
CRITICAL CALL CONSUME (B, RES) COMP END
CRITICAL 100 CONTINUE
  • Atomic is a special case of a critical section
    that can be used for certain simple statements

COMP PARALLEL PRIVATE(B) B DOIT(I) COMP
ATOMIC X X B COMP END PARALLEL
32
Master directive
  • The master construct denotes a structured block
    that is only executed by the master thread. The
    other threads just skip it (no implied barriers
    or flushes).

pragma omp parallel private (tmp)
do_many_things() pragma omp master
exchange_boundaries() pragma barrier
do_many_other_things()
33
Single directive
  • The single construct denotes a block of code that
    is executed by only one thread.
  • A barrier and a flush are implied at the end of
    the single block.

pragma omp parallel private (tmp) do_many_thin
gs() pragma omp single exchange_boundaries()
do_many_other_things()
34
OpenMP Library routines
  • Lock routines
  • omp_init_lock(), omp_set_lock(),
    omp_unset_lock(), omp_test_lock()
  • Runtime environment routines
  • Modify/Check the number of threads
  • omp_set_num_threads(), omp_get_num_threads(),
    omp_get_thread_num(), omp_get_max_threads()
  • Turn on/off nesting and dynamic mode
  • omp_set_nested(), omp_set_dynamic(),
    omp_get_nested(), omp_get_dynamic()
  • Are we in a parallel region?
  • omp_in_parallel()
  • How many processors in the system?
  • omp_num_procs()

35
OpenMP Environment Variables
  • OMP_NUM_THREADS
  • bsh
  • export OMP_NUM_THREADS2
  • csh
  • setenv OMP_NUM_THREADS 4

36
Content
  • Overview of OpenMP
  • Acknowledgement
  • OpenMP constructs (5 categories)
  • OpenMP exercises
  • References

37
1. Hello World!
include ltomp.hgt main () int nthreads, tid /
Fork a team of threads giving them their own
copies of variables / pragma omp parallel
private(nthreads, tid) / Obtain thread
number / tid omp_get_thread_num()
printf("Hello World from thread d\n", tid)
/ Only master thread does this / if (tid
0) nthreads omp_get_num_threads
() printf("Number of threads d\n",
nthreads) / All threads join master
thread and disband /
38
Example Code - Pthread Creation and Termination
include ltpthread.hgt include ltstdio.hgt
define NUM_THREADS 5 void PrintHello(void
threadid) printf("\nd Hello World!\n",
threadid) pthread_exit(NULL) int main
(int argc, char argv) pthread_t
threadsNUM_THREADS int rc, t for(t0
tltNUM_THREADS t) printf("Creating
thread d\n", t) rc
pthread_create(threadst, NULL, PrintHello,
(void )t) if (rc)
printf("ERROR return code from pthread_create()
is d\n", rc) exit(-1)
pthread_exit(NULL)
39
2. Parallel Loop Reduction
PROGRAM REDUCTION INTEGER I, N REAL
A(100), B(100), SUM ! Some initializations
N 100 DO I 1, N A(I) I
1.0 B(I) A(I) ENDDO SUM
0.0 !OMP PARALLEL DO REDUCTION(SUM) DO
I 1, N SUM SUM (A(I) B(I))
ENDDO PRINT , ' Sum ', SUM END
40
3. Matrix-vector multiply using a parallel loop
and critical directive
/ Spawn a parallel region explicitly scoping
all variables / pragma omp parallel
shared(a,b,c,nthreads,chunk) private(tid,i,j,k)
pragma omp for schedule (static, chunk) for
(i0 iltNRA i)
printf("threadd did rowd\n",tid,i)
for(j0 jltNCB j) for (k0
kltNCA k) cij aik
bkj
41
Steps of Parallelization using OpenMP An Example
from a PGI Tutorial
  • Compile a code with the option to enable a
    profiler
  • Run the code and check if the results are correct
  • Find out the most time-consuming part of the code
    via the profiler information
  • Parallelize the time-consuming part
  • Repeat above steps until you get reasonable
    speedup

42
How to Use a Profiler
  • PGI compiler
  • pgf90 -fast -Minfo -Mproffunc fftpde.F -o fftpde
    (function level)
  • -Mproflines (line level)
  • -mp for compiling OpenMP codes
  • pgprof pgprof.out (show the profiler result)
  • Pathscale compiler
  • pathf90 -Ofast -pg Fftpde.F -o Fftpde
  • pathprof Fftpdemore

43
The most time-consuming loop in Fftpde.F
The OpenMP version of this loop in Fftpde_1.F
!OMP PARALLEL PRIVATE(Z) !OMP DO do k1,n3
do j1,n2 do i1,n1
z(i)cmplx(x1real(i,j,k),x1imag(i
,j,k)) end do call
fft(z,inverse,w,n1,m1) do i1,n1
x1real(i,j,k)real(z(i))
x1imag(i,j,k)aimag(z(i)) end do end
do end do !OMP END PARALLEL
do k1,n3 do j1,n2 do i1,n1

z(i)cmplx(x1real(i,j,k),x1imag(i,j,k))
end do call fft(z,inverse,w,n1,m1)
do i1,n1 x1real(i,j,k)real(z(i))
x1imag(i,j,k)aimag(z(i)) end do
end do end do
NEXT compare the 1 and 2 processor profiles
after adding OpenMP to this loop
44
Parallelizing the Reminder of Fftpde.F
  • The DO 130 loop near line 64 (fftpde_2.F)
  • The DO 190 loop near line 115 (fftpde_3.F)
  • 3) The DO 220 loop near line 139 (fftpde_4.F)
  • 4) The DO 250 loop near line 155 (fftpde_5.F)

45
!OMP PARALLEL PRIVATE(KK,KL,T1,T2,IK) !OMP DO
DO 130 K 1, N3 KK K - 1
KL KK T1 S T2 AN C C Find
starting seed T1 for this KK using the binary
rule for exponentiation. C DO 110 I 1,
100 IK KK / 2 IF (2 IK
.NE. KK) T2 RANDLC (T1, T2) IF (IK
.EQ. 0) GOTO 120 T2 RANDLC (T2, T2)
KK IK 110 CONTINUE C C Compute 2
NQ pseudorandom numbers. C 120 continue
CALL VRANLC (N1N2, T2, aa, x1real(1,1,k))
CALL VRANLC (N1N2, T2, aa, x1imag(1,1,k))
130 CONTINUE !OMP END PARALLEL
1. Parallelize the DO 130 loop in Fftpde_2.F
46
!OMP PARALLEL PRIVATE(K1,J1,JK,I1) !OMP DO
DO 190 K 1, N3 K1 K - 1 IF
(K .GT. N32) K1 K1 - N3 C DO 180 J 1,
N2 J1 J - 1 IF (J .GT. N22)
J1 J1 - N2 JK J1 2 K1 2 C
DO 170 I 1, N1 I1 I - 1
IF (I .GT. N12) I1 I1 - N1
X3(I,J,K) EXP (AP (I1 2 JK)) 170
CONTINUE C 180 CONTINUE 190 CONTINUE !OMP
END PARALLEL
2. Parallelize the DO 190 loop in Fftpde_3.F
47
3. Parallelize the DO 220 loop in Fftpde_4.F
!OMP PARALLEL PRIVATE(T1) !OMP DO DO
220 K 1, N3 DO 210 J 1, N2
DO 200 I 1, N1 T1 X3(I,J,K)
KT X2real(I,J,K) T1
X1real(I,J,K) X2imag(I,J,K) T1
X1imag(I,J,K) 200 CONTINUE 210
CONTINUE 220 CONTINUE !OMP END PARALLEL
48
4. Parallelize the DO 250 loop in Fftpde_5.F
!OMP PARALLEL !OMP DO DO 250 K 1, N3
DO 240 J 1, N2 DO 230 I
1, N1 X2real(I,J,K) RN
X2real(I,J,K) X2imag(I,J,K) RN
X2imag(I,J,K) 230 CONTINUE 240
CONTINUE 250 CONTINUE !OMP END PARALLEL
49
Conclusion
  • OpenMP is successful in small-to-medium SMP
    systems
  • Multiple cores/CPUs dominate the future computer
    architectures OpenMP would be the major parallel
    programming language in these architectures.
  • Simple everybody can learn it in 2 weeks
  • Not so simple Dont stop learning! keep learning
    it for better performance

50
Some Buggy Codes
pragma omp parallel for shared(a,b,c,chunk)
private(i,tid) schedule(static,chunk) tid
omp_get_thread_num() for (i0 i lt N i)
ci ai bi printf("tid d i
d ci f\n", tid, i, ci) / end
of parallel for construct /
51
Content
  • Overview of OpenMP
  • Acknowledgement
  • OpenMP constructs (5 categories)
  • OpenMP exercises
  • References

52
References
  • OpenMP Official Website
  • www.openmp.org
  • OpenMP 2.5 Specifications
  • An OpenMP book
  • Rohit Chandra, Parallel Programming in OpenMP.
    Morgan Kaufmann Publishers.
  • Compunity
  • The community of OpenMP researchers and
    developers in academia and industry
  • http//www.compunity.org/
  • Conference papers
  • WOMPAT, EWOMP, WOMPEI, IWOMP
  • http//www.nic.uoregon.edu/iwomp2005/index.htmlpr
    ogram

53
Exercises
  • cp /tmp/omp_examples.tar.gz /
  • tar xzvf omp_examples.tar.gz
  • (marvin) use pathscale (medusa) use pgi
  • pathf90 pathcc -mp -Ofast
  • pgf90 pgcc -mp -fast
  • Compile(make) and run the codes in
  • LLNL_C, LLNL_F, and FFT
  • There is a README in each subdirectory
  • Set the number of threads before running
  • Echo SHELL
  • export OMP_NUM_THREADS2 (for bsh)
  • setenv OMP_NUM_THREADS2 (for csh)

54
OpenMP Compilers and Platforms
  • Fujitsu/Lahey Fortran, C and C
  • Intel Linux Systems
  • Fujitsu Solaris Systems
  • HP HP-UX PA-RISC/Itanium , HP Tru64 Unix
  • Fortran/C/C
  • IBM XL Fortran and C from IBM
  • IBM AIX Systems
  • Intel C and Fortran Compilers from Intel
  • Intel IA32 Linux/Windows Systems
  • Intel Itanium-based Linux/Windows Systems
  • Guide Fortran and C/C from Intel's KAI Software
    Lab
  • Intel Linux/Windows Systems
  • PGF77 and PGF90 Compilers from The Portland
    Group, Inc. (PGI)
  • Intel Linux/Solaris/Windows/NT Systems

55
Compilers and Platforms
  • SGI MIPSpro 7.4 Compilers
  • SGI IRIX Systems
  • Sun Microsystems Sun ONE Studio, Compiler
    Collection, Fortran 95, C, and C
  • Sun Solaris Platforms
  • VAST from Veridian Pacific-Sierra Research
  • IBM AIX Systems
  • Intel IA32 Linux/Windows/NT Systems
  • SGI IRIX Systems
  • Sun Solaris Systems
  • PATHSCALE EKOPATH COMPILER SUITE FOR AMD64 and
    EM64T, Fortran, C, C
  • 64-bit Linux
  • Microsoft Visual Studio 2005 (Visual C)
  • Windows

56
Parallelize Win32 API, PI
void main () double pi int i DWORD
threadID int threadArgNUM_THREADS
for(i0 iltNUM_THREADS i) threadArgi i1
InitializeCriticalSection(hUpdateMutex) for
(i0 iltNUM_THREADS i)
thread_handlesi CreateThread(0, 0,
(LPTHREAD_START_ROUTINE) Pi, threadArgi, 0,
threadID) WaitForMultipleObjects(NUM_THREA
DS, thread_handles, TRUE,INFINITE) pi
global_sum step printf(" pi is f \n",pi)
include ltwindows.hgt define NUM_THREADS 2 HANDLE
thread_handlesNUM_THREADS CRITICAL_SECTION
hUpdateMutex static long num_steps
100000 double step double global_sum
0.0 void Pi (void arg) int i, start
double x, sum 0.0 start (int ) arg
step 1.0/(double) num_steps for
(istartilt num_steps iiNUM_THREADS) x
(i-0.5)step sum sum 4.0/(1.0xx)
EnterCriticalSection(hUpdateMutex)
global_sum sum LeaveCriticalSection(hUpd
ateMutex)
Doubles code size!
57
Solution Keep it simple
  • Threads libraries
  • Pro Programmer has control over everything
  • Con Programmer must control everything

Programmers scared away
Full control
Increased complexity
Sometimes a simple evolutionary approach is better
58
PI Program an example
static long num_steps 100000 double step void
main () int i double x, pi, sum 0.0 step
1.0/(double) num_steps for (i1ilt
num_steps i) x (i-0.5)step sum sum
4.0/(1.0xx) pi step sum
59
OpenMP PI Program Parallel Region example (SPMD
Program)
  • include ltomp.hgt
  • static long num_steps 100000 double step
  • define NUM_THREADS 2
  • void main ()
  • int i double x, pi, sumNUM_THREADS
  • step 1.0/(double) num_steps
  • omp_set_num_threads(NUM_THREADS)
  • pragma omp parallel
  • double x int id
  • id omp_get_thread_num()
  • for (iid, sumid0.0ilt num_steps
    iiNUM_THREADS)
  • x (i0.5)step
  • sumid 4.0/(1.0xx)
  • for(i0, pi0.0iltNUM_THREADSi) pi
    sumi step

SPMD Programs Each thread runs the same code
with the thread ID selecting any thread specific
behavior.
60
OpenMP PI ProgramWork Sharing Construct
include ltomp.hgt static long num_steps 100000
double step define NUM_THREADS 2 void main ()
int i double x, pi, sumNUM_THREADS step
1.0/(double) num_steps omp_set_num_threads(NUM
_THREADS) pragma omp parallel double x
int id id omp_get_thread_num() sumid
0 pragma omp for for (iidilt
num_steps i) x (i0.5)step sumid
4.0/(1.0xx) for(i0,
pi0.0iltNUM_THREADSi)pi sumi step
61
OpenMP PI ProgramPrivate Clause and a Critical
Section
  • include ltomp.hgt
  • static long num_steps 100000 double step
  • define NUM_THREADS 2
  • void main ()
  • int i, id double x, sum, pi0.0step
    1.0/(double) num_stepsomp_set_num_threads(NUM_TH
    READS)pragma omp parallel private (x, sum)
    id omp_get_thread_num() for
    (iid,sum0.0ilt num_stepsiiNUM_THREADS) x
    (i0.5)step sum 4.0/(1.0xx)
    pragma omp critical pi sum

Note We didnt need to create an array to hold
local sums or clutter the code with explicit
declarations of x and sum.
62
OpenMP PI ProgramParallel For with a Reduction
  • include ltomp.hgt
  • static long num_steps 100000 double step
  • define NUM_THREADS 2
  • void main ()
  • int i double x, pi, sum 0.0step
    1.0/(double) num_steps
  • omp_set_num_threads(NUM_THREADS)
  • pragma omp parallel for reduction(sum)
    private(x)
  • for (i1ilt num_steps i) x
    (i-0.5)step sum sum 4.0/(1.0xx)
  • pi step sum

OpenMP adds 2 to 4 lines of code
Write a Comment
User Comments (0)
About PowerShow.com