Title: Fortran 90 Parallel Programming with
1Fortran 90 Parallel Programming with
- CESUP/UFRGS June 10, 2003
2Components of OpenMP
- OpenMP is an Application Programming Interface
(API) consisting of - OpenMP Compiler Directives
- Runtime Library Routines
- Environmental Variables
3My First F90 OpenMP program
- PROGRAM HELLO
- INTEGER NTHREADS, TID
- ! integer OMP_GET_NUM_THREADS,
OMP_GET_THREAD_NUM - TID 0 NTHREADS 1
-
- ! Fork a team of NTHREADS threads
- !OMP PARALLEL PRIVATE(TID)
- ! Obtain thread number
- ! TID OMP_GET_THREAD_NUM()
- PRINT , 'Hello World from thread ', TID
- ! Only master thread does this
- IF (TID .EQ. 0) THEN
- ! NTHREADS OMP_GET_NUM_THREADS()
- PRINT , 'Number of threads ', NTHREADS
- END IF
Runtime Library Routines
OpenMP Directive
Sentinel
OpenMP Directive
4Compile and Execute F90 program
- SERIAL run
- f90 -x omp -o hello.x omp_hello.f90
- hello.x
- PARALLEL OpenMP run
- f90 -o hello.x omp_hello.f90
- env OMP_NUM_THREADS4 \
- env OMP_SCHEDULESTATIC hello.x
5Output from My First F90 program
- jupiter/home0/nick/115 f90 -x omp -o hello.x \
- My_First_OpemMP.f90 hello.x
- Hello World from thread 0
- Number of threads 1
- jupiter/home0/nick/116 f90 -o hello_p.x \
- My_First_OpemMP.f90
- jupiter/home0/nick/117 env \
- OMP_NUM_THREADS4 env \
- OMP_SCHEDULESTATIC hello_p.x
- Hello World from thread 2
- Number of threads 4
- Hello World from thread 0
- Hello World from thread 1
- Hello World from thread 3
6Parallel Region Constructor
!OMP PARALLEL PRIVATE(TID) ! TID
OMP_GET_THREAD_NUM() PRINT , 'Hello World
from thread ', TID !OMP END PARALLEL
TID0
Serial Region
TID3
TID0
TID1
TID2
Parallel Region
7Parallel Region Sum Series Example
- The program presented in the NOTES PAGES sums the
following series up to N terms - Prod_ABC 1.2.3 2.3.4 N(N1)(N2)
- The closed form for the above sum is
- Close_Form_Ans N(N1)(N2)(N3)/4
- The program is serial (i.e. it uses 1 cpu) and
our job is to parallelize this program.
8First Attempt at Parallelizing the Program
- Following segments were added
- Defaults for serial run
- Parallel directive in which
- i_start and i_end are calculated
- adjustment of i_end for last slave
- Many printout statements
- Looks good, but will it work ?
- Are there any DATA DEPENDENCES ?
9Comments about the First Attempt
- The variable NUM_THREADS has a SHARED attribute.
There is no need for each thread in the parallel
region to call the omp_get_num_threads()
function. - The shared variables Scal_Prod_ABC and Prod_ABC
must be synchronized. - The work in the original DO loop was distributed
among the THREADS. Each thread starts at its own
i_start value. All threads perform my_width
iterations, except for the last thread which
might be shorter. -
-
10Work-sharing construct inside a parallel region
- The parallel region
- !OMP PARALLEL
- !OMP DO
- ltdo-loopgt
- !OMP END DO
-
- !OMP DO
- ltanother do-loopgt
- !OMP END DO
-
- !OMP END PARALLEL
11PARALLEL DO construct
- The parallel construct
- !OMP PARALLEL DO clause1,
- ltsingle do-loopgt
- !OMP END PARALLEL DO
12A more compact formulation
- The following program uses the PARALLEL DO
construct and also resolves the DATA DEPENDENCE
by introducing a CRITICAL directive
13Comments on the Compact Formulation
- The directive !OMP PARALLEL DO is the same as
the pair of directives - !OMP PARALLEL
- !OMP DO
- single-DO-LOOP
- !OMP END DO
- !OMP END PARALLEL
-
14Final version with a SUM REDUCTION
- The PARALLEL DO construct has two clauses
- REDUCTION(Scal_Prod_ABC) to take care of the
critical section. - PRIVATE clause for the Prod_ABC variable.
15Parallel Region Constructor - Revisited
!OMP PARALLEL clause1 clause2 ltstructured
block of codegt !OMP END PARALLEL
Master thread
Serial Region
thread3
thread0
thread1
thread2
Parallel Region
16FIRSTPRIVATE and LASTPRIVATE (1/3)
X
Serial Region
Parallel Region
X
X
X
X
Parallel Region
A
B
C
D
Serial Region
D
17FIRSTPRIVATE and LASTPRIVATE (2/3)
- The initial and final values of PRIVATE variables
are unspecified - A FIRSTPRIVATE variable is private, and its
initial value is copied from the preceding serial
region into the current parallel region - A LASTPRIVATE variable is private, and its final
value is copied into the serial region following
the current parallel region
18FIRSTPRIVATE and LASTPRIVATE (3/3)
- Before and after parallel region array variable
zzz is global. - Inside parallel region zzz is recomputed but only
last two elements of array zzz are copied to the
next serial region. - All elements of zzz are zero in the serial region
following the parallel region, except for the
first two elements.
19DATA SCOPING - Lexical and Dynamic
- program bom_dia
- !OMP PARALLEL
- call greetings
- !OMP END PARALLEL
- end
- subroutine greetings
- ! external OMP_GET_THREAD_NUM
- ! integer OMP_GET_THREAD_NUM
- integer TID
- character12,dimension(03) saludos
- DATA saludos /"Bom dia","Buenos dias,
- Good morning","Bon jour"/
- !OMP CRITICAL
- TID OMP_GET_THREAD_NUM()
- write(6,1001) TID, saludos(TID)
- !OMP END CRITICAL
- 1001 format("TID",i1," ",a12)
- return
- end
LEXICAL EXTENT
D Y E N X A T M E I N C T
20Threadprivate Directive (1/5)
- The program appearing in NOTES PAGES produces
different results when run serially and when run
in parallel. Why ?
21Threadprivate Directive (2/5)
- The variables istart, iend are used as arguments
in the call work subroutine, the COMMON was
removed, but the results are still different. Why
? - Because the DO REDUCTION is using some values of
array iarray before they are defined ! - Must first define all values of iarray then start
- the summation process !
22Threadprivate Directive (3/5)
- Here we use a separate region for the sum
REDUCTION. Now parallel and serial versions
produce identical results - isum 338350
- closed_form 338350
23Threadprivate Directive (4/5)
- The threadprivate directive makes the COMMON
block above it private to each thread. This is a
useful alternative to the previous program,
especially when the number of variables that need
to be private is large.
24Threadprivate Directive (5/5)
- The COPYIN clause copies the threadprivate
variables from the master thread to all the
slaves - OUTPUT
- NN 10
- isum 385
- closed_form 385
- NN 100
- isum 338350
- closed_form 338350