Title: Introduction to
1Introduction to
George Mozdzynski February 2009
2Outline
- What is OpenMP?
- Why OpenMP?
- OpenMP standard
- Key directives
- Using OpenMP on IBM
- OpenMP parallelisation strategy
- Performance issues
- Automatic checking of OpenMP applications
- Class exercises
3What is OpenMP?
- OpenMP is a directive based language for
expressing parallelism on Shared Memory
Multiprocessor systems - F77, F90, C, C supported
- www.openmp.org for info, specifications, FAQ,
mail group - xlf90 supports
- OpenMP version 2.0 (xlf90 v9)
- OpenMP version 2.5 (xlf90 v10)
-
4Why OpenMP?
- Most new supercomputers use shared memory
multiprocessor technology - MPI considered difficult to program
- SMPs already had vendor specific directives for
parallelisation - OpenMP is a standard (just like MPI)
- OpenMP is relatively easy to use
- OpenMP can work together with MPI
- MPI/OpenMP v. MPI-only performance
5IFS T799L91 Forecast Model(64 nodes, Power5,
2048 user threads)
6OpenMP directives
!OMP PARALLEL DO PRIVATE(J,K, !OMP L,M) DO
J1,N CALL SUB(J,) ENDDO !OMP END
PARALLEL DO ! ITHROMP_GET_MAX_THREADS()
See p7-9
Conditional compilation
e.g. ifsaux/module/yomoml.F90
OpenMP parallel programs are valid serial programs
7IBM Supercomputer
8! /bin/ksh _at_ SHELL /bin/ksh _at_ job_type
parallel _at_ class np _at_ output
omptest.out _at_ error omptest.out _at_ node
1 _at_ resources ConsumableCpus(16)
ConsumableMemory(6000mb) _at_ total_tasks 1 _at_
core_limit 4096 _at_ notification never _at_
node_usage not_shared !!!!for accurate
timing _at_ ec_smt no !!!!for
accurate timing _at_ queue cd HOME/OpenMP_Course
export XLSMPOPTS"stack200000000" LDRFLAGS"-q64
-bmaxstack0x8000000000" xlf90_r LDRFLAGS -O3
-qstrict -qsmpomp -o omptest omptest1.F for omp
in 1 2 4 8 16 do echo Using omp threads
export OMP_NUM_THREADSomp ./omptest done
9program omptest1 parameter(k1000000) real8
a(k),b(k) a()0. b()0. write(0,'("start
timing")') zbegtimef()1.0e-3 do irep1,1000
call work(k,a,b) enddo zendtimef()1.0e-3 write(0
,'("end timing")') write(,'("time",F10.2)')zend-
zbeg stop end subroutine work(k,a,b) real8
a(k),b(k) !OMP PARALLEL DO PRIVATE(I) do i1,k
a(i)b(i) enddo !OMP END PARALLEL DO return end
Threads time s/u 1 1.35 1.00 2
0.68 1.99 4 0.34 3.97 8 0.37
3.65 16 0.29 4.66
10program omptest1 parameter(k10000000) real8
a(k),b(k) a()0. b()0. write(0,'("start
timing")') zbegtimef()1.0e-3 do irep1,200
call work(k,a,b) enddo zendtimef()1.0e-3 write(0
,'("end timing")') write(,'("time",F10.2)')zend-
zbeg stop end subroutine work(k,a,b) real8
a(k),b(k) !OMP PARALLEL DO PRIVATE(I) do i1,k
a(i)b(i) enddo !OMP END PARALLEL DO return end
Threads time s/u 1 4.27 1.00 2
4.25 1.00 4 2.37 1.80 8 0.78
5.47 16 0.48 8.90
11program omptest1 parameter(k10000000) real8
a(k),b(k) !OMP PARALLEL DO PRIVATE(I) do i1,k
a(i)0. b(i)0. enddo !OMP END PARALLEL
DO write(0,'("start timing")') zbegtimef()1.0e-3
do irep1,200 call work(k,a,b) enddo zendtimef
()1.0e-3 write(0,'("end timing")') write(,'("tim
e",F10.2)')zend-zbeg stop end subroutine
work(k,a,b) real8 a(k),b(k) !OMP PARALLEL DO
PRIVATE(I) do i1,k a(i)b(i) enddo !OMP END
PARALLEL DO return end
Threads time s/u 1 4.27 1.00 2
2.12 2.01 4 0.85 5.02 8 0.69
6.19 16 0.38 11.24
12K10M, IREP200 a(I)b(I)cos(b(I)) Threads
time s/u 1 58.61 1.00
2 29.27 2.00 4 14.64 4.00
8 7.31 8.02 16 3.97
14.76 OpenMP performance tip For perfect speedup
avoid using memory ?
13Key directives Parallel Region
- !OMP PARALLEL clause,clause
- block
- !OMP END PARALLEL
- Where clause can be
- PRIVATE(list)
- etc.,
See p12-14
14Key directives Work-sharing constructs/1
- !OMP DO clause,clause
- do_loop
- !OMP END DO
- Where clause can be
- PRIVATE(list)
- SCHEDULE(type,chunk)
- etc.,
See p15-18
15Key directives Work-sharing constructs/2
!OMP WORKSHARE block !OMP END WORKSHARE No
PRIVATE or SCHEDULE options A good example for
block would be array assignment statements (I.e.
no DO)
See p20-22
16Key directives combined parallel work-sharing/1
- !OMP PARALLEL DO clause,clause
- do_loop
- !OMP END PARALLEL DO
- Where clause can be
- PRIVATE(list)
- SCHEDULE(type,chunk)
- etc.,
See p23
17Key directives combined parallel work-sharing/2
- !OMP PARALLEL WORKSHARE clause,clause
- block
- !OMP END PARALLEL WORKSHARE
- Where clause can be
- PRIVATE(list)
- etc.,
See p24-25
18K10M, IREP200 !OMP PARALLEL DO PRIVATE(I)
!OMP PARALLEL WORKSHARE DO I1,10000000
A()B()COS(C())
A(I)B(I)COS(C(I)) !OMP END PARALLEL
WORKSHARE ENDDO !OMP END PARALLEL
DO Threads time s/u time s/u
1 50.71 1.00 51.32 1.00 2
25.36 2.00 25.56 2.01 3
16.84 3.01 16.85 3.05 4 12.75
3.98 12.94 3.97 5 10.14
5.00 10.46 4.91 6 8.70 5.83
8.78 5.85 7 7.75 6.54
7.28 7.05 8 6.96 7.29 6.43
7.98
19(No Transcript)
20(No Transcript)
21program omptest2 INTEGER OMP_GET_THREAD_NUM INTEG
ER OMP_GET_NUM_THREADS INTEGER OMP_GET_MAX_THREADS
! WRITE(0,'("NUM_THREADS",I3)')OMP_GET_NUM_THRE
ADS() ! WRITE(0,'("MAX_THREADS",I3)')OMP_GET_MAX
_THREADS() !OMP PARALLEL DO SCHEDULE(STATIC)
PRIVATE(J,ITH,ITHS) DO J1,10 !
ITHOMP_GET_THREAD_NUM() ! ITHSOMP_GET_NUM_THREA
DS() ! WRITE(0,'("J",I3," THREAD_NUM",I3,"
NUM_THREADS",I3)')J,ITH,ITHS ENDDO !OMP END
PARALLEL DO stop end
NUM_THREADS 1 MAX_THREADS 4 J 1 THREAD_NUM
0 NUM_THREADS 4 J 2 THREAD_NUM 0
NUM_THREADS 4 J 3 THREAD_NUM 0 NUM_THREADS
4 J 4 THREAD_NUM 1 NUM_THREADS 4 J 7
THREAD_NUM 2 NUM_THREADS 4 J 8 THREAD_NUM
2 NUM_THREADS 4 J 5 THREAD_NUM 1
NUM_THREADS 4 J 6 THREAD_NUM 1 NUM_THREADS
4 J 9 THREAD_NUM 3 NUM_THREADS 4 J 10
THREAD_NUM 3 NUM_THREADS 4
export OMP_NUM_THREADS4 ./omptest
22!OMP PARALLEL PRIVATE(JKGLO,ICEND,IBL,IOFF) !OMP
DO SCHEDULE(DYNAMIC,1) DO JKGLO1,NGPTOT,NPRO
MA ICENDMIN(NPROMA,NGPTOT-JKGLO1)
IBL(JKGLO-1)/NPROMA1 IOFFJKGLO
CALL EC_PHYS(NCURRENT_ITER,LFULLIMP,LNHDYN,CDCONF(
44) ,IBL,IGL1,IGL2,ICEND,JKGL
O,ICEND ,GPP(1,1,IBL),GEMU(IO
FF) ,GELAM(IOFF),GESLO(IOFF),G
ECLO(IOFF),GM(IOFF)
,OROG(IOFF),GNORDL(IOFF),GNORDM(IOFF)
,GSQM2(IOFF),RCOLON(IOFF),RSILON(IOFF)
,RINDX(IOFF),RINDY(IOFF),GAW(IOF
F) ...
,GPBTP9(1,1,IBL),GPBQP9(1,1,IBL)
,GPFORCEU(1,1,IBL,0),GPFORCEV(1,1,IBL,0)
,GPFORCET(1,1,IBL,0),GPFORCEQ(1
,1,IBL,0)) ENDDO !OMP END DO !OMP END
PARALLEL
ifs/control/gp_model.F90
23!OMP PARALLEL DO SCHEDULE(STATIC,1) !OMP
PRIVATE(JMLOCF,IM,ISTA,IEND) DO
JMLOCFNPTRMF(MYSETN),NPTRMF(MYSETN1)-1
IMMYMS(JMLOCF) ISTANSPSTAF(IM)
IENDISTA2(NSMAX1-IM)-1 CALL
SPCSI(CDCONF,IM,ISTA,IEND,LLONEM,ISPEC2V,
ZSPVORG,ZSPDIVG,ZSPTG,ZSPSPG)
ENDDO !OMP END PARALLEL DO
ifs/control/spcm.F90
24!OMP PARALLEL PRIVATE(JKGLO,ICEND,IBL,IOFF,ZSLBUF
1AUX,JFLD,JROF) IF (.NOT.ALLOCATED(ZSLBUF1AUX))A
LLOCATE (ZSLBUF1AUX(NPROMA,NFLDSLB1)) !OMP DO
SCHEDULE(DYNAMIC,1) DO JKGLO1,NGPTOT,NPROMA
ICENDMIN(NPROMA,NGPTOT-JKGLO1)
IBL(JKGLO-1)/NPROMA1 IOFFJKGLO
ZSLBUF1AUX(,)_ZERO_ CALL CPG25(CDCONF(44)
,ICEND,JKGLO,NGPBLKS,ZSLBUF1AUX,ZSLBUF2X
(1,1,IBL) ,RCORI(IOFF),GM(IOFF),RATATH(I
OFF),RATATX(IOFF) ...
,GT5(1,MSPT5M,IBL)) ! move data from
blocked form to latitude (NASLB1) form DO
JFLD1,NFLDSLB1 DO JROFJKGLO,MIN(JKGLO-1NP
ROMA,NGPTOT) ZSLBUF1(NSLCORE(JROF),JFLD)Z
SLBUF1AUX(JROF-JKGLO1,JFLD) ENDDO
ENDDO ENDDO !OMP END DO IF
(ALLOCATED(ZSLBUF1AUX)) DEALLOCATE(ZSLBUF1AUX) !O
MP END PARALLEL
ifs/control/gp_model_ad.F90
25OpenMP parallelisation strategy
- Start with a correct serial execution of the
application - Apply OpenMP directives to time-consuming do
loops one at a time and TEST - Use high level approach where possible
- Use thread checker to perform a correctness
check - Results may change slightly
- Results should be bit reproducible for different
numbers of threads - Avoid reductions for reals (except max, min)
- Array syntax now supported with WORKSHARE
26Performance issues
- Amdahls Law
- Work imbalance
- False sharing (cache line ping-pong)
- Memory overheads
- OpenMP overheads
- Fiona J.L. Reid and J. Mark Bull, HPCx UK
- http//www.hpcx.ac.uk/research/hpc/technical_repor
ts/HPCxTR0411.pdf - IBM Redbook, The Power4 Processor, Introduction
and Tuning Guide - Chapter 7 parallel programming techniques and
performance - http//www.redbooks.ibm.com/redbooks/pdfs/sg247041
.pdf
27Synchronisation overheads on p690 (Power4)
28SCHEDULE overheads on P690 (8 CPUs)
29Store1 Is this safe?
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY X,
A INTEGER K XA(K) CALL SUB(X) END
30Store1 Is this safe?
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY X,
A INTEGER K XA(K) CALL SUB(X) END
NO, X has a Write/Read conflict
31Store2 Is this safe?
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY X,
A INTEGER K A(K)X CALL SUB(X) END
32Store2 Is this safe?
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY X,
A INTEGER K A(K)X CALL SUB(X) END
YES, different locations in A are being written to
33- It is illegal to store to the same location
(scalar or array) from multiple threads of a
parallel region - A safe strategy would be to
- only read module data
- only write to parallel region private data and
subroutine local data within parallel region
34Reduction example 1
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY
M ... IVAL time consuming calc !OMP
CRITICAL MMIVAL !OMP END CRITICAL ... END
35Reduction example 2
USE MYMOD, ONLY M ... IVAL0 !OMP PARALLEL
DO PRIVATE(J) REDUCTION(IVAL) DO J1,N
CALL WORK(J,IVAL) ENDDO !OMP END PARALLEL
DO MMIVAL ... SUBROUTINE WORK(K,KVAL) ... IVA
L time consuming calc KVAL KVAL
IVAL ... END
36Reduction example 3
USE MYMOD, ONLY M INTEGER IVAL(N) ... !OMP
PARALLEL DO PRIVATE(J) DO J1,N CALL
WORK(J,IVAL(J)) ENDDO !OMP END PARALLEL
DO MMSUM(IVAL) ... SUBROUTINE
WORK(K,KVAL) ... KVAL time consuming
calc ... END
37Intel Thread Checker
- Thread Checker
- analyse correctness of an OpenMP application
- Can identify any place where a construct exists
that may violate the single-threaded consistency
property - simulates multiple threads (uses 1)
- gt 100 times slower than real time
- No equivalent product from IBM
- Technology similar to KAIs Assure/Guide products
- Intel bought KAI a few years ago
- http//developer.intel.com/software/products/threa
ding
38(No Transcript)
39(No Transcript)
40Stack issues (IBM)
- Master thread stack inherits task (process) stack
- Default is 4 Gbytes
- To go beyond 4 Gbytes link with
-bmaxstack0x800000000 - Non-master thread stacks
- Default only 4 MBytes
- Use XLSMPOPTSstacknnnnnn
- nnnnnn is bytes
- maximum used to be 256 Mbytes or 268435456 (now
it is up to the limit imposed by system resources
for 64-bit mode) - Large arrays (gt1 Mbyte?) should use the heap
- Real,allocatable biggy()
- Allocate(biggy(100000000)) Deallocate(biggy)
41And now the exercises