Introduction to - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Introduction to

Description:

OpenMP is a directive based language for expressing parallelism on Shared Memory ... False sharing (cache line ping-pong) Memory overheads. OpenMP overheads ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 42
Provided by: georgemo7
Category:

less

Transcript and Presenter's Notes

Title: Introduction to


1
Introduction to
George Mozdzynski February 2009
2
Outline
  • What is OpenMP?
  • Why OpenMP?
  • OpenMP standard
  • Key directives
  • Using OpenMP on IBM
  • OpenMP parallelisation strategy
  • Performance issues
  • Automatic checking of OpenMP applications
  • Class exercises

3
What is OpenMP?
  • OpenMP is a directive based language for
    expressing parallelism on Shared Memory
    Multiprocessor systems
  • F77, F90, C, C supported
  • www.openmp.org for info, specifications, FAQ,
    mail group
  • xlf90 supports
  • OpenMP version 2.0 (xlf90 v9)
  • OpenMP version 2.5 (xlf90 v10)

4
Why OpenMP?
  • Most new supercomputers use shared memory
    multiprocessor technology
  • MPI considered difficult to program
  • SMPs already had vendor specific directives for
    parallelisation
  • OpenMP is a standard (just like MPI)
  • OpenMP is relatively easy to use
  • OpenMP can work together with MPI
  • MPI/OpenMP v. MPI-only performance

5
IFS T799L91 Forecast Model(64 nodes, Power5,
2048 user threads)
6
OpenMP directives
!OMP PARALLEL DO PRIVATE(J,K, !OMP L,M) DO
J1,N CALL SUB(J,) ENDDO !OMP END
PARALLEL DO ! ITHROMP_GET_MAX_THREADS()
See p7-9
Conditional compilation
e.g. ifsaux/module/yomoml.F90
OpenMP parallel programs are valid serial programs
7
IBM Supercomputer
8
! /bin/ksh _at_ SHELL /bin/ksh _at_ job_type
parallel _at_ class np _at_ output
omptest.out _at_ error omptest.out _at_ node
1 _at_ resources ConsumableCpus(16)
ConsumableMemory(6000mb) _at_ total_tasks 1 _at_
core_limit 4096 _at_ notification never _at_
node_usage not_shared !!!!for accurate
timing _at_ ec_smt no !!!!for
accurate timing _at_ queue cd HOME/OpenMP_Course
export XLSMPOPTS"stack200000000" LDRFLAGS"-q64
-bmaxstack0x8000000000" xlf90_r LDRFLAGS -O3
-qstrict -qsmpomp -o omptest omptest1.F for omp
in 1 2 4 8 16 do echo Using omp threads
export OMP_NUM_THREADSomp ./omptest done
9
program omptest1 parameter(k1000000) real8
a(k),b(k) a()0. b()0. write(0,'("start
timing")') zbegtimef()1.0e-3 do irep1,1000
call work(k,a,b) enddo zendtimef()1.0e-3 write(0
,'("end timing")') write(,'("time",F10.2)')zend-
zbeg stop end subroutine work(k,a,b) real8
a(k),b(k) !OMP PARALLEL DO PRIVATE(I) do i1,k
a(i)b(i) enddo !OMP END PARALLEL DO return end
Threads time s/u 1 1.35 1.00 2
0.68 1.99 4 0.34 3.97 8 0.37
3.65 16 0.29 4.66
10
program omptest1 parameter(k10000000) real8
a(k),b(k) a()0. b()0. write(0,'("start
timing")') zbegtimef()1.0e-3 do irep1,200
call work(k,a,b) enddo zendtimef()1.0e-3 write(0
,'("end timing")') write(,'("time",F10.2)')zend-
zbeg stop end subroutine work(k,a,b) real8
a(k),b(k) !OMP PARALLEL DO PRIVATE(I) do i1,k
a(i)b(i) enddo !OMP END PARALLEL DO return end
Threads time s/u 1 4.27 1.00 2
4.25 1.00 4 2.37 1.80 8 0.78
5.47 16 0.48 8.90
11
program omptest1 parameter(k10000000) real8
a(k),b(k) !OMP PARALLEL DO PRIVATE(I) do i1,k
a(i)0. b(i)0. enddo !OMP END PARALLEL
DO write(0,'("start timing")') zbegtimef()1.0e-3
do irep1,200 call work(k,a,b) enddo zendtimef
()1.0e-3 write(0,'("end timing")') write(,'("tim
e",F10.2)')zend-zbeg stop end subroutine
work(k,a,b) real8 a(k),b(k) !OMP PARALLEL DO
PRIVATE(I) do i1,k a(i)b(i) enddo !OMP END
PARALLEL DO return end
Threads time s/u 1 4.27 1.00 2
2.12 2.01 4 0.85 5.02 8 0.69
6.19 16 0.38 11.24
12
K10M, IREP200 a(I)b(I)cos(b(I)) Threads
time s/u 1 58.61 1.00
2 29.27 2.00 4 14.64 4.00
8 7.31 8.02 16 3.97
14.76 OpenMP performance tip For perfect speedup
avoid using memory ?
13
Key directives Parallel Region
  • !OMP PARALLEL clause,clause
  • block
  • !OMP END PARALLEL
  • Where clause can be
  • PRIVATE(list)
  • etc.,

See p12-14
14
Key directives Work-sharing constructs/1
  • !OMP DO clause,clause
  • do_loop
  • !OMP END DO
  • Where clause can be
  • PRIVATE(list)
  • SCHEDULE(type,chunk)
  • etc.,

See p15-18
15
Key directives Work-sharing constructs/2
!OMP WORKSHARE block !OMP END WORKSHARE No
PRIVATE or SCHEDULE options A good example for
block would be array assignment statements (I.e.
no DO)
See p20-22
16
Key directives combined parallel work-sharing/1
  • !OMP PARALLEL DO clause,clause
  • do_loop
  • !OMP END PARALLEL DO
  • Where clause can be
  • PRIVATE(list)
  • SCHEDULE(type,chunk)
  • etc.,

See p23
17
Key directives combined parallel work-sharing/2
  • !OMP PARALLEL WORKSHARE clause,clause
  • block
  • !OMP END PARALLEL WORKSHARE
  • Where clause can be
  • PRIVATE(list)
  • etc.,

See p24-25
18
K10M, IREP200 !OMP PARALLEL DO PRIVATE(I)
!OMP PARALLEL WORKSHARE DO I1,10000000
A()B()COS(C())
A(I)B(I)COS(C(I)) !OMP END PARALLEL
WORKSHARE ENDDO !OMP END PARALLEL
DO Threads time s/u time s/u
1 50.71 1.00 51.32 1.00 2
25.36 2.00 25.56 2.01 3
16.84 3.01 16.85 3.05 4 12.75
3.98 12.94 3.97 5 10.14
5.00 10.46 4.91 6 8.70 5.83
8.78 5.85 7 7.75 6.54
7.28 7.05 8 6.96 7.29 6.43
7.98
19
(No Transcript)
20
(No Transcript)
21
program omptest2 INTEGER OMP_GET_THREAD_NUM INTEG
ER OMP_GET_NUM_THREADS INTEGER OMP_GET_MAX_THREADS
! WRITE(0,'("NUM_THREADS",I3)')OMP_GET_NUM_THRE
ADS() ! WRITE(0,'("MAX_THREADS",I3)')OMP_GET_MAX
_THREADS() !OMP PARALLEL DO SCHEDULE(STATIC)
PRIVATE(J,ITH,ITHS) DO J1,10 !
ITHOMP_GET_THREAD_NUM() ! ITHSOMP_GET_NUM_THREA
DS() ! WRITE(0,'("J",I3," THREAD_NUM",I3,"
NUM_THREADS",I3)')J,ITH,ITHS ENDDO !OMP END
PARALLEL DO stop end
NUM_THREADS 1 MAX_THREADS 4 J 1 THREAD_NUM
0 NUM_THREADS 4 J 2 THREAD_NUM 0
NUM_THREADS 4 J 3 THREAD_NUM 0 NUM_THREADS
4 J 4 THREAD_NUM 1 NUM_THREADS 4 J 7
THREAD_NUM 2 NUM_THREADS 4 J 8 THREAD_NUM
2 NUM_THREADS 4 J 5 THREAD_NUM 1
NUM_THREADS 4 J 6 THREAD_NUM 1 NUM_THREADS
4 J 9 THREAD_NUM 3 NUM_THREADS 4 J 10
THREAD_NUM 3 NUM_THREADS 4
export OMP_NUM_THREADS4 ./omptest
22
!OMP PARALLEL PRIVATE(JKGLO,ICEND,IBL,IOFF) !OMP
DO SCHEDULE(DYNAMIC,1) DO JKGLO1,NGPTOT,NPRO
MA ICENDMIN(NPROMA,NGPTOT-JKGLO1)
IBL(JKGLO-1)/NPROMA1 IOFFJKGLO
CALL EC_PHYS(NCURRENT_ITER,LFULLIMP,LNHDYN,CDCONF(
44) ,IBL,IGL1,IGL2,ICEND,JKGL
O,ICEND ,GPP(1,1,IBL),GEMU(IO
FF) ,GELAM(IOFF),GESLO(IOFF),G
ECLO(IOFF),GM(IOFF)
,OROG(IOFF),GNORDL(IOFF),GNORDM(IOFF)
,GSQM2(IOFF),RCOLON(IOFF),RSILON(IOFF)
,RINDX(IOFF),RINDY(IOFF),GAW(IOF
F) ...
,GPBTP9(1,1,IBL),GPBQP9(1,1,IBL)
,GPFORCEU(1,1,IBL,0),GPFORCEV(1,1,IBL,0)
,GPFORCET(1,1,IBL,0),GPFORCEQ(1
,1,IBL,0)) ENDDO !OMP END DO !OMP END
PARALLEL
ifs/control/gp_model.F90
23
!OMP PARALLEL DO SCHEDULE(STATIC,1) !OMP
PRIVATE(JMLOCF,IM,ISTA,IEND) DO
JMLOCFNPTRMF(MYSETN),NPTRMF(MYSETN1)-1
IMMYMS(JMLOCF) ISTANSPSTAF(IM)
IENDISTA2(NSMAX1-IM)-1 CALL
SPCSI(CDCONF,IM,ISTA,IEND,LLONEM,ISPEC2V,
ZSPVORG,ZSPDIVG,ZSPTG,ZSPSPG)
ENDDO !OMP END PARALLEL DO
ifs/control/spcm.F90
24
!OMP PARALLEL PRIVATE(JKGLO,ICEND,IBL,IOFF,ZSLBUF
1AUX,JFLD,JROF) IF (.NOT.ALLOCATED(ZSLBUF1AUX))A
LLOCATE (ZSLBUF1AUX(NPROMA,NFLDSLB1)) !OMP DO
SCHEDULE(DYNAMIC,1) DO JKGLO1,NGPTOT,NPROMA
ICENDMIN(NPROMA,NGPTOT-JKGLO1)
IBL(JKGLO-1)/NPROMA1 IOFFJKGLO
ZSLBUF1AUX(,)_ZERO_ CALL CPG25(CDCONF(44)
,ICEND,JKGLO,NGPBLKS,ZSLBUF1AUX,ZSLBUF2X
(1,1,IBL) ,RCORI(IOFF),GM(IOFF),RATATH(I
OFF),RATATX(IOFF) ...
,GT5(1,MSPT5M,IBL)) ! move data from
blocked form to latitude (NASLB1) form DO
JFLD1,NFLDSLB1 DO JROFJKGLO,MIN(JKGLO-1NP
ROMA,NGPTOT) ZSLBUF1(NSLCORE(JROF),JFLD)Z
SLBUF1AUX(JROF-JKGLO1,JFLD) ENDDO
ENDDO ENDDO !OMP END DO IF
(ALLOCATED(ZSLBUF1AUX)) DEALLOCATE(ZSLBUF1AUX) !O
MP END PARALLEL
ifs/control/gp_model_ad.F90
25
OpenMP parallelisation strategy
  • Start with a correct serial execution of the
    application
  • Apply OpenMP directives to time-consuming do
    loops one at a time and TEST
  • Use high level approach where possible
  • Use thread checker to perform a correctness
    check
  • Results may change slightly
  • Results should be bit reproducible for different
    numbers of threads
  • Avoid reductions for reals (except max, min)
  • Array syntax now supported with WORKSHARE

26
Performance issues
  • Amdahls Law
  • Work imbalance
  • False sharing (cache line ping-pong)
  • Memory overheads
  • OpenMP overheads
  • Fiona J.L. Reid and J. Mark Bull, HPCx UK
  • http//www.hpcx.ac.uk/research/hpc/technical_repor
    ts/HPCxTR0411.pdf
  • IBM Redbook, The Power4 Processor, Introduction
    and Tuning Guide
  • Chapter 7 parallel programming techniques and
    performance
  • http//www.redbooks.ibm.com/redbooks/pdfs/sg247041
    .pdf

27
Synchronisation overheads on p690 (Power4)
28
SCHEDULE overheads on P690 (8 CPUs)
29
Store1 Is this safe?
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY X,
A INTEGER K XA(K) CALL SUB(X) END
30
Store1 Is this safe?
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY X,
A INTEGER K XA(K) CALL SUB(X) END
NO, X has a Write/Read conflict
31
Store2 Is this safe?
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY X,
A INTEGER K A(K)X CALL SUB(X) END
32
Store2 Is this safe?
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY X,
A INTEGER K A(K)X CALL SUB(X) END
YES, different locations in A are being written to
33
  • It is illegal to store to the same location
    (scalar or array) from multiple threads of a
    parallel region
  • A safe strategy would be to
  • only read module data
  • only write to parallel region private data and
    subroutine local data within parallel region

34
Reduction example 1
... !OMP PARALLEL DO PRIVATE(J) DO J1,N
CALL WORK(J) ENDDO !OMP END PARALLEL DO
... SUBROUTINE WORK(K) USE MYMOD, ONLY
M ... IVAL time consuming calc !OMP
CRITICAL MMIVAL !OMP END CRITICAL ... END
35
Reduction example 2
USE MYMOD, ONLY M ... IVAL0 !OMP PARALLEL
DO PRIVATE(J) REDUCTION(IVAL) DO J1,N
CALL WORK(J,IVAL) ENDDO !OMP END PARALLEL
DO MMIVAL ... SUBROUTINE WORK(K,KVAL) ... IVA
L time consuming calc KVAL KVAL
IVAL ... END
36
Reduction example 3
USE MYMOD, ONLY M INTEGER IVAL(N) ... !OMP
PARALLEL DO PRIVATE(J) DO J1,N CALL
WORK(J,IVAL(J)) ENDDO !OMP END PARALLEL
DO MMSUM(IVAL) ... SUBROUTINE
WORK(K,KVAL) ... KVAL time consuming
calc ... END
37
Intel Thread Checker
  • Thread Checker
  • analyse correctness of an OpenMP application
  • Can identify any place where a construct exists
    that may violate the single-threaded consistency
    property
  • simulates multiple threads (uses 1)
  • gt 100 times slower than real time
  • No equivalent product from IBM
  • Technology similar to KAIs Assure/Guide products
  • Intel bought KAI a few years ago
  • http//developer.intel.com/software/products/threa
    ding

38
(No Transcript)
39
(No Transcript)
40
Stack issues (IBM)
  • Master thread stack inherits task (process) stack
  • Default is 4 Gbytes
  • To go beyond 4 Gbytes link with
    -bmaxstack0x800000000
  • Non-master thread stacks
  • Default only 4 MBytes
  • Use XLSMPOPTSstacknnnnnn
  • nnnnnn is bytes
  • maximum used to be 256 Mbytes or 268435456 (now
    it is up to the limit imposed by system resources
    for 64-bit mode)
  • Large arrays (gt1 Mbyte?) should use the heap
  • Real,allocatable biggy()
  • Allocate(biggy(100000000)) Deallocate(biggy)

41
And now the exercises
Write a Comment
User Comments (0)
About PowerShow.com