Title: Parallel Programming on the
1Parallel Programming on the SGI Origin2000
Taub Computer Center Technion
Anne Weill-Zrahia
With thanks to Moshe Goldberg, TCC and Igor
Zacharov SGI
Mar 2005
2Parallel Programming on the SGI Origin2000
- Parallelization Concepts
- SGI Computer Design
- Efficient Scalar Design
- Parallel Programming -OpenMP
- Parallel Programming- MPI
34) Parallel Programming-OpenMP
4Is this your joint bank account?
Limor in Haifa
Take IL150 (Write IL350)
IL 0
IL150
Read IL500
IL500
IL100
IL350
IL500
IL500
IL350
Read IL500
Take IL400 (Write IL100)
IL 0
IL400
Shimon in Tel Aviv
Final amount
Initial amount
5Introduction
- Parallelization instruction to the compiler
- f77 o prog mp prog.f
- Or f77 o prog pfa prog.f
- Now try to understand what a compiler has to
determine when deciding how to parallelize
- Note that when loosely talk about
parallelization, what is meant is Is the
program as presented here parallelizable?
- This is an important distinction, because
sometimes rewriting - can transform non-parallelizable code into a
parallelizable - form, as we will see
6Data dependency types
- Iteration i depends on values calculated in the
previous iteration i-1 - (loop carried dependence)
- do i2,n
- a(i) a(i-1) cannot be
parallelized - enddo
2) Data dependence within single iteration
(non-loop carried dependence) do i2,n
c . . . . a(I) . . . c . . .
parallelizable enddo
3) Reduction do i1,n s s x
parallelizable enddo
All data dependencies in programs are variations
on these fundamental types.
7Data dependency analysis
Question Are the following loops parallelizable?
do i2,n a(i) b(i-1) enddo
YES!
do i2,n a(i) a(i-1) enddo
NO!
Why?
8Data dependency analysis
do i2,n a(i) b(i-1) enddo
YES!
cycle1
cycle2
CPU1 CPU2 CPU3
A(2)B(1) A(3)B(2) A(4)B(3)
A(5)B(4) A(6)B(5) A(7)B(6)
9Data dependency analysis
do i2,n a(i) a(i-1) enddo
Scalar (non-parallel) run
cycle2
cycle3
cycle4
cycle1
A(4)A(3)
A(5)A(4)
CPU1
A(2)A(1)
A(3)A(2)
In each cycle NEW data from previous cycle is read
10Data dependency analysis
do i2,n a(i) a(i-1) enddo
No!
cycle1
CPU1 CPU2 CPU3
A(2)A(1) A(3)A(2) A(4)A(3)
Will probably read OLD data
11Data dependency analysis
Data dependency analysis
do i2,n a(i) a(i-1) enddo
No!
cycle2
cycle1
May read NEW data
CPU1 CPU2 CPU3
A(2)A(1) A(3)A(2) A(4)A(3)
A(5)A(4) A(6)A(5) A(7)A(6)
Will probably read OLD data
12Data dependency analysis
Another question Are the following loops
parallelizable?
do i3,n,2 a(i) a(i-1) enddo
YES!
do i1,n s s a(i) enddo
Depends!
13Data dependency analysis
do i3,n,2 a(i) a(i-1) enddo
YES!
cycle1
cycle2
CPU1 CPU2 CPU3
A(3)A(2) A(5)A(4) A(7)A(6)
A(9)A(8) A(11)A(10) A(13)A(12)
14Data dependency analysis
Data dependency analysis
do i1,n s s a(i) enddo
Depends!
cycle1
cycle2
CPU1 CPU2 CPU3
SSA(1) SSA(2) SSA(3)
SSA(4) SSA(5) SSA(6)
- The value of S will be undetermined and typically
it will vary - from one run to the next
- - This bug in parallel programming is called a
race condition
15Data dependency analysis
What is the principle involved here?
The examples shown fall into two categories
- Data being read is independent of data that is
written - a(i) b(i-1) i2,3,4. . .
- a(i) a(i-1) i3,5,7. . .
2) Data being read depends on data that is
written a(i) a(I-1) i2,3,4. . .
s s a(i) i1,2,3. . .
16Data dependency analysis
Here is a typical situation
Is there a data dependency in the following loop?
do i 1,n a(i) sin(x(i)) result a(i)
b(i) c(i) result c(i) enddo
No!
Clearly, result is a temporary variable that
is reassigned for every iteration.
Note result must be a private variable (this
will be discussed later).
17Data dependency analysis
Here is a (slightly different) typical situation
Is there a data dependency in the following loop?
do i 1,n a(i) sin(result) result
a(i) b(i) c(i) result c(i) enddo
Yes!
The value of result is carried over from one
iteration to the next.
This is the classical read/write situation but
now it is somewhat hidden.
18Data dependency analysis
The loop could (symbolically) be rewritten
do i 1,n a(i) sin(result(i-1))
result(i) a(i) b(i) c(i) result(i)
c(i) enddo
Now substitute the expression for a(i)
do i 1,n a(i) sin(result(i-1))
result(i) sin(result(i-1)) b(i) c(i)
result(i) c(i) enddo
This is really of the type a(i)a(i-1) !
19Data dependency analysis
One more Can the following loop be parallelized?
do i 3,n a(i) a(i-2) enddo
If this is parallelized, there will probably be
different answers from one run to another.
Why?
20Data dependency analysis
do i 3,n a(i) a(i-2) enddo
This looks like it will be safe.
cycle2
cycle1
CPU1 CPU2
A(3)A(1) A(4)A(2)
A(5)A(3) A(6)A(4)
21Data dependency analysis
do i 3,n a(i) a(i-2) enddo
HOWEVER what if there are 3 cpus and not 2?
cycle1
CPU1 CPU2 CPU3
A(3)A(1) A(4)A(2) A(5)A(3)
In this case, a(3) is read and written in two
threads at once
22RISC memory levels
Single CPU
CPU
Cache
Main memory
23RISC memory levels
Single CPU
CPU
Cache
Main memory
24RISC memory levels
Multiple CPUs
CPU
0
Cache 0
CPU
1
Cache 1
Main memory
25RISC memory levels
Multiple CPUs
CPU
0
Cache 0
CPU
1
Cache 1
Main memory
26RISC Memory Levels
Multiple CPUs
CPU
0
Cache 0
CPU
1
Cache 1
Main memory
27Definition of OpenMP
- Application Program Interface (API) for
- Shared Memory Parallel Programming
- Directive based approach with library support
- Targets existing applications and widely used
- languages
- Fortran API first released October 1997
- C, C API first released October 1998
- - Multi-vendor/platform support
28Why was OpenMP developed?
- Parallel programming before OpenMP
- Standards for distributed memory (MPI and
PVM) - No standard for shared memory programming
- Vendors had different directive-based API for
SMP - SGI, Cray, KuckAssoc, DEC
- Vendor proprietary, similar but not the
same - Most were targeted at loop level
parallelism - Commercial users, high end software vendors,
- have big investment in existing codes
- End result users wanting portability were
forced - to use MPI even for shared memory
- This sacrifices built-in SMP hardware
benefits - Requires major effort
29The Spread of OpenMP
Organization Architecture review board
Web site www.openmp.org
Hardware HP/DEC IBM Intel SGI Sun
Software Portland (PGI) NAG Intel Kuck
Assoc (KAI) Absoft
30OpenMP Interface model
Directives And Pragmas
Runtime Library routines
Environment variables
- Control structures
- Work sharing
- Data scope attributes
- private,firstprivate,
- lastprivate
- shared
- reduction
-Control and query number thread nested
parallel? throughput mode - Lock API
- Runtime environment
- schedule type
- max number threads
- nested parallelism
- throughput mode
31OpenMP execution model
OpenMP programs starts in a single thread,
sequential mode
To create additional threads, user opens a
parallel region additional slave threads
launched master thread is part of team
threads disappear at the end of parallel region
run This model is repeated as needed
Parallel 4 threads
Parallel 2 threads
Parallel 3 threads
Master thread
32Creating parallel threads
Fortran
comp parallel clause,clause code to run in
parallel comp end parallel
C/C
pragma omp parallel clause,clause code
to run in parallel
Replicate execution
i0
i0 Comp parallel call foo(i,a,b) Comp end
parallel print,i
foo
foo
foo
foo
print,i
Number of threads set in library or environment
call
33(No Transcript)
34(No Transcript)
35(No Transcript)
36OpenMP on the Origin 2000
Switches, formats
f77 -mp
comp parallel do compshared(a,b,c) OR comp
parallel do shared(a,b,c)
Conditional compilation
c iam omp_get_thread()1
37OpenMP on the Origin 2000 -C
Switches, formats
cc -mp
pragma omp parallel for\ shared(a,b,c) OR pragma
omp parallel for shared(a,b,c)
38OpenMP on the Origin 2000
Parallel Do Directive
comp parallel do private(I)
do I1,n a(I) I1 enddo
comp end parallel do --gt optional
Topics Clauses, Detailed construct
39OpenMP on the Origin 2000
Parallel Do Directive - Clauses
shared private default(privatesharednone) firstp
rivate lastprivate reduction(operatorintrinsic
var) schedule(type,chunk) if(scalar_logical_expr
ession) ordered copyin(var)
40Allocating private and shared variables
S shared variable
P private variable
S
S
Single thread
Parallel region
Single thread
41Clauses in OpenMP - 1
Clauses for the parallel directive specify data
association rules and conditional computation
shared (list) - data accessible by all threads,
which all refer to the same storage private
(list) - data private to each thread - a new
storage location is created with that name for
each thread, and the of the storage are not
available outside the parallel region default
(private shared none) - default association
for variables not otherwise mentioned firstprivate
(list) - same as for private(list), but the
contents are given an initial value from the
variable with the same name, from outside the
parallel region lastprivate (list) - available
only for work sharing constructs - a shared
variable with that name is set to the last
computed value of a thread private variable
in the work sharing construct
42Clauses in OpenMP - 2
reduction (op/intrinsiclist) - variables in
the list are named scalars of intrinsic type -
a private copy of each variable will be made in
each thread and initialized according to the
intended operation - at the end of the parallel
region or other synchronization point all
private copies will be combined - the operation
must be of one of the forms x x op expr
x intrinsic(x,expr) if (x.LT.expr) x
expr x x-- x --x where expr does
not contain x
Op/intrinsic Init or - 0
1 .AND. .TRUE. .OR.
.FALSE. .EQV. .TRUE. .NEQV.
.FALSE. MAX smallest number MIN
largest number IAND all bits on IOR
or IEOR 0
Op Init or - 0 1 -0
0 0 1 0
- example comp parallel do reduction(a,y)
reduction (.OR.s)
43Clauses in OpenMP - 3
copyin(list) - the list must contain common
block (or global) names tahat have been declared
threadprivate - data in the master thread in
that common block will be copied to the thread
private storage at the beginning of the parallel
region - there is no copyout clause data in
private common block is not available outside
of that thread if (scalar_logical_expression) -
when an if clause is present, the enclosed code
block is executed in parallel only if the
scalar_logical_expression is .TRUE. ordered -
only for do/for work sharing constructs the
code in the ORDERED block will be executed in
the same sequence as sequential
execution schedule (kind,chunk) - only for
do/for work sharing constructs specifies
scheduling discipline for loop
iterations nowait - end of worksharing
construct and SINGLE directive implies a
synchronization\ point unless nowait is
specified
44OpenMP on the Origin 2000
Parallel Sections Directive
comp parallel sections private(I)
comp section block1 comp section block2
comp end parallel sections
Topics Clauses, Detailed construct
45OpenMP on the Origin 2000
Parallel Sections Directive - Clauses
shared private default(privatesharednone) firstp
rivate lastprivate reduction(operatorintrinsic
var) if(scalar_logical_expression) copyin(var)
46OpenMP on the Origin 2000
Defining a Parallel Region - Individual Do Loops
comp parallel shared(a,b)
comp do private(j)
do j1,n a(j)j enddo
comp end do nowait comp do private(k)
do k1,n b(k)k enddo
comp end do comp end parallel
47OpenMP on the Origin 2000
Defining a Parallel Region - Explicit Sections
comp parallel shared(a,b)
comp section block1 comp single block2 comp
section block3
comp end parallel
48OpenMP on the Origin 2000
Synchronization Constructs
master/end master critical/end critical barrier at
omic flush ordered/end ordered
49OpenMP on the Origin 2000
Run-Time Library Routines
Execution environment
omp_set_num_threads omp_get_num_threads omp_get_ma
x_threads omp_get_thread_num omp_get_num_procs omp
_in_parallel omp_set_dynamic/omp_get_dynamic omp_s
et_nested/omp_get_nested
50OpenMP on the Origin 2000
Run-Time Library Routines
Lock routines
omp_init_lock omp_destroy_lock omp_set_lock omp_un
set_lock omp_test_lock
51OpenMP on the Origin 2000
Environment Variables
OMP_NUM_THREADS or MP_SET_NUMTHREADS OMP_DYNAMIC O
MP_NESTED
52Exercise 5 OpenMP to parallelize a loop
53(No Transcript)
54initial values
main loop
55(No Transcript)
56(No Transcript)
57Enhancing Performance
- Ensuring sufficient work running a loop in
parallel adds runtime costs - Scheduling loops for load - balancing
58The SCHEDULE clause
Static Each thread is assigned one chunk of iterations, according to variable or equally sized
Dynamic At runtime, chunks are assigned to threads dynamically
59OpenMP summary
- Small number of compiler directives to set up
parallel execution of - code and runtime library system for locking
function - Portable directives (supported by different
vendors in the same way) - Parallelization is for SMP programming model
the machine should - have a global address space
- Number of execution threads is controlled
outside the program - A correct OpenMP program should not depend on
the exact number - of execution threads nor on the scheduling
mechanism for work - distribution
- In addition, a correct OpenMP program should be
(weakly) serially - equivalent that is, the results of the
computation should be within - rounding accuracy when compared to sequential
program - On SGI, OpenMP programming can be mixed with MPI
library, so that - it is possible to have hierarchical
parallelism - OpenMP parallelism in a single node (Global
Address Space) - MPI parallelism between nodes in a cluster
(Network connection)