Parallel Programming on the

About This Presentation

Title:

Parallel Programming on the

Description:

Parallel Programming on the. SGI Origin2000. With thanks to ... NAG. Intel. Kuck & Assoc (KAI) Absoft. OpenMP Interface model. Control structures. Work sharing ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 60

Provided by: mos683

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Programming on the

1
Parallel Programming on the SGI Origin2000
Taub Computer Center Technion
Anne Weill-Zrahia
With thanks to Moshe Goldberg, TCC and Igor
Zacharov SGI
Mar 2005
2
Parallel Programming on the SGI Origin2000

Parallelization Concepts
SGI Computer Design
Efficient Scalar Design
Parallel Programming -OpenMP
Parallel Programming- MPI

3
4) Parallel Programming-OpenMP
4
Is this your joint bank account?
Limor in Haifa
Take IL150 (Write IL350)
IL 0
IL150
Read IL500
IL500
IL100
IL350
IL500
IL500
IL350
Read IL500

Take IL400 (Write IL100)
IL 0
IL400
Shimon in Tel Aviv
Final amount
Initial amount
5
Introduction

Parallelization instruction to the compiler
f77 o prog mp prog.f
Or f77 o prog pfa prog.f

- Now try to understand what a compiler has to
determine when deciding how to parallelize
- Note that when loosely talk about
parallelization, what is meant is Is the
program as presented here parallelizable?

This is an important distinction, because
sometimes rewriting
can transform non-parallelizable code into a
parallelizable
form, as we will see

6
Data dependency types

Iteration i depends on values calculated in the
previous iteration i-1
(loop carried dependence)
do i2,n
a(i) a(i-1) cannot be
parallelized
enddo

2) Data dependence within single iteration
(non-loop carried dependence) do i2,n
c . . . . a(I) . . . c . . .
parallelizable enddo
3) Reduction do i1,n s s x
parallelizable enddo
All data dependencies in programs are variations
on these fundamental types.
7
Data dependency analysis
Question Are the following loops parallelizable?
do i2,n a(i) b(i-1) enddo
YES!
do i2,n a(i) a(i-1) enddo
NO!
Why?
8
Data dependency analysis
do i2,n a(i) b(i-1) enddo
YES!
cycle1
cycle2
CPU1 CPU2 CPU3
A(2)B(1) A(3)B(2) A(4)B(3)
A(5)B(4) A(6)B(5) A(7)B(6)
9
Data dependency analysis
do i2,n a(i) a(i-1) enddo
Scalar (non-parallel) run
cycle2
cycle3
cycle4
cycle1
A(4)A(3)
A(5)A(4)
CPU1
A(2)A(1)
A(3)A(2)
In each cycle NEW data from previous cycle is read
10
Data dependency analysis
do i2,n a(i) a(i-1) enddo
No!
cycle1
CPU1 CPU2 CPU3
A(2)A(1) A(3)A(2) A(4)A(3)
Will probably read OLD data
11
Data dependency analysis
Data dependency analysis
do i2,n a(i) a(i-1) enddo
No!
cycle2
cycle1
May read NEW data
CPU1 CPU2 CPU3
A(2)A(1) A(3)A(2) A(4)A(3)
A(5)A(4) A(6)A(5) A(7)A(6)
Will probably read OLD data
12
Data dependency analysis
Another question Are the following loops
parallelizable?
do i3,n,2 a(i) a(i-1) enddo
YES!
do i1,n s s a(i) enddo
Depends!
13
Data dependency analysis
do i3,n,2 a(i) a(i-1) enddo
YES!
cycle1
cycle2
CPU1 CPU2 CPU3
A(3)A(2) A(5)A(4) A(7)A(6)
A(9)A(8) A(11)A(10) A(13)A(12)
14
Data dependency analysis
Data dependency analysis
do i1,n s s a(i) enddo
Depends!
cycle1
cycle2
CPU1 CPU2 CPU3
SSA(1) SSA(2) SSA(3)
SSA(4) SSA(5) SSA(6)

The value of S will be undetermined and typically
it will vary
from one run to the next
- This bug in parallel programming is called a
race condition

15
Data dependency analysis
What is the principle involved here?
The examples shown fall into two categories

Data being read is independent of data that is
written
a(i) b(i-1) i2,3,4. . .
a(i) a(i-1) i3,5,7. . .

2) Data being read depends on data that is
written a(i) a(I-1) i2,3,4. . .
s s a(i) i1,2,3. . .
16
Data dependency analysis
Here is a typical situation
Is there a data dependency in the following loop?
do i 1,n a(i) sin(x(i)) result a(i)
b(i) c(i) result c(i) enddo
No!
Clearly, result is a temporary variable that
is reassigned for every iteration.
Note result must be a private variable (this
will be discussed later).
17
Data dependency analysis
Here is a (slightly different) typical situation
Is there a data dependency in the following loop?
do i 1,n a(i) sin(result) result
a(i) b(i) c(i) result c(i) enddo
Yes!
The value of result is carried over from one
iteration to the next.
This is the classical read/write situation but
now it is somewhat hidden.
18
Data dependency analysis
The loop could (symbolically) be rewritten
do i 1,n a(i) sin(result(i-1))
result(i) a(i) b(i) c(i) result(i)
c(i) enddo
Now substitute the expression for a(i)
do i 1,n a(i) sin(result(i-1))
result(i) sin(result(i-1)) b(i) c(i)
result(i) c(i) enddo
This is really of the type a(i)a(i-1) !
19
Data dependency analysis
One more Can the following loop be parallelized?
do i 3,n a(i) a(i-2) enddo
If this is parallelized, there will probably be
different answers from one run to another.
Why?
20
Data dependency analysis
do i 3,n a(i) a(i-2) enddo
This looks like it will be safe.
cycle2
cycle1
CPU1 CPU2
A(3)A(1) A(4)A(2)
A(5)A(3) A(6)A(4)
21
Data dependency analysis
do i 3,n a(i) a(i-2) enddo
HOWEVER what if there are 3 cpus and not 2?
cycle1
CPU1 CPU2 CPU3
A(3)A(1) A(4)A(2) A(5)A(3)
In this case, a(3) is read and written in two
threads at once
22
RISC memory levels
Single CPU
CPU
Cache
Main memory
23
RISC memory levels
Single CPU
CPU
Cache
Main memory
24
RISC memory levels
Multiple CPUs
CPU
0
Cache 0
CPU
1
Cache 1
Main memory
25
RISC memory levels
Multiple CPUs
CPU
0
Cache 0
CPU
1
Cache 1
Main memory
26
RISC Memory Levels
Multiple CPUs
CPU
0
Cache 0
CPU
1
Cache 1
Main memory
27
Definition of OpenMP

Application Program Interface (API) for
Shared Memory Parallel Programming
Directive based approach with library support
Targets existing applications and widely used
languages
Fortran API first released October 1997
C, C API first released October 1998
- Multi-vendor/platform support

28
Why was OpenMP developed?

Parallel programming before OpenMP
Standards for distributed memory (MPI and
PVM)
No standard for shared memory programming
Vendors had different directive-based API for
SMP
SGI, Cray, KuckAssoc, DEC
Vendor proprietary, similar but not the
same
Most were targeted at loop level
parallelism
Commercial users, high end software vendors,
have big investment in existing codes
End result users wanting portability were
forced
to use MPI even for shared memory
This sacrifices built-in SMP hardware
benefits
Requires major effort

29
The Spread of OpenMP
Organization Architecture review board
Web site www.openmp.org
Hardware HP/DEC IBM Intel SGI Sun
Software Portland (PGI) NAG Intel Kuck
Assoc (KAI) Absoft
30
OpenMP Interface model
Directives And Pragmas
Runtime Library routines
Environment variables

Control structures
Work sharing
Data scope attributes
private,firstprivate,
lastprivate
shared
reduction

-Control and query number thread nested
parallel? throughput mode - Lock API

Runtime environment
schedule type
max number threads
nested parallelism
throughput mode

31
OpenMP execution model
OpenMP programs starts in a single thread,
sequential mode
To create additional threads, user opens a
parallel region additional slave threads
launched master thread is part of team
threads disappear at the end of parallel region
run This model is repeated as needed
Parallel 4 threads
Parallel 2 threads
Parallel 3 threads
Master thread
32
Creating parallel threads
Fortran
comp parallel clause,clause code to run in
parallel comp end parallel
C/C
pragma omp parallel clause,clause code
to run in parallel
Replicate execution
i0
i0 Comp parallel call foo(i,a,b) Comp end
parallel print,i
foo
foo
foo
foo
print,i
Number of threads set in library or environment
call
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
OpenMP on the Origin 2000
Switches, formats
f77 -mp
comp parallel do compshared(a,b,c) OR comp
parallel do shared(a,b,c)
Conditional compilation
c iam omp_get_thread()1
37
OpenMP on the Origin 2000 -C
Switches, formats
cc -mp
pragma omp parallel for\ shared(a,b,c) OR pragma
omp parallel for shared(a,b,c)
38
OpenMP on the Origin 2000
Parallel Do Directive
comp parallel do private(I)
do I1,n a(I) I1 enddo
comp end parallel do --gt optional
Topics Clauses, Detailed construct
39
OpenMP on the Origin 2000
Parallel Do Directive - Clauses
shared private default(privatesharednone) firstp
rivate lastprivate reduction(operatorintrinsic
var) schedule(type,chunk) if(scalar_logical_expr
ession) ordered copyin(var)
40
Allocating private and shared variables
S shared variable
P private variable
S
S
Single thread
Parallel region
Single thread
41
Clauses in OpenMP - 1
Clauses for the parallel directive specify data
association rules and conditional computation
shared (list) - data accessible by all threads,
which all refer to the same storage private
(list) - data private to each thread - a new
storage location is created with that name for
each thread, and the of the storage are not
available outside the parallel region default
(private shared none) - default association
for variables not otherwise mentioned firstprivate
(list) - same as for private(list), but the
contents are given an initial value from the
variable with the same name, from outside the
parallel region lastprivate (list) - available
only for work sharing constructs - a shared
variable with that name is set to the last
computed value of a thread private variable
in the work sharing construct
42
Clauses in OpenMP - 2
reduction (op/intrinsiclist) - variables in
the list are named scalars of intrinsic type -
a private copy of each variable will be made in
each thread and initialized according to the
intended operation - at the end of the parallel
region or other synchronization point all
private copies will be combined - the operation
must be of one of the forms x x op expr
x intrinsic(x,expr) if (x.LT.expr) x
expr x x-- x --x where expr does
not contain x
Op/intrinsic Init or - 0
1 .AND. .TRUE. .OR.
.FALSE. .EQV. .TRUE. .NEQV.
.FALSE. MAX smallest number MIN
largest number IAND all bits on IOR
or IEOR 0
Op Init or - 0 1 -0
0 0 1 0
- example comp parallel do reduction(a,y)
reduction (.OR.s)
43
Clauses in OpenMP - 3
copyin(list) - the list must contain common
block (or global) names tahat have been declared
threadprivate - data in the master thread in
that common block will be copied to the thread
private storage at the beginning of the parallel
region - there is no copyout clause data in
private common block is not available outside
of that thread if (scalar_logical_expression) -
when an if clause is present, the enclosed code
block is executed in parallel only if the
scalar_logical_expression is .TRUE. ordered -
only for do/for work sharing constructs the
code in the ORDERED block will be executed in
the same sequence as sequential
execution schedule (kind,chunk) - only for
do/for work sharing constructs specifies
scheduling discipline for loop
iterations nowait - end of worksharing
construct and SINGLE directive implies a
synchronization\ point unless nowait is
specified
44
OpenMP on the Origin 2000
Parallel Sections Directive
comp parallel sections private(I)
comp section block1 comp section block2
comp end parallel sections
Topics Clauses, Detailed construct
45
OpenMP on the Origin 2000
Parallel Sections Directive - Clauses
shared private default(privatesharednone) firstp
rivate lastprivate reduction(operatorintrinsic
var) if(scalar_logical_expression) copyin(var)
46
OpenMP on the Origin 2000
Defining a Parallel Region - Individual Do Loops
comp parallel shared(a,b)
comp do private(j)
do j1,n a(j)j enddo
comp end do nowait comp do private(k)
do k1,n b(k)k enddo
comp end do comp end parallel
47
OpenMP on the Origin 2000
Defining a Parallel Region - Explicit Sections
comp parallel shared(a,b)
comp section block1 comp single block2 comp
section block3
comp end parallel
48
OpenMP on the Origin 2000
Synchronization Constructs
master/end master critical/end critical barrier at
omic flush ordered/end ordered
49
OpenMP on the Origin 2000
Run-Time Library Routines
Execution environment
omp_set_num_threads omp_get_num_threads omp_get_ma
x_threads omp_get_thread_num omp_get_num_procs omp
_in_parallel omp_set_dynamic/omp_get_dynamic omp_s
et_nested/omp_get_nested
50
OpenMP on the Origin 2000
Run-Time Library Routines
Lock routines
omp_init_lock omp_destroy_lock omp_set_lock omp_un
set_lock omp_test_lock
51
OpenMP on the Origin 2000
Environment Variables
OMP_NUM_THREADS or MP_SET_NUMTHREADS OMP_DYNAMIC O
MP_NESTED
52
Exercise 5 OpenMP to parallelize a loop
53
(No Transcript)
54
initial values
main loop
55
(No Transcript)
56
(No Transcript)
57
Enhancing Performance

Ensuring sufficient work running a loop in
parallel adds runtime costs
Scheduling loops for load - balancing

58
The SCHEDULE clause

SCHEDULE (TYPE,CHUNK)

Static Each thread is assigned one chunk of iterations, according to variable or equally sized
Dynamic At runtime, chunks are assigned to threads dynamically
59
OpenMP summary

Small number of compiler directives to set up
parallel execution of
code and runtime library system for locking
function
Portable directives (supported by different
vendors in the same way)
Parallelization is for SMP programming model
the machine should
have a global address space
Number of execution threads is controlled
outside the program
A correct OpenMP program should not depend on
the exact number
of execution threads nor on the scheduling
mechanism for work
distribution
In addition, a correct OpenMP program should be
(weakly) serially
equivalent that is, the results of the
computation should be within
rounding accuracy when compared to sequential
program
On SGI, OpenMP programming can be mixed with MPI
library, so that
it is possible to have hierarchical
parallelism
OpenMP parallelism in a single node (Global
Address Space)
MPI parallelism between nodes in a cluster
(Network connection)

Write a Comment

User Comments (0)

About PowerShow.com

Parallel Programming on the - PowerPoint PPT Presentation

Parallel Programming on the

Parallel Programming on the. SGI Origin2000. With thanks to ... NAG. Intel. Kuck & Assoc (KAI) Absoft. OpenMP Interface model. Control structures. Work sharing ... – PowerPoint PPT presentation