Shared Memory Parallel Programming - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Shared Memory Parallel Programming

Description:

Molecular Dynamics. for some number of timesteps { for all molecules i ... Consider two different routines called within a parallel region. ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 45

Provided by: barbara179

Category:

more less

Transcript and Presenter's Notes

Title: Shared Memory Parallel Programming

1
Shared Memory Parallel Programming

Introduction to OpenMP

2
Laplace Equation

An elliptic partial differential equation (PDE)
Can model many natural phenomena, e.g., heat
dissipation in a metal sheet
PDEs used to model many physical systems
(weather, flow over wing, turbulence, etc.)

3
Laplace Equation

Typical approach is to generate mesh by covering
region of interest with a grid of points
Then impose an initial state or initial
approximate solution on grid
At each time step, update current values at each
point on the grid
Terminate either after a specified number of
iterations, or when a steady state is reached

4
Problem Statement
y
F 1
2
F(x,y) 0
F 0
F 0
F 0
x
5
Discretization

Represent F in continuous rectangle by a
2-dimensional discrete grid (array)
The boundary conditions on the rectangle are the
boundary values of the array
The internal values are found by updating values
using a combination of values at neighboring grid
points until termination

6
Discretized Problem Statement
j
4-point stencil
i
7
Solution Methods

At each time step, update solution and test for
steady state
Variety of methods for update operation
Jacobi, Gauss-Seidel, SOR (Successive
Over-Relaxation), Red-Black. Multigrid,
Test for steady state by comparing values at grid
points from previous time step with those at
current time step

8
Typical Algorithm

For some number of iterations
for each internal grid point
update value using average of its neighbors
Termination condition
values at grid points change very little
(we will ignore this part in our example)

9
Jacobi Method

/ Initialization /
for( i0 iltn1 i ) gridi0 0.0
for( i0 iltn1 i ) gridin1 0.0
for( j0 jltn1 j ) grid0j 1.0
for( j0 jltn1 j ) gridn1j 0.0
for( i1 iltn i )
for( j1 jltn j )
gridij 0.0

10
Jacobi Method

for some number of timesteps/iterations
for (i1 iltn i )
for( j1, jltn, j )
tempij 0.25
( gridi-1j gridi1j
gridij-1 gridij1 )
for( i1 iltn i )
for( j1 jltn j )
gridij tempij

11
Data Usage in Parallel Jacobi
j
old array
new array
i
updated value
12
Parallel Jacobi Method

No dependences between iterations of first (i,j)
loop nest
No dependences between iterations of second (i,j)
loop nest
True and anti-dependence between first and second
loop nest in the same timestep
True and anti-dependence between second loop nest
and first loop nest of next timestep

13
Data Usage in Parallel Jacobi
j
thread2
thread1
i
Updated by thread on another processor. Value
needs to be back in main memory before next loop.
14
Parallel Jacobi (continued)

First (i,j) loop nest can be parallelized
Second (i,j) loop nest can be parallelized
But keep order of loops and timesteps
threads must not begin second loop until first
loop nest completes
or begin new timestep until previous one
completes
In other words, threads may need to wait at the
end of each (i,j) loop nest
This is a barrier (barrier synchronization)

15
Parallel Jacobi Method

for some number of timesteps/iterations
for (i1 iltn i ) ? distribute iterations
for( j1, jltn, j )
tempij 0.25
( gridi-1j gridi1j
gridij-1 gridij1 )
synchronization point
for( i1 iltn i ) ? distribute iterations
for( j1 jltn j )
gridij tempij
synchronization point

16
OpenMP Jacobi Method

for some number of timesteps/iterations
pragma omp parallel for
for (i1 iltn i )
for( j1, jltn, j )
tempij 0.25
( gridi-1j gridi1j
gridij-1 gridij1 )
pragma omp parallel for
for( i1 iltn i )
for( j1 jltn j )
gridij tempij

OpenMP automatically inserts a barrier at the end
of each parallel loop. At a barrier, data is
automatically flushed to main memory
17
Gauss-Seidel Method

for some number of timesteps/iterations
for (i1 iltn i )
for( j1, jltn, j )
grid ij 0.25
( gridi-1j gridi1j
gridij-1 gridij1 )

This method cannot be easily parallelized.
18
(No Transcript)
19
Molecular Dynamics

for some number of timesteps
for all molecules i
for all nearby molecules j
forcei f( loci, locj )
for all molecules i
loci g( loci, forcei )

20
Molecular Dynamics (continued)

On a distributed system, we have to assign
molecules to processors
With shared memory, that is not needed

proc3
proc1
proc2
21
Molecular Dynamics (continued)

for some number of timesteps
for( i0 iltnum_mol i )
for( j0 jltcounti j )
forcei f(loci,locindexj)
for( i0 iltnum_mol i )
loci g( loci, forcei )
/ compute new neighbors, counti for each i /

22
Molecular Dynamics (continued)

In first loop nest
No loop-carried dependence in outer loop
Loop-carried dependence (reduction) in j-loop
No loop-carried dependence in second loop nest
True dependence between first and second loop
nests

23
Molecular Dynamics (continued)

Outer loop in first loop nest can be parallelized
Second loop nest can be parallelized
OpenMP performs synchronization between loops
Memory is shared, so
share molecules between threads
if large differences in number of neighbors, can
get load balancing problem
cache interferences between threads possible

24
Molecular Dynamics (continued)

for some number of timesteps
pragma omp parallel for
for( i ilt i )
for( j0 jltcounti j )
forcei f(loci,locindexj)
pragma omp parallel for
for( i ilt i )
loci g( loci, forcei )
exchange loci values with neighbors

What schedule would you use?
25
Irregular codes in OpenMP

Fairly easy to parallelize irregular computations
using OpenMP (sometimes)
Dont need to figure out which data elements are
neighbors, partition data
But hidden costs as multiple threads may need to
update data in the same cache line
And the threads may have different amounts of
work to perform

26
OpenMP Overview
COMP FLUSH
pragma omp critical
COMP THREADPRIVATE(/ABC/)
CALL OMP_SET_NUM_THREADS(10)

OpenMP An API for Writing Multithreaded
Applications
A set of compiler directives and library routines
for parallel application programmers
Greatly simplifies writing multi-threaded (MT)
programs in Fortran, C and C
Standardizes last 20 years of SMP practice

COMP parallel do shared(a, b, c)
call omp_test_lock(jlok)
call OMP_INIT_LOCK (ilok)
COMP MASTER
COMP ATOMIC
COMP SINGLE PRIVATE(X)
setenv OMP_SCHEDULE dynamic
COMP PARALLEL DO ORDERED PRIVATE (A, B, C)
COMP ORDERED
COMP PARALLEL REDUCTION ( A, B)
COMP SECTIONS
pragma omp parallel for private(A, B)
!OMP BARRIER
COMP PARALLEL COPYIN(/blk/)
COMP DO lastprivate(XX)
Nthrds OMP_GET_NUM_PROCS()
omp_set_lock(lck)
The name OpenMP is the property of the OpenMP
Architecture Review Board.
27
Data EnvironmentDefault storage attributes

Shared memory programming model
Most variables are shared by default
Global variables are shared among threads
Fortran COMMON blocks, SAVE variables, MODULE
variables
C File scope variables, static
But not everything is shared...
Stack variables in subprograms called from
parallel regions are PRIVATE
Automatic variables within a statement block are
PRIVATE.

28
A Shared Memory Architecture

Shared memory
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
29
Data in OpenMP

Data assigned to a thread is private or local to
that thread and is not accessible to other
threads
Threads owning data that is adjacent in grid are
sometimes called neighboring threads
Performance problems if two threads write data on
same cache line
This doesnt usually happen with private data
Optimization principle computation in each
thread should use, as far as possible, private
data

30
Data Sharing Examples
subroutine work (index) common /input/
A(10) integer index() real temp(10) integer
count save count
program sort common /input/ A(10) integer
index(10) !OMP PARALLEL call
work(index) !OMP END PARALLEL print, index(1)
A, index and count are shared by all
threads. temp is local to each thread
temp
Third party trademarks and names are the
property of their respective owner.
31
Data EnvironmentChanging storage attributes

One can selectively change storage attributes
constructs using the following clauses
SHARED
PRIVATE
FIRSTPRIVATE
THREADPRIVATE
The value of a private inside a parallel loop can
be transmitted to a global value outside the
loop with
LASTPRIVATE
The default status can be modified with
DEFAULT (PRIVATE SHARED NONE)

All the clauses on this page apply to OpenMP
construct NOT to the entire region.
All data clauses apply to parallel regions and
worksharing constructs except shared which only
applies to parallel regions.
32
Private Clause

private(var) creates a local copy of var for
each thread.
The value is uninitialized
Private copy is not storage-associated with the
original
The original is undefined at the end

program wrong IS 0 COMP PARALLEL
DO PRIVATE(IS) DO J1,1000 IS IS
J END DO print , IS
IS was not initialized
Regardless of initialization, IS is undefined at
this point
33
Firstprivate Clause

Firstprivate is a special case of private.
Initializes each private copy with the
corresponding value from the master thread.

program almost_right IS 0 COMP
PARALLEL DO FIRSTPRIVATE(IS) DO J1,1000
IS IS J 1000 CONTINUE print , IS
Each thread gets its own IS with an initial value
of 0
Regardless of initialization, IS is undefined at
this point
34
Lastprivate Clause

Lastprivate passes the value of a private from
the last iteration to a global variable.

program closer IS 0 COMP PARALLEL
DO FIRSTPRIVATE(IS) COMP LASTPRIVATE(IS)
DO J1,1000 IS IS J 1000 CONTINUE
print , IS
Each thread gets its own IS with an initial value
of 0
IS is defined as its value at the last
sequential iteration (i.e. for J1000)
35
OpenMP A data environment test

Consider this example of PRIVATE and FIRSTPRIVATE
Are A,B,C local to each thread or shared inside
the parallel region?
What are their initial values inside and after
the parallel region?

variables A,B, and C 1COMP PARALLEL
PRIVATE(B) COMP FIRSTPRIVATE(C)

Inside this parallel region ...
A is shared by all threads equals 1
B and C are local to each thread.
Bs initial value is undefined
Cs initial value equals 1
Outside this parallel region ...
The values of B and C are undefined.

36
OpenMP Reduction

Combines an accumulation operation across
threads
reduction (op list)
Inside a parallel or a work-sharing construct
A local copy of each list variable is made and
initialized depending on the op (e.g. 0 for
).
Compiler finds standard reduction expressions
containing op and uses them to update the local
copy.
Local copies are reduced into a single value and
combined with the original global value.
The variables in list must be shared in the
enclosing parallel region.

37
OpenMP Reduction example

Remember the code we used to demo private,
firstprivate and lastprivate.

program closer IS 0 DO
J1,1000 IS IS J 1000 CONTINUE
print , IS
38
OpenMP Reduction operands/initial-values

A range of associative operands can be used with
reduction
Initial values are the ones that make sense
mathematically.

Min and Max are not available in C/C
39
Default Clause

Note that the default storage attribute is
DEFAULT(SHARED) (so no need to use it)
To change default DEFAULT(PRIVATE)
each variable in static extent of the parallel
region is made private as if specified in a
private clause
mostly saves typing
DEFAULT(NONE) no default for variables in static
extent. Must list storage attribute for each
variable in static extent

Only the Fortran API supports default(private).
C/C only has default(shared) or default(none).
40
Default Clause Example
itotal 1000 COMP PARALLEL PRIVATE(np,
each) np omp_get_num_threads()
each itotal/np COMP END PARALLEL
Are these two codes equivalent?
itotal 1000 COMP PARALLEL
DEFAULT(PRIVATE) SHARED(itotal) np
omp_get_num_threads() each itotal/np
COMP END PARALLEL
41
Threadprivate

Makes global data private to a thread
Fortran COMMON blocks
C File scope and static variables
Different from making them PRIVATE
with PRIVATE global variables are masked.
THREADPRIVATE preserves global scope within each
thread
Threadprivate variables can be initialized using
COPYIN or by using DATA statements.

42
A threadprivate example
Consider two different routines called within a
parallel region.
subroutine poo parameter (N1000)
common/buf/A(N),B(N) !OMP THREADPRIVATE(/buf/)
do i1, N B(i) const
A(i) end do return
end
subroutine bar parameter (N1000)
common/buf/A(N),B(N) !OMP THREADPRIVATE(/buf/)
do i1, N A(i) sqrt(B(i))
end do return
end
Because of the threadprivate construct, each
thread executing these routines has its own copy
of the common block /buf/.
43
Copyin
You initialize threadprivate data using a copyin
clause.
parameter (N1000) common/buf/A(N) !O
MP THREADPRIVATE(/buf/) C Initialize the A
array call init_data(N,A) !OMP PARALLEL
COPYIN(A) Now each thread sees threadprivate
array A initialied to the global value set in
the subroutine init_data() !OMP END
PARALLEL end
44
Copyprivate
Used with a single region to broadcast values of
privates from one member of a team to the rest of
the team.
include ltomp.hgt void input_parameters (int,
int) // fetch values of input parameters void
do_work(int, int) void main() int Nsize,
choice pragma omp parallel private (Nsize,
choice) pragma omp single
copyprivate (Nsize, choice)
input_parameters (Nsize, choice)
do_work(Nsize, choice)

Write a Comment

User Comments (0)