OpenMP - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

OpenMP

Description:

schedule(static, chunk_size) iterations/chunk_size chunks distributed in round-robin ... at runtime. Implementation dependent. for - Example. include omp.h ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 52

Provided by: SathishV4

Category:

more less

Transcript and Presenter's Notes

Title: OpenMP

1
OpenMP

Sathish Vadhiyar

Credits/Sources OpenMP C/C standard
(openmp.org) OpenMP tutorial (http//www.llnl.gov/
computing/tutorials/openMP/Introduction) OpenMP
sc99 tutorial presentation (openmp.org) Dr. Eric
Strohmaier (University of Tennessee, CS594 class,
Feb 9, 2000)
2
Introduction

An API for multi-threaded shared memory
parallelism
A specification for a set of compiler directives,
library routines, and environment variables
standardizing pragmas
Standardized by a group of hardware and software
vendors
Both fine-grain and coarse-grain parallelism
(orphaned directives)
Much easier to program than MPI

3
History

Many different vendors provided their own ways of
compiler directives for shared memory programming
using threads
OpenMP standard started in 1997
October 1997 Fortran version 1.0
Late 1998 C/C version 1.0
June 2000 Fortran version 2.0
April 2002 C/C version 2.0

4
Introduction

Parallelism
loop level (fine-grained) and coarse grained
parallelism
Threaded parallelism
Explicit parallelism
No task level parallelism
Supports nested parallelism
Follows fork-join model
The number of threads can be varied from one
region to another
Based on compiler directives
Users responsibility to ensure program
correctness, avoiding deadlocks etc.

5
Execution Model

Begins as a single thread called master thread
Fork When parallel construct is encountered,
team of threads are created
Statements in the parallel region are executed in
parallel
Join At the end of the parallel region, the team
threads synchronize and terminate

6
Definitions

Construct statement containing directive and
structured block
Directive - pragma ltomp idgt ltother textgt
Based on C pragma directives
pragma omp directive-name clause , clause
new-line
Example
pragma omp parallel default(shared)
private(beta,pi)

7
Types of constructs, Calls, Variables

Work-sharing constructs
Synchronization constructs
Data environment constructs
Library calls, environment variables

8
parallel construct

pragma omp parallel clause , clause
new-line
structured-block
Clause

9
Parallel construct

Parallel region executed by multiple threads
Implied barrier at the end of parallel section
If num_threads, omp_set_num_threads(),
OMP_SET_NUM_THREADS not used, then number of
created threads is implementation dependent
Number of threads can be dynamically adjusted
using omp_set_dynamic() or OMP_DYNAMIC
Number of physical processors hosting the thread
also implementation dependent
Threads numbered from 0 to N-1
Nested parallelism by embedding one parallel
construct inside another

10
Parallel construct - Example

include ltomp.hgt
main ()
int nthreads, tid
pragma omp parallel private(nthreads, tid)
printf("Hello World \n)

11
Work sharing construct

For distributing the execution among the threads
that encounter it
3 types for, sections, single

12
for construct

For distributing the iterations among the threads

pragma omp for clause , clause new-line
for-loop Clause
13
for construct

Restriction in the structure of the for loop so
that the compiler can determine the number of
iterations e.g. no branching out of loop
The assignment of iterations to threads depend on
the schedule clause
Implicit barrier at the end of for if not nowait

14
schedule clause

schedule(static, chunk_size) iterations/chunk_si
ze chunks distributed in round-robin
schedule(dynamic, chunk_size) chunk_size chunk
given to the next ready thread
schedule(guided, chunk_size) actual chunk_size
is unassigned_iterations/(threadschunk_size) to
the next ready thread. Thus exponential decrease
in chunk sizes
schedule(runtime) decision at runtime.
Implementation dependent

15
for - Example

include ltomp.hgt
define CHUNKSIZE 100
define N 1000
main ()
int i, chunk float aN, bN, cN
/ Some initializations /
for (i0 i lt N i)
ai bi i 1.0
chunk CHUNKSIZE
pragma omp parallel shared(a,b,c,chunk)
private(i)
pragma omp for schedule(dynamic,chunk) nowait
for (i0 i lt N i)
ci ai bi
/ end of parallel section /

16
sections construct

For distributing non-iterative sections among
threads
Clause

17
sections - Example
18
single directive

Only a single thread can execute the block

19
Single - Example
20
Combined parallel work-sharing directives
21
Synchronization directives
22
critical - Example

include ltomp.hgt
main()
int x
x 0
pragma omp parallel shared(x)
pragma omp critical
x x 1

23
atomic - Example
24
flush directive

Point where consistent view of memory is provided
among the threads
Thread-visible variables (global variables,
shared variables etc.) are written to memory
If var-list is used, only variables in the list
are flushed

25
flush - Example
26
flush Example (Contd)
27
ordered - Example
28
Data Environment

Global variable-list declared are made private to
a thread
Each thread gets its own copy
Persist between different parallel regions

include ltomp.hgt int alpha10, beta10,
i pragma omp threadprivate(alpha) main ()
/ Explicitly turn off dynamic threads /
omp_set_dynamic(0) / First parallel region
/ pragma omp parallel private(i,beta) for
(i0 i lt 10 i) alphai betai i /
Second parallel region / pragma omp parallel
printf("alpha3 d and beta3
d\n",alpha3,beta3)
29
Data Scope Attribute Clauses

Most variables are shared by default
Data scopes explicitly specified by data scope
attribute clauses
Shared variables
If not specified in a threadprivate directive
Static variables in the dynamic extent
Heap allocated memory
Global variables
Clauses
private
firstprivate
lastprivate
shared
default
reduction
copyin
copyprivate

30
private, firstprivate lastprivate

private (variable-list)
variable-list private to each thread
A new object with automatic storage duration
allocated for the construct
firstprivate (variable-list)
The new object is initialized with the value of
the old object that existed prior to the
construct
lastprivate (variable-list)
The value of the private object corresponding to
the last iteration or the last section is
assigned to the original object

31
private - Example
32
lastprivate - Example
33
shared, default, reduction

shared(variable-list)
default(shared none)
Specifies the sharing behavior of all of the
variables visible in the construct
Reduction(op variable-list)
Private copies of the variables are made for each
thread
The final object value at the end of the
reduction will be combination of all the private
object values

34
default - Example
35
reduction - Example

include ltomp.hgt
main ()
int i, n, chunk float a100, b100, result
/ Some initializations /
n 100 chunk 10 result 0.0
for (i0 i lt n i)
ai i 1.0 bi i 2.0
pragma omp parallel for \ default(shared)
private(i) \ schedule(static,chunk) \
reduction(result)
for (i0 i lt n i)
result result (ai bi)
printf("Final result f\n",result)

36
copyin, copyprivate

copyin(variable-list)
Applicable to threadprivate variables
Value of the variable in the master thread is
copied to the individual threadprivate copies
copyprivate(variable-list)
Appears on a single directive
Variables in variable-list are broadcast to other
threads in the team from the thread that executed
the single construct

37
copyprivate - Example
38
Nested parallelism

A parallel directive nested within another
parallel directive
Establishes a new team consisting of only current
thread (default)
If nested parallelism is enabled, current thread
can spawn more threads

39
Library Routines (API)

Querying function (number of threads etc.)
General purpose locking routines
Setting execution environment (dynamic threads,
nested parallelism etc.)

40
API

OMP_SET_NUM_THREADS(num_threads)
OMP_GET_NUM_THREADS()
OMP_GET_MAX_THREADS()
OMP_GET_THREAD_NUM()
OMP_GET_NUM_PROCS()
OMP_IN_PARALLEL()
OMP_SET_DYNAMIC(dynamic_threads)
OMP_GET_DYNAMIC()
OMP_SET_NESTED(nested)
OMP_GET_NESTED()

41
API(Contd..)

omp_init_lock(omp_lock_t lock)
omp_init_nest_lock(omp_nest_lock_t lock)
omp_destroy_lock(omp_lock_t lock)
omp_destroy_nest_lock(omp_nest_lock_t lock)
omp_set_lock(omp_lock_t lock)
omp_set_nest_lock(omp_nest_lock_t lock)
omp_unset_lock(omp_lock_t lock)
omp_unset_nest__lock(omp_nest_lock_t lock)
omp_test_lock(omp_lock_t lock)
omp_test_nest_lock(omp_nest_lock_t lock)
omp_get_wtime()
omp_get_wtick()

42
Lock details

Simple locks and nestable locks
Simple locks are not locked if they are already
in a locked state
Nestable locks can be locked multiple times by
the same thread
Simple locks are available if they are unlocked
Nestable locks are available if they are unlocked
or owned by a calling thread

43
Example Lock functions
44
Example Nested lock
45
Example Nested lock (Contd..)
46
Environment Variables

OMP_SCHEDULE
setenv OMP_SCHEDULE "guided, 4
setenv OMP_SCHEDULE "dynamic"
OMP_NUM_THREADS
setenv OMP_NUM_THREADS 8
OMP_DYNAMIC
setenv OMP_DYNAMIC TRUE
OMP_NESTED

47
Hybrid Programming Combining MPI and OpenMP
benefits

MPI
- explicit parallelism, no synchronization
problems
- suitable for coarse grain
OpenMP
- easy to program, dynamic scheduling allowed
- only for shared memory, data
synchronization problems
MPI/OpenMP Hybrid
- Can combine MPI data placement with OpenMP
fine-grain parallelism
- Suitable for cluster of SMPs (Clumps)
- Can implement hierarchical model

48
Hierarchical Model
49
Benefits of Mixed Modes

When MPI codes scale poorly
- When MPI codes involve load balance
problems
When MPI codes have memory related problems for a
single process
Restricted MPI process applications only
power-of-2 processes
If MPI implementation is poorly optimized
When there are efficient shared memory algorithms

50
Case 1 WaTor/ Laplace problem with hybrid model

Divide the grid into processes by MPI. Within a
process create threads (using PARALLEL DO) to
deal with multiple threads hierarchical model
or
Divide the whole domain into a fixed number of
threads and processes (see diagram)

51
Case 2 Molecular dynamics (Henty)

List of links between particles that are
separated by lt cut-off distance
Main computation is to calculate the forces on
the links
MPI implementation domain decomposition
(parallelization across cells) and block-cyclic
distribution
OpenMP parallelization across links (automatic
load balancing)
Force loop parallelized over links
- force updates by atomic operations
Hybrid Combination of domain decomposition and
parallelization across links.
- Maybe less efficient than others
depending upon the block size