Title: OpenMP
1OpenMP
Credits/Sources OpenMP C/C standard
(openmp.org) OpenMP tutorial (http//www.llnl.gov/
computing/tutorials/openMP/Introduction) OpenMP
sc99 tutorial presentation (openmp.org) Dr. Eric
Strohmaier (University of Tennessee, CS594 class,
Feb 9, 2000)
2Introduction
- An API for multi-threaded shared memory
parallelism - A specification for a set of compiler directives,
library routines, and environment variables
standardizing pragmas - Standardized by a group of hardware and software
vendors - Both fine-grain and coarse-grain parallelism
(orphaned directives) - Much easier to program than MPI
3History
- Many different vendors provided their own ways of
compiler directives for shared memory programming
using threads - OpenMP standard started in 1997
- October 1997 Fortran version 1.0
- Late 1998 C/C version 1.0
- June 2000 Fortran version 2.0
- April 2002 C/C version 2.0
4Introduction
- Parallelism
- loop level (fine-grained) and coarse grained
parallelism - Threaded parallelism
- Explicit parallelism
- No task level parallelism
- Supports nested parallelism
- Follows fork-join model
- The number of threads can be varied from one
region to another - Based on compiler directives
- Users responsibility to ensure program
correctness, avoiding deadlocks etc.
5Execution Model
- Begins as a single thread called master thread
- Fork When parallel construct is encountered,
team of threads are created - Statements in the parallel region are executed in
parallel - Join At the end of the parallel region, the team
threads synchronize and terminate
6Definitions
- Construct statement containing directive and
structured block - Directive - pragma ltomp idgt ltother textgt
- Based on C pragma directives
- pragma omp directive-name clause , clause
new-line - Example
- pragma omp parallel default(shared)
private(beta,pi)
7Types of constructs, Calls, Variables
- Work-sharing constructs
- Synchronization constructs
- Data environment constructs
- Library calls, environment variables
8parallel construct
- pragma omp parallel clause , clause
new-line - structured-block
- Clause
9Parallel construct
- Parallel region executed by multiple threads
- Implied barrier at the end of parallel section
- If num_threads, omp_set_num_threads(),
OMP_SET_NUM_THREADS not used, then number of
created threads is implementation dependent - Number of threads can be dynamically adjusted
using omp_set_dynamic() or OMP_DYNAMIC - Number of physical processors hosting the thread
also implementation dependent - Threads numbered from 0 to N-1
- Nested parallelism by embedding one parallel
construct inside another
10Parallel construct - Example
- include ltomp.hgt
- main ()
- int nthreads, tid
-
- pragma omp parallel private(nthreads, tid)
- printf("Hello World \n)
-
11Work sharing construct
- For distributing the execution among the threads
that encounter it - 3 types for, sections, single
12for construct
- For distributing the iterations among the threads
pragma omp for clause , clause new-line
for-loop Clause
13for construct
- Restriction in the structure of the for loop so
that the compiler can determine the number of
iterations e.g. no branching out of loop - The assignment of iterations to threads depend on
the schedule clause - Implicit barrier at the end of for if not nowait
14schedule clause
- schedule(static, chunk_size) iterations/chunk_si
ze chunks distributed in round-robin - schedule(dynamic, chunk_size) chunk_size chunk
given to the next ready thread - schedule(guided, chunk_size) actual chunk_size
is unassigned_iterations/(threadschunk_size) to
the next ready thread. Thus exponential decrease
in chunk sizes - schedule(runtime) decision at runtime.
Implementation dependent
15for - Example
- include ltomp.hgt
- define CHUNKSIZE 100
- define N 1000
- main ()
- int i, chunk float aN, bN, cN
- / Some initializations /
- for (i0 i lt N i)
- ai bi i 1.0
- chunk CHUNKSIZE
- pragma omp parallel shared(a,b,c,chunk)
private(i) - pragma omp for schedule(dynamic,chunk) nowait
- for (i0 i lt N i)
- ci ai bi
- / end of parallel section /
-
16sections construct
- For distributing non-iterative sections among
threads - Clause
17sections - Example
18single directive
- Only a single thread can execute the block
19Single - Example
20Combined parallel work-sharing directives
21Synchronization directives
22critical - Example
- include ltomp.hgt
- main()
- int x
- x 0
- pragma omp parallel shared(x)
- pragma omp critical
- x x 1
-
-
23atomic - Example
24flush directive
- Point where consistent view of memory is provided
among the threads - Thread-visible variables (global variables,
shared variables etc.) are written to memory - If var-list is used, only variables in the list
are flushed
25flush - Example
26flush Example (Contd)
27ordered - Example
28Data Environment
- Global variable-list declared are made private to
a thread - Each thread gets its own copy
- Persist between different parallel regions
include ltomp.hgt int alpha10, beta10,
i pragma omp threadprivate(alpha) main ()
/ Explicitly turn off dynamic threads /
omp_set_dynamic(0) / First parallel region
/ pragma omp parallel private(i,beta) for
(i0 i lt 10 i) alphai betai i /
Second parallel region / pragma omp parallel
printf("alpha3 d and beta3
d\n",alpha3,beta3)
29Data Scope Attribute Clauses
- Most variables are shared by default
- Data scopes explicitly specified by data scope
attribute clauses - Shared variables
- If not specified in a threadprivate directive
- Static variables in the dynamic extent
- Heap allocated memory
- Global variables
- Clauses
- private
- firstprivate
- lastprivate
- shared
- default
- reduction
- copyin
- copyprivate
30private, firstprivate lastprivate
- private (variable-list)
- variable-list private to each thread
- A new object with automatic storage duration
allocated for the construct - firstprivate (variable-list)
- The new object is initialized with the value of
the old object that existed prior to the
construct - lastprivate (variable-list)
- The value of the private object corresponding to
the last iteration or the last section is
assigned to the original object
31private - Example
32lastprivate - Example
33shared, default, reduction
- shared(variable-list)
- default(shared none)
- Specifies the sharing behavior of all of the
variables visible in the construct - Reduction(op variable-list)
- Private copies of the variables are made for each
thread - The final object value at the end of the
reduction will be combination of all the private
object values
34default - Example
35reduction - Example
- include ltomp.hgt
- main ()
- int i, n, chunk float a100, b100, result
- / Some initializations /
- n 100 chunk 10 result 0.0
- for (i0 i lt n i)
- ai i 1.0 bi i 2.0
-
- pragma omp parallel for \ default(shared)
private(i) \ schedule(static,chunk) \
reduction(result) - for (i0 i lt n i)
- result result (ai bi)
- printf("Final result f\n",result)
-
36copyin, copyprivate
- copyin(variable-list)
- Applicable to threadprivate variables
- Value of the variable in the master thread is
copied to the individual threadprivate copies - copyprivate(variable-list)
- Appears on a single directive
- Variables in variable-list are broadcast to other
threads in the team from the thread that executed
the single construct
37copyprivate - Example
38Nested parallelism
- A parallel directive nested within another
parallel directive - Establishes a new team consisting of only current
thread (default) - If nested parallelism is enabled, current thread
can spawn more threads
39Library Routines (API)
- Querying function (number of threads etc.)
- General purpose locking routines
- Setting execution environment (dynamic threads,
nested parallelism etc.)
40API
- OMP_SET_NUM_THREADS(num_threads)
- OMP_GET_NUM_THREADS()
- OMP_GET_MAX_THREADS()
- OMP_GET_THREAD_NUM()
- OMP_GET_NUM_PROCS()
- OMP_IN_PARALLEL()
- OMP_SET_DYNAMIC(dynamic_threads)
- OMP_GET_DYNAMIC()
- OMP_SET_NESTED(nested)
- OMP_GET_NESTED()
41API(Contd..)
- omp_init_lock(omp_lock_t lock)
- omp_init_nest_lock(omp_nest_lock_t lock)
- omp_destroy_lock(omp_lock_t lock)
- omp_destroy_nest_lock(omp_nest_lock_t lock)
- omp_set_lock(omp_lock_t lock)
- omp_set_nest_lock(omp_nest_lock_t lock)
- omp_unset_lock(omp_lock_t lock)
- omp_unset_nest__lock(omp_nest_lock_t lock)
- omp_test_lock(omp_lock_t lock)
- omp_test_nest_lock(omp_nest_lock_t lock)
- omp_get_wtime()
- omp_get_wtick()
42Lock details
- Simple locks and nestable locks
- Simple locks are not locked if they are already
in a locked state - Nestable locks can be locked multiple times by
the same thread - Simple locks are available if they are unlocked
- Nestable locks are available if they are unlocked
or owned by a calling thread
43Example Lock functions
44Example Nested lock
45Example Nested lock (Contd..)
46Environment Variables
- OMP_SCHEDULE
- setenv OMP_SCHEDULE "guided, 4
- setenv OMP_SCHEDULE "dynamic"
- OMP_NUM_THREADS
- setenv OMP_NUM_THREADS 8
- OMP_DYNAMIC
- setenv OMP_DYNAMIC TRUE
- OMP_NESTED
47Hybrid Programming Combining MPI and OpenMP
benefits
- MPI
- - explicit parallelism, no synchronization
problems - - suitable for coarse grain
- OpenMP
- - easy to program, dynamic scheduling allowed
- - only for shared memory, data
synchronization problems - MPI/OpenMP Hybrid
- - Can combine MPI data placement with OpenMP
fine-grain parallelism - - Suitable for cluster of SMPs (Clumps)
- - Can implement hierarchical model
48Hierarchical Model
49Benefits of Mixed Modes
- When MPI codes scale poorly
- - When MPI codes involve load balance
problems - When MPI codes have memory related problems for a
single process - Restricted MPI process applications only
power-of-2 processes - If MPI implementation is poorly optimized
- When there are efficient shared memory algorithms
50Case 1 WaTor/ Laplace problem with hybrid model
- Divide the grid into processes by MPI. Within a
process create threads (using PARALLEL DO) to
deal with multiple threads hierarchical model
or - Divide the whole domain into a fixed number of
threads and processes (see diagram)
51Case 2 Molecular dynamics (Henty)
- List of links between particles that are
separated by lt cut-off distance - Main computation is to calculate the forces on
the links - MPI implementation domain decomposition
(parallelization across cells) and block-cyclic
distribution - OpenMP parallelization across links (automatic
load balancing) - Force loop parallelized over links
- - force updates by atomic operations
- Hybrid Combination of domain decomposition and
parallelization across links. - - Maybe less efficient than others
depending upon the block size