OpenMP - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

OpenMP

Description:

schedule(static, chunk_size) iterations/chunk_size chunks distributed in round-robin ... at runtime. Implementation dependent. for - Example. include omp.h ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 52
Provided by: SathishV4
Category:
Tags: openmp | example | of | robin | round | schedule

less

Transcript and Presenter's Notes

Title: OpenMP


1
OpenMP
  • Sathish Vadhiyar

Credits/Sources OpenMP C/C standard
(openmp.org) OpenMP tutorial (http//www.llnl.gov/
computing/tutorials/openMP/Introduction) OpenMP
sc99 tutorial presentation (openmp.org) Dr. Eric
Strohmaier (University of Tennessee, CS594 class,
Feb 9, 2000)
2
Introduction
  • An API for multi-threaded shared memory
    parallelism
  • A specification for a set of compiler directives,
    library routines, and environment variables
    standardizing pragmas
  • Standardized by a group of hardware and software
    vendors
  • Both fine-grain and coarse-grain parallelism
    (orphaned directives)
  • Much easier to program than MPI

3
History
  • Many different vendors provided their own ways of
    compiler directives for shared memory programming
    using threads
  • OpenMP standard started in 1997
  • October 1997 Fortran version 1.0
  • Late 1998 C/C version 1.0
  • June 2000 Fortran version 2.0
  • April 2002 C/C version 2.0

4
Introduction
  • Parallelism
  • loop level (fine-grained) and coarse grained
    parallelism
  • Threaded parallelism
  • Explicit parallelism
  • No task level parallelism
  • Supports nested parallelism
  • Follows fork-join model
  • The number of threads can be varied from one
    region to another
  • Based on compiler directives
  • Users responsibility to ensure program
    correctness, avoiding deadlocks etc.

5
Execution Model
  • Begins as a single thread called master thread
  • Fork When parallel construct is encountered,
    team of threads are created
  • Statements in the parallel region are executed in
    parallel
  • Join At the end of the parallel region, the team
    threads synchronize and terminate

6
Definitions
  • Construct statement containing directive and
    structured block
  • Directive - pragma ltomp idgt ltother textgt
  • Based on C pragma directives
  • pragma omp directive-name clause , clause
    new-line
  • Example
  • pragma omp parallel default(shared)
    private(beta,pi)

7
Types of constructs, Calls, Variables
  • Work-sharing constructs
  • Synchronization constructs
  • Data environment constructs
  • Library calls, environment variables

8
parallel construct
  • pragma omp parallel clause , clause
    new-line
  • structured-block
  • Clause

9
Parallel construct
  • Parallel region executed by multiple threads
  • Implied barrier at the end of parallel section
  • If num_threads, omp_set_num_threads(),
    OMP_SET_NUM_THREADS not used, then number of
    created threads is implementation dependent
  • Number of threads can be dynamically adjusted
    using omp_set_dynamic() or OMP_DYNAMIC
  • Number of physical processors hosting the thread
    also implementation dependent
  • Threads numbered from 0 to N-1
  • Nested parallelism by embedding one parallel
    construct inside another

10
Parallel construct - Example
  • include ltomp.hgt
  • main ()
  • int nthreads, tid
  • pragma omp parallel private(nthreads, tid)
  • printf("Hello World \n)

11
Work sharing construct
  • For distributing the execution among the threads
    that encounter it
  • 3 types for, sections, single

12
for construct
  • For distributing the iterations among the threads

pragma omp for clause , clause new-line
for-loop Clause
13
for construct
  • Restriction in the structure of the for loop so
    that the compiler can determine the number of
    iterations e.g. no branching out of loop
  • The assignment of iterations to threads depend on
    the schedule clause
  • Implicit barrier at the end of for if not nowait

14
schedule clause
  • schedule(static, chunk_size) iterations/chunk_si
    ze chunks distributed in round-robin
  • schedule(dynamic, chunk_size) chunk_size chunk
    given to the next ready thread
  • schedule(guided, chunk_size) actual chunk_size
    is unassigned_iterations/(threadschunk_size) to
    the next ready thread. Thus exponential decrease
    in chunk sizes
  • schedule(runtime) decision at runtime.
    Implementation dependent

15
for - Example
  • include ltomp.hgt
  • define CHUNKSIZE 100
  • define N 1000
  • main ()
  • int i, chunk float aN, bN, cN
  • / Some initializations /
  • for (i0 i lt N i)
  • ai bi i 1.0
  • chunk CHUNKSIZE
  • pragma omp parallel shared(a,b,c,chunk)
    private(i)
  • pragma omp for schedule(dynamic,chunk) nowait
  • for (i0 i lt N i)
  • ci ai bi
  • / end of parallel section /

16
sections construct
  • For distributing non-iterative sections among
    threads
  • Clause

17
sections - Example
18
single directive
  • Only a single thread can execute the block

19
Single - Example
20
Combined parallel work-sharing directives
21
Synchronization directives
22
critical - Example
  • include ltomp.hgt
  • main()
  • int x
  • x 0
  • pragma omp parallel shared(x)
  • pragma omp critical
  • x x 1

23
atomic - Example
24
flush directive
  • Point where consistent view of memory is provided
    among the threads
  • Thread-visible variables (global variables,
    shared variables etc.) are written to memory
  • If var-list is used, only variables in the list
    are flushed

25
flush - Example
26
flush Example (Contd)
27
ordered - Example
28
Data Environment
  • Global variable-list declared are made private to
    a thread
  • Each thread gets its own copy
  • Persist between different parallel regions

include ltomp.hgt int alpha10, beta10,
i pragma omp threadprivate(alpha) main ()
/ Explicitly turn off dynamic threads /
omp_set_dynamic(0) / First parallel region
/ pragma omp parallel private(i,beta) for
(i0 i lt 10 i) alphai betai i /
Second parallel region / pragma omp parallel
printf("alpha3 d and beta3
d\n",alpha3,beta3)
29
Data Scope Attribute Clauses
  • Most variables are shared by default
  • Data scopes explicitly specified by data scope
    attribute clauses
  • Shared variables
  • If not specified in a threadprivate directive
  • Static variables in the dynamic extent
  • Heap allocated memory
  • Global variables
  • Clauses
  • private
  • firstprivate
  • lastprivate
  • shared
  • default
  • reduction
  • copyin
  • copyprivate

30
private, firstprivate lastprivate
  • private (variable-list)
  • variable-list private to each thread
  • A new object with automatic storage duration
    allocated for the construct
  • firstprivate (variable-list)
  • The new object is initialized with the value of
    the old object that existed prior to the
    construct
  • lastprivate (variable-list)
  • The value of the private object corresponding to
    the last iteration or the last section is
    assigned to the original object

31
private - Example
32
lastprivate - Example
33
shared, default, reduction
  • shared(variable-list)
  • default(shared none)
  • Specifies the sharing behavior of all of the
    variables visible in the construct
  • Reduction(op variable-list)
  • Private copies of the variables are made for each
    thread
  • The final object value at the end of the
    reduction will be combination of all the private
    object values

34
default - Example
35
reduction - Example
  • include ltomp.hgt
  • main ()
  • int i, n, chunk float a100, b100, result
  • / Some initializations /
  • n 100 chunk 10 result 0.0
  • for (i0 i lt n i)
  • ai i 1.0 bi i 2.0
  • pragma omp parallel for \ default(shared)
    private(i) \ schedule(static,chunk) \
    reduction(result)
  • for (i0 i lt n i)
  • result result (ai bi)
  • printf("Final result f\n",result)

36
copyin, copyprivate
  • copyin(variable-list)
  • Applicable to threadprivate variables
  • Value of the variable in the master thread is
    copied to the individual threadprivate copies
  • copyprivate(variable-list)
  • Appears on a single directive
  • Variables in variable-list are broadcast to other
    threads in the team from the thread that executed
    the single construct

37
copyprivate - Example
38
Nested parallelism
  • A parallel directive nested within another
    parallel directive
  • Establishes a new team consisting of only current
    thread (default)
  • If nested parallelism is enabled, current thread
    can spawn more threads

39
Library Routines (API)
  • Querying function (number of threads etc.)
  • General purpose locking routines
  • Setting execution environment (dynamic threads,
    nested parallelism etc.)

40
API
  • OMP_SET_NUM_THREADS(num_threads)
  • OMP_GET_NUM_THREADS()
  • OMP_GET_MAX_THREADS()
  • OMP_GET_THREAD_NUM()
  • OMP_GET_NUM_PROCS()
  • OMP_IN_PARALLEL()
  • OMP_SET_DYNAMIC(dynamic_threads)
  • OMP_GET_DYNAMIC()
  • OMP_SET_NESTED(nested)
  • OMP_GET_NESTED()

41
API(Contd..)
  • omp_init_lock(omp_lock_t lock)
  • omp_init_nest_lock(omp_nest_lock_t lock)
  • omp_destroy_lock(omp_lock_t lock)
  • omp_destroy_nest_lock(omp_nest_lock_t lock)
  • omp_set_lock(omp_lock_t lock)
  • omp_set_nest_lock(omp_nest_lock_t lock)
  • omp_unset_lock(omp_lock_t lock)
  • omp_unset_nest__lock(omp_nest_lock_t lock)
  • omp_test_lock(omp_lock_t lock)
  • omp_test_nest_lock(omp_nest_lock_t lock)
  • omp_get_wtime()
  • omp_get_wtick()

42
Lock details
  • Simple locks and nestable locks
  • Simple locks are not locked if they are already
    in a locked state
  • Nestable locks can be locked multiple times by
    the same thread
  • Simple locks are available if they are unlocked
  • Nestable locks are available if they are unlocked
    or owned by a calling thread

43
Example Lock functions
44
Example Nested lock
45
Example Nested lock (Contd..)
46
Environment Variables
  • OMP_SCHEDULE
  • setenv OMP_SCHEDULE "guided, 4
  • setenv OMP_SCHEDULE "dynamic"
  • OMP_NUM_THREADS
  • setenv OMP_NUM_THREADS 8
  • OMP_DYNAMIC
  • setenv OMP_DYNAMIC TRUE
  • OMP_NESTED

47
Hybrid Programming Combining MPI and OpenMP
benefits
  • MPI
  • - explicit parallelism, no synchronization
    problems
  • - suitable for coarse grain
  • OpenMP
  • - easy to program, dynamic scheduling allowed
  • - only for shared memory, data
    synchronization problems
  • MPI/OpenMP Hybrid
  • - Can combine MPI data placement with OpenMP
    fine-grain parallelism
  • - Suitable for cluster of SMPs (Clumps)
  • - Can implement hierarchical model

48
Hierarchical Model
49
Benefits of Mixed Modes
  • When MPI codes scale poorly
  • - When MPI codes involve load balance
    problems
  • When MPI codes have memory related problems for a
    single process
  • Restricted MPI process applications only
    power-of-2 processes
  • If MPI implementation is poorly optimized
  • When there are efficient shared memory algorithms

50
Case 1 WaTor/ Laplace problem with hybrid model
  • Divide the grid into processes by MPI. Within a
    process create threads (using PARALLEL DO) to
    deal with multiple threads hierarchical model
    or
  • Divide the whole domain into a fixed number of
    threads and processes (see diagram)

51
Case 2 Molecular dynamics (Henty)
  • List of links between particles that are
    separated by lt cut-off distance
  • Main computation is to calculate the forces on
    the links
  • MPI implementation domain decomposition
    (parallelization across cells) and block-cyclic
    distribution
  • OpenMP parallelization across links (automatic
    load balancing)
  • Force loop parallelized over links
  • - force updates by atomic operations
  • Hybrid Combination of domain decomposition and
    parallelization across links.
  • - Maybe less efficient than others
    depending upon the block size
Write a Comment
User Comments (0)
About PowerShow.com