Tuesday, November 07, 2006 - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Tuesday, November 07, 2006

Description:

... preferred or available during compilation, the code is in effect a serial code. ... code for both serial and parallel applications which can ease code maintenance. ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 45
Provided by: Erud
Category:

less

Transcript and Presenter's Notes

Title: Tuesday, November 07, 2006


1
Tuesday, November 07, 2006
  • If anything can go wrong, it will.
  • Murphys Law

2
(No Transcript)
3
(No Transcript)
4
(No Transcript)
5
Shared Memory Model
  • A collection of processors, each with access to
    same shared memory.
  • Processors can interact and synchronize with each
    other through shared variables.

6
Shared Memory Programming
  • It is possible to write parallel programs for
    multiprocessors using MPI
  • But we can achieve better performance by using a
    programming model tailored for a shared-memory
    environment.

7
OpenMP
  • On shared-memory multiprocessors, memory among
    processors can be shared.
  • A directive-based OpenMP Application Program
    Interface (API) has been developed specifically
    for shared-memory parallel processing.
  • Directives assist the compiler in the
    parallelization of application codes.

8
  • In the past, almost all major manufacturers of
    high performance shared-memory multiprocessor
    computers have their own sets of directives.
  • The functionalities and syntaxes of these
    directive sets varied among vendors.

9
  • Code portability

10
  • A standard to ensure code portability across
    shared-memory platforms, an independent
    organization, openmp.org, was established in
    1996.
  • As a result, the OpenMP API came into being in
    1997.
  • The primary benefit of using OpenMP is the
    relative ease of code parallelization made
    possible by the shared-memory architecture.

11
OpenMP
  • OpenMP has broad support from many major computer
    hardware and software manufacturers.
  • Similar to MPI's achievement as the standard for
    distributed-memory parallel processing, OpenMP
    has emerged as the standard for shared-memory
    parallel computing.

12
  • fork
  • join

13
  • The standard view of parallelism in a shared
    memory program is fork/join parallelism.
  • A the beginning of a program, only a single
    thread, called master thread, is active.
  • At points where parallel operations are required,
    the master thread forks (creates/awakens)
    additional threads.

14
(No Transcript)
15
  • The master thread and child threads work
    concurrently through the parallel section.
  • At end of parallel code the child threads die or
    are suspended and flow of control single master
    thread (join).
  • Number of active threads can change dynamically
    throughout the execution of the program.

16
(No Transcript)
17
Parallel for loops
  • Parallel operations are often expressed as loops
  • With OpenMP it is easy to indicate when
    iterations of for loop may be executed in
    parallel.

18
  • for (ifirst iltsize iprime)
  • markedi1
  • No dependence between one iteration of loop and
    another.

19
  • for (ifirst iltsize iprime)
  • markedi1
  • In OpenMP we will simply indicate that the
    iterations of for loop may be executed in
    parallel
  • The compiler will take care of generating the
    code that forks/joins threads and schedules the
    iterations.

20
pragma
  • A compiler directive in C/C is called a pragma
    (pragmatic information)
  • A pragma is used to communicate information to
    the compiler.

21
pragma
  • Compiler may ignore that information and still
    generate correct object program.
  • Information provided by pragma can help compiler
    optimize the program

22
parallel for pragma
  • pragma omp ltrest of pragmagt

23
parallel for pragma
  • pragma omp parallel for
  • Instruct the compiler to parallelize the for loop
    that immediately follows this directive.

24
parallel for pragma
  • pragma omp parallel for
  • for (ifirst iltsize iprime)
  • markedi1

25
parallel for pragma
  • pragma omp parallel for
  • for (ifirst iltsize iprime)
  • markedi1
  • Runtime system must have information it needs to
    determine the number of iterations when it
    evaluates the control clause.
  • for loop must not contain statements that allow
    the loop to be exited prematurely (i.e. break,
    return, exit, goto)
  • continue is allowed.

26
parallel for pragma
  • pragma omp parallel for
  • for (ifirst iltsize iprime)
  • markedi1
  • In parallel for pragma, variables are by default
    shared, except the loop index which is private.

27
parallel for pragma
  • int b3
  • char cptr
  • int i
  • cptr malloc(1)
  • pragma omp parallel for
  • for(i0 ilt3 i)
  • bii

28
parallel for pragma
  • for(i2 ilt5 i)
  • ai ai ai-1

29
parallel for pragma
  • for(i2 ilt5 i)
  • ai ai ai-1

Assume that the array a has been initialized with
integers from 1-5.
30
parallel for pragma
  • Suppose we have 2 threads.
  • Assume the first thread is assigned loop indices
    2 and 3.
  • Second thread is assigned 4 and 5.

31
parallel for pragma
  • One possible order of execution
  • Thread 1 performs the computation on i4,
    reading the value of a(3) before thread 0 has
    completed the computations for i3 which update
    a(3).

32
  • OpenMP will do what you tell it to.
  • If you parallelize a loop with data dependency,
    it will give wrong result.
  • The programmer is responsible for correctness of
    code.

33
parallel for pragma
  • The runtime system needs to know how many threads
    to create.
  • There are several ways to specify the number of
    threads to be used.
  • One of these is to set the environment variable
    OMP_NUM_THREADS to the required number of
    threads.

34
parallel for pragma
  • Environment variable OMP_NUM_THREADS
  • In bash
  • export OMP_NUM_THREADS4

35
parallel for pragma
  • The loop indices are distributed among the
    specified number of threads.
  • The way in which the loop indices are distributed
    is known as the schedule.

36
parallel for pragma
  • In the "static" schedule, which is typically the
    default, each thread will get a chunk of indices
    of approximately equal size.
  • For example, if the loop goes from 1 to 100 and
    there are 3 threads,
  • The first thread will process i1 through i34
  • The second thread will process i35 through i67
  • The third thread will process i68 through i100.

37
parallel for pragma
  • There is an implied barrier at the end of the
    loop
  • Each thread will wait at the end of the loop
    until all threads have reached that point before
    they continue.

38
  • Sequential program is a special case of
    shared-memory parallel program (i.e. one with no
    forks/joins in it)

39
  • Directives, as well as OpenMP function calls, are
    treated as comments in the event that OpenMP
    invocation is not preferred or available during
    compilation, the code is in effect a serial code.
  • This affords a unified code for both serial and
    parallel applications which can ease code
    maintenance.

40
  • Shared memory model supports incremental
    parallelization.
  • Incremental parallelization is the process of
    transforming a sequential program into a parallel
    program one block of code at a time.

41
  • Benefits of incremental parallelization?

42
  • Benefits of incremental parallelization
  • Profile execution of sequential program.
  • Sort program blocks in terms of time they consume.

43
  • An overhead is incurred any time parallel threads
    are spawned, such as in the case of parallel for
    directive. This is system dependent.
  • Therefore, when a short loop is parallelized, it
    will probably take longer to execute on multiple
    threads than on a single thread since the
    overhead is greater than the time savings due to
    parallelization.

44
"How long is long enough?"
  • Answer is dependent upon the system and the loop
    under consideration.
  • As a very rough estimate, several thousand
    operations (total over all loop iterations, not
    per iteration)
  • There is only one way to know for sure
  • Try parallelizing the loop, and then time it and
    see if it is running faster.
Write a Comment
User Comments (0)
About PowerShow.com