Shared Memory Parallel Programming - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Shared Memory Parallel Programming

Description:

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP THREADPRIVATE(/ABC ... break all the good software engineering rules (i.e. stop being so portable) ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 33
Provided by: barbara179
Category:

less

Transcript and Presenter's Notes

Title: Shared Memory Parallel Programming


1
Shared Memory Parallel Programming
  • OpenMP Performance

2
OpenMP Overview
COMP FLUSH
pragma omp critical
COMP THREADPRIVATE(/ABC/)
CALL OMP_SET_NUM_THREADS(10)
  • OpenMP An API for Writing Multithreaded
    Applications
  • A set of compiler directives and library routines
    for parallel application programmers
  • Greatly simplifies writing multi-threaded (MT)
    programs in Fortran, C and C
  • Standardizes last 20 years of SMP practice

COMP parallel do shared(a, b, c)
call omp_test_lock(jlok)
call OMP_INIT_LOCK (ilok)
COMP MASTER
COMP ATOMIC
COMP SINGLE PRIVATE(X)
setenv OMP_SCHEDULE dynamic
COMP PARALLEL DO ORDERED PRIVATE (A, B, C)
COMP ORDERED
COMP PARALLEL REDUCTION ( A, B)
COMP SECTIONS
pragma omp parallel for private(A, B)
!OMP BARRIER
COMP PARALLEL COPYIN(/blk/)
COMP DO lastprivate(XX)
Nthrds OMP_GET_NUM_PROCS()
omp_set_lock(lck)
The name OpenMP is the property of the OpenMP
Architecture Review Board.
3
How to Get Started?
  • First thing figure what takes the time in your
    sequential program gt profile it!
  • Typically, few parts (few loops) take the bulk of
    the time.
  • Parallelize those parts first, worrying about
    granularity and load balance.
  • Advantage of shared memory you can do that
    incrementally.
  • Then worry about locality.

4
Factors that Determine Speedup
  • Amount of sequential code.
  • Characteristics of parallel code
  • granularity
  • load balancing
  • locality
  • uniprocessor
  • multiprocessor
  • synchronization and communication

5
Major Performance Impact
  • Amdahls law tells us that we need to avoid
    serial bottlenecks in code if we are to achieve
    scalable parallelism
  • If 1 of a program is serial, speedup is limited
    to 100, no matter how many processors it is
    computed on
  • We must profile codes very carefully to find out
    how much of them is sequential

Time tseq tpar / p on p threads
6
Major Performance Impact
  • Equation is very coarse
  • Some code might be replicated
  • It is essentially sequential
  • There are some overheads that are not present in
    sequential program
  • Parallel program may use cache better than
    sequential code
  • Since there is overall more cache available
  • Occasionally leads to superlinear speedup

Time tseq tpar / p on p threads
7
Uniprocessor Memory Hierarchy
access time
size
memory
100 cycles
128Mb-...
L2 cache
20 cycles
256-512k
L1 cache
2 cycles
32-128k
CPU
8
Shared Memory
access time
shared memory
100 cycles
L2 cache
L2 cache
20 cycles
2 cycles
L1 cache
L1 cache
CPU
CPU
9
Distributed Shared Memory
access time
memory
memory
100s of cycles
L2 cache
L2 cache
20 cycles
2 cycles
L1 cache
L1 cache
CPU
CPU
10
Recall Locality
  • Locality (or re-use) the extent to which a
    thread continues to use the same data or close
    data.
  • Temporal locality If you have accessed a
    particular word in memory, you access that word
    again (before the line gets replaced).
  • Spatial locality If you access a word in a
    cache line, you access other word(s) in that
    cache line before it gets replaced.

11
Bottom Line
  • To get good performance,
  • You have to have a high hit rate.
  • You have to continue to access the data close
    to the data that you accessed recently.
  • Each thread must access data well
  • Much more important than for sequential code,
    since access costs higher
  • Penalty for getting it wrong can be severe

12
Scalability
  • Performance of program for large p
  • If grows roughly in proportion to p, code is
    scalable
  • In practice, some reasonable growth may be
    sufficient
  • Often extremely difficult to achieve

13
Granularity
  • Granularity size of the piece of code that is
    executed by a single processor.
  • May be a statement, a single loop iteration, a
    set of loop iterations, etc.
  • Fine granularity leads to
  • (positive) ability to use lots of processors
  • (positive) finer-grain load balancing
  • (negative) increased overhead

Appropriate size may depend on hardware
14
Load Balance
  • Difference in execution time between threads
    between synchronization points
  • Sum up minimal times
  • Sum up maximal times
  • Difference is total imbalance
  • Unpredictable for some codes
  • For these, we have dynamic and guided schedules

15
Load in Molecular Dynamics
  • for some number of timesteps
  • pragma omp parallel for
  • for( i0 iltnum_mol i )
  • for( j0 jltcounti j )
  • forcei f(loci,locindexj)
  • pragma omp parallel for
  • for( i0 iltnum_mol i )
  • loci g( loci, forcei )

How much work is there?
May have poor load balance if number of neighbors
varies a lot
16
Better Load Balance
  • Rewrite to assign iterations of first loop nest
    such that each thread has the same number of
    neighbors
  • Extra overheads we would have to compute this
    repeatedly, as neighbor list can change during
    computation
  • Use explicit schedule to assign work

Is it worth it? Can we express the desired
schedule?
17
Parallel Code Performance Issues
  • Concurrency we want simultaneous execution of
    as many actions as possible
  • Data locality - high percentage of memory
    accesses are local (to local cache or memory)
  • Load balance - same amount of work performed on
    each processor
  • Scalability ability to exploit increasing
    numbers of processors (resources) with good
    efficiency

18
Some Conclusions
  • You need to parallelize most of the code
  • But also minimize overheads introduced by
    parallelization
  • Avoid having threads idle
  • Take care that parallel code makes good use of
    memory hierarchy at all levels

What size problems are worth parallel
execution? How important is ease of code
maintenance?
19
Performance Scalability Hindrances
  • Too fine grained
  • Symptom high overhead
  • Caused by Small parallel/critical
  • Overly synchronized
  • Symptom high overhead
  • Caused by large synchronized sections
  • Dependencies real?
  • Load Imbalance
  • Symptom large wait times
  • Caused by uneven work distribution

20
Specific OpenMP Improvements
  • Too many parallel regions
  • Reduce overheads and improve cache use by merging
    them
  • May need to use single directive for sequential
    parts
  • Too many barriers
  • Can you remove some (by using nowait)?
  • Too much work in critical region
  • Can you remove some or create multiple critical
    regions?

21
Tuning Critical Sections
  • It often helps to chop up large critical sections
    into finer, named ones
  • Original Code
  • pragma omp critical (foo)
  • update( a )
  • update( b )
  • Transformed Code
  • pragma omp critical (foo_a)
  • update( a )
  • pragma omp critical (foo_b)
  • update( b )
  • Still need to avoid wait at first critical!

22
Tuning Locks Instead of Critical
  • Original Code
  • pragma omp critical
  • for( i0 iltn i )
  • ai
  • bi
  • ci
  • Idea cycle through different parts of the array
    using locks!
  • Transformed Code
  • jstart omp_get_thread_num()
  • for( k 0 k lt nlocks k )
  • j ( jstart k ) nlocks
  • omp_set_lock( lckj )
  • for( ilbj iltubj i )
  • ai
  • bi
  • ci
  • omp_unset_lock( lckj )
  • Adapt to your situation

23
Tuning Eliminate Implicit Barriers
Remember Work-sharing constructs have implicit
barrier at end
  • When consecutive work-sharing constructs modify
    ( use) different objects, the barrier in the
    middle can be eliminated
  • When same object modified (or used), barrier can
    be safely removed if iteration spaces align
  • On most systems will be OK with OpenMP 3.0

24
Problems with Data Accesses
  • Potential problems
  • Contention for access to memory
  • Too much remote data
  • Frequent false sharing
  • Poor sequential access pattern
  • Some remedies
  • Can you privatize data to make it local (and
    remove false sharing)?
  • Can you use OS features to pin data and memory
    down in big system
  • First touch default placement, dplace,
    Next_touch

Might need to revisit parallelization strategy
25
Parallelism worth it?
  • When would parallelizing this loop help?
  • DO I 1, N A(I) 0
  • ENDDO
  • Some issues to consider
  • Would it help increase size of parallel region
  • Number of threads/processors being used
  • Bandwidth of the memory system
  • Value of N
  • Very large N, so A is not cache contained
  • Placement of Object A
  • If distributed onto different processor caches,
    or about to be distributed
  • On NUMA systems, when using first touch policy
    for placing objects, to achieve a certain
    placement for object A

26
Too fine grained?
  • When would parallelizing this loop help?
  • DO I 1, N SUM SUM A(I) B(I)
  • ENDDO
  • Know your compiler!
  • Some issues to consider
  • of threads/processors being used
  • How are reductions implemented?
  • Atomic, critical, expanded scalar, logarithmic
  • All the issues from the previous slide about
    existing distribution of A and B

27
Tuning Load Balancing
do I 1, N do J I, M
  • Notorious problem for triangular loops
  • Within a parallel do/for, use the schedule clause
  • Remember, dynamic much more expensive than static
  • Chunked static can be very effective for load
    imbalance
  • When dealing with consecutive dos/fors, nowait
    can help, but be careful to avoid data races

28
Load Imbalance Thread Profiler
  • Thread Profiler is a performance analysis tool
    from Intel Corporation
  • There are a few tools out there for use with
    OpenMP.

29
High Performance Tuning
  • To fine-tune for high performance, break all the
    good software engineering rules (i.e. stop being
    so portable).
  • Step 1
  • Know your application
  • For best performance, also know your compiler,
    performance tool, and hardware
  • The safest pragma to use is
  • parallel do/for
  • Sections can introduce load imbalance
  • There is usually more than one way to express the
    desired parallelism
  • So how do you pick which constructs to use?

Tradeoff level of performance vs. portability
30
Understand the Overheads!Approximate numbers
just to illustrate
31
Performance Optimization Summary
  • Getting maximal performance is difficult
  • Might involve changing
  • Some aspect of application
  • Parallelization strategy
  • Directives used
  • Compiler flags used
  • Use of OS features
  • Requires understanding of potential problems
  • Not always easy to pinpoint actual problem

32
Cache Ping Ponging Varying Times for Sequential
Regions
  • Picture shows three runs of same program (4, 2, 1
    threaded)
  • Each set of three bars is a serial region
  • Why does runtime change for serial regions?
  • No reason pinpointed
  • Time to think!
  • Thread migration
  • Data migration
  • Overhead?

Run Time
Write a Comment
User Comments (0)
About PowerShow.com