ECE1747 Parallel Programming - PowerPoint PPT Presentation

About This Presentation
Title:

ECE1747 Parallel Programming

Description:

Some other systems allow shared-read, exclusive-write locks. Barrier Synchronization ... the attributes of a mutex lock. Set scheduling parameters. ECE 1747 ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 65
Provided by: CITI
Category:

less

Transcript and Presenter's Notes

Title: ECE1747 Parallel Programming


1
ECE1747 Parallel Programming
  • Shared Memory Multithreading Pthreads

2
Shared Memory
  • All threads access the same shared memory data
    space.

Shared Memory Address Space
proc1
proc2
proc3
procN
3
Shared Memory (continued)
  • Concretely, it means that a variable x, a pointer
    p, or an array a refer to the same object, no
    matter what processor the reference originates
    from.
  • We have more or less implicitly assumed this to
    be the case in earlier examples.

4
Shared Memory
a
proc1
proc2
proc3
procN
5
Distributed Memory - Message Passing
  • The alternative model to shared memory.

mem1
mem2
mem3
memN
a
a
a
a
proc1
proc2
proc3
procN
network
6
Shared Memory vs. Message Passing
  • Same terminology is used in distinguishing
    hardware.
  • For us distinguish programming models, not
    hardware.

7
Programming vs. Hardware
  • One can implement
  • a shared memory programming model
  • on shared or distributed memory hardware
  • (also in software or in hardware)
  • One can implement
  • a message passing programming model
  • on shared or distributed memory hardware

8
Portability of programming models
shared memory programming
message passing programming
distr. memory machine
shared memory machine
9
Shared Memory Programming Important Point to
Remember
  • No matter what the implementation, it
    conceptually looks like shared memory.
  • There may be some (important) performance
    differences.

10
Multithreading
  • User has explicit control over thread.
  • Good control can be used to performance benefit.
  • Bad user has to deal with it.

11
Pthreads
  • POSIX standard shared-memory multithreading
    interface.
  • Provides primitives for process management and
    synchronization.

12
What does the user have to do?
  • Decide how to decompose the computation into
    parallel parts.
  • Create (and destroy) processes to support that
    decomposition.
  • Add synchronization to make sure dependences are
    covered.

13
General Thread Structure
  • Typically, a thread is a concurrent execution of
    a function or a procedure.
  • So, your program needs to be restructured such
    that parallel parts form separate procedures or
    functions.

14
Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
15
Thread Joining Example
  • void func(void ) ..
  • pthread_t id int X
  • pthread_create(id, NULL, func, X)
  • ..
  • pthread_join(id, NULL)
  • ..

16
Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
pthread_ join(id)
pthread_ exit()
17
Sequential SOR
  • for some number of timesteps/iterations
  • for (i0 iltn i )
  • for( j1, jltn, j )
  • tempij 0.25
  • ( gridi-1j gridi1j
  • gridij-1 gridij1 )
  • for( i0 iltn i )
  • for( j1 jltn j )
  • gridij tempij

18
Parallel SOR
  • First (i,j) loop nest can be parallelized.
  • Second (i,j) loop nest can be parallelized.
  • Must wait to start second loop nest until all
    processors have finished first.
  • Must wait to start first loop nest of next
    iteration until all processors have second loop
    nest of previous iteration.
  • Give n/p rows to each processor.

19
Pthreads SOR Parallel parts (1)
  • void sor_1(void s)
  • int slice (int) s
  • int from (slicen)/p
  • int to ((slice1)n)/p
  • for( ifrom iltto i)
  • for( j0 jltn j )
  • tempij 0.25(gridi-1j gridi1j
  • gridij-1 gridij1)

20
Pthreads SOR Parallel parts (2)
  • void sor_2(void s)
  • int slice (int) s
  • int from (slicen)/p
  • int to ((slice1)n)/p
  • for( ifrom iltto i)
  • for( j0 jltn j )
  • gridij tempij

21
Pthreads SOR main
  • for some number of timesteps
  • for( i0 iltp i )
  • pthread_create(thrdi, NULL, sor_1, (void
    )i)
  • for( i0 iltp i )
  • pthread_join(thrdi, NULL)
  • for( i0 iltp i )
  • pthread_create(thrdi, NULL, sor_2, (void
    )i)
  • for( i0 iltp i )
  • pthread_join(thrdi, NULL)

22
Summary Thread Management
  • pthread_create() creates a parallel thread
    executing a given function (and arguments),
    returns thread identifier.
  • pthread_exit() terminates thread.
  • pthread_join() waits for thread with particular
    thread identifier to terminate.

23
Summary Program Structure
  • Encapsulate parallel parts in functions.
  • Use function arguments to parameterize what a
    particular thread does.
  • Call pthread_create() with the function and
    arguments, save thread identifier returned.
  • Call pthread_join() with that thread identifier.

24
Pthreads Synchronization
  • Create/exit/join
  • provide some form of synchronization,
  • at a very coarse level,
  • requires thread creation/destruction.
  • Need for finer-grain synchronization
  • mutex locks,
  • condition variables.

25
Use of Mutex Locks
  • To implement critical sections.
  • Pthreads provides only exclusive locks.
  • Some other systems allow shared-read,
    exclusive-write locks.

26
Barrier Synchronization
  • A wait at a barrier causes a thread to wait until
    all threads have performed a wait at the barrier.
  • At that point, they all proceed.

27
Implementing Barriers in Pthreads
  • Count the number of arrivals at the barrier.
  • Wait if this is not the last arrival.
  • Make everyone unblock if this is the last
    arrival.
  • Since the arrival count is a shared variable,
    enclose the whole operation in a mutex
    lock-unlock.

28
Implementing Barriers in Pthreads
  • void barrier()
  • pthread_mutex_lock(mutex_arr)
  • arrived
  • if (arrivedltN)
  • pthread_cond_wait(cond, mutex_arr)
  • else
  • pthread_cond_broadcast(cond)
  • arrived0 / be prepared for next barrier
    /
  • pthread_mutex_unlock(mutex_arr)

29
Parallel SOR with Barriers (1 of 2)
  • void sor (void arg)
  • int slice (int)arg
  • int from (slice (n-1))/p 1
  • int to ((slice1) (n-1))/p 1
  • for some number of iterations

30
Parallel SOR with Barriers (2 of 2)
  • for (ifrom iltto i)
  • for (j1 jltn j)
  • tempij 0.25 (gridi-1j
    gridi1j gridij-1 gridij1)
  • barrier()
  • for (ifrom iltto i)
  • for (j1 jltn j)
  • gridijtempij
  • barrier()

31
Parallel SOR with Barriers main
  • int main(int argc, char argv)
  • pthread_t thrdp
  • / Initialize mutex and condition variables /
  • for (i0 iltp i)
  • pthread_create (thrdi, attr, sor,
    (void)i)
  • for (i0 iltp i)
  • pthread_join (thrdi, NULL)
  • / Destroy mutex and condition variables /

32
Note again
  • Many shared memory programming systems (other
    than Pthreads) have barriers as basic primitive.
  • If they do, you should use it, not construct it
    yourself.
  • Implementation may be more efficient than what
    you can do yourself.

33
Busy Waiting
  • Not an explicit part of the API.
  • Available in a general shared memory programming
    environment.

34
Busy Waiting
  • initially flag 0
  • P1 produce data
  • flag 1
  • P2 while( !flag )
  • consume data

35
Use of Busy Waiting
  • On the surface, simple and efficient.
  • In general, not a recommended practice.
  • Often leads to messy and unreadable code (blurs
    data/synchronization distinction).
  • May be inefficient

36
Private Data in Pthreads
  • To make a variable private in Pthreads, you need
    to make an array out of it.
  • Index the array by thread identifier, which you
    can get by the pthreads_self() call.
  • Not very elegant or efficient.

37
Other Primitives in Pthreads
  • Set the attributes of a thread.
  • Set the attributes of a mutex lock.
  • Set scheduling parameters.

38
ECE 1747 Parallel Programming
  • Machine-independent
  • Performance Optimization Techniques

39
Returning to Sequential vs. Parallel
  • Sequential execution time t seconds.
  • Startup overhead of parallel execution t_st
    seconds (depends on architecture)
  • (Ideal) parallel execution time t/p t_st.
  • If t/p t_st gt t, no gain.

40
General Idea
  • Parallelism limited by dependences.
  • Restructure code to eliminate or reduce
    dependences.
  • Sometimes possible by compiler, but good to know
    how to do it by hand.

41
Summary
  • Reorganize code such that
  • dependences are removed or reduced
  • large pieces of parallel work emerge
  • loop bounds become known
  • Code can become messy there is a point of
    diminishing returns.

42
Factors that Determine Speedup
  • Characteristics of parallel code
  • granularity
  • load balance
  • locality
  • communication and synchronization

43
Granularity
  • Granularity size of the program unit that is
    executed by a single processor.
  • May be a single loop iteration, a set of loop
    iterations, etc.
  • Fine granularity leads to
  • (positive) ability to use lots of processors
  • (positive) finer-grain load balancing
  • (negative) increased overhead

44
Granularity and Critical Sections
  • Small granularity gt more processors gt more
    critical section accesses gt more contention.

45
Issues in Performance of Parallel Parts
  • Granularity.
  • Load balance.
  • Locality.
  • Synchronization and communication.

46
Load Balance
  • Load imbalance different in execution time
    between processors between barriers.
  • Execution time may not be predictable.
  • Regular data parallel yes.
  • Irregular data parallel or pipeline perhaps.
  • Task queue no.

47
Static vs. Dynamic
  • Static done once, by the programmer
  • block, cyclic, etc.
  • fine for regular data parallel
  • Dynamic done at runtime
  • task queue
  • fine for unpredictable execution times
  • usually high overhead
  • Semi-static done once, at run-time

48
Choice is not inherent
  • MM or SOR could be done using task queues put
    all iterations in a queue.
  • In heterogeneous environment.
  • In multitasked environment.
  • TSP could be done using static partitioning give
    length-1 path to all processors.
  • If we did exhaustive search.

49
Static Load Balancing
  • Block
  • best locality
  • possibly poor load balance
  • Cyclic
  • better load balance
  • worse locality
  • Block-cyclic
  • load balancing advantages of cyclic (mostly)
  • better locality (see later)

50
Dynamic Load Balancing (1 of 2)
  • Centralized single task queue.
  • Easy to program
  • Excellent load balance
  • Distributed task queue per processor.
  • Less communication/synchronization

51
Dynamic Load Balancing (2 of 2)
  • Task stealing
  • Processes normally remove and insert tasks from
    their own queue.
  • When queue is empty, remove task(s) from other
    queues.
  • Extra overhead and programming difficulty.
  • Better load balancing.

52
Semi-static Load Balancing
  • Measure the cost of program parts.
  • Use measurement to partition computation.
  • Done once, done every iteration, done every n
    iterations.

53
Molecular Dynamics (MD)
  • Simulation of a set of bodies under the influence
    of physical laws.
  • Atoms, molecules, celestial bodies, ...
  • Have same basic structure.

54
Molecular Dynamics (Skeleton)
  • for some number of timesteps
  • for all molecules i
  • for all other molecules j
  • forcei f( loci, locj )
  • for all molecules i
  • loci g( loci, forcei )

55
Molecular Dynamics
  • To reduce amount of computation, account for
    interaction only with nearby molecules.

56
Molecular Dynamics (continued)
  • for some number of timesteps
  • for all molecules i
  • for all nearby molecules j
  • forcei f( loci, locj )
  • for all molecules i
  • loci g( loci, forcei )

57
Molecular Dynamics (continued)
  • for each molecule i
  • number of nearby molecules counti
  • array of indices of nearby molecules indexj
  • ( 0 lt j lt counti)

58
Molecular Dynamics (continued)
  • for some number of timesteps
  • for( i0 iltnum_mol i )
  • for( j0 jltcounti j )
  • forcei f(loci,locindexj)
  • for( i0 iltnum_mol i )
  • loci g( loci, forcei )

59
Molecular Dynamics (simple)
  • for some number of timesteps
  • pragma omp parallel for
  • for( i0 iltnum_mol i )
  • for( j0 jltcounti j )
  • forcei f(loci,locindexj)
  • pragma omp parallel for
  • for( i0 iltnum_mol i )
  • loci g( loci, forcei )

60
Molecular Dynamics (simple)
  • Simple to program.
  • Possibly poor load balance
  • block distribution of i iterations (molecules)
  • could lead to uneven neighbor distribution
  • cyclic does not help

61
Better Load Balance
  • Assign iterations such that each processor has
    the same number of neighbors.
  • Array of assign records
  • size number of processors
  • two elements
  • beginning i value (molecule)
  • ending i value (molecule)
  • Recompute partition periodically

62
Molecular Dynamics (continued)
  • for some number of timesteps
  • pragma omp parallel
  • pr omp_get_thread_num()
  • for( iassignpr-gtb iltassignpr-gte i )
  • for( j0 jltcounti j )
  • forcei f(loci,locindexj)
  • pragma omp parallel for
  • for( i0 iltnum_mol i )
  • loci g( loci, forcei )

63
Frequency of Balancing
  • Every time neighbor list is recomputed.
  • once during initialization.
  • every iteration.
  • every n iterations.
  • Extra overhead vs. better approximation and
    better load balance.

64
Summary
  • Parallel code optimization
  • Critical section accesses.
  • Granularity.
  • Load balance.
Write a Comment
User Comments (0)
About PowerShow.com