ECE1747 Parallel Programming - PowerPoint PPT Presentation

About This Presentation
Title:

ECE1747 Parallel Programming

Description:

We have more or less implicitly assumed this to be the case in ... How to remember a signal (1 of 2) semaphore_signal(i) { pthread_mutex_lock(&mutex_rem[i] ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 63
Provided by: CITI
Category:

less

Transcript and Presenter's Notes

Title: ECE1747 Parallel Programming


1
ECE1747 Parallel Programming
  • Shared Memory Multithreading Pthreads

2
Shared Memory
  • All threads access the same shared memory data
    space.

Shared Memory Address Space
proc1
proc2
proc3
procN
3
Shared Memory (continued)
  • Concretely, it means that a variable x, a pointer
    p, or an array a refer to the same object, no
    matter what processor the reference originates
    from.
  • We have more or less implicitly assumed this to
    be the case in earlier examples.

4
Shared Memory
a
proc1
proc2
proc3
procN
5
Distributed Memory - Message Passing
  • The alternative model to shared memory.

mem1
mem2
mem3
memN
a
a
a
a
proc1
proc2
proc3
procN
network
6
Shared Memory vs. Message Passing
  • Same terminology is used in distinguishing
    hardware.
  • For us distinguish programming models, not
    hardware.

7
Programming vs. Hardware
  • One can implement
  • a shared memory programming model
  • on shared or distributed memory hardware
  • (also in software or in hardware)
  • One can implement
  • a message passing programming model
  • on shared or distributed memory hardware

8
Portability of programming models
shared memory programming
message passing programming
distr. memory machine
shared memory machine
9
Shared Memory Programming Important Point to
Remember
  • No matter what the implementation, it
    conceptually looks like shared memory.
  • There may be some (important) performance
    differences.

10
Multithreading
  • User has explicit control over thread.
  • Good control can be used to performance benefit.
  • Bad user has to deal with it.

11
Pthreads
  • POSIX standard shared-memory multithreading
    interface.
  • Provides primitives for process management and
    synchronization.

12
What does the user have to do?
  • Decide how to decompose the computation into
    parallel parts.
  • Create (and destroy) processes to support that
    decomposition.
  • Add synchronization to make sure dependences are
    covered.

13
General Thread Structure
  • Typically, a thread is a concurrent execution of
    a function or a procedure.
  • So, your program needs to be restructured such
    that parallel parts form separate procedures or
    functions.

14
Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
15
Thread Joining Example
  • void func(void ) ..
  • pthread_t id int X
  • pthread_create(id, NULL, func, X)
  • ..
  • pthread_join(id, NULL)
  • ..

16
Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
pthread_ join(id)
pthread_ exit()
17
Matrix Multiply
  • for( i0 iltn i )
  • for( j0 jltn j )
  • cij 0.0
  • for( k0 kltn k )
  • cij aikbkj

18
Parallel Matrix Multiply
  • All i- or j-iterations can be run in parallel.
  • If we have p processors, n/p rows to each
    processor.
  • Corresponds to partitioning i-loop.

19
Matrix Multiply Parallel Part
  • void mmult(void s)
  • int slice (int) s
  • int from (slicen)/p
  • int to ((slice1)n)/p
  • for(ifrom iltto i)
  • for(j0 jltn j)
  • cij 0.0
  • for(k0 kltn k)
  • cij aikbkj

20
Matrix Multiply Main
  • int main()
  • pthread_t thrdp
  • for( i0 iltp i )
  • pthread_create(thrdi, NULL, mmult,(void)
    i)
  • for( i0 iltp i )
  • pthread_join(thrdi, NULL)

21
Sequential SOR
  • for some number of timesteps/iterations
  • for (i0 iltn i )
  • for( j1, jltn, j )
  • tempij 0.25
  • ( gridi-1j gridi1j
  • gridij-1 gridij1 )
  • for( i0 iltn i )
  • for( j1 jltn j )
  • gridij tempij

22
Parallel SOR
  • First (i,j) loop nest can be parallelized.
  • Second (i,j) loop nest can be parallelized.
  • Must wait to start second loop nest until all
    processors have finished first.
  • Must wait to start first loop nest of next
    iteration until all processors have second loop
    nest of previous iteration.
  • Give n/p rows to each processor.

23
Pthreads SOR Parallel parts (1)
  • void sor_1(void s)
  • int slice (int) s
  • int from (slicen)/p
  • int to ((slice1)n)/p
  • for( ifrom iltto i)
  • for( j0 jltn j )
  • tempij 0.25(gridi-1j gridi1j
  • gridij-1 gridij1)

24
Pthreads SOR Parallel parts (2)
  • void sor_2(void s)
  • int slice (int) s
  • int from (slicen)/p
  • int to ((slice1)n)/p
  • for( ifrom iltto i)
  • for( j0 jltn j )
  • gridij tempij

25
Pthreads SOR main
  • for some number of timesteps
  • for( i0 iltp i )
  • pthread_create(thrdi, NULL, sor_1, (void
    )i)
  • for( i0 iltp i )
  • pthread_join(thrdi, NULL)
  • for( i0 iltp i )
  • pthread_create(thrdi, NULL, sor_2, (void
    )i)
  • for( i0 iltp i )
  • pthread_join(thrdi, NULL)

26
Summary Thread Management
  • pthread_create() creates a parallel thread
    executing a given function (and arguments),
    returns thread identifier.
  • pthread_exit() terminates thread.
  • pthread_join() waits for thread with particular
    thread identifier to terminate.

27
Summary Program Structure
  • Encapsulate parallel parts in functions.
  • Use function arguments to parameterize what a
    particular thread does.
  • Call pthread_create() with the function and
    arguments, save thread identifier returned.
  • Call pthread_join() with that thread identifier.

28
Pthreads Synchronization
  • Create/exit/join
  • provide some form of synchronization,
  • at a very coarse level,
  • requires thread creation/destruction.
  • Need for finer-grain synchronization
  • mutex locks,
  • condition variables.

29
Use of Mutex Locks
  • To implement critical sections (as needed, e.g.,
    in en_queue and de_queue in TSP).
  • Pthreads provides only exclusive locks.
  • Some other systems allow shared-read,
    exclusive-write locks.

30
Condition variables (1 of 5)
  • pthread_cond_init(
  • pthread_cond_t cond,
  • pthread_cond_attr attr)
  • Creates a new condition variable cond.
  • Attribute ignore for now.

31
Condition Variables (2 of 5)
  • pthread_cond_destroy(
  • pthread_cond_t cond)
  • Destroys the condition variable cond.

32
Condition Variables (3 of 5)
  • pthread_cond_wait(
  • pthread_cond_t cond,
  • pthread_mutex_t mutex)
  • Blocks the calling thread, waiting on cond.
  • Unlocks the mutex.

33
Condition Variables (4 of 5)
  • pthread_cond_signal(
  • pthread_cond_t cond)
  • Unblocks one thread waiting on cond.
  • Which one is determined by scheduler.
  • If no thread waiting, then signal is a no-op.

34
Condition Variables (5 of 5)
  • pthread_cond_broadcast(
  • pthread_cond_t cond)
  • Unblocks all threads waiting on cond.
  • If no thread waiting, then broadcast is a no-op.

35
Use of Condition Variables
  • To implement signal-wait synchronization
    discussed in earlier examples.
  • Important note a signal is forgotten if there
    is no corresponding wait that has already
    happened.

36
Use of Wait/Signal (Pipelining)
  • Sequential
  • Parallel
  • (Pattern -- picture horiz. line -- processor).

37
PIPE
  • P1for( i0 iltnum_pics, read(in_pic) i )
  • int_pic_1i trans1( in_pic )
  • signal( event_1_2i )
  • P2 for( i0 iltnum_pics i )
  • wait( event_1_2i )
  • int_pic_2i trans2( int_pic_1i )
  • signal( event_2_3i )

38
PIPE Using Pthreads
  • Replacing the original wait/signal by a Pthreads
    condition variable wait/signal will not work.
  • signals before a wait are forgotten.
  • we need to remember a signal.

39
How to remember a signal (1 of 2)
  • semaphore_signal(i)
  • pthread_mutex_lock(mutex_remi)
  • arrived i 1
  • pthread_cond_signal(condi)
  • pthread_mutex_unlock(mutex_remi)

40
How to Remember a Signal (2 of 2)
  • semaphore_wait(i)
  • pthreads_mutex_lock(mutex_remi)
  • if( arrivedi 0 )
  • pthreads_cond_wait(condi, mutex_remi)
  • arrivedi 0
  • pthreads_mutex_unlock(mutex_remi)

41
PIPE with Pthreads
  • P1for( i0 iltnum_pics, read(in_pic) i )
  • int_pic_1i trans1( in_pic )
  • semaphore_signal( event_1_2i )
  • P2 for( i0 iltnum_pics i )
  • semaphore_wait( event_1_2i )
  • int_pic_2i trans2( int_pic_1i )
  • semaphore_signal( event_2_3i )

42
Note
  • Many shared memory programming systems (other
    than Pthreads) have semaphores as basic
    primitive.
  • If they do, you should use it, not construct it
    yourself.
  • Implementation may be more efficient than what
    you can do yourself.

43
Parallel TSP
  • process i
  • while( (pde_queue()) ! NULL )
  • for each expansion by one city
  • q add_city(p)
  • if complete(q) update_best(q)
  • else en_queue(q)

44
Parallel TSP
  • Need critical section
  • in update_best,
  • in en_queue/de_queue.
  • In de_queue
  • wait if q is empty,
  • terminate if all processes are waiting.
  • In en_queue
  • signal q is no longer empty.

45
Parallel TSP Mutual Exclusion
  • en_queue() / de_queue()
  • pthreads_mutex_lock(queue)
  • pthreads_mutex_unlock(queue)
  • update_best()
  • pthreads_mutex_lock(best)
  • pthreads_mutex_unlock(best)

46
Parallel TSP Condition Synchronization
  • de_queue()
  • while( (q is empty) and (not done) )
  • waiting
  • if( waiting p )
  • done true
  • pthreads_cond_broadcast(empty, queue)
  • else
  • pthreads_cond_wait(empty, queue)
  • waiting--
  • if( done )
  • return null
  • else
  • remove and return head of the queue

47
Pthreads SOR main
  • for some number of timesteps
  • for( i0 iltp i )
  • pthread_create(thrdi, NULL, sor_1, (void
    )i)
  • for( i0 iltp i )
  • pthread_join(thrdi, NULL)
  • for( i0 iltp i )
  • pthread_create(thrdi, NULL, sor_2, (void
    )i)
  • for( i0 iltp i )
  • pthread_join(thrdi, NULL)

48
Pthreads SOR Parallel parts (1)
  • void sor_1(void s)
  • int slice (int) s
  • int from (slicen)/p
  • int to ((slice1)n)/p
  • for(ifromilttoi)
  • for( j0 jltn j )
  • tempij 0.25(gridi-1j gridi1j
  • gridij-1 gridij1)

49
Pthreads SOR Parallel parts (2)
  • void sor_2(void s)
  • int slice (int) s
  • int from (slicen)/p
  • int to ((slice1)n)/p
  • for(ifromilttoi)
  • for( j0 jltn j )
  • gridij tempij

50
Reality bites ...
  • Create/exit/join is not so cheap.
  • It would be more efficient if we could come up
    with a parallel program, in which
  • create/exit/join would happen rarely (once!),
  • cheaper synchronization were used.
  • We need something that makes all threads wait,
    until all have arrived -- a barrier.

51
Barrier Synchronization
  • A wait at a barrier causes a thread to wait until
    all threads have performed a wait at the barrier.
  • At that point, they all proceed.

52
Implementing Barriers in Pthreads
  • Count the number of arrivals at the barrier.
  • Wait if this is not the last arrival.
  • Make everyone unblock if this is the last
    arrival.
  • Since the arrival count is a shared variable,
    enclose the whole operation in a mutex
    lock-unlock.

53
Implementing Barriers in Pthreads
  • void barrier()
  • pthread_mutex_lock(mutex_arr)
  • arrived
  • if (arrivedltN)
  • pthread_cond_wait(cond, mutex_arr)
  • else
  • pthread_cond_broadcast(cond)
  • arrived0 / be prepared for next barrier
    /
  • pthread_mutex_unlock(mutex_arr)

54
Parallel SOR with Barriers (1 of 2)
  • void sor (void arg)
  • int slice (int)arg
  • int from (slice (n-1))/p 1
  • int to ((slice1) (n-1))/p 1
  • for some number of iterations

55
Parallel SOR with Barriers (2 of 2)
  • for (ifrom iltto i)
  • for (j1 jltn j)
  • tempij 0.25 (gridi-1j
    gridi1j gridij-1 gridij1)
  • barrier()
  • for (ifrom iltto i)
  • for (j1 jltn j)
  • gridijtempij
  • barrier()

56
Parallel SOR with Barriers main
  • int main(int argc, char argv)
  • pthread_t thrdp
  • / Initialize mutex and condition variables /
  • for (i0 iltp i)
  • pthread_create (thrdi, attr, sor,
    (void)i)
  • for (i0 iltp i)
  • pthread_join (thrdi, NULL)
  • / Destroy mutex and condition variables /

57
Note again
  • Many shared memory programming systems (other
    than Pthreads) have barriers as basic primitive.
  • If they do, you should use it, not construct it
    yourself.
  • Implementation may be more efficient than what
    you can do yourself.

58
Busy Waiting
  • Not an explicit part of the API.
  • Available in a general shared memory programming
    environment.

59
Busy Waiting
  • initially flag 0
  • P1 produce data
  • flag 1
  • P2 while( !flag )
  • consume data

60
Use of Busy Waiting
  • On the surface, simple and efficient.
  • In general, not a recommended practice.
  • Often leads to messy and unreadable code (blurs
    data/synchronization distinction).
  • May be inefficient

61
Private Data in Pthreads
  • To make a variable private in Pthreads, you need
    to make an array out of it.
  • Index the array by thread identifier, which you
    can get by the pthreads_self() call.
  • Not very elegant or efficient.

62
Other Primitives in Pthreads
  • Set the attributes of a thread.
  • Set the attributes of a mutex lock.
  • Set scheduling parameters.
Write a Comment
User Comments (0)
About PowerShow.com