Title: ECE1747 Parallel Programming
1ECE1747 Parallel Programming
- Shared Memory Multithreading Pthreads
2Shared Memory
- All threads access the same shared memory data
space.
Shared Memory Address Space
proc1
proc2
proc3
procN
3Shared Memory (continued)
- Concretely, it means that a variable x, a pointer
p, or an array a refer to the same object, no
matter what processor the reference originates
from. - We have more or less implicitly assumed this to
be the case in earlier examples.
4Shared Memory
a
proc1
proc2
proc3
procN
5Distributed Memory - Message Passing
- The alternative model to shared memory.
mem1
mem2
mem3
memN
a
a
a
a
proc1
proc2
proc3
procN
network
6Shared Memory vs. Message Passing
- Same terminology is used in distinguishing
hardware. - For us distinguish programming models, not
hardware.
7Programming vs. Hardware
- One can implement
- a shared memory programming model
- on shared or distributed memory hardware
- (also in software or in hardware)
- One can implement
- a message passing programming model
- on shared or distributed memory hardware
8Portability of programming models
shared memory programming
message passing programming
distr. memory machine
shared memory machine
9Shared Memory Programming Important Point to
Remember
- No matter what the implementation, it
conceptually looks like shared memory. - There may be some (important) performance
differences.
10Multithreading
- User has explicit control over thread.
- Good control can be used to performance benefit.
- Bad user has to deal with it.
11Pthreads
- POSIX standard shared-memory multithreading
interface. - Provides primitives for process management and
synchronization.
12What does the user have to do?
- Decide how to decompose the computation into
parallel parts. - Create (and destroy) processes to support that
decomposition. - Add synchronization to make sure dependences are
covered.
13 General Thread Structure
- Typically, a thread is a concurrent execution of
a function or a procedure. - So, your program needs to be restructured such
that parallel parts form separate procedures or
functions.
14Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
15Thread Joining Example
- void func(void ) ..
- pthread_t id int X
- pthread_create(id, NULL, func, X)
- ..
- pthread_join(id, NULL)
- ..
16Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
pthread_ join(id)
pthread_ exit()
17Matrix Multiply
- for( i0 iltn i )
- for( j0 jltn j )
- cij 0.0
- for( k0 kltn k )
- cij aikbkj
-
18Parallel Matrix Multiply
- All i- or j-iterations can be run in parallel.
- If we have p processors, n/p rows to each
processor. - Corresponds to partitioning i-loop.
19Matrix Multiply Parallel Part
- void mmult(void s)
-
- int slice (int) s
- int from (slicen)/p
- int to ((slice1)n)/p
- for(ifrom iltto i)
- for(j0 jltn j)
- cij 0.0
- for(k0 kltn k)
- cij aikbkj
-
-
20Matrix Multiply Main
- int main()
-
- pthread_t thrdp
-
- for( i0 iltp i )
- pthread_create(thrdi, NULL, mmult,(void)
i) - for( i0 iltp i )
- pthread_join(thrdi, NULL)
-
21Sequential SOR
- for some number of timesteps/iterations
- for (i0 iltn i )
- for( j1, jltn, j )
- tempij 0.25
- ( gridi-1j gridi1j
- gridij-1 gridij1 )
- for( i0 iltn i )
- for( j1 jltn j )
- gridij tempij
22Parallel SOR
- First (i,j) loop nest can be parallelized.
- Second (i,j) loop nest can be parallelized.
- Must wait to start second loop nest until all
processors have finished first. - Must wait to start first loop nest of next
iteration until all processors have second loop
nest of previous iteration. - Give n/p rows to each processor.
23Pthreads SOR Parallel parts (1)
- void sor_1(void s)
-
- int slice (int) s
- int from (slicen)/p
- int to ((slice1)n)/p
- for( ifrom iltto i)
- for( j0 jltn j )
- tempij 0.25(gridi-1j gridi1j
- gridij-1 gridij1)
-
24Pthreads SOR Parallel parts (2)
- void sor_2(void s)
-
- int slice (int) s
- int from (slicen)/p
- int to ((slice1)n)/p
- for( ifrom iltto i)
- for( j0 jltn j )
- gridij tempij
-
25Pthreads SOR main
- for some number of timesteps
- for( i0 iltp i )
- pthread_create(thrdi, NULL, sor_1, (void
)i) - for( i0 iltp i )
- pthread_join(thrdi, NULL)
- for( i0 iltp i )
- pthread_create(thrdi, NULL, sor_2, (void
)i) - for( i0 iltp i )
- pthread_join(thrdi, NULL)
26Summary Thread Management
- pthread_create() creates a parallel thread
executing a given function (and arguments),
returns thread identifier. - pthread_exit() terminates thread.
- pthread_join() waits for thread with particular
thread identifier to terminate.
27Summary Program Structure
- Encapsulate parallel parts in functions.
- Use function arguments to parameterize what a
particular thread does. - Call pthread_create() with the function and
arguments, save thread identifier returned. - Call pthread_join() with that thread identifier.
28Pthreads Synchronization
- Create/exit/join
- provide some form of synchronization,
- at a very coarse level,
- requires thread creation/destruction.
- Need for finer-grain synchronization
- mutex locks,
- condition variables.
29Use of Mutex Locks
- To implement critical sections (as needed, e.g.,
in en_queue and de_queue in TSP). - Pthreads provides only exclusive locks.
- Some other systems allow shared-read,
exclusive-write locks.
30Condition variables (1 of 5)
- pthread_cond_init(
- pthread_cond_t cond,
- pthread_cond_attr attr)
- Creates a new condition variable cond.
- Attribute ignore for now.
31Condition Variables (2 of 5)
- pthread_cond_destroy(
- pthread_cond_t cond)
- Destroys the condition variable cond.
32Condition Variables (3 of 5)
- pthread_cond_wait(
- pthread_cond_t cond,
- pthread_mutex_t mutex)
- Blocks the calling thread, waiting on cond.
- Unlocks the mutex.
33Condition Variables (4 of 5)
- pthread_cond_signal(
- pthread_cond_t cond)
- Unblocks one thread waiting on cond.
- Which one is determined by scheduler.
- If no thread waiting, then signal is a no-op.
34Condition Variables (5 of 5)
- pthread_cond_broadcast(
- pthread_cond_t cond)
- Unblocks all threads waiting on cond.
- If no thread waiting, then broadcast is a no-op.
35Use of Condition Variables
- To implement signal-wait synchronization
discussed in earlier examples. - Important note a signal is forgotten if there
is no corresponding wait that has already
happened.
36Use of Wait/Signal (Pipelining)
- Sequential
- Parallel
- (Pattern -- picture horiz. line -- processor).
37PIPE
- P1for( i0 iltnum_pics, read(in_pic) i )
- int_pic_1i trans1( in_pic )
- signal( event_1_2i )
-
- P2 for( i0 iltnum_pics i )
- wait( event_1_2i )
- int_pic_2i trans2( int_pic_1i )
- signal( event_2_3i )
-
38PIPE Using Pthreads
- Replacing the original wait/signal by a Pthreads
condition variable wait/signal will not work. - signals before a wait are forgotten.
- we need to remember a signal.
39How to remember a signal (1 of 2)
- semaphore_signal(i)
- pthread_mutex_lock(mutex_remi)
- arrived i 1
- pthread_cond_signal(condi)
- pthread_mutex_unlock(mutex_remi)
-
40How to Remember a Signal (2 of 2)
- semaphore_wait(i)
- pthreads_mutex_lock(mutex_remi)
- if( arrivedi 0 )
- pthreads_cond_wait(condi, mutex_remi)
-
- arrivedi 0
- pthreads_mutex_unlock(mutex_remi)
41PIPE with Pthreads
- P1for( i0 iltnum_pics, read(in_pic) i )
- int_pic_1i trans1( in_pic )
- semaphore_signal( event_1_2i )
-
- P2 for( i0 iltnum_pics i )
- semaphore_wait( event_1_2i )
- int_pic_2i trans2( int_pic_1i )
- semaphore_signal( event_2_3i )
-
42Note
- Many shared memory programming systems (other
than Pthreads) have semaphores as basic
primitive. - If they do, you should use it, not construct it
yourself. - Implementation may be more efficient than what
you can do yourself.
43Parallel TSP
- process i
- while( (pde_queue()) ! NULL )
- for each expansion by one city
- q add_city(p)
- if complete(q) update_best(q)
- else en_queue(q)
-
44Parallel TSP
- Need critical section
- in update_best,
- in en_queue/de_queue.
- In de_queue
- wait if q is empty,
- terminate if all processes are waiting.
- In en_queue
- signal q is no longer empty.
45Parallel TSP Mutual Exclusion
- en_queue() / de_queue()
- pthreads_mutex_lock(queue)
-
- pthreads_mutex_unlock(queue)
-
- update_best()
- pthreads_mutex_lock(best)
-
- pthreads_mutex_unlock(best)
46Parallel TSP Condition Synchronization
- de_queue()
- while( (q is empty) and (not done) )
- waiting
- if( waiting p )
- done true
- pthreads_cond_broadcast(empty, queue)
-
- else
- pthreads_cond_wait(empty, queue)
- waiting--
-
-
- if( done )
- return null
- else
- remove and return head of the queue
47Pthreads SOR main
- for some number of timesteps
- for( i0 iltp i )
- pthread_create(thrdi, NULL, sor_1, (void
)i) - for( i0 iltp i )
- pthread_join(thrdi, NULL)
- for( i0 iltp i )
- pthread_create(thrdi, NULL, sor_2, (void
)i) - for( i0 iltp i )
- pthread_join(thrdi, NULL)
48Pthreads SOR Parallel parts (1)
- void sor_1(void s)
-
- int slice (int) s
- int from (slicen)/p
- int to ((slice1)n)/p
- for(ifromilttoi)
- for( j0 jltn j )
- tempij 0.25(gridi-1j gridi1j
- gridij-1 gridij1)
-
49Pthreads SOR Parallel parts (2)
- void sor_2(void s)
-
- int slice (int) s
- int from (slicen)/p
- int to ((slice1)n)/p
- for(ifromilttoi)
- for( j0 jltn j )
- gridij tempij
-
50Reality bites ...
- Create/exit/join is not so cheap.
- It would be more efficient if we could come up
with a parallel program, in which - create/exit/join would happen rarely (once!),
- cheaper synchronization were used.
- We need something that makes all threads wait,
until all have arrived -- a barrier.
51Barrier Synchronization
- A wait at a barrier causes a thread to wait until
all threads have performed a wait at the barrier. - At that point, they all proceed.
52Implementing Barriers in Pthreads
- Count the number of arrivals at the barrier.
- Wait if this is not the last arrival.
- Make everyone unblock if this is the last
arrival. - Since the arrival count is a shared variable,
enclose the whole operation in a mutex
lock-unlock.
53Implementing Barriers in Pthreads
- void barrier()
-
- pthread_mutex_lock(mutex_arr)
- arrived
- if (arrivedltN)
- pthread_cond_wait(cond, mutex_arr)
-
- else
- pthread_cond_broadcast(cond)
- arrived0 / be prepared for next barrier
/ -
- pthread_mutex_unlock(mutex_arr)
54Parallel SOR with Barriers (1 of 2)
- void sor (void arg)
-
- int slice (int)arg
- int from (slice (n-1))/p 1
- int to ((slice1) (n-1))/p 1
- for some number of iterations
-
55Parallel SOR with Barriers (2 of 2)
- for (ifrom iltto i)
- for (j1 jltn j)
- tempij 0.25 (gridi-1j
gridi1j gridij-1 gridij1) - barrier()
- for (ifrom iltto i)
- for (j1 jltn j)
- gridijtempij
- barrier()
56Parallel SOR with Barriers main
- int main(int argc, char argv)
-
- pthread_t thrdp
- / Initialize mutex and condition variables /
- for (i0 iltp i)
- pthread_create (thrdi, attr, sor,
(void)i) - for (i0 iltp i)
- pthread_join (thrdi, NULL)
- / Destroy mutex and condition variables /
-
57Note again
- Many shared memory programming systems (other
than Pthreads) have barriers as basic primitive. - If they do, you should use it, not construct it
yourself. - Implementation may be more efficient than what
you can do yourself.
58Busy Waiting
- Not an explicit part of the API.
- Available in a general shared memory programming
environment.
59Busy Waiting
- initially flag 0
- P1 produce data
- flag 1
- P2 while( !flag )
- consume data
60Use of Busy Waiting
- On the surface, simple and efficient.
- In general, not a recommended practice.
- Often leads to messy and unreadable code (blurs
data/synchronization distinction). - May be inefficient
61Private Data in Pthreads
- To make a variable private in Pthreads, you need
to make an array out of it. - Index the array by thread identifier, which you
can get by the pthreads_self() call. - Not very elegant or efficient.
62Other Primitives in Pthreads
- Set the attributes of a thread.
- Set the attributes of a mutex lock.
- Set scheduling parameters.