Title: ECE1747 Parallel Programming
1ECE1747 Parallel Programming
- Shared Memory Multithreading Pthreads
2Shared Memory
- All threads access the same shared memory data
space.
Shared Memory Address Space
proc1
proc2
proc3
procN
3Shared Memory (continued)
- Concretely, it means that a variable x, a pointer
p, or an array a refer to the same object, no
matter what processor the reference originates
from. - We have more or less implicitly assumed this to
be the case in earlier examples.
4Shared Memory
a
proc1
proc2
proc3
procN
5Distributed Memory - Message Passing
- The alternative model to shared memory.
mem1
mem2
mem3
memN
a
a
a
a
proc1
proc2
proc3
procN
network
6Shared Memory vs. Message Passing
- Same terminology is used in distinguishing
hardware. - For us distinguish programming models, not
hardware.
7Programming vs. Hardware
- One can implement
- a shared memory programming model
- on shared or distributed memory hardware
- (also in software or in hardware)
- One can implement
- a message passing programming model
- on shared or distributed memory hardware
8Portability of programming models
shared memory programming
message passing programming
distr. memory machine
shared memory machine
9Shared Memory Programming Important Point to
Remember
- No matter what the implementation, it
conceptually looks like shared memory. - There may be some (important) performance
differences.
10Multithreading
- User has explicit control over thread.
- Good control can be used to performance benefit.
- Bad user has to deal with it.
11Pthreads
- POSIX standard shared-memory multithreading
interface. - Provides primitives for process management and
synchronization.
12What does the user have to do?
- Decide how to decompose the computation into
parallel parts. - Create (and destroy) processes to support that
decomposition. - Add synchronization to make sure dependences are
covered.
13 General Thread Structure
- Typically, a thread is a concurrent execution of
a function or a procedure. - So, your program needs to be restructured such
that parallel parts form separate procedures or
functions.
14Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
15Thread Joining Example
- void func(void ) ..
- pthread_t id int X
- pthread_create(id, NULL, func, X)
- ..
- pthread_join(id, NULL)
- ..
16Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
pthread_ join(id)
pthread_ exit()
17Sequential SOR
- for some number of timesteps/iterations
- for (i0 iltn i )
- for( j1, jltn, j )
- tempij 0.25
- ( gridi-1j gridi1j
- gridij-1 gridij1 )
- for( i0 iltn i )
- for( j1 jltn j )
- gridij tempij
18Parallel SOR
- First (i,j) loop nest can be parallelized.
- Second (i,j) loop nest can be parallelized.
- Must wait to start second loop nest until all
processors have finished first. - Must wait to start first loop nest of next
iteration until all processors have second loop
nest of previous iteration. - Give n/p rows to each processor.
19Pthreads SOR Parallel parts (1)
- void sor_1(void s)
-
- int slice (int) s
- int from (slicen)/p
- int to ((slice1)n)/p
- for( ifrom iltto i)
- for( j0 jltn j )
- tempij 0.25(gridi-1j gridi1j
- gridij-1 gridij1)
-
20Pthreads SOR Parallel parts (2)
- void sor_2(void s)
-
- int slice (int) s
- int from (slicen)/p
- int to ((slice1)n)/p
- for( ifrom iltto i)
- for( j0 jltn j )
- gridij tempij
-
21Pthreads SOR main
- for some number of timesteps
- for( i0 iltp i )
- pthread_create(thrdi, NULL, sor_1, (void
)i) - for( i0 iltp i )
- pthread_join(thrdi, NULL)
- for( i0 iltp i )
- pthread_create(thrdi, NULL, sor_2, (void
)i) - for( i0 iltp i )
- pthread_join(thrdi, NULL)
22Summary Thread Management
- pthread_create() creates a parallel thread
executing a given function (and arguments),
returns thread identifier. - pthread_exit() terminates thread.
- pthread_join() waits for thread with particular
thread identifier to terminate.
23Summary Program Structure
- Encapsulate parallel parts in functions.
- Use function arguments to parameterize what a
particular thread does. - Call pthread_create() with the function and
arguments, save thread identifier returned. - Call pthread_join() with that thread identifier.
24Pthreads Synchronization
- Create/exit/join
- provide some form of synchronization,
- at a very coarse level,
- requires thread creation/destruction.
- Need for finer-grain synchronization
- mutex locks,
- condition variables.
25Use of Mutex Locks
- To implement critical sections.
- Pthreads provides only exclusive locks.
- Some other systems allow shared-read,
exclusive-write locks.
26Barrier Synchronization
- A wait at a barrier causes a thread to wait until
all threads have performed a wait at the barrier. - At that point, they all proceed.
27Implementing Barriers in Pthreads
- Count the number of arrivals at the barrier.
- Wait if this is not the last arrival.
- Make everyone unblock if this is the last
arrival. - Since the arrival count is a shared variable,
enclose the whole operation in a mutex
lock-unlock.
28Implementing Barriers in Pthreads
- void barrier()
-
- pthread_mutex_lock(mutex_arr)
- arrived
- if (arrivedltN)
- pthread_cond_wait(cond, mutex_arr)
-
- else
- pthread_cond_broadcast(cond)
- arrived0 / be prepared for next barrier
/ -
- pthread_mutex_unlock(mutex_arr)
29Parallel SOR with Barriers (1 of 2)
- void sor (void arg)
-
- int slice (int)arg
- int from (slice (n-1))/p 1
- int to ((slice1) (n-1))/p 1
- for some number of iterations
-
30Parallel SOR with Barriers (2 of 2)
- for (ifrom iltto i)
- for (j1 jltn j)
- tempij 0.25 (gridi-1j
gridi1j gridij-1 gridij1) - barrier()
- for (ifrom iltto i)
- for (j1 jltn j)
- gridijtempij
- barrier()
31Parallel SOR with Barriers main
- int main(int argc, char argv)
-
- pthread_t thrdp
- / Initialize mutex and condition variables /
- for (i0 iltp i)
- pthread_create (thrdi, attr, sor,
(void)i) - for (i0 iltp i)
- pthread_join (thrdi, NULL)
- / Destroy mutex and condition variables /
-
32Note again
- Many shared memory programming systems (other
than Pthreads) have barriers as basic primitive. - If they do, you should use it, not construct it
yourself. - Implementation may be more efficient than what
you can do yourself.
33Busy Waiting
- Not an explicit part of the API.
- Available in a general shared memory programming
environment.
34Busy Waiting
- initially flag 0
- P1 produce data
- flag 1
- P2 while( !flag )
- consume data
35Use of Busy Waiting
- On the surface, simple and efficient.
- In general, not a recommended practice.
- Often leads to messy and unreadable code (blurs
data/synchronization distinction). - May be inefficient
36Private Data in Pthreads
- To make a variable private in Pthreads, you need
to make an array out of it. - Index the array by thread identifier, which you
can get by the pthreads_self() call. - Not very elegant or efficient.
37Other Primitives in Pthreads
- Set the attributes of a thread.
- Set the attributes of a mutex lock.
- Set scheduling parameters.
38ECE 1747 Parallel Programming
- Machine-independent
- Performance Optimization Techniques
39Returning to Sequential vs. Parallel
- Sequential execution time t seconds.
- Startup overhead of parallel execution t_st
seconds (depends on architecture) - (Ideal) parallel execution time t/p t_st.
- If t/p t_st gt t, no gain.
40General Idea
- Parallelism limited by dependences.
- Restructure code to eliminate or reduce
dependences. - Sometimes possible by compiler, but good to know
how to do it by hand.
41Summary
- Reorganize code such that
- dependences are removed or reduced
- large pieces of parallel work emerge
- loop bounds become known
-
- Code can become messy there is a point of
diminishing returns.
42Factors that Determine Speedup
- Characteristics of parallel code
- granularity
- load balance
- locality
- communication and synchronization
43Granularity
- Granularity size of the program unit that is
executed by a single processor. - May be a single loop iteration, a set of loop
iterations, etc. - Fine granularity leads to
- (positive) ability to use lots of processors
- (positive) finer-grain load balancing
- (negative) increased overhead
44Granularity and Critical Sections
- Small granularity gt more processors gt more
critical section accesses gt more contention.
45Issues in Performance of Parallel Parts
- Granularity.
- Load balance.
- Locality.
- Synchronization and communication.
46Load Balance
- Load imbalance different in execution time
between processors between barriers. - Execution time may not be predictable.
- Regular data parallel yes.
- Irregular data parallel or pipeline perhaps.
- Task queue no.
47Static vs. Dynamic
- Static done once, by the programmer
- block, cyclic, etc.
- fine for regular data parallel
- Dynamic done at runtime
- task queue
- fine for unpredictable execution times
- usually high overhead
- Semi-static done once, at run-time
48Choice is not inherent
- MM or SOR could be done using task queues put
all iterations in a queue. - In heterogeneous environment.
- In multitasked environment.
- TSP could be done using static partitioning give
length-1 path to all processors. - If we did exhaustive search.
49Static Load Balancing
- Block
- best locality
- possibly poor load balance
- Cyclic
- better load balance
- worse locality
- Block-cyclic
- load balancing advantages of cyclic (mostly)
- better locality (see later)
50Dynamic Load Balancing (1 of 2)
- Centralized single task queue.
- Easy to program
- Excellent load balance
- Distributed task queue per processor.
- Less communication/synchronization
51Dynamic Load Balancing (2 of 2)
- Task stealing
- Processes normally remove and insert tasks from
their own queue. - When queue is empty, remove task(s) from other
queues. - Extra overhead and programming difficulty.
- Better load balancing.
52Semi-static Load Balancing
- Measure the cost of program parts.
- Use measurement to partition computation.
- Done once, done every iteration, done every n
iterations.
53Molecular Dynamics (MD)
- Simulation of a set of bodies under the influence
of physical laws. - Atoms, molecules, celestial bodies, ...
- Have same basic structure.
54Molecular Dynamics (Skeleton)
- for some number of timesteps
- for all molecules i
- for all other molecules j
- forcei f( loci, locj )
- for all molecules i
- loci g( loci, forcei )
55Molecular Dynamics
- To reduce amount of computation, account for
interaction only with nearby molecules.
56Molecular Dynamics (continued)
- for some number of timesteps
- for all molecules i
- for all nearby molecules j
- forcei f( loci, locj )
- for all molecules i
- loci g( loci, forcei )
57Molecular Dynamics (continued)
- for each molecule i
- number of nearby molecules counti
- array of indices of nearby molecules indexj
- ( 0 lt j lt counti)
58Molecular Dynamics (continued)
- for some number of timesteps
- for( i0 iltnum_mol i )
- for( j0 jltcounti j )
- forcei f(loci,locindexj)
- for( i0 iltnum_mol i )
- loci g( loci, forcei )
59Molecular Dynamics (simple)
- for some number of timesteps
- pragma omp parallel for
- for( i0 iltnum_mol i )
- for( j0 jltcounti j )
- forcei f(loci,locindexj)
- pragma omp parallel for
- for( i0 iltnum_mol i )
- loci g( loci, forcei )
60Molecular Dynamics (simple)
- Simple to program.
- Possibly poor load balance
- block distribution of i iterations (molecules)
- could lead to uneven neighbor distribution
- cyclic does not help
61Better Load Balance
- Assign iterations such that each processor has
the same number of neighbors. - Array of assign records
- size number of processors
- two elements
- beginning i value (molecule)
- ending i value (molecule)
- Recompute partition periodically
62Molecular Dynamics (continued)
- for some number of timesteps
- pragma omp parallel
- pr omp_get_thread_num()
- for( iassignpr-gtb iltassignpr-gte i )
- for( j0 jltcounti j )
- forcei f(loci,locindexj)
- pragma omp parallel for
- for( i0 iltnum_mol i )
- loci g( loci, forcei )
63Frequency of Balancing
- Every time neighbor list is recomputed.
- once during initialization.
- every iteration.
- every n iterations.
- Extra overhead vs. better approximation and
better load balance.
64Summary
- Parallel code optimization
- Critical section accesses.
- Granularity.
- Load balance.