ECE1747 Parallel Programming

About This Presentation

Title:

ECE1747 Parallel Programming

Description:

Some other systems allow shared-read, exclusive-write locks. Barrier Synchronization ... the attributes of a mutex lock. Set scheduling parameters. ECE 1747 ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 65

Provided by: CITI

Category:

more less

Transcript and Presenter's Notes

Title: ECE1747 Parallel Programming

1
ECE1747 Parallel Programming

Shared Memory Multithreading Pthreads

2
Shared Memory

All threads access the same shared memory data
space.

Shared Memory Address Space
proc1
proc2
proc3
procN
3
Shared Memory (continued)

Concretely, it means that a variable x, a pointer
p, or an array a refer to the same object, no
matter what processor the reference originates
from.
We have more or less implicitly assumed this to
be the case in earlier examples.

4
Shared Memory
a
proc1
proc2
proc3
procN
5
Distributed Memory - Message Passing

The alternative model to shared memory.

mem1
mem2
mem3
memN
a
a
a
a
proc1
proc2
proc3
procN
network
6
Shared Memory vs. Message Passing

Same terminology is used in distinguishing
hardware.
For us distinguish programming models, not
hardware.

7
Programming vs. Hardware

One can implement
a shared memory programming model
on shared or distributed memory hardware
(also in software or in hardware)
One can implement
a message passing programming model
on shared or distributed memory hardware

8
Portability of programming models
shared memory programming
message passing programming
distr. memory machine
shared memory machine
9
Shared Memory Programming Important Point to
Remember

No matter what the implementation, it
conceptually looks like shared memory.
There may be some (important) performance
differences.

10
Multithreading

User has explicit control over thread.
Good control can be used to performance benefit.
Bad user has to deal with it.

11
Pthreads

POSIX standard shared-memory multithreading
interface.
Provides primitives for process management and
synchronization.

12
What does the user have to do?

Decide how to decompose the computation into
parallel parts.
Create (and destroy) processes to support that
decomposition.
Add synchronization to make sure dependences are
covered.

13
General Thread Structure

Typically, a thread is a concurrent execution of
a function or a procedure.
So, your program needs to be restructured such
that parallel parts form separate procedures or
functions.

14
Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
15
Thread Joining Example

void func(void ) ..
pthread_t id int X
pthread_create(id, NULL, func, X)
..
pthread_join(id, NULL)
..

16
Example of Thread Creation (contd.)
main()
pthread_ create(func)
func()
pthread_ join(id)
pthread_ exit()
17
Sequential SOR

for some number of timesteps/iterations
for (i0 iltn i )
for( j1, jltn, j )
tempij 0.25
( gridi-1j gridi1j
gridij-1 gridij1 )
for( i0 iltn i )
for( j1 jltn j )
gridij tempij

18
Parallel SOR

First (i,j) loop nest can be parallelized.
Second (i,j) loop nest can be parallelized.
Must wait to start second loop nest until all
processors have finished first.
Must wait to start first loop nest of next
iteration until all processors have second loop
nest of previous iteration.
Give n/p rows to each processor.

19
Pthreads SOR Parallel parts (1)

void sor_1(void s)
int slice (int) s
int from (slicen)/p
int to ((slice1)n)/p
for( ifrom iltto i)
for( j0 jltn j )
tempij 0.25(gridi-1j gridi1j
gridij-1 gridij1)

20
Pthreads SOR Parallel parts (2)

void sor_2(void s)
int slice (int) s
int from (slicen)/p
int to ((slice1)n)/p
for( ifrom iltto i)
for( j0 jltn j )
gridij tempij

21
Pthreads SOR main

for some number of timesteps
for( i0 iltp i )
pthread_create(thrdi, NULL, sor_1, (void
)i)
for( i0 iltp i )
pthread_join(thrdi, NULL)
for( i0 iltp i )
pthread_create(thrdi, NULL, sor_2, (void
)i)
for( i0 iltp i )
pthread_join(thrdi, NULL)

22
Summary Thread Management

pthread_create() creates a parallel thread
executing a given function (and arguments),
returns thread identifier.
pthread_exit() terminates thread.
pthread_join() waits for thread with particular
thread identifier to terminate.

23
Summary Program Structure

Encapsulate parallel parts in functions.
Use function arguments to parameterize what a
particular thread does.
Call pthread_create() with the function and
arguments, save thread identifier returned.
Call pthread_join() with that thread identifier.

24
Pthreads Synchronization

Create/exit/join
provide some form of synchronization,
at a very coarse level,
requires thread creation/destruction.
Need for finer-grain synchronization
mutex locks,
condition variables.

25
Use of Mutex Locks

To implement critical sections.
Pthreads provides only exclusive locks.
Some other systems allow shared-read,
exclusive-write locks.

26
Barrier Synchronization

A wait at a barrier causes a thread to wait until
all threads have performed a wait at the barrier.
At that point, they all proceed.

27
Implementing Barriers in Pthreads

Count the number of arrivals at the barrier.
Wait if this is not the last arrival.
Make everyone unblock if this is the last
arrival.
Since the arrival count is a shared variable,
enclose the whole operation in a mutex
lock-unlock.

28
Implementing Barriers in Pthreads

void barrier()
pthread_mutex_lock(mutex_arr)
arrived
if (arrivedltN)
pthread_cond_wait(cond, mutex_arr)
else
pthread_cond_broadcast(cond)
arrived0 / be prepared for next barrier
/
pthread_mutex_unlock(mutex_arr)

29
Parallel SOR with Barriers (1 of 2)

void sor (void arg)
int slice (int)arg
int from (slice (n-1))/p 1
int to ((slice1) (n-1))/p 1
for some number of iterations

30
Parallel SOR with Barriers (2 of 2)

for (ifrom iltto i)
for (j1 jltn j)
tempij 0.25 (gridi-1j
gridi1j gridij-1 gridij1)
barrier()
for (ifrom iltto i)
for (j1 jltn j)
gridijtempij
barrier()

31
Parallel SOR with Barriers main

int main(int argc, char argv)
pthread_t thrdp
/ Initialize mutex and condition variables /
for (i0 iltp i)
pthread_create (thrdi, attr, sor,
(void)i)
for (i0 iltp i)
pthread_join (thrdi, NULL)
/ Destroy mutex and condition variables /

32
Note again

Many shared memory programming systems (other
than Pthreads) have barriers as basic primitive.
If they do, you should use it, not construct it
yourself.
Implementation may be more efficient than what
you can do yourself.

33
Busy Waiting

Not an explicit part of the API.
Available in a general shared memory programming
environment.

34
Busy Waiting

initially flag 0
P1 produce data
flag 1
P2 while( !flag )
consume data

35
Use of Busy Waiting

On the surface, simple and efficient.
In general, not a recommended practice.
Often leads to messy and unreadable code (blurs
data/synchronization distinction).
May be inefficient

36
Private Data in Pthreads

To make a variable private in Pthreads, you need
to make an array out of it.
Index the array by thread identifier, which you
can get by the pthreads_self() call.
Not very elegant or efficient.

37
Other Primitives in Pthreads

Set the attributes of a thread.
Set the attributes of a mutex lock.
Set scheduling parameters.

38
ECE 1747 Parallel Programming

Machine-independent
Performance Optimization Techniques

39
Returning to Sequential vs. Parallel

Sequential execution time t seconds.
Startup overhead of parallel execution t_st
seconds (depends on architecture)
(Ideal) parallel execution time t/p t_st.
If t/p t_st gt t, no gain.

40
General Idea

Parallelism limited by dependences.
Restructure code to eliminate or reduce
dependences.
Sometimes possible by compiler, but good to know
how to do it by hand.

41
Summary

Reorganize code such that
dependences are removed or reduced
large pieces of parallel work emerge
loop bounds become known
Code can become messy there is a point of
diminishing returns.

42
Factors that Determine Speedup

Characteristics of parallel code
granularity
load balance
locality
communication and synchronization

43
Granularity

Granularity size of the program unit that is
executed by a single processor.
May be a single loop iteration, a set of loop
iterations, etc.
Fine granularity leads to
(positive) ability to use lots of processors
(positive) finer-grain load balancing
(negative) increased overhead

44
Granularity and Critical Sections

Small granularity gt more processors gt more
critical section accesses gt more contention.

45
Issues in Performance of Parallel Parts

Granularity.
Load balance.
Locality.
Synchronization and communication.

46
Load Balance

Load imbalance different in execution time
between processors between barriers.
Execution time may not be predictable.
Regular data parallel yes.
Irregular data parallel or pipeline perhaps.
Task queue no.

47
Static vs. Dynamic

Static done once, by the programmer
block, cyclic, etc.
fine for regular data parallel
Dynamic done at runtime
task queue
fine for unpredictable execution times
usually high overhead
Semi-static done once, at run-time

48
Choice is not inherent

MM or SOR could be done using task queues put
all iterations in a queue.
In heterogeneous environment.
In multitasked environment.
TSP could be done using static partitioning give
length-1 path to all processors.
If we did exhaustive search.

49
Static Load Balancing

Block
best locality
possibly poor load balance
Cyclic
better load balance
worse locality
Block-cyclic
load balancing advantages of cyclic (mostly)
better locality (see later)

50
Dynamic Load Balancing (1 of 2)

Centralized single task queue.
Easy to program
Excellent load balance
Distributed task queue per processor.
Less communication/synchronization

51
Dynamic Load Balancing (2 of 2)

Task stealing
Processes normally remove and insert tasks from
their own queue.
When queue is empty, remove task(s) from other
queues.
Extra overhead and programming difficulty.
Better load balancing.

52
Semi-static Load Balancing

Measure the cost of program parts.
Use measurement to partition computation.
Done once, done every iteration, done every n
iterations.

53
Molecular Dynamics (MD)

Simulation of a set of bodies under the influence
of physical laws.
Atoms, molecules, celestial bodies, ...
Have same basic structure.

54
Molecular Dynamics (Skeleton)

for some number of timesteps
for all molecules i
for all other molecules j
forcei f( loci, locj )
for all molecules i
loci g( loci, forcei )

55
Molecular Dynamics

To reduce amount of computation, account for
interaction only with nearby molecules.

56
Molecular Dynamics (continued)

for some number of timesteps
for all molecules i
for all nearby molecules j
forcei f( loci, locj )
for all molecules i
loci g( loci, forcei )

57
Molecular Dynamics (continued)

for each molecule i
number of nearby molecules counti
array of indices of nearby molecules indexj
( 0 lt j lt counti)

58
Molecular Dynamics (continued)

for some number of timesteps
for( i0 iltnum_mol i )
for( j0 jltcounti j )
forcei f(loci,locindexj)
for( i0 iltnum_mol i )
loci g( loci, forcei )

59
Molecular Dynamics (simple)

for some number of timesteps
pragma omp parallel for
for( i0 iltnum_mol i )
for( j0 jltcounti j )
forcei f(loci,locindexj)
pragma omp parallel for
for( i0 iltnum_mol i )
loci g( loci, forcei )

60
Molecular Dynamics (simple)

Simple to program.
Possibly poor load balance
block distribution of i iterations (molecules)
could lead to uneven neighbor distribution
cyclic does not help

61
Better Load Balance

Assign iterations such that each processor has
the same number of neighbors.
Array of assign records
size number of processors
two elements
beginning i value (molecule)
ending i value (molecule)
Recompute partition periodically

62
Molecular Dynamics (continued)

for some number of timesteps
pragma omp parallel
pr omp_get_thread_num()
for( iassignpr-gtb iltassignpr-gte i )
for( j0 jltcounti j )
forcei f(loci,locindexj)
pragma omp parallel for
for( i0 iltnum_mol i )
loci g( loci, forcei )

63
Frequency of Balancing

Every time neighbor list is recomputed.
once during initialization.
every iteration.
every n iterations.
Extra overhead vs. better approximation and
better load balance.

ECE1747 Parallel Programming - PowerPoint PPT Presentation

ECE1747 Parallel Programming

Some other systems allow shared-read, exclusive-write locks. Barrier Synchronization ... the attributes of a mutex lock. Set scheduling parameters. ECE 1747 ... – PowerPoint PPT presentation