Parallel Programming - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Parallel Programming

Description:

Execution time, Tp. Speedup, S. S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) p ... Connectivity multiplicity of paths between 2 nodes. ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 37

Provided by: SathishV4

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Programming

1
Parallel Programming

Sathish S. Vadhiyar
Course Web Page
http//www.serc.iisc.ernet.in/vss/courses/PPP2007

2
Motivation for Parallel Programming

Faster Execution time due to non-dependencied
between regions of code
Presents a level of modularity
Resource constraints. Large databases.
Certain class of algorithms lend themselves
Aggregate bandwidth to memory/disk. Increase in
data throughput.
Clock rate improvement in the past decade 40
Memory access time improvement in the past decade
10
Grand challenge problems (more later)

3
Challenges / Problems in Parallel Algorithms

Building efficient algorithms.
Avoiding
Communication delay
Idling
Synchronization

4
Challenges
P0
P1
Idle time
Computation
Communication
Synchronization
5
How do we evaluate a parallel program?

Execution time, Tp
Speedup, S
S(p, n) T(1, n) / T(p, n)
Usually, S(p, n) lt p
Sometimes S(p, n) gt p (superlinear speedup)
Efficiency, E
E(p, n) S(p, n)/p
Usually, E(p, n) lt 1
Sometimes, greater than 1
Scalability Limitations in parallel computing,
relation to n and p.

6
Speedups and efficiency
S
E
p
p
Ideal
Practical
7
Limitations on speedup Amdahls law

Amdahl's law states that the performance
improvement to be gained from using some faster
mode of execution is limited by the fraction of
the time the faster mode can be used.
Overall speedup in terms of fractions of
computation time with and without enhancement,
increase in enhancement.
Places a limit on the speedup due to parallelism.
Speedup 1
(fs (fp/P))

8
Amdahls law Illustration
S 1 / (s (1-s)/p)
Courtesy http//www.metz.supelec.fr/dedu/docs/ko
hPaper/node2.html http//nereida.deioc.ull.es/html
/openmp/pdp2002/sld008.htm
9
Amdahls law analysis

For the same fraction, speedup numbers keep
moving away from processor size.
Thus Amdahls law is a bit depressing for
parallel programming.
In practice, the number of parallel portions of
work has to be large enough to match a given
number of processors.

10
Gustafsons Law

Amdahls law keep the parallel work fixed
Gustafsons law keep computation time on
parallel processors fixed, change the fraction of
parallel work to match the computation time
Serial component of code is independent of
problem size
Parallel component scales as problem size which
scales as number of processors
Scaled Speedup, S
(Seq Par(P)P)/(Seq Par(P))

11
Metrics (Contd..)
Table 5.1 Efficiency as a function of n and p.
12
Scalability

Efficiency decreases with increasing P increases
with increasing N
How effectively the parallel algorithm can use an
increasing number of processors
How the amount of computation performed must
scale with P to keep E constant
This function of N in terms of P is called
isoefficiency function.
An algorithm with an isoefficiency function of
O(P) is highly scalable while an algorithm with
quadratic or exponential isoefficiency function
is poorly scalable

13
Scalability Analysis Finite Difference
algorithm with 1D decomposition
For constant efficiency, a function of P, when
substituted for N must satisfy the following
relation for increasing P and constant E.
Can be satisfied with N P, except for small P.
Hence isoefficiency function O(P2) since
computation is O(N2)
14
Scalability Analysis Finite Difference
algorithm with 2D decomposition
Can be satisfied with N sqroot(P)
Hence isoefficiency function O(P)
2D algorithm is more scalable than 1D
15

Parallel Algorithm Design, Types and Models

16
Parallel Algorithm Types and Models

Single Program Multiple Data (SPMD)
Multiple Program Multiple Data (MPMD)

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
17
Parallel Algorithm Types and Models

Master-Worker / parameter sweep / task farming
Pipleline / systolic / wavefront

P0
P1
P2
P3
P4
P0
P1
P2
P3
P4
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
18
Parallel Algorithm Types and Models

Data parallel model
Processes perform identical tasks on different
data
Task parallel model
Different processes perform different tasks on
same or different data based on task dependency
graph
Work pool model
Any task can be performed by any process. Tasks
are added to a work pool dynamically
Pipeline model
A stream of data passes through a chain of
processes stream parallelism

Parallel Architectures
- Classification
- Cache coherence in shared memory platforms
- Interconnection networks

20
Classification of Architectures Flynns
classification

Single Instruction Single Data (SISD) Serial
Computers
Single Instruction Multiple Data (SIMD)
- Vector processors and processor arrays
- Examples CM-2, Cray-90, Cray YMP, Hitachi
3600

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
21
Classification of Architectures Flynns
classification

Multiple Instruction Single Data (MISD) Not
popular
Multiple Instruction Multiple Data (MIMD)
- Most popular
- IBM SP and most other supercomputers,
clusters, computational Grids etc.

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
22
Classification of Architectures Based on Memory

Shared memory
2 types UMA and NUMA

NUMA Examples HP-Exemplar, SGI Origin, Sequent
NUMA-Q
UMA
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
23
Classification of Architectures Based on Memory

Distributed memory

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/

Recently multi-cores
Yet another classification MPPs, NOW
(Berkeley), COW, Computational Grids

24
Programming Paradigms, Algorithm Types, Techniques

Shared memory model Threads, OpenMP
Message passing model MPI
Data parallel model HPF

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
25
Cache Coherence in SMPs

All processes read variable x residing in
cache line a
Each process updates x at different points of
time

CPU0
CPU1
CPU2
CPU3
a
a
a
a
cache0
cache1
cache2
cache3
a

Challenge To maintain consistent view of the
data
Protocols
Write update
Write invalidate

Main Memory
26
Caches Coherence Protocols and Implementations

Write update propagate cache line to other
processors on every write to a processor
Write invalidate each processor get the updated
cache line whenever it reads stale data
Which is better??

27
Caches False sharing

Different processors update different parts of
the same cache line
Leads to ping-pong of cache lines between
processors
Situation better in update protocols than
invalidate protocols. Why?

CPU1
CPU0
A0, A2, A4
A1, A3, A5
cache0
cache1
A0 A8
A9 A15

Modify the algorithm to change the stride

Main Memory
28
Caches Coherence using invalidate protocols

3 states associated with data items
Shared a variable shared by 2 caches
Invalid another processor (say P0) has updated
the data item
Dirty state of the data item in P0
Implementations
Snoopy
for bus based architectures
Memory operations are propagated over the bus and
snooped
Instead of broadcasting memory operations to all
processors, propagate coherence operations to
relevant processors
Directory-based
A central directory maintains states of cache
blocks, associated processors
Implemented with presence bits

29
Interconnection Networks

An interconnection network defined by switches,
links and interfaces
Switches provide mapping between input and
output ports, buffering, routing etc.
Interfaces connects nodes with network
Network topologies
Static point-to-point communication links among
processing nodes
Dynamic Communication links are formed
dynamically by switches

30
Interconnection Networks

Static
Bus SGI challenge
Completely connected
Star
Linear array, Ring (1-D torus)
Mesh Intel ASCI Red (2-D) , Cray T3E (3-D),
2DTorus
k-d mesh d dimensions with k nodes in each
dimension
Hypercubes logp-2 mesh e.g. many MIMD
machines
Trees our campus network
Dynamic Communication links are formed
dynamically by switches
Crossbar Cray X series non-blocking network
Multistage SP2 blocking network.

31
Evaluating Interconnection topologies

Diameter maximum distance between any two
processing nodes
Full-connected
Star
Ring
Hypercube -
Connectivity multiplicity of paths between 2
nodes. Maximum number of arcs to be removed from
network to break it into two disconnected
networks
Linear-array
Ring
2-d mesh
2-d mesh with wraparound
D-dimension hypercubes

1
2
p/2
logP
1
2
2
4
d
32
Evaluating Interconnection topologies

bisection width minimum number of links to be
removed from network to partition it into 2 equal
halves
Ring
P-node 2-D mesh -
Tree
Star
Completely connected
Hypercubes -

2
Root(P)
1
1
P2/4
P/2
33
Evaluating Interconnection topologies

channel width number of bits that can be
simultaneously communicated over a link, i.e.
number of physical wires between 2 nodes
channel rate performance of a single physical
wire
channel bandwidth channel rate times channel
width
bisection bandwidth maximum volume of
communication between two halves of network, i.e.
bisection width times channel bandwidth