Title: Parallel Programming
1Parallel Programming
- Sathish S. Vadhiyar
- Course Web Page
- http//www.serc.iisc.ernet.in/vss/courses/PPP2007
2Motivation for Parallel Programming
- Faster Execution time due to non-dependencied
between regions of code - Presents a level of modularity
- Resource constraints. Large databases.
- Certain class of algorithms lend themselves
- Aggregate bandwidth to memory/disk. Increase in
data throughput. - Clock rate improvement in the past decade 40
- Memory access time improvement in the past decade
10 - Grand challenge problems (more later)
3Challenges / Problems in Parallel Algorithms
- Building efficient algorithms.
- Avoiding
- Communication delay
- Idling
- Synchronization
4Challenges
P0
P1
Idle time
Computation
Communication
Synchronization
5How do we evaluate a parallel program?
- Execution time, Tp
- Speedup, S
- S(p, n) T(1, n) / T(p, n)
- Usually, S(p, n) lt p
- Sometimes S(p, n) gt p (superlinear speedup)
- Efficiency, E
- E(p, n) S(p, n)/p
- Usually, E(p, n) lt 1
- Sometimes, greater than 1
- Scalability Limitations in parallel computing,
relation to n and p. -
6Speedups and efficiency
S
E
p
p
Ideal
Practical
7Limitations on speedup Amdahls law
- Amdahl's law states that the performance
improvement to be gained from using some faster
mode of execution is limited by the fraction of
the time the faster mode can be used. - Overall speedup in terms of fractions of
computation time with and without enhancement,
increase in enhancement. - Places a limit on the speedup due to parallelism.
- Speedup 1
- (fs (fp/P))
8Amdahls law Illustration
S 1 / (s (1-s)/p)
Courtesy http//www.metz.supelec.fr/dedu/docs/ko
hPaper/node2.html http//nereida.deioc.ull.es/html
/openmp/pdp2002/sld008.htm
9Amdahls law analysis
- For the same fraction, speedup numbers keep
moving away from processor size. - Thus Amdahls law is a bit depressing for
parallel programming. - In practice, the number of parallel portions of
work has to be large enough to match a given
number of processors.
10Gustafsons Law
- Amdahls law keep the parallel work fixed
- Gustafsons law keep computation time on
parallel processors fixed, change the fraction of
parallel work to match the computation time - Serial component of code is independent of
problem size - Parallel component scales as problem size which
scales as number of processors - Scaled Speedup, S
- (Seq Par(P)P)/(Seq Par(P))
11Metrics (Contd..)
Table 5.1 Efficiency as a function of n and p.
12Scalability
- Efficiency decreases with increasing P increases
with increasing N - How effectively the parallel algorithm can use an
increasing number of processors - How the amount of computation performed must
scale with P to keep E constant - This function of N in terms of P is called
isoefficiency function. - An algorithm with an isoefficiency function of
O(P) is highly scalable while an algorithm with
quadratic or exponential isoefficiency function
is poorly scalable
13Scalability Analysis Finite Difference
algorithm with 1D decomposition
For constant efficiency, a function of P, when
substituted for N must satisfy the following
relation for increasing P and constant E.
Can be satisfied with N P, except for small P.
Hence isoefficiency function O(P2) since
computation is O(N2)
14Scalability Analysis Finite Difference
algorithm with 2D decomposition
Can be satisfied with N sqroot(P)
Hence isoefficiency function O(P)
2D algorithm is more scalable than 1D
15- Parallel Algorithm Design, Types and Models
16Parallel Algorithm Types and Models
- Single Program Multiple Data (SPMD)
- Multiple Program Multiple Data (MPMD)
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
17Parallel Algorithm Types and Models
- Master-Worker / parameter sweep / task farming
- Pipleline / systolic / wavefront
P0
P1
P2
P3
P4
P0
P1
P2
P3
P4
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
18Parallel Algorithm Types and Models
- Data parallel model
- Processes perform identical tasks on different
data - Task parallel model
- Different processes perform different tasks on
same or different data based on task dependency
graph - Work pool model
- Any task can be performed by any process. Tasks
are added to a work pool dynamically - Pipeline model
- A stream of data passes through a chain of
processes stream parallelism
19- Parallel Architectures
- - Classification
- - Cache coherence in shared memory platforms
- - Interconnection networks
20Classification of Architectures Flynns
classification
- Single Instruction Single Data (SISD) Serial
Computers - Single Instruction Multiple Data (SIMD)
- - Vector processors and processor arrays
- - Examples CM-2, Cray-90, Cray YMP, Hitachi
3600
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
21Classification of Architectures Flynns
classification
- Multiple Instruction Single Data (MISD) Not
popular - Multiple Instruction Multiple Data (MIMD)
- - Most popular
- - IBM SP and most other supercomputers,
- clusters, computational Grids etc.
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
22Classification of Architectures Based on Memory
- Shared memory
- 2 types UMA and NUMA
NUMA Examples HP-Exemplar, SGI Origin, Sequent
NUMA-Q
UMA
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
23Classification of Architectures Based on Memory
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
- Recently multi-cores
- Yet another classification MPPs, NOW
(Berkeley), COW, Computational Grids
24Programming Paradigms, Algorithm Types, Techniques
- Shared memory model Threads, OpenMP
- Message passing model MPI
- Data parallel model HPF
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
25Cache Coherence in SMPs
- All processes read variable x residing in
cache line a - Each process updates x at different points of
time
CPU0
CPU1
CPU2
CPU3
a
a
a
a
cache0
cache1
cache2
cache3
a
- Challenge To maintain consistent view of the
data - Protocols
- Write update
- Write invalidate
Main Memory
26Caches Coherence Protocols and Implementations
- Write update propagate cache line to other
processors on every write to a processor - Write invalidate each processor get the updated
cache line whenever it reads stale data - Which is better??
27Caches False sharing
- Different processors update different parts of
the same cache line - Leads to ping-pong of cache lines between
processors - Situation better in update protocols than
invalidate protocols. Why?
CPU1
CPU0
A0, A2, A4
A1, A3, A5
cache0
cache1
A0 A8
A9 A15
- Modify the algorithm to change the stride
Main Memory
28Caches Coherence using invalidate protocols
- 3 states associated with data items
- Shared a variable shared by 2 caches
- Invalid another processor (say P0) has updated
the data item - Dirty state of the data item in P0
- Implementations
- Snoopy
- for bus based architectures
- Memory operations are propagated over the bus and
snooped - Instead of broadcasting memory operations to all
processors, propagate coherence operations to
relevant processors - Directory-based
- A central directory maintains states of cache
blocks, associated processors - Implemented with presence bits
29Interconnection Networks
- An interconnection network defined by switches,
links and interfaces - Switches provide mapping between input and
output ports, buffering, routing etc. - Interfaces connects nodes with network
- Network topologies
- Static point-to-point communication links among
processing nodes - Dynamic Communication links are formed
dynamically by switches
30Interconnection Networks
- Static
- Bus SGI challenge
- Completely connected
- Star
- Linear array, Ring (1-D torus)
- Mesh Intel ASCI Red (2-D) , Cray T3E (3-D),
2DTorus - k-d mesh d dimensions with k nodes in each
dimension - Hypercubes logp-2 mesh e.g. many MIMD
machines - Trees our campus network
- Dynamic Communication links are formed
dynamically by switches - Crossbar Cray X series non-blocking network
- Multistage SP2 blocking network.
31Evaluating Interconnection topologies
- Diameter maximum distance between any two
processing nodes - Full-connected
- Star
- Ring
- Hypercube -
- Connectivity multiplicity of paths between 2
nodes. Maximum number of arcs to be removed from
network to break it into two disconnected
networks - Linear-array
- Ring
- 2-d mesh
- 2-d mesh with wraparound
- D-dimension hypercubes
1
2
p/2
logP
1
2
2
4
d
32Evaluating Interconnection topologies
- bisection width minimum number of links to be
removed from network to partition it into 2 equal
halves - Ring
- P-node 2-D mesh -
- Tree
- Star
- Completely connected
- Hypercubes -
2
Root(P)
1
1
P2/4
P/2
33Evaluating Interconnection topologies
- channel width number of bits that can be
simultaneously communicated over a link, i.e.
number of physical wires between 2 nodes - channel rate performance of a single physical
wire - channel bandwidth channel rate times channel
width - bisection bandwidth maximum volume of
communication between two halves of network, i.e.
bisection width times channel bandwidth
34 35Parallel Algorithm Design - Components
- Decomposition Splitting the problem into tasks
or modules - Mapping Assigning tasks to processor
- Mappings contradictory objectives
- To minimize idle times
- To reduce communications
36Parallel Algorithm Design - Containing
Interaction Overheads
- Maximizing data locality
- Minimizing volume of data exchange
- Minimizing frequency of interactions
- Minimizing contention and hot spots
- Overlapping computations with interactions
- Overlapping interactions with interactions
- Replicating data or computations