Parallel Programming - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Parallel Programming

Description:

Execution time, Tp. Speedup, S. S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) p ... Connectivity multiplicity of paths between 2 nodes. ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 37
Provided by: SathishV4
Category:

less

Transcript and Presenter's Notes

Title: Parallel Programming


1
Parallel Programming
  • Sathish S. Vadhiyar
  • Course Web Page
  • http//www.serc.iisc.ernet.in/vss/courses/PPP2007

2
Motivation for Parallel Programming
  • Faster Execution time due to non-dependencied
    between regions of code
  • Presents a level of modularity
  • Resource constraints. Large databases.
  • Certain class of algorithms lend themselves
  • Aggregate bandwidth to memory/disk. Increase in
    data throughput.
  • Clock rate improvement in the past decade 40
  • Memory access time improvement in the past decade
    10
  • Grand challenge problems (more later)

3
Challenges / Problems in Parallel Algorithms
  • Building efficient algorithms.
  • Avoiding
  • Communication delay
  • Idling
  • Synchronization

4
Challenges
P0
P1
Idle time
Computation
Communication
Synchronization
5
How do we evaluate a parallel program?
  • Execution time, Tp
  • Speedup, S
  • S(p, n) T(1, n) / T(p, n)
  • Usually, S(p, n) lt p
  • Sometimes S(p, n) gt p (superlinear speedup)
  • Efficiency, E
  • E(p, n) S(p, n)/p
  • Usually, E(p, n) lt 1
  • Sometimes, greater than 1
  • Scalability Limitations in parallel computing,
    relation to n and p.

6
Speedups and efficiency
S
E
p
p
Ideal
Practical
7
Limitations on speedup Amdahls law
  • Amdahl's law states that the performance
    improvement to be gained from using some faster
    mode of execution is limited by the fraction of
    the time the faster mode can be used.
  • Overall speedup in terms of fractions of
    computation time with and without enhancement,
    increase in enhancement.
  • Places a limit on the speedup due to parallelism.
  • Speedup 1
  • (fs (fp/P))

8
Amdahls law Illustration
S 1 / (s (1-s)/p)
Courtesy http//www.metz.supelec.fr/dedu/docs/ko
hPaper/node2.html http//nereida.deioc.ull.es/html
/openmp/pdp2002/sld008.htm
9
Amdahls law analysis
  • For the same fraction, speedup numbers keep
    moving away from processor size.
  • Thus Amdahls law is a bit depressing for
    parallel programming.
  • In practice, the number of parallel portions of
    work has to be large enough to match a given
    number of processors.

10
Gustafsons Law
  • Amdahls law keep the parallel work fixed
  • Gustafsons law keep computation time on
    parallel processors fixed, change the fraction of
    parallel work to match the computation time
  • Serial component of code is independent of
    problem size
  • Parallel component scales as problem size which
    scales as number of processors
  • Scaled Speedup, S
  • (Seq Par(P)P)/(Seq Par(P))

11
Metrics (Contd..)
Table 5.1 Efficiency as a function of n and p.
12
Scalability
  • Efficiency decreases with increasing P increases
    with increasing N
  • How effectively the parallel algorithm can use an
    increasing number of processors
  • How the amount of computation performed must
    scale with P to keep E constant
  • This function of N in terms of P is called
    isoefficiency function.
  • An algorithm with an isoefficiency function of
    O(P) is highly scalable while an algorithm with
    quadratic or exponential isoefficiency function
    is poorly scalable

13
Scalability Analysis Finite Difference
algorithm with 1D decomposition
For constant efficiency, a function of P, when
substituted for N must satisfy the following
relation for increasing P and constant E.
Can be satisfied with N P, except for small P.
Hence isoefficiency function O(P2) since
computation is O(N2)
14
Scalability Analysis Finite Difference
algorithm with 2D decomposition
Can be satisfied with N sqroot(P)
Hence isoefficiency function O(P)
2D algorithm is more scalable than 1D
15
  • Parallel Algorithm Design, Types and Models

16
Parallel Algorithm Types and Models
  • Single Program Multiple Data (SPMD)
  • Multiple Program Multiple Data (MPMD)

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
17
Parallel Algorithm Types and Models
  • Master-Worker / parameter sweep / task farming
  • Pipleline / systolic / wavefront

P0
P1
P2
P3
P4
P0
P1
P2
P3
P4
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
18
Parallel Algorithm Types and Models
  • Data parallel model
  • Processes perform identical tasks on different
    data
  • Task parallel model
  • Different processes perform different tasks on
    same or different data based on task dependency
    graph
  • Work pool model
  • Any task can be performed by any process. Tasks
    are added to a work pool dynamically
  • Pipeline model
  • A stream of data passes through a chain of
    processes stream parallelism

19
  • Parallel Architectures
  • - Classification
  • - Cache coherence in shared memory platforms
  • - Interconnection networks

20
Classification of Architectures Flynns
classification
  • Single Instruction Single Data (SISD) Serial
    Computers
  • Single Instruction Multiple Data (SIMD)
  • - Vector processors and processor arrays
  • - Examples CM-2, Cray-90, Cray YMP, Hitachi
    3600

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
21
Classification of Architectures Flynns
classification
  • Multiple Instruction Single Data (MISD) Not
    popular
  • Multiple Instruction Multiple Data (MIMD)
  • - Most popular
  • - IBM SP and most other supercomputers,
  • clusters, computational Grids etc.

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
22
Classification of Architectures Based on Memory
  • Shared memory
  • 2 types UMA and NUMA

NUMA Examples HP-Exemplar, SGI Origin, Sequent
NUMA-Q
UMA
Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
23
Classification of Architectures Based on Memory
  • Distributed memory

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
  • Recently multi-cores
  • Yet another classification MPPs, NOW
    (Berkeley), COW, Computational Grids

24
Programming Paradigms, Algorithm Types, Techniques
  • Shared memory model Threads, OpenMP
  • Message passing model MPI
  • Data parallel model HPF

Courtesy http//www.llnl.gov/computing/tutorials/
parallel_comp/
25
Cache Coherence in SMPs
  • All processes read variable x residing in
    cache line a
  • Each process updates x at different points of
    time

CPU0
CPU1
CPU2
CPU3
a
a
a
a
cache0
cache1
cache2
cache3
a
  • Challenge To maintain consistent view of the
    data
  • Protocols
  • Write update
  • Write invalidate

Main Memory
26
Caches Coherence Protocols and Implementations
  • Write update propagate cache line to other
    processors on every write to a processor
  • Write invalidate each processor get the updated
    cache line whenever it reads stale data
  • Which is better??

27
Caches False sharing
  • Different processors update different parts of
    the same cache line
  • Leads to ping-pong of cache lines between
    processors
  • Situation better in update protocols than
    invalidate protocols. Why?

CPU1
CPU0
A0, A2, A4
A1, A3, A5
cache0
cache1
A0 A8
A9 A15
  • Modify the algorithm to change the stride

Main Memory
28
Caches Coherence using invalidate protocols
  • 3 states associated with data items
  • Shared a variable shared by 2 caches
  • Invalid another processor (say P0) has updated
    the data item
  • Dirty state of the data item in P0
  • Implementations
  • Snoopy
  • for bus based architectures
  • Memory operations are propagated over the bus and
    snooped
  • Instead of broadcasting memory operations to all
    processors, propagate coherence operations to
    relevant processors
  • Directory-based
  • A central directory maintains states of cache
    blocks, associated processors
  • Implemented with presence bits

29
Interconnection Networks
  • An interconnection network defined by switches,
    links and interfaces
  • Switches provide mapping between input and
    output ports, buffering, routing etc.
  • Interfaces connects nodes with network
  • Network topologies
  • Static point-to-point communication links among
    processing nodes
  • Dynamic Communication links are formed
    dynamically by switches

30
Interconnection Networks
  • Static
  • Bus SGI challenge
  • Completely connected
  • Star
  • Linear array, Ring (1-D torus)
  • Mesh Intel ASCI Red (2-D) , Cray T3E (3-D),
    2DTorus
  • k-d mesh d dimensions with k nodes in each
    dimension
  • Hypercubes logp-2 mesh e.g. many MIMD
    machines
  • Trees our campus network
  • Dynamic Communication links are formed
    dynamically by switches
  • Crossbar Cray X series non-blocking network
  • Multistage SP2 blocking network.

31
Evaluating Interconnection topologies
  • Diameter maximum distance between any two
    processing nodes
  • Full-connected
  • Star
  • Ring
  • Hypercube -
  • Connectivity multiplicity of paths between 2
    nodes. Maximum number of arcs to be removed from
    network to break it into two disconnected
    networks
  • Linear-array
  • Ring
  • 2-d mesh
  • 2-d mesh with wraparound
  • D-dimension hypercubes

1
2
p/2
logP
1
2
2
4
d
32
Evaluating Interconnection topologies
  • bisection width minimum number of links to be
    removed from network to partition it into 2 equal
    halves
  • Ring
  • P-node 2-D mesh -
  • Tree
  • Star
  • Completely connected
  • Hypercubes -

2
Root(P)
1
1
P2/4
P/2
33
Evaluating Interconnection topologies
  • channel width number of bits that can be
    simultaneously communicated over a link, i.e.
    number of physical wires between 2 nodes
  • channel rate performance of a single physical
    wire
  • channel bandwidth channel rate times channel
    width
  • bisection bandwidth maximum volume of
    communication between two halves of network, i.e.
    bisection width times channel bandwidth

34
  • Back-up

35
Parallel Algorithm Design - Components
  • Decomposition Splitting the problem into tasks
    or modules
  • Mapping Assigning tasks to processor
  • Mappings contradictory objectives
  • To minimize idle times
  • To reduce communications

36
Parallel Algorithm Design - Containing
Interaction Overheads
  • Maximizing data locality
  • Minimizing volume of data exchange
  • Minimizing frequency of interactions
  • Minimizing contention and hot spots
  • Overlapping computations with interactions
  • Overlapping interactions with interactions
  • Replicating data or computations
Write a Comment
User Comments (0)
About PowerShow.com