Models of Parallel Computation - PowerPoint PPT Presentation

About This Presentation
Title:

Models of Parallel Computation

Description:

... B., L. Carter, and J. Ferrante, ``Modeling Parallel Computers ... CREW (concurrent read, ... 'Broadcast' can be done on CREW PRAM in O(1): Broadcaster sends value ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 30
Provided by: csewe4
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Models of Parallel Computation


1
Models of Parallel Computation
  • WA Appendix D
  • LogP Towards a Realistic Model of Parallel
    Computation, PPOPP, May 1993
  • Alpern, B., L. Carter, and J. Ferrante,
    Modeling Parallel Computers as Memory
    Hierarchies,'' Programming Models for Massively
    Parallel Computers,
  • Giloi, W. K., S. Jahnichen, and B. D.
    Shriver ed., IEEE Press, 1993.

2
Computation Models
  • Model provides underlying abstraction useful for
    analysis of costs, design of algorithms
  • Serial computational models use RAM or TM as
    underlying models for algorithm design

3
RAM Random Access Machine
  • unalterable program consisting of optionally
    labeled instructions.
  • memory is composed of a sequence of words, each
    capable of containing an arbitrary integer.
  • an accumulator, referenced implicitly by most
    instructions.
  • a read-only input tape
  • a write-only output tape

4
RAM Assumptions
  • We assume
  • all instructions take the same time to execute
  • word-length unbounded
  • the RAM has arbitrary amounts of memory
  • arbitrary memory locations can be accessed in the
    same amount of time
  • RAM provides an ideal model of a serial computer
    for analyzing the efficiency of serial
    algorithms.

5
PRAM Parallel Random Access Machine
  • PRAM provides an ideal model of a parallel
    computer for analyzing the efficiency of parallel
    algorithms.
  • PRAM composed of
  • P unmodifiable programs, each composed of
    optionally labeled instructions.
  • a single shared memory composed of a sequence of
    words, each capable of containing an arbitrary
    integer.
  • P accumulators, one associated with each program
  • a read-only input tape
  • a write-only output tape

6
More PRAM
  • PRAM is a synchronous, MIMD, shared memory
    parallel computer.
  • Different protocols can be used for reading and
    writing shared memory.
  • EREW (exclusive read, exclusive write)
  • CREW (concurrent read, exclusive write)
  • CRCW (concurrent read, concurrent write) --
    requires additional protocol for arbitrating
    write conflicts
  • PRAM can emulate a message-passing machine by
    logically dividing shared memory into private
    memories for the P processors.

7
Broadcasting on a PRAM
  • Broadcast can be done on CREW PRAM in O(1)
  • Broadcaster sends value to shared memory
  • Processors read from shared memory

8
LogP machine model
  • Model of distributed memory multicomputer
  • Developed by Culler, Karp, Patterson, etc.
  • Authors tried to model prevailing parallel
    architectures (circa 1993).
  • Machine model represents prevalent MPP
    organization
  • machine constructed from at most a few thousand
    nodes,
  • each node contains a powerful processor
  • each node contains substantial memory
  • interconnection structure has limited bandwidth
  • interconnection structure has significant latency

9
LogP parameters
  • L upper bound on latency incurred by sending a
    message from a source to a destination
  • o overhead, defined as the time the processor is
    engaged in sending or receiving a message, during
    which time it cannot do anything else
  • g gap, defined as the minimum time between
    consecutive message transmissions or receptions
  • P number of processor/memory modules

10
LogP Assumptions
  • network has finite capacity.
  • at most ceiling(L/g) messages can be in transit
    from any one processor to any other atone time.
  • asynchronous communication.
  • latency and order of messages is unpredictable
  • all messages are small
  • context switching overhead is 0 (not modeled)
  • multithreading (virtual processes) may be
    employed but only up to a limit of L/g virtual
    processors

11
LogP notes
  • All parameters measured in processor cycles
  • Local operations take one cycle
  • Messages are assumed to be small
  • LogP was particularly well-suited to modeling
    CM-5. Not clear if the same correlation is found
    with other machines.

12
LogP Analysis of PRAM Broadcasting Algorithm
  • Algorithm
  • Broadcaster sends value to shared memory (well
    assume the value is in P0s memory)
  • P Processors read from shared memory (other
    processors receive messages from P0)
  • Time for P0 to send P messages o g (P-1)
  • Maximum time for other processors to receive
    messages o (P-2)g o L o

13
Efficient Broadcasting in LogP Model
  • Gap includes overhead time so overhead lt gap

14
Mapping induced by LogP Broadcasting algorithm on
8 processors
15
Analysis of LogP Broadcasting Algorithm to 7
Processors
  • Time to receive one message from P0 for first
    processor (P5) is L2o
  • Time to receive message for last processor is
    max3gL2o, 2gL2o, g2L4o, 4o2L,
    g4o2Lmax3gL2o, g2L4o
  • Compare to LogP analysis of PRAM Broadcast which
    is o (P-2)g o L o 5g 3o L

16
Scalable Performance
  • LogP Broadcast utilizes tree structure to
    optimize broadcast time
  • Tree depends on values of L,o,g,P
  • Strategy is much more scalable (and ultimately
    more efficient) than PRAM Broadcast

17
Moral
  • Analysis can be no better than underlying model.
    The more accurate the model, the more accurate
    the analysis.
  • (This is why we use TM to determine
    undecidability but RAM to determine complexity.)

18
Other Models used for Analysis
  • BSP (Bulk Synchronous Parallel)
  • Slight precursor and competitor to LogP
  • PMH (Parallel Memory Hierarchy)
  • Focuses on memory costs

19
BSPBulk Synchronous Parallel
  • BSP proposed by Valiant
  • BSP model consists of
  • P processors, each with local memory
  • Communication network for point-to-point message
    passing between processors
  • Mechanism for synchronizing all or some of the
    processors at defined intervals

20
BSP Programs
  • BSP programs composed of supersteps
  • In each superstep, processors execute L
    computational steps using locally stored data,
    and send and receive messages
  • Processors synchronized at the end of the
    superstep (at which time all messages have been
    received)
  • BSP programs can be implemented through
    mechanisms like Oxford BSP library (C routines
    for implementing BSP programs) and BSP-L.

21
BSP Parameters
  • P number of processors (with memory)
  • L synchronization periodicity
  • g communication cost
  • s processor speed (measured in number of time
    steps/second)
  • Processor sends at most h messages and receives
    at most h messages in a single superstep
    (communication called an h-relation)

22
BSP Notes
  • Complete program set of supersteps
  • Communication startup not modeled, g is for
    continuous traffic conditions
  • Message size is one data word
  • More than one process or thread can be executed
    by a processor.
  • Generally assumed that computation and
    communication are not overlapped
  • Time for a superstep max number of local
    operations performed by any processor g(max
    number of messages sent or received by a
    processor) L

23
BSP Analysis of PRAM Broadcast
  • Algorithm
  • Broadcaster sends value to shared memory (well
    assume the value is in P0s memory)
  • P Processors read from shared memory (other
    processors receive messages from P0)
  • In BSP model, processors only allowed to send or
    receive at most h messages in a single superstep.
    Broadcast for more than h processors would
    require a tree structure
  • If there were more than Lh processors, then a
    tree broadcast would require more than one
    superstep.
  • How much time does it take for a P processor
    broadcast?

24
BSP Analysis of PRAM Broadcast
  • How much time does it take for a P processor
    broadcast?

25
PMH Parallel Memory Hierarchy Model
  • PMH seeks to represent memory. Goal is to model
    algorithms so that good decisions can be made
    about where to allocate data during execution.
  • Model represents costs of interprocessor
    communication and memory hierarchy traffic (e.g.
    between main memory and disk, between registers
    and cache).
  • Proposed by Carter, Ferrante, Alpern

26
PMH Model
  • Computer is modeled as a tree of memory modules
    with the processors at the leaves.
  • All data movement takes the form of block
    transfers between children and their parents.
  • PMH is composed of a tree of modules
  • all modules hold data
  • leaf modules also perform computation
  • data in a module is partitioned into blocks
  • Each module has 4 parameters for each module

27
Un-parameterized PMH Models for a Cluster of
Workstations
Bandwidth from processor to diskgt bandwidth from
processor to network
Bandwidth between 2 processorsgt bandwidth to disk
28
PMH Module Parameters
  • Blocksize s_m tells how many bytes there are per
    block of m
  • Blockcount n_m tells how many blocks fit in m
  • Childcount c_m tells how many children m has
  • Transfer time t_m tells how many cycles it takes
    to transfer a block between m and its parent
  • Size of "node" and length of "edge" in PMH graph
    should correspond to blocksize, blockcount and
    transfer time
  • Generally all modules at a given level of the
    tree will have the same parameters

29
Summary
  • Goal of parallel computation models is to provide
    a realistic representation of the costs of
    programming.
  • Model provides algorithm designers and
    programmers a measure of algorithm complexity
    which helps them decide what is good (i.e.
    performance-efficient)
  • Next up Mapping and Scheduling
Write a Comment
User Comments (0)
About PowerShow.com