Title: Models of Parallel Computation
1Models of Parallel Computation
- WA Appendix D
- LogP Towards a Realistic Model of Parallel
Computation, PPOPP, May 1993 - Alpern, B., L. Carter, and J. Ferrante,
Modeling Parallel Computers as Memory
Hierarchies,'' Programming Models for Massively
Parallel Computers, - Giloi, W. K., S. Jahnichen, and B. D.
Shriver ed., IEEE Press, 1993.
2Computation Models
- Model provides underlying abstraction useful for
analysis of costs, design of algorithms - Serial computational models use RAM or TM as
underlying models for algorithm design
3RAM Random Access Machine
- unalterable program consisting of optionally
labeled instructions. - memory is composed of a sequence of words, each
capable of containing an arbitrary integer. - an accumulator, referenced implicitly by most
instructions. - a read-only input tape
- a write-only output tape
4RAM Assumptions
- We assume
- all instructions take the same time to execute
- word-length unbounded
- the RAM has arbitrary amounts of memory
- arbitrary memory locations can be accessed in the
same amount of time - RAM provides an ideal model of a serial computer
for analyzing the efficiency of serial
algorithms.
5PRAM Parallel Random Access Machine
- PRAM provides an ideal model of a parallel
computer for analyzing the efficiency of parallel
algorithms. - PRAM composed of
- P unmodifiable programs, each composed of
optionally labeled instructions. - a single shared memory composed of a sequence of
words, each capable of containing an arbitrary
integer. - P accumulators, one associated with each program
- a read-only input tape
- a write-only output tape
6More PRAM
- PRAM is a synchronous, MIMD, shared memory
parallel computer. - Different protocols can be used for reading and
writing shared memory. - EREW (exclusive read, exclusive write)
- CREW (concurrent read, exclusive write)
- CRCW (concurrent read, concurrent write) --
requires additional protocol for arbitrating
write conflicts - PRAM can emulate a message-passing machine by
logically dividing shared memory into private
memories for the P processors.
7Broadcasting on a PRAM
- Broadcast can be done on CREW PRAM in O(1)
- Broadcaster sends value to shared memory
- Processors read from shared memory
8LogP machine model
- Model of distributed memory multicomputer
- Developed by Culler, Karp, Patterson, etc.
- Authors tried to model prevailing parallel
architectures (circa 1993). - Machine model represents prevalent MPP
organization - machine constructed from at most a few thousand
nodes, - each node contains a powerful processor
- each node contains substantial memory
- interconnection structure has limited bandwidth
- interconnection structure has significant latency
9LogP parameters
- L upper bound on latency incurred by sending a
message from a source to a destination - o overhead, defined as the time the processor is
engaged in sending or receiving a message, during
which time it cannot do anything else - g gap, defined as the minimum time between
consecutive message transmissions or receptions - P number of processor/memory modules
10LogP Assumptions
- network has finite capacity.
- at most ceiling(L/g) messages can be in transit
from any one processor to any other atone time. - asynchronous communication.
- latency and order of messages is unpredictable
- all messages are small
- context switching overhead is 0 (not modeled)
- multithreading (virtual processes) may be
employed but only up to a limit of L/g virtual
processors
11LogP notes
- All parameters measured in processor cycles
- Local operations take one cycle
- Messages are assumed to be small
- LogP was particularly well-suited to modeling
CM-5. Not clear if the same correlation is found
with other machines.
12LogP Analysis of PRAM Broadcasting Algorithm
- Algorithm
- Broadcaster sends value to shared memory (well
assume the value is in P0s memory) - P Processors read from shared memory (other
processors receive messages from P0) - Time for P0 to send P messages o g (P-1)
- Maximum time for other processors to receive
messages o (P-2)g o L o
13Efficient Broadcasting in LogP Model
- Gap includes overhead time so overhead lt gap
14Mapping induced by LogP Broadcasting algorithm on
8 processors
15Analysis of LogP Broadcasting Algorithm to 7
Processors
- Time to receive one message from P0 for first
processor (P5) is L2o - Time to receive message for last processor is
max3gL2o, 2gL2o, g2L4o, 4o2L,
g4o2Lmax3gL2o, g2L4o - Compare to LogP analysis of PRAM Broadcast which
is o (P-2)g o L o 5g 3o L
16Scalable Performance
- LogP Broadcast utilizes tree structure to
optimize broadcast time - Tree depends on values of L,o,g,P
- Strategy is much more scalable (and ultimately
more efficient) than PRAM Broadcast
17Moral
- Analysis can be no better than underlying model.
The more accurate the model, the more accurate
the analysis. - (This is why we use TM to determine
undecidability but RAM to determine complexity.)
18Other Models used for Analysis
- BSP (Bulk Synchronous Parallel)
- Slight precursor and competitor to LogP
- PMH (Parallel Memory Hierarchy)
- Focuses on memory costs
19BSPBulk Synchronous Parallel
- BSP proposed by Valiant
- BSP model consists of
- P processors, each with local memory
- Communication network for point-to-point message
passing between processors - Mechanism for synchronizing all or some of the
processors at defined intervals
20BSP Programs
- BSP programs composed of supersteps
- In each superstep, processors execute L
computational steps using locally stored data,
and send and receive messages - Processors synchronized at the end of the
superstep (at which time all messages have been
received) - BSP programs can be implemented through
mechanisms like Oxford BSP library (C routines
for implementing BSP programs) and BSP-L.
21BSP Parameters
- P number of processors (with memory)
- L synchronization periodicity
- g communication cost
- s processor speed (measured in number of time
steps/second) - Processor sends at most h messages and receives
at most h messages in a single superstep
(communication called an h-relation)
22BSP Notes
- Complete program set of supersteps
- Communication startup not modeled, g is for
continuous traffic conditions - Message size is one data word
- More than one process or thread can be executed
by a processor. - Generally assumed that computation and
communication are not overlapped - Time for a superstep max number of local
operations performed by any processor g(max
number of messages sent or received by a
processor) L
23BSP Analysis of PRAM Broadcast
- Algorithm
- Broadcaster sends value to shared memory (well
assume the value is in P0s memory) - P Processors read from shared memory (other
processors receive messages from P0) - In BSP model, processors only allowed to send or
receive at most h messages in a single superstep.
Broadcast for more than h processors would
require a tree structure - If there were more than Lh processors, then a
tree broadcast would require more than one
superstep. - How much time does it take for a P processor
broadcast?
24BSP Analysis of PRAM Broadcast
- How much time does it take for a P processor
broadcast?
25PMH Parallel Memory Hierarchy Model
- PMH seeks to represent memory. Goal is to model
algorithms so that good decisions can be made
about where to allocate data during execution. - Model represents costs of interprocessor
communication and memory hierarchy traffic (e.g.
between main memory and disk, between registers
and cache). - Proposed by Carter, Ferrante, Alpern
26PMH Model
- Computer is modeled as a tree of memory modules
with the processors at the leaves. - All data movement takes the form of block
transfers between children and their parents. - PMH is composed of a tree of modules
- all modules hold data
- leaf modules also perform computation
- data in a module is partitioned into blocks
- Each module has 4 parameters for each module
27Un-parameterized PMH Models for a Cluster of
Workstations
Bandwidth from processor to diskgt bandwidth from
processor to network
Bandwidth between 2 processorsgt bandwidth to disk
28PMH Module Parameters
- Blocksize s_m tells how many bytes there are per
block of m - Blockcount n_m tells how many blocks fit in m
- Childcount c_m tells how many children m has
- Transfer time t_m tells how many cycles it takes
to transfer a block between m and its parent - Size of "node" and length of "edge" in PMH graph
should correspond to blocksize, blockcount and
transfer time - Generally all modules at a given level of the
tree will have the same parameters
29Summary
- Goal of parallel computation models is to provide
a realistic representation of the costs of
programming. - Model provides algorithm designers and
programmers a measure of algorithm complexity
which helps them decide what is good (i.e.
performance-efficient) - Next up Mapping and Scheduling