Title: Basic Communication Operations:
1Basic Communication Operations Implementation
and Complexity
- Operations
- one-to-all broadcast and reduction
- all-to-all broadcast and reduction
- all-reduce, parallel prefix operations
- all-to-all scatter
- Topologies
- linear array/ring
- 2D mesh
- hypercube
- Improving complexity
- splitting and routing messages in parts
- frequently used operations, you better know well
what they do, how they do it and at what cost - the algorithms are simple and practical
- the techniques used demonstrate nicely many
useful concepts of parallel algorithm design and
3Linear Array vs Ring, Mesh vs Torus
- Linear Array vs Ring
- for simplicity reasons, we will see mostly
examples of algorithms for ring topology - the ring can be simulated by linear array by
simply embedding the ring into the linear array - we assume the time to communicate a message of
size m is tstwm - therefore we are worried only about the
congestion of the embedding - Mesh vs Torus
- all mesh and torus algorithms will be of the
form - apply the linear array/ring algorithm for each
row - apply the linear array/ring algorithm for each
column - the only difference is that the mesh uses the
linear array algorithm, while the torus uses the
ring algorithm
4One-to-all Broadcast
- Linear Array/Ring
- Naïve approach
- send to the right neighbour
- terrible complexity (tstwm)(p-1)
- Recursive doubling
- use logical binary tree for broadcasting
- make sure to minimize congestion
- complexity (tstwm)log p
5One-to-all Broadcast
- 2D Mesh
- Treat rows and columns as linear arrays
- broadcast from the source within its row
- each node of that row then broadcasts within its
row - balanced binary tree broadcasts within
row/column - complexity ?
- 3D Mesh
- the same dimension-wise approach can be used
- Hypercube
- the same dimension-wise approach can be used
6One-to-all Broadcast Reduction
- Summary
- balanced binary tree approach common for all
topologies - complexity (tstwm)log p
- All to one reduction
- just reverse the flow of messages
- combine the incoming messages using the operation
7All-to-all Broadcast Reduction
- All to all broadcast
- each process sends a message to all other
processes - different processes send different messages
- often used in matrix operations
- All to all reduction
- each process is a destination of an all-to-one
reduction - each process starts with p messages
- Naïve approach
- sequentially execute p broadcasts/reductions
- Better approach
- pipeline the broadcasts
- be careful not to overload/congest the links
8All-to-all Broadcast Reduction
- Linear Array/Ring
- pipelining the binary tree approach results in
excessive congestion - use simple send-to-the right broadcasting and
pipeline it - 2D Mesh
- 2 phases
- all2all broadcast within each row (unit size
messages) - all2all broadcast within each column
(consolidated messages of size vp) - Hypercube
- extend the 2D mesh algorithm to d dimensions
9All-to-all Broadcast Reduction
- Complexity
- Linear Array/Ring
- p-1 communication rounds of cost tstwm each
- total (tstwm)(p-1)
- 2D Mesh
- (vp-1) communication rounds of cost tstwm each
(rows) - (vp-1) communication rounds of cost tstwmvp
each (columns) - total 2ts(vp-1) twm(p-1)
- Hypercube
- log n communication rounds
- messages of size 2(i-1)m are exchanged in round
i - sum of ts 2(i-1)twm for i1 to log p
- total tslog ptwm(p-1)
10All-to-all Broadcast Reduction
- Discussion
- The bandwidth-limited time twm(p-1) is the same
for all topologies - each processor has to receive total data of
m(p-1) bits over a link of bandwidth determined
by tw - The hypercube algorithm cannot be efficiently
applied to other topologies - unlike the one2all broadcast
- would cause excessive congestion
- All2All Reduction
- reverse the order and direction of messages
- instead of concatenating, the messages are
11 All-Reduce and Parallel Prefix
- All-Reduce
- semantics all2one reduce followed by broadcast
of the result - often implements barrier() (synchronization
primitive) - implementation all2all broadcast communication
pattern, but instead of concatenating the
messages, they are combined according to the
operation - complexity (tstwm)log p
- Parallel Prefix (Scan)
- apply the operation only on your predecessors
- solution use all2all broadcast with combining
- have two combining buffers one for yourself
where only messages from predecessors are
combined, and one for others, where all messages
are combined - complexity (tstwm)log p
12Scatter and Gather
- Scatter
- like one2all broadcast, but starts with the
combined messages which is subsequently split
into smaller and smaller blocks - Gather
- reverse of scatter()
- Complexity
- step i messages of size (p-1)/2(i-1)
- total tslog ptwm(p-1)
- again, there is a lower bound of step twm(p-1)
regardless of the topology (that much must leave
the source)
13Summary of techniques
- One2All broadcast
- balanced binary tree dimension-wise
- All2one reduction
- reverse messages combine them
- All2All broadcast
- pipeline simple linear broadcasts
dimension-wise concatenate - All2All reduce
- reverse all2all messages combine them
- Scatter
- one2all communication pattern split messages
- Gather
- reverse of scatter
- Consider the following starting configuration in
a 4x4 torus - Answer the following questions
- which processors have received 7 after 3th time
step? - which processors has a message reaching p1 been
traveling through? - what is the contents of the buffers after 3rd
time step? - for broadcast from p6
- for all2all broadcast
- for reduce to p4
- for allreduce
- for gather to p11
- Homework
- do this for a hypercube as well
- Row2All broadcast
- Describe how to implement communication procedure
Row2All(i) in a cxd torus - each node of row i contains a message
- at the end each node of the torus should receive
all messages from row i - derive the time complexity of your approach(es)
- can you prove that your solution is bandwidth
optimal? - what happens if d (or c) is very large with
respect to another one?
All2All broadcast on a tree Given a balanced
binary tree, describe a procedure to perform
all2all broadcast that takes time (tstwmp/2)log
p for m-word messages on p nodes. Assume that
only the leaves of the tree contain nodes, and
that an exchange of two m-word messages any two
nodes connected by bidirectional channels takes
time tstwmk if the channel (or a part of it) is
shared by k simultaneous messages.
17All2All Scatter
18All2All Scatter
- Linear Array/Ring
- send all your data to the right
- receive a message from the left, extract your
part and delete it, then forward the rest to the
right - Complexity
- step i messages of size (p-i)
- total (tstwmp/2)(p-1)
- the average distance traveled by a piece of data
is p/2 - therefore the total traffic is m x (p-1) x p/2 x
p - this travels over p edges, so the lower bound on
the total time is twm(p-1)p/2
19All2All Scatter/Gather
- 2D Mesh
- each node first groups its messaged according to
the destination column - all2all scatter independently within each row
(messages of size mvp) - sort the combined packet destined for a column
according to the row destinations within the
column - all2all scatter within each column
- Complexity
- all2all scatter among vp nodes with messages of
size mvp (tstwmp/2)(vp-1) - total 2(tstwmp/2)(vp-1)
20All2All Scatter/Gather
- Hypercube
- Naïve Approach
- extend the mesh algorithm
- log p iterations, in iteration i send message of
size p/2 over link in dimension i - Complexity
- log p iterations, each of cost tstwmp/2
- total (tstwmp/2)log p
- lower bound p x m(p-1) x (log p)/2 total
traffic over (p log p)/2 links results in
twm(p-1) - non bandwidth optimal!
21All2All Scatter/Gather
- Hypercube Bandwidth Optimal
- let every pair of nodes communicate directly
with each other! - p-1 rounds of communication
- in round node x communicates with node x XOR i
- if dimensional routing is used, there is no
congestion - Complexity
- p-1 rounds of cost tstwm each
- total (tstwm)(p-1)
- note that the ts factor is higher here optimal
algorithm depends on the message size m and on
the ratio between ts and tw
22Improving Complexity
- Splitting and Routing Messages in Parts
- all discussed operations with exception to
one2all broadcast, all2one reduction and
all-reduce are bandwidth optimal - we can reduce bandwidth complexity of these
operations by splitting the message and routing
the messages in parts - m must be big enough
- the factor of ts increases and the factor of tw
decreases, whether that really speeds up
execution depends on ts, tw and m
23Improving Complexity
- One2All Broadcast
- scatter m into p nodes (fragments of size m/p)
- all2all broadcast of these fragments, everybody
then concatenates the fragments to get original m - Complexity
- scatter tslog ptwm/p(p-1)
- all2all broadcast(hypercube) tslog
ptwm/p(p-1) - total(hypercube) 2(tslog ptwm/p(p-1))
- All2One Reduction
- dual of One2All Broadcast
- all2all reduction followed by gather
24Improving Complexity
- AllReduce
- equals All2One reduce followed by One2All
broadcast - All2One Reduce All2All reduce followed by
gather - One2All broadcast scatter followed by All2All
broadcast - inner gather and scatter cancel each other (i.e.
the data are already where we wanted them to be) - All2All reduce followed by All2All broadcast
- Complexity
- total(hypercube) 2(tslog ptwm/p(p-1))
25Summary of Techniques
- Communication Patterns
- binary tree one2all broadcast, scatter,
reduction - linear pipelining all2all operations in
linear array/ring - dimensional all2all operations in hypercubes,
mesh/torus - all2all direct all2all scatter in hypercubes
- Communication Directions
- forward - broadcast, scatter
- backward reduction, gather
- Data Handling
- copy/forward - broadcast
- combine - reductions
- concatenate all2all broadcast and scatter
- split- all2all broadcast, all2all scatter
26Examples of Combinations of the Techniques
- Linear Array/Ring Combinations
- binary tree forward copy One2All broadcast
- binary tree backward combine All2One
reduction - linear pipelining forward copy All2All
broadcast - linear pipelining backwardextractcombine
All2All reduction - Hypercube Combinations
- binary tree forward copy One2All broadcast
- dimensional forward concatenate All2All
broadcast - dimensional backward splitcombine
All2All reduction - all2all direct forward copy All2All