Basic Communication Operations: - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Basic Communication Operations:

Description:

the same dimension-wise approach can be used. Hypercube ... Homework: do this for a hypercube as well. 6. 5. 0. 4. 2. 10. 7. 13. 16. 9. 1. 12. 3. 8. 11. 14. p0 ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 28

Provided by: Pao3

Category:

more less

Transcript and Presenter's Notes

Title: Basic Communication Operations:

1
Basic Communication Operations Implementation
and Complexity

Operations
one-to-all broadcast and reduction
all-to-all broadcast and reduction
all-reduce, parallel prefix operations
all-to-all scatter
Topologies
linear array/ring
2D mesh
hypercube
Improving complexity
splitting and routing messages in parts

2
Why?

frequently used operations, you better know well
what they do, how they do it and at what cost
the algorithms are simple and practical
the techniques used demonstrate nicely many
useful concepts of parallel algorithm design and
analysis

3
Linear Array vs Ring, Mesh vs Torus

Linear Array vs Ring
for simplicity reasons, we will see mostly
examples of algorithms for ring topology
the ring can be simulated by linear array by
simply embedding the ring into the linear array
we assume the time to communicate a message of
size m is tstwm
therefore we are worried only about the
congestion of the embedding
Mesh vs Torus
all mesh and torus algorithms will be of the
form
apply the linear array/ring algorithm for each
row
apply the linear array/ring algorithm for each
column
the only difference is that the mesh uses the
linear array algorithm, while the torus uses the
ring algorithm

4
One-to-all Broadcast

Linear Array/Ring
Naïve approach
send to the right neighbour
terrible complexity (tstwm)(p-1)

p0
p1
p2
p3
p4
p5
p6
p7

Recursive doubling
use logical binary tree for broadcasting
make sure to minimize congestion
complexity (tstwm)log p

5
One-to-all Broadcast

2D Mesh
Treat rows and columns as linear arrays
broadcast from the source within its row
each node of that row then broadcasts within its
row
balanced binary tree broadcasts within
row/column
complexity ?
3D Mesh
the same dimension-wise approach can be used
Hypercube
the same dimension-wise approach can be used

6
One-to-all Broadcast Reduction

Summary
balanced binary tree approach common for all
topologies
complexity (tstwm)log p
All to one reduction
just reverse the flow of messages
combine the incoming messages using the operation

7
All-to-all Broadcast Reduction

All to all broadcast
each process sends a message to all other
processes
different processes send different messages
often used in matrix operations
All to all reduction
each process is a destination of an all-to-one
reduction
each process starts with p messages
Naïve approach
sequentially execute p broadcasts/reductions
Better approach
pipeline the broadcasts
be careful not to overload/congest the links

8
All-to-all Broadcast Reduction

Linear Array/Ring
pipelining the binary tree approach results in
excessive congestion
use simple send-to-the right broadcasting and
pipeline it
2D Mesh
2 phases
all2all broadcast within each row (unit size
messages)
all2all broadcast within each column
(consolidated messages of size vp)
Hypercube
extend the 2D mesh algorithm to d dimensions

9
All-to-all Broadcast Reduction

Complexity
Linear Array/Ring
p-1 communication rounds of cost tstwm each
total (tstwm)(p-1)
2D Mesh
(vp-1) communication rounds of cost tstwm each
(rows)
(vp-1) communication rounds of cost tstwmvp
each (columns)
total 2ts(vp-1) twm(p-1)
Hypercube
log n communication rounds
messages of size 2(i-1)m are exchanged in round
i
sum of ts 2(i-1)twm for i1 to log p
total tslog ptwm(p-1)

10
All-to-all Broadcast Reduction

Discussion
The bandwidth-limited time twm(p-1) is the same
for all topologies
each processor has to receive total data of
m(p-1) bits over a link of bandwidth determined
by tw
The hypercube algorithm cannot be efficiently
applied to other topologies
unlike the one2all broadcast
would cause excessive congestion
All2All Reduction
reverse the order and direction of messages
instead of concatenating, the messages are
combined

11

All-Reduce and Parallel Prefix

All-Reduce
semantics all2one reduce followed by broadcast
of the result
often implements barrier() (synchronization
primitive)
implementation all2all broadcast communication
pattern, but instead of concatenating the
messages, they are combined according to the
operation
complexity (tstwm)log p
Parallel Prefix (Scan)
apply the operation only on your predecessors
solution use all2all broadcast with combining
have two combining buffers one for yourself
where only messages from predecessors are
combined, and one for others, where all messages
are combined
complexity (tstwm)log p

12
Scatter and Gather

Scatter
like one2all broadcast, but starts with the
combined messages which is subsequently split
into smaller and smaller blocks
Gather
reverse of scatter()
Complexity
step i messages of size (p-1)/2(i-1)
total tslog ptwm(p-1)
again, there is a lower bound of step twm(p-1)
regardless of the topology (that much must leave
the source)

13
Summary of techniques

One2All broadcast
balanced binary tree dimension-wise
All2one reduction
reverse messages combine them
All2All broadcast
pipeline simple linear broadcasts
dimension-wise concatenate
All2All reduce
reverse all2all messages combine them
Scatter
one2all communication pattern split messages
Gather
reverse of scatter

14
Exercises

Consider the following starting configuration in
a 4x4 torus
Answer the following questions
which processors have received 7 after 3th time
step?
which processors has a message reaching p1 been
traveling through?
what is the contents of the buffers after 3rd
time step?
for broadcast from p6
for all2all broadcast
for reduce to p4
for allreduce
for gather to p11
Homework
do this for a hypercube as well

15
Exercises

Row2All broadcast
Describe how to implement communication procedure
Row2All(i) in a cxd torus
each node of row i contains a message
at the end each node of the torus should receive
all messages from row i
derive the time complexity of your approach(es)
can you prove that your solution is bandwidth
optimal?
what happens if d (or c) is very large with
respect to another one?

16
Exercises
All2All broadcast on a tree Given a balanced
binary tree, describe a procedure to perform
all2all broadcast that takes time (tstwmp/2)log
p for m-word messages on p nodes. Assume that
only the leaves of the tree contain nodes, and
that an exchange of two m-word messages any two
nodes connected by bidirectional channels takes
time tstwmk if the channel (or a part of it) is
shared by k simultaneous messages.
17
All2All Scatter
18
All2All Scatter

Linear Array/Ring
send all your data to the right
receive a message from the left, extract your
part and delete it, then forward the rest to the
right
Complexity
step i messages of size (p-i)
total (tstwmp/2)(p-1)
the average distance traveled by a piece of data
is p/2
therefore the total traffic is m x (p-1) x p/2 x
p
this travels over p edges, so the lower bound on
the total time is twm(p-1)p/2

19
All2All Scatter/Gather

2D Mesh
each node first groups its messaged according to
the destination column
all2all scatter independently within each row
(messages of size mvp)
sort the combined packet destined for a column
according to the row destinations within the
column
all2all scatter within each column
Complexity
all2all scatter among vp nodes with messages of
size mvp (tstwmp/2)(vp-1)
total 2(tstwmp/2)(vp-1)

20
All2All Scatter/Gather

Hypercube
Naïve Approach
extend the mesh algorithm
log p iterations, in iteration i send message of
size p/2 over link in dimension i
Complexity
log p iterations, each of cost tstwmp/2
total (tstwmp/2)log p
lower bound p x m(p-1) x (log p)/2 total
traffic over (p log p)/2 links results in
twm(p-1)
non bandwidth optimal!

21
All2All Scatter/Gather

Hypercube Bandwidth Optimal
let every pair of nodes communicate directly
with each other!
p-1 rounds of communication
in round node x communicates with node x XOR i
if dimensional routing is used, there is no
congestion
Complexity
p-1 rounds of cost tstwm each
total (tstwm)(p-1)
note that the ts factor is higher here optimal
algorithm depends on the message size m and on
the ratio between ts and tw

22
Improving Complexity

Splitting and Routing Messages in Parts
all discussed operations with exception to
one2all broadcast, all2one reduction and
all-reduce are bandwidth optimal
we can reduce bandwidth complexity of these
operations by splitting the message and routing
the messages in parts
m must be big enough
the factor of ts increases and the factor of tw
decreases, whether that really speeds up
execution depends on ts, tw and m

23
Improving Complexity

One2All Broadcast
scatter m into p nodes (fragments of size m/p)
all2all broadcast of these fragments, everybody
then concatenates the fragments to get original m
Complexity
scatter tslog ptwm/p(p-1)
all2all broadcast(hypercube) tslog
ptwm/p(p-1)
total(hypercube) 2(tslog ptwm/p(p-1))
All2One Reduction
dual of One2All Broadcast
all2all reduction followed by gather

24
Improving Complexity

AllReduce
equals All2One reduce followed by One2All
broadcast
All2One Reduce All2All reduce followed by
gather
One2All broadcast scatter followed by All2All
broadcast
inner gather and scatter cancel each other (i.e.
the data are already where we wanted them to be)
All2All reduce followed by All2All broadcast
Complexity
total(hypercube) 2(tslog ptwm/p(p-1))

25
Summary of Techniques