Title: Basic Communication Operations
1Basic Communication Operations
- Carl Tropper
- Department of Computer Science
2Building Blocks in Parallel Programs
- Common interactions between processes in parallel
programs-broadcast, reduction, prefix sum. - Study their behavior on standard networks-mesh,
hypercube, linear arrays - Derive expressions for time complexity
- Assumptions are send/receive one message on a
link at a time, can send on one link while
receiving on another link, cut through routing,
bidirectional links - Message transfer time on a link ts twm
3Broadcast and Reduction
- One to all broadcast
- All to one reduction
- p processes have m words apiece
- associative operation on each word-sum,product,max
,min - Both used in
- Gaussian elimination
- Shortest paths
- Inner product of vectors
4One to all Broadcast on Ring or Linear Array
- One to all broadcast
- Naïve-send p-1 messages from source to other
processes. Inefficient. - Recursive doubling (below). Furthest node,half
the distance. Log p steps
5Reduction on Linear Array
- Odd nodes send to preceding even nodes
- 0,2 to 0 6,4 to 4 concurrently
- 4 to 0
6Matrix vector multiplication on an nxn mesh
- Broadcast input vector to each row
- Each row does reduction
7Broadcast on a Mesh
- Build on array algorithm
- One to all from source (0) to row (4,8,12)
- One to all on the columns
8Broadcast on a Hypercube
- Source sends to nodes with highest
dimension(msb), then next lower
dimension,destination and source to next lower
dimension,..,everyone who can sends to lowest
dimension
9Broadcast on a Tree
- Same idea as hypercube
- Grey circles are switches
- Start from 0 and go to 4, rest the same as
hypercube algorithm
10(No Transcript)
11(No Transcript)
12(No Transcript)
13All to all broadcast on linear array/ring
- Idea-each node is kept busy in a circular transfer
14All to all broadcast on ring
15(No Transcript)
16All to all broadcast
- Mesh-each row does all to all. Nodes do all to
all broadcast in their columns. - Hypercube-pairs of nodes exchange messages in
each dimension. Requires log p steps.
17(No Transcript)
18(No Transcript)
19All to all broadcast on a mesh
20(No Transcript)
21All to all broadcast on a hypercube
- Hypercube of dimension p is composed of two
hypercubes of dimension p-1 - Start with dimension one hypercubes. Adjacent
nodes exchange messages - Adjacent nodes in dimension two hypercubes
exchange messages - In general, adjacent nodes in next higher
dimensional hypercubes exchange messages - Requires log p steps.
22All to all broadcast on a hypercube
23All to all broadcast time
- Ring or Linear Array T(tstwm)(p-1)
- Mesh
- 1st phase vp simultaneous all to all broadcasts
among vp nodes takes time (ts twm)(vp-1) - 2nd phase size of each message is mvp, phase
takes time (tstwmvp) )(vp-1) - T2ts(vp-1)twm(p-1) is total time
- Hypercube
- Size of message in ith step is 2i-1m
- Takes ts2i-1twm for a pair of nodes to send and
receive messages - T?I1logp (ts2i-1twm)tslog p twm(p-1)
24Observation on all to all broadcast time
- Send time neglecting start up is the same for
all 3 architecturestwm(p-1) - Lower bound on communication time for all 3
architectures
25(No Transcript)
26All Reduce Operation
- All reduce operation-all nodes start with buffer
of size m. Associative operation performed on all
buffers-all nodes get same result. - Semantically equivalent to all to one reduction
followed by one to all broadcast. - All reduce with one word implements barrier synch
for message passing machines. No node can finish
reduction before all nodes have contributed to
the reduction. - Implement all reduce using all to all broadcast.
Add message contents instead of concatenating
messages. - In hypercube, T(tstwm)log p for log p steps
because message size does not double in each
dimension.
27Prefix Sum Operation
- Prefix sums (scans) are all partial sums, sk, of
p numbers, n1,.,np-1, one number living on each
node. - Node k starts out with nk and winds up with sk
- Modify all to all broadcast. Each node only uses
partial sums from nodes with smaller labels.
28. Prefix Sum on Hypercube
29Prefix sum on hypercube
30Scatter and gather
- Scatter (one to all personalized communication)
source sends unique message to each destination
(vs broadcast - same message for each
destination) - Hypercubeuse one to all broadcast. Node
transfers half of its messages to one neighbor
and half to other neighbor at each step.Data goes
from one subcube to another. Log p steps. - Gather Node collects unique messages from other
nodes - HypercubeOdd numbered nodes send buffers to even
nodes in other (lower dimensional) cube.
Continue.
31Scatter operation on hypercube
32All to all personalized communication
- Each node sends distinct messages to every other
node. - Used in FFT, Matrix transpose, sample sort
- Transpose of Ai,j is ATAj,i ,0 i,j n
- Put row i (i,o),(i,1),.,(i,n-1) processor Pi
- Transpose-(i,0) goes to P0, (i,1) goes to
processor 1,(i,n) goes to Pn - Every processor sends a distinct element to every
other processor! - All to all on ring,mesh,hypercube
33All to all on ring,mesh,hypercube
- Ring All nodes send messages in same direction.
- Each node sends p-1 message of size m to
neighbor. - Nodes extract message(s) for them, forward rest
- Mesh p x p mesh. Each node groups destination
nodes into columns - All to all personalized communication on each row
- Upon arrival, messages are sorted by row
- All to all communication on rows
- Hypercube p node hypercube
- p/2 links in same dimension connect 2 subcubes
with p/2 nodes - Each node exchanges p/2 messages in a given
dimension
34(No Transcript)
35(No Transcript)
36All to all personalized communication -optimal
algorithm on a hypercube
- Nodes chose partners for exchange in order to not
suffer congestion - In step j node i exchanges with node
- i XOR j
- First step-all nodes differing in lsb exchange
messages.. - Last step-all nodes differing msb exchange
messages - E cube routing in hypercube implements this
- (Hamming) distance between nodes is non-zero
bits in i xor j - Sort links corresponding to non-zero bits in
ascending order - Route messages along these links
37Optimal algorithm picture
38Optimal algorithm
39Circular q Shift
- Node i sends packet to (iq) mod p
- Useful in string, image pattern matching, matrix
comp - 5 shift on 4x4 mesh
- Everyone goes 1 to the right (q mod vp)
- Everyone goes 1 upwards q/vp
- Left column goes up before general upshift to
make up for wraparound effect in first step
40Mesh Circular Shift
41Circular shift on hypercube
- Map linear array onto hypercube map I to j,
where j is the d bit reflected Gray code of i
42(No Transcript)
43(No Transcript)
44What happens if we split messages into packets?
- One to all broadcastscatter operation followed
by all to all broadcast - Scatter tslogp p tw(m/p)(p-1)
- All to all of messages of size m/p ts log p
tw(m/p)(p-1) on a hypercube - T2 x (ts log p twm)on a hypercube
- Bottom line-double the start-up cost, but cost of
tw reduced by (log p)/2 - All to one reduction-dual of one to all
broadcast, so all to all reduction followed by
gather - All reduce combine the above two
-
45(No Transcript)
46(No Transcript)