Title: Parallel algorithms often require processors to exchange data with other processors.
1- - Parallel algorithms often require processors
to exchange data with other processors.
2Basic communication operations
- Basic patterns of communication among processors
- Building blocks in parallel algorithms
- one-to-all broadcast
- all-to-one reduction
- all-to-all broadcast
- all-to-all reduction
- all-reduce operations
- prefix-sum operations
- scatter and gather
- all-to-all personalized communication
3A key to the efficiency of parallel programs
- Proper implementation of basic communication
operations on various parallel architectures.
4One-to-all broadcast and all-to-one reduction
- one-to-all broadcast one processor sends a
message M to all other (or some) processors - all-to-one reduction each processor has a
buffer M of size m. One processor accumulates
data from all other (or some) processors into its
buffer of size m.
5An example of one-to-all broadcast and
all-to-one reduction
- Matrix-vector multiplication A ? x y,
- where
- A 4 ? 4 matrix
- x, y 4 ? 1 vector
-
- Use 16 processors
?
6Mapping of the matrix and the vectors to
processors
7Perform one-to-all broadcast each element of x
among Processors in each column
8Each processor multiplies its matrix element with
its element of x
9Perform all-to-one reduction among processorsin
each row, where an associate operation is sum
10All-to-all broadcast and All-to-all reduction
- all-to-all broadcast each processor
simultaneously performs a one-to-all broadcast - all-to-all reduction each processor becomes the
destination of an all-to-one reduction
11All-reduce operation
- Initially, each processor has a buffer of size m.
- Each (not one) processor collects the final
result from all other processors through an
associative operator into its buffer of size m.
12All-to-one reduction vs. all-reduce
- An associative operation addition
0
1
2
3
0123
All-to-one reduction
0
1
2
3
0
1
2
3
0
1
2
3
0123
0123
All-reduce
0
1
2
3
0
1
2
3
0123
0123
13A simple method to perform all-reduce
- First, perform an all-to-one reduction
- Next, perform a one-to-all broadcast
0
1
2
3
0123
All-to-one reduction
0
1
2
3
0
1
2
3
0123
0123
0123
One-to-all broadcast
0
1
2
3
0
1
2
3
0123
0123
14Can we perform all-reduce faster?
- May or maybe not, we can improve the time by
using the communication pattern of all-to-all
broadcast - Example ) all-reduce on hypercube
15Prefix Sums
- Given p numbers n0, n1 ,.., np-1 and an
associative operation ?, the problem is to
compute the p quantities - n0
- n0 ? n1
-
- n0 ? n1 ? ? np-1
- Example ) the associative operation is addition
- input 3,1,4,0,2, prefix sums are 3,4,8,8,10
16An application of prefix sums
Packing uppercase letters from an array with both
uppercase and lowercase letters
A
T
Compute prefix sums on T
T
A
17Scatter and gather
- Scatter operation one processor sends a unique
message to every other processor - Gather operation one processor collects the
unique message from every other processor - No duplication of message
18All-to-all personalized communication
- Each processor sends a distinct message to every
other processor. - Each processor performs a scatter operation.
19An example of all-to-all personalized
communication
- Matrix transposition
- Transpose of a matrix A is a matrix AT such that
ATi,j Aj,i. - Example )
A
AT
20Transposing a 4 ? 4 matrix using four processors
The scatter operation among P0, P1,P2, P3
P0
P1
P2
P3
21A key to the efficiency of parallel programs
- Proper implementation of basic communication
operations on various parallel architectures.
22Parallel architectures
- Parallel architectures are often modeled as
interconnection networks, where - Processors or processing elements are represented
as nodes - Physical media between processors are represented
as links between corresponding nodes - Example ) ring, mesh, hypercube, tree
23Interconnection network topologies
- Ring, mesh, hypercube, tree, hypertree,
- Pyramid, butterfly, cube-connected cycles,
- Shuffle exchange, de Bruijn, star, omega,
- Fat tree, tori, hyper-star, meta-cube,
- Circulrant, K-cube, honeycom, Clos,
- k-ary n-trees, Myrinet, mesh of tree ..
24Ring or Linear array
- Linear array each node (except the two nodes at
the ends) has two neighbors. - Ring linear array with wraparound
25 Mesh or Torus
- 2-dimensional (or 2-D) mesh For each node
identified by two indices (i,j), it is connected
to four other nodes whose indices differ by one.
(i,j1)
(i,j)
(i-1,j)
(i1,j)
(i,j-1)
2-D mesh with no wraparound
2-D mesh with wraparound (2-D torus)
26A 2-D mesh is an extension of the linear array to
two-dimensions A 2-D torus is an extension of the
ring to two-dimensions.
27Hypercube
- For each node identified by a binary number of d
bits, it is connected to d neighbors whose bits
differ by one. (d-D hypercube) - Many algorithms with recursive patterns map
naturally - onto hypercube.
284-D hypercube
- 4-D hypercube two 3-d hypercubes with
additional 8 links connecting corresponding nodes
29- A mesh can be mapped to a hypercube by numbering
the nodes of the mesh. - Example ) mapping 2-D mesh to 4-D hypercube
1000
1001
1011
1010
1100
1101
1111
1110
0100
0101
0111
0110
0000
0001
0011
0010
30Tree
- For any pair of nodes, there is only one path
between them. - Static binary tree each node is a processor
- Dynamic binary tree Each leaf is a processor and
non-leaf nodes serve as switching units.
31Communication model
- Routing technique cut-through routing
- Links are bidirectional two directly-connected
nodes can send messages each other simultaneously - Single port a node can send or/and receive on
only one link at a time
or
32Cut-through routing
- A message is divided into fixed size units called
flits (4 bits 32 bytes) - Once a connection has been established, the flits
are sent one after the other. - The time to transfer a message between two
processors is independent of the path length
between them. - Time to transfer an m-word message ts mtw
- ts the time required to handle a message at the
sender (or receiver) - tw time required to transfer a word
33Some features affecting the efficiency of
parallel programs
Basic communication operation
One-to-all broadcast
ring
mesh
tree
hypercube
Cut-through routing Packet-switching
routing Store-and forward routing Bidirectional Si
ngle-port, mult-port
Parallel architecture
Communication model