Parallel algorithms often require processors to exchange data with other processors. - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Parallel algorithms often require processors to exchange data with other processors.

Description:

Parallel algorithms often require processors to exchange data with other processors. – PowerPoint PPT presentation

Number of Views:188

Avg rating:3.0/5.0

Slides: 34

Provided by: engi57

Category:

more less

Transcript and Presenter's Notes

Title: Parallel algorithms often require processors to exchange data with other processors.

1

- Parallel algorithms often require processors
to exchange data with other processors.

2
Basic communication operations

Basic patterns of communication among processors
Building blocks in parallel algorithms
one-to-all broadcast
all-to-one reduction
all-to-all broadcast
all-to-all reduction
all-reduce operations
prefix-sum operations
scatter and gather
all-to-all personalized communication

3
A key to the efficiency of parallel programs

Proper implementation of basic communication
operations on various parallel architectures.

4
One-to-all broadcast and all-to-one reduction

one-to-all broadcast one processor sends a
message M to all other (or some) processors
all-to-one reduction each processor has a
buffer M of size m. One processor accumulates
data from all other (or some) processors into its
buffer of size m.

5
An example of one-to-all broadcast and
all-to-one reduction

Matrix-vector multiplication A ? x y,
where
A 4 ? 4 matrix
x, y 4 ? 1 vector
Use 16 processors

?
6
Mapping of the matrix and the vectors to
processors
7
Perform one-to-all broadcast each element of x
among Processors in each column
8
Each processor multiplies its matrix element with
its element of x
9
Perform all-to-one reduction among processorsin
each row, where an associate operation is sum
10
All-to-all broadcast and All-to-all reduction

all-to-all broadcast each processor
simultaneously performs a one-to-all broadcast
all-to-all reduction each processor becomes the
destination of an all-to-one reduction

11
All-reduce operation

Initially, each processor has a buffer of size m.
Each (not one) processor collects the final
result from all other processors through an
associative operator into its buffer of size m.

12
All-to-one reduction vs. all-reduce

An associative operation addition

0
1
2
3
0123
All-to-one reduction
0
1
2
3
0
1
2
3
0
1
2
3
0123
0123
All-reduce
0
1
2
3
0
1
2
3
0123
0123
13
A simple method to perform all-reduce

First, perform an all-to-one reduction
Next, perform a one-to-all broadcast

0
1
2
3
0123
All-to-one reduction
0
1
2
3
0
1
2
3
0123
0123
0123
One-to-all broadcast
0
1
2
3
0
1
2
3
0123
0123
14
Can we perform all-reduce faster?

May or maybe not, we can improve the time by
using the communication pattern of all-to-all
broadcast
Example ) all-reduce on hypercube

15
Prefix Sums

Given p numbers n0, n1 ,.., np-1 and an
associative operation ?, the problem is to
compute the p quantities
n0
n0 ? n1
n0 ? n1 ? ? np-1
Example ) the associative operation is addition
input 3,1,4,0,2, prefix sums are 3,4,8,8,10

16
An application of prefix sums
Packing uppercase letters from an array with both
uppercase and lowercase letters
A
T
Compute prefix sums on T
T
A
17
Scatter and gather

Scatter operation one processor sends a unique
message to every other processor
Gather operation one processor collects the
unique message from every other processor
No duplication of message

18
All-to-all personalized communication

Each processor sends a distinct message to every
other processor.
Each processor performs a scatter operation.

19
An example of all-to-all personalized
communication

Matrix transposition
Transpose of a matrix A is a matrix AT such that
ATi,j Aj,i.
Example )

A
AT
20
Transposing a 4 ? 4 matrix using four processors
The scatter operation among P0, P1,P2, P3
P0
P1
P2
P3
21
A key to the efficiency of parallel programs

Proper implementation of basic communication
operations on various parallel architectures.

22
Parallel architectures

Parallel architectures are often modeled as
interconnection networks, where
Processors or processing elements are represented
as nodes
Physical media between processors are represented
as links between corresponding nodes
Example ) ring, mesh, hypercube, tree

23
Interconnection network topologies

Ring, mesh, hypercube, tree, hypertree,
Pyramid, butterfly, cube-connected cycles,
Shuffle exchange, de Bruijn, star, omega,
Fat tree, tori, hyper-star, meta-cube,
Circulrant, K-cube, honeycom, Clos,
k-ary n-trees, Myrinet, mesh of tree ..

24
Ring or Linear array

Linear array each node (except the two nodes at
the ends) has two neighbors.
Ring linear array with wraparound

25
Mesh or Torus

2-dimensional (or 2-D) mesh For each node
identified by two indices (i,j), it is connected
to four other nodes whose indices differ by one.

(i,j1)
(i,j)
(i-1,j)
(i1,j)
(i,j-1)
2-D mesh with no wraparound
2-D mesh with wraparound (2-D torus)
26
A 2-D mesh is an extension of the linear array to
two-dimensions A 2-D torus is an extension of the
ring to two-dimensions.
27
Hypercube

For each node identified by a binary number of d
bits, it is connected to d neighbors whose bits
differ by one. (d-D hypercube)
Many algorithms with recursive patterns map
naturally
onto hypercube.

28
4-D hypercube

4-D hypercube two 3-d hypercubes with
additional 8 links connecting corresponding nodes

A mesh can be mapped to a hypercube by numbering
the nodes of the mesh.
Example ) mapping 2-D mesh to 4-D hypercube

1000
1001
1011
1010
1100
1101
1111
1110
0100
0101
0111
0110
0000
0001
0011
0010
30
Tree

For any pair of nodes, there is only one path
between them.
Static binary tree each node is a processor
Dynamic binary tree Each leaf is a processor and
non-leaf nodes serve as switching units.

31
Communication model

Routing technique cut-through routing
Links are bidirectional two directly-connected
nodes can send messages each other simultaneously
Single port a node can send or/and receive on
only one link at a time

or
32
Cut-through routing

A message is divided into fixed size units called
flits (4 bits 32 bytes)
Once a connection has been established, the flits
are sent one after the other.
The time to transfer a message between two
processors is independent of the path length
between them.
Time to transfer an m-word message ts mtw
ts the time required to handle a message at the
sender (or receiver)
tw time required to transfer a word

33
Some features affecting the efficiency of
parallel programs
Basic communication operation
One-to-all broadcast
ring

mesh

tree
hypercube
Cut-through routing Packet-switching
routing Store-and forward routing Bidirectional Si
ngle-port, mult-port
Parallel architecture
Communication model

Write a Comment

User Comments (0)