Parallel algorithms often require processors to exchange data with other processors. - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Parallel algorithms often require processors to exchange data with other processors.

Description:

Parallel algorithms often require processors to exchange data with other processors. – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 34
Provided by: engi57
Category:

less

Transcript and Presenter's Notes

Title: Parallel algorithms often require processors to exchange data with other processors.


1
  • -    Parallel algorithms often require processors
    to exchange data with other processors.

2
Basic communication operations
  • Basic patterns of communication among processors
  • Building blocks in parallel algorithms
  • one-to-all broadcast
  • all-to-one reduction
  • all-to-all broadcast
  • all-to-all reduction
  • all-reduce operations
  • prefix-sum operations
  • scatter and gather
  • all-to-all personalized communication

3
A key to the efficiency of parallel programs
  • Proper implementation of basic communication
    operations on various parallel architectures.

4
One-to-all broadcast and all-to-one reduction
  • one-to-all broadcast one processor sends a
    message M to all other (or some) processors
  • all-to-one reduction each processor has a
    buffer M of size m. One processor accumulates
    data from all other (or some) processors into its
    buffer of size m.

5
An example of one-to-all broadcast and
all-to-one reduction
  • Matrix-vector multiplication A ? x y,
  • where
  • A 4 ? 4 matrix
  • x, y 4 ? 1 vector
  • Use 16 processors


?
6
Mapping of the matrix and the vectors to
processors
7
Perform one-to-all broadcast each element of x
among Processors in each column
8
Each processor multiplies its matrix element with
its element of x
9
Perform all-to-one reduction among processorsin
each row, where an associate operation is sum
10
All-to-all broadcast and All-to-all reduction
  • all-to-all broadcast each processor
    simultaneously performs a one-to-all broadcast
  • all-to-all reduction each processor becomes the
    destination of an all-to-one reduction

11
All-reduce operation
  • Initially, each processor has a buffer of size m.
  • Each (not one) processor collects the final
    result from all other processors through an
    associative operator into its buffer of size m.

12
All-to-one reduction vs. all-reduce
  • An associative operation addition

0
1
2
3
0123
All-to-one reduction
0
1
2
3
0
1
2
3
0
1
2
3
0123
0123
All-reduce
0
1
2
3
0
1
2
3
0123
0123
13
A simple method to perform all-reduce
  • First, perform an all-to-one reduction
  • Next, perform a one-to-all broadcast

0
1
2
3
0123
All-to-one reduction
0
1
2
3
0
1
2
3
0123
0123
0123
One-to-all broadcast
0
1
2
3
0
1
2
3
0123
0123
14
Can we perform all-reduce faster?
  • May or maybe not, we can improve the time by
    using the communication pattern of all-to-all
    broadcast
  • Example ) all-reduce on hypercube

15
Prefix Sums
  • Given p numbers n0, n1 ,.., np-1 and an
    associative operation ?, the problem is to
    compute the p quantities
  • n0
  • n0 ? n1
  • n0 ? n1 ? ? np-1
  • Example ) the associative operation is addition
  • input 3,1,4,0,2, prefix sums are 3,4,8,8,10

16
An application of prefix sums
Packing uppercase letters from an array with both
uppercase and lowercase letters
A
T
Compute prefix sums on T
T
A
17
Scatter and gather
  • Scatter operation one processor sends a unique
    message to every other processor
  • Gather operation one processor collects the
    unique message from every other processor
  • No duplication of message

18
All-to-all personalized communication
  • Each processor sends a distinct message to every
    other processor.
  • Each processor performs a scatter operation.

19
An example of all-to-all personalized
communication
  • Matrix transposition
  • Transpose of a matrix A is a matrix AT such that
    ATi,j Aj,i.
  • Example )

A
AT
20
Transposing a 4 ? 4 matrix using four processors
The scatter operation among P0, P1,P2, P3
P0
P1
P2
P3
21
A key to the efficiency of parallel programs
  • Proper implementation of basic communication
    operations on various parallel architectures.

22
Parallel architectures
  • Parallel architectures are often modeled as
    interconnection networks, where
  • Processors or processing elements are represented
    as nodes
  • Physical media between processors are represented
    as links between corresponding nodes
  • Example ) ring, mesh, hypercube, tree

23
Interconnection network topologies
  • Ring, mesh, hypercube, tree, hypertree,
  • Pyramid, butterfly, cube-connected cycles,
  • Shuffle exchange, de Bruijn, star, omega,
  • Fat tree, tori, hyper-star, meta-cube,
  • Circulrant, K-cube, honeycom, Clos,
  • k-ary n-trees, Myrinet, mesh of tree ..

24
Ring or Linear array
  • Linear array each node (except the two nodes at
    the ends) has two neighbors.
  • Ring linear array with wraparound

25
Mesh or Torus
  • 2-dimensional (or 2-D) mesh For each node
    identified by two indices (i,j), it is connected
    to four other nodes whose indices differ by one.

(i,j1)
(i,j)
(i-1,j)
(i1,j)
(i,j-1)
2-D mesh with no wraparound
2-D mesh with wraparound (2-D torus)
26
A 2-D mesh is an extension of the linear array to
two-dimensions A 2-D torus is an extension of the
ring to two-dimensions.
27
Hypercube
  • For each node identified by a binary number of d
    bits, it is connected to d neighbors whose bits
    differ by one. (d-D hypercube)
  • Many algorithms with recursive patterns map
    naturally
  • onto hypercube.

28
4-D hypercube
  • 4-D hypercube two 3-d hypercubes with
    additional 8 links connecting corresponding nodes

29
  • A mesh can be mapped to a hypercube by numbering
    the nodes of the mesh.
  • Example ) mapping 2-D mesh to 4-D hypercube

1000
1001
1011
1010
1100
1101
1111
1110
0100
0101
0111
0110
0000
0001
0011
0010
30
Tree
  • For any pair of nodes, there is only one path
    between them.
  • Static binary tree each node is a processor
  • Dynamic binary tree Each leaf is a processor and
    non-leaf nodes serve as switching units.

31
Communication model
  • Routing technique cut-through routing
  • Links are bidirectional two directly-connected
    nodes can send messages each other simultaneously
  • Single port a node can send or/and receive on
    only one link at a time

or
32
Cut-through routing
  • A message is divided into fixed size units called
    flits (4 bits 32 bytes)
  • Once a connection has been established, the flits
    are sent one after the other.
  • The time to transfer a message between two
    processors is independent of the path length
    between them.
  • Time to transfer an m-word message ts mtw
  • ts the time required to handle a message at the
    sender (or receiver)
  • tw time required to transfer a word

33
Some features affecting the efficiency of
parallel programs
Basic communication operation
One-to-all broadcast
ring


mesh

tree
hypercube
Cut-through routing Packet-switching
routing Store-and forward routing Bidirectional Si
ngle-port, mult-port
Parallel architecture
Communication model
Write a Comment
User Comments (0)
About PowerShow.com