Basic Communication Operations: - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Basic Communication Operations:

Description:

the same dimension-wise approach can be used. Hypercube ... Homework: do this for a hypercube as well. 6. 5. 0. 4. 2. 10. 7. 13. 16. 9. 1. 12. 3. 8. 11. 14. p0 ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 28
Provided by: Pao3
Category:

less

Transcript and Presenter's Notes

Title: Basic Communication Operations:


1
Basic Communication Operations Implementation
and Complexity
  • Operations
  • one-to-all broadcast and reduction
  • all-to-all broadcast and reduction
  • all-reduce, parallel prefix operations
  • all-to-all scatter
  • Topologies
  • linear array/ring
  • 2D mesh
  • hypercube
  • Improving complexity
  • splitting and routing messages in parts

2
Why?
  • frequently used operations, you better know well
    what they do, how they do it and at what cost
  • the algorithms are simple and practical
  • the techniques used demonstrate nicely many
    useful concepts of parallel algorithm design and
    analysis

3
Linear Array vs Ring, Mesh vs Torus
  • Linear Array vs Ring
  • for simplicity reasons, we will see mostly
    examples of algorithms for ring topology
  • the ring can be simulated by linear array by
    simply embedding the ring into the linear array
  • we assume the time to communicate a message of
    size m is tstwm
  • therefore we are worried only about the
    congestion of the embedding
  • Mesh vs Torus
  • all mesh and torus algorithms will be of the
    form
  • apply the linear array/ring algorithm for each
    row
  • apply the linear array/ring algorithm for each
    column
  • the only difference is that the mesh uses the
    linear array algorithm, while the torus uses the
    ring algorithm

4
One-to-all Broadcast
  • Linear Array/Ring
  • Naïve approach
  • send to the right neighbour
  • terrible complexity (tstwm)(p-1)

p0
p1
p2
p3
p4
p5
p6
p7
  • Recursive doubling
  • use logical binary tree for broadcasting
  • make sure to minimize congestion
  • complexity (tstwm)log p

5
One-to-all Broadcast
  • 2D Mesh
  • Treat rows and columns as linear arrays
  • broadcast from the source within its row
  • each node of that row then broadcasts within its
    row
  • balanced binary tree broadcasts within
    row/column
  • complexity ?
  • 3D Mesh
  • the same dimension-wise approach can be used
  • Hypercube
  • the same dimension-wise approach can be used

6
One-to-all Broadcast Reduction
  • Summary
  • balanced binary tree approach common for all
    topologies
  • complexity (tstwm)log p
  • All to one reduction
  • just reverse the flow of messages
  • combine the incoming messages using the operation

7
All-to-all Broadcast Reduction
  • All to all broadcast
  • each process sends a message to all other
    processes
  • different processes send different messages
  • often used in matrix operations
  • All to all reduction
  • each process is a destination of an all-to-one
    reduction
  • each process starts with p messages
  • Naïve approach
  • sequentially execute p broadcasts/reductions
  • Better approach
  • pipeline the broadcasts
  • be careful not to overload/congest the links

8
All-to-all Broadcast Reduction
  • Linear Array/Ring
  • pipelining the binary tree approach results in
    excessive congestion
  • use simple send-to-the right broadcasting and
    pipeline it
  • 2D Mesh
  • 2 phases
  • all2all broadcast within each row (unit size
    messages)
  • all2all broadcast within each column
    (consolidated messages of size vp)
  • Hypercube
  • extend the 2D mesh algorithm to d dimensions

9
All-to-all Broadcast Reduction
  • Complexity
  • Linear Array/Ring
  • p-1 communication rounds of cost tstwm each
  • total (tstwm)(p-1)
  • 2D Mesh
  • (vp-1) communication rounds of cost tstwm each
    (rows)
  • (vp-1) communication rounds of cost tstwmvp
    each (columns)
  • total 2ts(vp-1) twm(p-1)
  • Hypercube
  • log n communication rounds
  • messages of size 2(i-1)m are exchanged in round
    i
  • sum of ts 2(i-1)twm for i1 to log p
  • total tslog ptwm(p-1)

10
All-to-all Broadcast Reduction
  • Discussion
  • The bandwidth-limited time twm(p-1) is the same
    for all topologies
  • each processor has to receive total data of
    m(p-1) bits over a link of bandwidth determined
    by tw
  • The hypercube algorithm cannot be efficiently
    applied to other topologies
  • unlike the one2all broadcast
  • would cause excessive congestion
  • All2All Reduction
  • reverse the order and direction of messages
  • instead of concatenating, the messages are
    combined

11

All-Reduce and Parallel Prefix
  • All-Reduce
  • semantics all2one reduce followed by broadcast
    of the result
  • often implements barrier() (synchronization
    primitive)
  • implementation all2all broadcast communication
    pattern, but instead of concatenating the
    messages, they are combined according to the
    operation
  • complexity (tstwm)log p
  • Parallel Prefix (Scan)
  • apply the operation only on your predecessors
  • solution use all2all broadcast with combining
  • have two combining buffers one for yourself
    where only messages from predecessors are
    combined, and one for others, where all messages
    are combined
  • complexity (tstwm)log p

12
Scatter and Gather
  • Scatter
  • like one2all broadcast, but starts with the
    combined messages which is subsequently split
    into smaller and smaller blocks
  • Gather
  • reverse of scatter()
  • Complexity
  • step i messages of size (p-1)/2(i-1)
  • total tslog ptwm(p-1)
  • again, there is a lower bound of step twm(p-1)
    regardless of the topology (that much must leave
    the source)

13
Summary of techniques
  • One2All broadcast
  • balanced binary tree dimension-wise
  • All2one reduction
  • reverse messages combine them
  • All2All broadcast
  • pipeline simple linear broadcasts
    dimension-wise concatenate
  • All2All reduce
  • reverse all2all messages combine them
  • Scatter
  • one2all communication pattern split messages
  • Gather
  • reverse of scatter

14
Exercises
  • Consider the following starting configuration in
    a 4x4 torus
  • Answer the following questions
  • which processors have received 7 after 3th time
    step?
  • which processors has a message reaching p1 been
    traveling through?
  • what is the contents of the buffers after 3rd
    time step?
  • for broadcast from p6
  • for all2all broadcast
  • for reduce to p4
  • for allreduce
  • for gather to p11
  • Homework
  • do this for a hypercube as well

15
Exercises
  • Row2All broadcast
  • Describe how to implement communication procedure
    Row2All(i) in a cxd torus
  • each node of row i contains a message
  • at the end each node of the torus should receive
    all messages from row i
  • derive the time complexity of your approach(es)
  • can you prove that your solution is bandwidth
    optimal?
  • what happens if d (or c) is very large with
    respect to another one?

16
Exercises
All2All broadcast on a tree Given a balanced
binary tree, describe a procedure to perform
all2all broadcast that takes time (tstwmp/2)log
p for m-word messages on p nodes. Assume that
only the leaves of the tree contain nodes, and
that an exchange of two m-word messages any two
nodes connected by bidirectional channels takes
time tstwmk if the channel (or a part of it) is
shared by k simultaneous messages.
17
All2All Scatter
18
All2All Scatter
  • Linear Array/Ring
  • send all your data to the right
  • receive a message from the left, extract your
    part and delete it, then forward the rest to the
    right
  • Complexity
  • step i messages of size (p-i)
  • total (tstwmp/2)(p-1)
  • the average distance traveled by a piece of data
    is p/2
  • therefore the total traffic is m x (p-1) x p/2 x
    p
  • this travels over p edges, so the lower bound on
    the total time is twm(p-1)p/2

19
All2All Scatter/Gather
  • 2D Mesh
  • each node first groups its messaged according to
    the destination column
  • all2all scatter independently within each row
    (messages of size mvp)
  • sort the combined packet destined for a column
    according to the row destinations within the
    column
  • all2all scatter within each column
  • Complexity
  • all2all scatter among vp nodes with messages of
    size mvp (tstwmp/2)(vp-1)
  • total 2(tstwmp/2)(vp-1)

20
All2All Scatter/Gather
  • Hypercube
  • Naïve Approach
  • extend the mesh algorithm
  • log p iterations, in iteration i send message of
    size p/2 over link in dimension i
  • Complexity
  • log p iterations, each of cost tstwmp/2
  • total (tstwmp/2)log p
  • lower bound p x m(p-1) x (log p)/2 total
    traffic over (p log p)/2 links results in
    twm(p-1)
  • non bandwidth optimal!

21
All2All Scatter/Gather
  • Hypercube Bandwidth Optimal
  • let every pair of nodes communicate directly
    with each other!
  • p-1 rounds of communication
  • in round node x communicates with node x XOR i
  • if dimensional routing is used, there is no
    congestion
  • Complexity
  • p-1 rounds of cost tstwm each
  • total (tstwm)(p-1)
  • note that the ts factor is higher here optimal
    algorithm depends on the message size m and on
    the ratio between ts and tw

22
Improving Complexity
  • Splitting and Routing Messages in Parts
  • all discussed operations with exception to
    one2all broadcast, all2one reduction and
    all-reduce are bandwidth optimal
  • we can reduce bandwidth complexity of these
    operations by splitting the message and routing
    the messages in parts
  • m must be big enough
  • the factor of ts increases and the factor of tw
    decreases, whether that really speeds up
    execution depends on ts, tw and m

23
Improving Complexity
  • One2All Broadcast
  • scatter m into p nodes (fragments of size m/p)
  • all2all broadcast of these fragments, everybody
    then concatenates the fragments to get original m
  • Complexity
  • scatter tslog ptwm/p(p-1)
  • all2all broadcast(hypercube) tslog
    ptwm/p(p-1)
  • total(hypercube) 2(tslog ptwm/p(p-1))
  • All2One Reduction
  • dual of One2All Broadcast
  • all2all reduction followed by gather

24
Improving Complexity
  • AllReduce
  • equals All2One reduce followed by One2All
    broadcast
  • All2One Reduce All2All reduce followed by
    gather
  • One2All broadcast scatter followed by All2All
    broadcast
  • inner gather and scatter cancel each other (i.e.
    the data are already where we wanted them to be)
  • All2All reduce followed by All2All broadcast
  • Complexity
  • total(hypercube) 2(tslog ptwm/p(p-1))

25
Summary of Techniques
  • Communication Patterns
  • binary tree one2all broadcast, scatter,
    reduction
  • linear pipelining all2all operations in
    linear array/ring
  • dimensional all2all operations in hypercubes,
    mesh/torus
  • all2all direct all2all scatter in hypercubes
  • Communication Directions
  • forward - broadcast, scatter
  • backward reduction, gather
  • Data Handling
  • copy/forward - broadcast
  • combine - reductions
  • concatenate all2all broadcast and scatter
  • split- all2all broadcast, all2all scatter

26
Examples of Combinations of the Techniques
  • Linear Array/Ring Combinations
  • binary tree forward copy One2All broadcast
  • binary tree backward combine All2One
    reduction
  • linear pipelining forward copy All2All
    broadcast
  • linear pipelining backwardextractcombine
    All2All reduction
  • Hypercube Combinations
  • binary tree forward copy One2All broadcast
  • dimensional forward concatenate All2All
    broadcast
  • dimensional backward splitcombine
    All2All reduction
  • all2all direct forward copy All2All
    scatter

27
Summary
Write a Comment
User Comments (0)
About PowerShow.com