Title: Exercise 1
1Exercise 1
- Row2All broadcast
- Describe how to implement communication procedure
Row2All(i) in a cxd torus - each node of row i contains a message
- at the end each node of the torus should receive
all messages from row i - derive the time complexity of your approach(es)
- can you prove that your solution is bandwidth
optimal? - what happens if d (or c) is very large with
respect to another one?
2Exercise 1
- Row2All broadcast
- Describe how to implement communication procedure
Row2All(i) in a cxd torus - each node of row i contains a message
- at the end each node of the torus should receive
all messages from row i - derive the time complexity of your approach(es)
- can you prove that your solution is bandwidth
optimal? - what happens if d (or c) is very large with
respect to another one?
3Exercise 2
All2All broadcast on a tree Given a balanced
binary tree, describe a procedure to perform
all2all broadcast that takes time (tstwmp/2)log
p for m-word messages on p nodes. Assume that
only the leaves of the tree contain nodes, and
that an exchange of two m-word messages any two
nodes connected by bidirectional channels takes
time tstwmpk if the channel (or a part of it) is
shared by k simultaneous messages.
4Exercise 3
- All-reduce operation in ring
- Consider the all-reduce operation in which each
processor starts with an array of m words, and
needs to get the result sum of the respective
words in the array at each processor. This
operation can be implemented on a n x n torus
using one of the following three alternatives - all2all broadcast of all the arrays followed by
a local computation of the sum of the respective
elements of the array - single node accumulation of all the arrays,
followed by one2all broadcast of the result array - an algorithm that uses the pattern of all2all
broadcast, but simply adds numbers rather then
concatenating messages - For each of the above cases, compute the run
time in terms of m, ts and tw. - Assume that ts100, tw1 and m is very large.
Which of the three alternatives is better? - Assume that ts100, tw1 and m is very small
(say 1). Which of the three alternatives is
better?
5Exercise 4
How to do prefix sums with p processors Describe
an algorithm for computing prefix sums in an
n-node array distributed among p
processors. Evaluate speedup and efficiency of
your solution. What is the isoefficiency function
of your solution?
6Exercise 5
- Simplified bucket sort
- Consider a simplified version of bucket-sort. You
are given an array A of n random integers in the
range 1..r as input. The output data consists
of r buckets, such that at the end of the
algorithm, bucket i contains indices of all
elements of A that are equal to i. - describe a decomposition based on partitioning
the input data (array A) and how would it work - describe a decomposition based on partitioning
the output data and how would the resulting
algorithm work - evaluate speedup and efficiency of these
approaches - derive the isoefficiency function for each
approach