Title: Pipelined Broadcast on Ethernet Switched Clusters
1Pipelined Broadcast on Ethernet Switched Clusters
- Pitch Patarasuk, Ahmad Faraj, Xin Yuan
- Department of Computer Science
- Florida State University
- Tallahassee, FL 32306
2Broadcast communication(MPI_Bcast)
n0
n1
n2
n3
Before
A
B
C
D
n0
n1
n2
n3
After
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
Let T(msize) time to send a message of size
msizeBroadcast(msize) gt T(msize)
3Ethernet Switched Cluster
switch
switch
switch
switch
4- Problem statement
- How to efficiently realize the broadcast
operation with large message sizes on Ethernet
switched clusters. - Using pipelined broadcast can achieve near
optimal results (T(msize) time for broadcasting a
message of size msize). - Finding contention free broadcast tree
- Finding a good segment size
5Traditional Broadcast algorithms
0
1
2
3
4
5
6
7
Time (P-1) x T(msize)
0
1
2
3
4
5
6
7
Time (P-1) x T(msize)
60
0
1
2
1
2
3
3
4
5
6
4
5
6
7
7
- Time 2x(log2(P1)-1)xT(msize)
70
4
2
1
6
5
3
7
8n0
n1
n2
n3
Before
A
B
C
D
Scatter
A
B
C
D
Allgather
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
Time 2 x T(msize)
9Time Complexity for large messages
10Pipelined Broadcast Algorithm
0
1
2
3
11- Performance of pipelined broadcast
- Assume no network contention
- a message of size msize be broken into X messages
of msize/X. - H tree hight, D the number of children
- Size of pipelined stage D T(msize/X)
- Total time T (X H 1) (D T(msize /X))
- linear tree H P, D 1, T T(msize)
- Binary tree H log(P), D 2, T 2T(msize)
- K-ary tree H log_k(P), D k, in general not
as efficient as binary tree.
12Time Complexity for large messages
13Pipelined broadcast
- How to find a contention-free broadcast tree?
- How to select the best segment size?
14Example of network contention
switch
switch
n4,n5,n6,n7
n0,n1,n2,n3
0
1
2
- There is a link contention cause by communication
(1?4), (2?5), (2 ? 6), and (3 ? 7)
3
4
5
6
7
15switch
switch
n2,n3,n6,n7
n0,n1,n4,n5
The linear tree 0?1?2?3??7 will have
a contention caused by (1?2) and (5?6)
16Algorithm for constructing contention free linear
tree
- Step 1 Traverse through all switches using
depth-first-search (DFS) algorithm, name the
switch by the order of their arrival in DFS tree - Step 2 The linear tree consists of all machines
in switch S0, follows by all machines in S1, then
S2,and so on
17Example of contention free linear tree
n0,n1,n4,n5
n2,n3,n6,n7
n12,n13,n14,n15
Switch S0
Switch S1
Switch S3
Switch S2
n8,n9,n10,n11
Linear tree n0?n1?n4?n5?2?3?6?7?8?9??15
18Algorithm for constructing contention free binary
tree
- Start with a contention free linear tree
- Recursively divide the tree into 2 sub-trees
- Make sure that the cannot be a contention
- The sub-trees are chosen such that the height of
the whole tree will be minimal
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
19Binary tree height
- Performance of binary pipeline broadcast depends
on the height of a binary tree - Even though contention free binary tree may not
be a complete binary tree, its height is not that
much more than a complete binary tree
20Average tree heights for 20 randomly generated
topologies
21Evaluation
- Contention free pipelined algorithms
- Routine generators from topology information
- The generated routines are based on MPICH p2p
primitives. - Linear tree
- Binary tree
- 3-nary tree
- Targets for comparison
- MPICH Binomial tree, Scatter/allgather
- LAM Flat-tree, Binomial
- Topology unaware pipelined linear and binary
algorithms
22Evaluation
23Performance of different pipelined trees
(topology 1)
24Comparing pipelined broadcast with other schemes
25Topology unaware and contention-free pipelined
broadcast
26Segment size for pipelined broadcast
27Conclusions
- Pipelined broadcast is faster than the current
broadcast algorithm for medium and large messages
- Linear pipeline has a completion time roughly
equal to T(msize) - binary pipeline broadcast is best for medium
messages - Contention free broadcast tree is necessary for
pipelined algorithms - A good segment size for pipelined broadcast is
not difficult to find.
28Questions?