Title: Scaling Collective Multicast Fattree Networks
1Scaling Collective Multicast Fat-tree Networks
- Sameer Kumar
- Parallel Programming Laboratory
- University Of Illinois at Urbana Champaign
- ICPADS04
2Collective Communication
- Communication operation in which all or a large
subset participate - For example broadcast
- Performance impediment
- All to all communication
- All to all personalized communication (AAPC)
- All to all multicast (AAM)
3Communication Model
- Overhead of a point to point message is
- Tp2p a mß
- a is the total software overhead of sending the
message - ß is the per byte network overhead
- m is the size of the message
- Direct all to all overhead
- TAAM (P 1) (a mß)
- a domain when m is small
- ß domain when m is large
4Optimization Strategies
- Short messages
- Parameter a dominates
- Message combining
- Reduce the total number of messages
- Multistage algorithm to send messages along a
virtual topology
- Large messages
- Parameter ß dominates
- Network contention
- Network topology specific optimizations that
minimize network contention
5Direct Strategies
- Direct strategies optimize all to all multicast
for large messages - Minimize network contention
- Topology specific optimizations that take
advantage of contention free schedules
6Fat-tree Networks
- Popular network topology for clusters
- Bisection bandwidth O(P)
- Network scales to several thousands of nodes
- Topology k-ary,n-tree
7k-ary n-trees
c)
4-ary 3 tree
8Contention Free Permutations
- Fat-trees have a nice propertysome processor
permutations are contention free - Prefix permutation k
- Processor i sends data to
- Cyclic shift by k
- Processor i sends a message to
- Contention free if
- Contention free permutations presented in Heller
et. al. from CM-5
9Prefix Permutation 1
Prefix Permutation by 1 Processor p sends to p
XOR 1
10Prefix Permutation 2
0
1
2
3
4
5
6
7
Prefix Permutation by 2 Processor p sends to p
XOR 2
11Prefix Permutation 3
0
1
2
3
4
5
6
7
Prefix Permutation by 3 Processor p sends to p
XOR 3
12Prefix Permutation 4
0
1
2
3
4
5
6
7
Prefix Permutation by 4 Processor p sends to p
XOR 4
13Cyclic Shift by k
0
1
2
3
4
5
6
7
Cyclic Shift by 2
14Quadrics HPC Interconnect
- Popular interconnect
- Several in top500 use quadrics
- Used by Pittsburghs Lemieux (6TF) and ASCI-Q
(20TF) - Features
- Low latency (5 µs for MPI)
- High bandwidth (320MB/s/node)
- Fat tree topology
- Scales to 2K nodes
15Effect of Contention of Throughput
Node Bandwidth Kth Permutation (MB/s)
Sending data from main memory is much slower
16Performance Bottlenecks
- 320 byte packet size
- Packet protocol restricts bandwidth to faraway
nodes - PCI/DMA bandwidth is restrictive
- Achievable bandwidth is only 128MB/s
17Quadrics Packet Protocol
Send the first packet
Ack Header
Receive Ack
Nearby Nodes Full Utilization
Sender
Receiver
18Far Away Messages
Send the first packet
Ack Header
Receive Ack
Sender
Receiver
19AAM on Fat-tree Networks
- Overcome bottlenecks
- Messages sent from NIC memory have 2.5 times
better performance - Avoid sending messages to far away nodes
- Using contention free permutations
- Permutation every processor sends a message to a
different destination
20AAM Strategy Ring
- Performs all to all multicast by sending messages
along a ring formed by the processors - Equivalent to P-1 cyclic-shift-by-1 operations
- Congestion free
- Has appeared in literature before
- Drawback
- Processors send different messages in each step
21Prefix Send Strategy
- P-1 prefix permutations
- In stage j, processor i sends a message to
processor (i XOR (j1)) - Congestion free
- Can send messages from Elan memory
- Bad performance on large fat-trees
- Sends P/2 messages to far-away nodes at distance
P/2 or more away - Wire/Switch delays restrict performance
22K-Prefix Strategy
- Hybrid of ring strategy and prefix send
- Prefix send used in partitions of size k
- Ring used between the partitions
- Our contribution!
23Performance
Node bandwidth (MB/s) each way
Our strategies send messages from Elan memory
24Cost Equation
- a , host and network software overhead
- ab, cost of barrier (barriers needed to
synchronize the nodes) - ßem, per byte network transmission cost
- d, copying overhead to NIC memory
- P, Number of processors
- k, Size of the partition in k-Prefix
25K-Prefixlb Strategy
k-Prefixlb strategy synchronizes nodes after a
few steps
26CPU Overhead
- Strategies should also be evaluated on compute
overhead - Asynchronous non blocking primitives needed
- A data driven system like Charm will
automatically support this
27Predicted vs Actual Performance
Predicted plot assumes a 9us, ab 15us, ß
d 294MB/s
28Missing Nodes
- Missing nodes due to down nodes in the fat tree
- Prefix-Send and k-Prefix do badly in this
scenario
29K-Shift Strategy
- Processor i sends data to the consecutive nodes
- i-k/21,, i-1, i1,, ik/2 and to ik
- Contention free and good performance with
non-contiguous nodes, when k8 - Our contribution
K-shift gains because most of the destinations
for each node do not change in the presence of
missing nodes
30Conclusion
- We optimize AAM for Quadrics QsNet
- Copying and sending a message from the NIC has
more bandwidth - K-Prefix avoids sending messages to far away
nodes - Handle missing nodes through the k-shift strategy
- Cluster interconnects other than quadrics also
have such problems - Impressive performance results
- CPU overhead should be a metric to evaluate AAM
strategies
31Future Work