Scaling Collective Multicast Fattree Networks - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Scaling Collective Multicast Fattree Networks

Description:

Communication operation in which all or a large subset participate. For ... specific optimizations that minimize network contention. 07/07 ... Minimize ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 31
Provided by: same5
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Scaling Collective Multicast Fattree Networks


1
Scaling Collective Multicast Fat-tree Networks
  • Sameer Kumar
  • Parallel Programming Laboratory
  • University Of Illinois at Urbana Champaign
  • ICPADS04

2
Collective Communication
  • Communication operation in which all or a large
    subset participate
  • For example broadcast
  • Performance impediment
  • All to all communication
  • All to all personalized communication (AAPC)
  • All to all multicast (AAM)

3
Communication Model
  • Overhead of a point to point message is
  • Tp2p a mß
  • a is the total software overhead of sending the
    message
  • ß is the per byte network overhead
  • m is the size of the message
  • Direct all to all overhead
  • TAAM (P 1) (a mß)
  • a domain when m is small
  • ß domain when m is large

4
Optimization Strategies
  • Short messages
  • Parameter a dominates
  • Message combining
  • Reduce the total number of messages
  • Multistage algorithm to send messages along a
    virtual topology
  • Large messages
  • Parameter ß dominates
  • Network contention
  • Network topology specific optimizations that
    minimize network contention

5
Direct Strategies
  • Direct strategies optimize all to all multicast
    for large messages
  • Minimize network contention
  • Topology specific optimizations that take
    advantage of contention free schedules

6
Fat-tree Networks
  • Popular network topology for clusters
  • Bisection bandwidth O(P)
  • Network scales to several thousands of nodes
  • Topology k-ary,n-tree

7
k-ary n-trees
c)
4-ary 3 tree
8
Contention Free Permutations
  • Fat-trees have a nice propertysome processor
    permutations are contention free
  • Prefix permutation k
  • Processor i sends data to
  • Cyclic shift by k
  • Processor i sends a message to
  • Contention free if
  • Contention free permutations presented in Heller
    et. al. from CM-5

9
Prefix Permutation 1
Prefix Permutation by 1 Processor p sends to p
XOR 1
10
Prefix Permutation 2
0
1
2
3
4
5
6
7
Prefix Permutation by 2 Processor p sends to p
XOR 2
11
Prefix Permutation 3
0
1
2
3
4
5
6
7
Prefix Permutation by 3 Processor p sends to p
XOR 3
12
Prefix Permutation 4
0
1
2
3
4
5
6
7
Prefix Permutation by 4 Processor p sends to p
XOR 4
13
Cyclic Shift by k
0
1
2
3
4
5
6
7
Cyclic Shift by 2
14
Quadrics HPC Interconnect
  • Popular interconnect
  • Several in top500 use quadrics
  • Used by Pittsburghs Lemieux (6TF) and ASCI-Q
    (20TF)
  • Features
  • Low latency (5 µs for MPI)
  • High bandwidth (320MB/s/node)
  • Fat tree topology
  • Scales to 2K nodes

15
Effect of Contention of Throughput
Node Bandwidth Kth Permutation (MB/s)
Sending data from main memory is much slower
16
Performance Bottlenecks
  • 320 byte packet size
  • Packet protocol restricts bandwidth to faraway
    nodes
  • PCI/DMA bandwidth is restrictive
  • Achievable bandwidth is only 128MB/s

17
Quadrics Packet Protocol
Send the first packet
Ack Header
Receive Ack
Nearby Nodes Full Utilization
Sender
Receiver
18
Far Away Messages
Send the first packet
Ack Header
Receive Ack
Sender
Receiver
19
AAM on Fat-tree Networks
  • Overcome bottlenecks
  • Messages sent from NIC memory have 2.5 times
    better performance
  • Avoid sending messages to far away nodes
  • Using contention free permutations
  • Permutation every processor sends a message to a
    different destination

20
AAM Strategy Ring
  • Performs all to all multicast by sending messages
    along a ring formed by the processors
  • Equivalent to P-1 cyclic-shift-by-1 operations
  • Congestion free
  • Has appeared in literature before
  • Drawback
  • Processors send different messages in each step

21
Prefix Send Strategy
  • P-1 prefix permutations
  • In stage j, processor i sends a message to
    processor (i XOR (j1))
  • Congestion free
  • Can send messages from Elan memory
  • Bad performance on large fat-trees
  • Sends P/2 messages to far-away nodes at distance
    P/2 or more away
  • Wire/Switch delays restrict performance

22
K-Prefix Strategy
  • Hybrid of ring strategy and prefix send
  • Prefix send used in partitions of size k
  • Ring used between the partitions
  • Our contribution!

23
Performance
Node bandwidth (MB/s) each way
Our strategies send messages from Elan memory
24
Cost Equation
  • a , host and network software overhead
  • ab, cost of barrier (barriers needed to
    synchronize the nodes)
  • ßem, per byte network transmission cost
  • d, copying overhead to NIC memory
  • P, Number of processors
  • k, Size of the partition in k-Prefix

25
K-Prefixlb Strategy
k-Prefixlb strategy synchronizes nodes after a
few steps
26
CPU Overhead
  • Strategies should also be evaluated on compute
    overhead
  • Asynchronous non blocking primitives needed
  • A data driven system like Charm will
    automatically support this

27
Predicted vs Actual Performance
Predicted plot assumes a 9us, ab 15us, ß
d 294MB/s
28
Missing Nodes
  • Missing nodes due to down nodes in the fat tree
  • Prefix-Send and k-Prefix do badly in this
    scenario


29
K-Shift Strategy
  • Processor i sends data to the consecutive nodes
  • i-k/21,, i-1, i1,, ik/2 and to ik
  • Contention free and good performance with
    non-contiguous nodes, when k8
  • Our contribution

K-shift gains because most of the destinations
for each node do not change in the presence of
missing nodes
30
Conclusion
  • We optimize AAM for Quadrics QsNet
  • Copying and sending a message from the NIC has
    more bandwidth
  • K-Prefix avoids sending messages to far away
    nodes
  • Handle missing nodes through the k-shift strategy
  • Cluster interconnects other than quadrics also
    have such problems
  • Impressive performance results
  • CPU overhead should be a metric to evaluate AAM
    strategies

31
Future Work
Write a Comment
User Comments (0)
About PowerShow.com