Scaling Collective Multicast Fattree Networks - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Scaling Collective Multicast Fattree Networks

Description:

Communication operation in which all or a large subset participate. For ... specific optimizations that minimize network contention. 07/07 ... Minimize ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 31

Provided by: same5

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scaling Collective Multicast Fattree Networks

1
Scaling Collective Multicast Fat-tree Networks

Sameer Kumar
Parallel Programming Laboratory
University Of Illinois at Urbana Champaign
ICPADS04

2
Collective Communication

Communication operation in which all or a large
subset participate
For example broadcast
Performance impediment
All to all communication
All to all personalized communication (AAPC)
All to all multicast (AAM)

3
Communication Model

Overhead of a point to point message is
Tp2p a mß
a is the total software overhead of sending the
message
ß is the per byte network overhead
m is the size of the message
Direct all to all overhead
TAAM (P 1) (a mß)
a domain when m is small
ß domain when m is large

4
Optimization Strategies

Short messages
Parameter a dominates
Message combining
Reduce the total number of messages
Multistage algorithm to send messages along a
virtual topology

Large messages
Parameter ß dominates
Network contention
Network topology specific optimizations that
minimize network contention

5
Direct Strategies

Direct strategies optimize all to all multicast
for large messages
Minimize network contention
Topology specific optimizations that take
advantage of contention free schedules

6
Fat-tree Networks

Popular network topology for clusters
Bisection bandwidth O(P)
Network scales to several thousands of nodes
Topology k-ary,n-tree

7
k-ary n-trees
c)
4-ary 3 tree
8
Contention Free Permutations

Fat-trees have a nice propertysome processor
permutations are contention free
Prefix permutation k
Processor i sends data to
Cyclic shift by k
Processor i sends a message to
Contention free if
Contention free permutations presented in Heller
et. al. from CM-5

9
Prefix Permutation 1
Prefix Permutation by 1 Processor p sends to p
XOR 1
10
Prefix Permutation 2
0
1
2
3
4
5
6
7
Prefix Permutation by 2 Processor p sends to p
XOR 2
11
Prefix Permutation 3
0
1
2
3
4
5
6
7
Prefix Permutation by 3 Processor p sends to p
XOR 3
12
Prefix Permutation 4
0
1
2
3
4
5
6
7
Prefix Permutation by 4 Processor p sends to p
XOR 4
13
Cyclic Shift by k
0
1
2
3
4
5
6
7
Cyclic Shift by 2
14
Quadrics HPC Interconnect

Popular interconnect
Several in top500 use quadrics
Used by Pittsburghs Lemieux (6TF) and ASCI-Q
(20TF)
Features
Low latency (5 µs for MPI)
High bandwidth (320MB/s/node)
Fat tree topology
Scales to 2K nodes

15
Effect of Contention of Throughput
Node Bandwidth Kth Permutation (MB/s)
Sending data from main memory is much slower
16
Performance Bottlenecks

320 byte packet size
Packet protocol restricts bandwidth to faraway
nodes
PCI/DMA bandwidth is restrictive
Achievable bandwidth is only 128MB/s

17
Quadrics Packet Protocol
Send the first packet
Ack Header
Receive Ack
Nearby Nodes Full Utilization
Sender
Receiver
18
Far Away Messages
Send the first packet
Ack Header
Receive Ack
Sender
Receiver
19
AAM on Fat-tree Networks

Overcome bottlenecks
Messages sent from NIC memory have 2.5 times
better performance
Avoid sending messages to far away nodes
Using contention free permutations
Permutation every processor sends a message to a
different destination

20
AAM Strategy Ring

Performs all to all multicast by sending messages
along a ring formed by the processors
Equivalent to P-1 cyclic-shift-by-1 operations
Congestion free
Has appeared in literature before
Drawback
Processors send different messages in each step

21
Prefix Send Strategy