Title: Integrating New Capabilities into NetPIPE
1Integrating New Capabilities into NetPIPE
- Dave Turner, Adam Oline, Xuehua Chen,
and Troy Benjegerdes - Scalable Computing Laboratory of Ames Laboratory
- This work was funded by the MICS office of the US
Department of Energy
2 w
i
t
h
o
r
w
i
t
h
o
u
t
f
e
n
c
e
c
a
l
l
s
.
M
e
a
s
u
r
e
p
e
r
f
o
r
m
a
n
c
e
o
r
d
o
a
n
i
n
t
e
g
r
i
t
y
t
e
s
t
.
http//www.scl.ameslab.gov/Projects/NetPIPE/
3The NetPIPE utility
- NetPIPE does a series of ping-pong tests
between two nodes. - Message sizes are chosen at regular intervals,
and with slight perturbations, to fully test the
communication system for idiosyncrasies. - Latencies reported represent half the ping-pong
time for messages smaller than 64 Bytes.
Some typical uses
- Measuring the overhead of message-passing
protocols. - Help in tuning the optimization parameters of
message-passing libraries. - Optimizing driver and OS parameters (socket
buffer sizes, etc.). - Identifying dropouts in networking hardware and
drivers.
What is not measured
- NetPIPE cannot measure the load on the CPU yet.
- The effects from the different methods for
maintaining message progress. - Scalability with system size.
4Recent additions to NetPIPE
- Can do an integrity test instead of measuring
performance. - Streaming mode measures performance in 1
direction only. - Must reset sockets to avoid effects from a
collapsing window size. - A bi-directional ping-pong mode has been added
(-2). - One-sided Get and Put calls can be measured
(MPI or SHMEM). - Can choose whether to use an intervening
MPI_Fence call to synchronize. - Messages can be bounced between the same
buffers (default mode), or they can be started
from a different area of memory each time. - There are lots of cache effects in SMP
message-passing. - InfiniBand can show similar effects since
memory must be registered with the card.
Process 1
Process 0
0
1
2
3
5Current projects
- Overlapping pair-wise ping-pong tests.
- Must consider synchronization if not using
bi-directional communications.
Ethernet Switch
n0
n1
n2
n3
Line speed vs end-point limited
n0
n1
n2
n3
- Investigate other methods for testing the
global network. - Evaluate the full range from simultaneous
nearest neighbor communications to all-to-all.
6Performance on Mellanox InfiniBand cards
A new NetPIPE module allows us to measure the raw
performance across InfiniBand hardware (RDMA and
Send/Recv). Burst mode preposts all receives to
duplicate the Mellanox test. The no-cache
performance is much lower when the memory has to
be registered with the card. An MP_Lite
InfiniBand module will be incorporated into
LAM/MPI.
MVAPICH 0.9.1
710 Gigabit Ethernet
Intel 10 Gigabit Ethernet cards 133 MHz PCI-X
bus Single mode fiber Intel ixgb driver Can only
achieve 2 Gbps now. Latency is 75 us. Streaming
mode delivers up to 3 Gbps. Much more
development work is needed.
8Channel-bonding Gigabit Ethernet for
better communications between nodes
Channel-bonding uses 2 or more Gigabit Ethernet
cards per PC to increase the communication rate
between nodes in a cluster. GigE cards cost 40
each. 24-port switches cost 1400. ? 100 /
computer This is much more cost effective for PC
clusters than using more expensive networking
hardware, and may deliver similar performance.
9Performance for channel-bonded Gigabit Ethernet
GigE can deliver 900 Mbps with latencies of 25-62
us for PCs with 64-bit / 66 MHz PCI
slots. Channel-bonding 2 GigE cards / PC using
MP_Lite doubles the performance for large
messages. Adding a 3rd card does not help
much. Channel-bonding 2 GigE cards / PC using
Linux kernel level bonding actually results in
poorer performance. The same tricks that make
channel-bonding successful in MP_Lite should make
Linux kernel bonding working even better. Any
message-passing system could then make use of
channel-bonding on Linux systems.
Channel-bonding multiple GigE cards using MP_Lite
and Linux kernel bonding
10Channel-bonding in MP_Lite
User space
Kernel space
device driver
Application on node 0
Large socket buffers
device queue
GigE card
a
b
dev_q_xmit
DMA
TCP/IP stack
b
TCP/IP stack
GigE card
a
dev_q_xmit
DMA
MP_Lite
device queue
Flow control may stop a given stream at several
places. With MP_Lite channel-bonding, each
stream is independent of the others.
11Linux kernel channel-bonding
User space
Kernel space
device driver
Application on node 0
device queue
Large socket buffer
GigE card
DMA
dqx
bonding.c
TCP/IP stack
dqx
dqx
GigE card
DMA
device queue
A full device queue will stop the flow at
bonding.c to both device queues. Flow control on
the destination node may stop the flow out of the
socket buffer. In both of these cases, problems
with one stream can affect both streams.
12Comparison of high-speed interconnects
InfiniBand can deliver 4500 - 6500 Mbps at a 7.5
us latency. Atoll delivers 1890 Mbps with a 4.7
us latency. SCI delivers 1840 Mbps with only a
4.2 us latency. Myrinet performance reaches 1820
Mbps with an 8 us latency. Channel-bonded GigE
offers 1800 Mbps for very large messages. Gigabit
Ethernet delivers 900 Mbps with a 25-62
us latency. 10 GigE only delivers 2 Gbps with a
75 us latency.
13Conclusions
- NetPIPE provides a consistent set of analytical
tools in the same flexible framework to many
message-passing and native communication layers. - New modules have been developed.
- 1-sided MPI and SHMEM
- GM, InfiniBand using the Mellanox VAPI, ARMCI,
LAPI - Internal tests like memcpy
- New modes have been incorporated into NetPIPE.
- Streaming and bi-directional modes.
- Testing without cache effects.
- The ability to test integrity instead of
performance.
14Current projects
- Developing new modules.
- ATOLL
- IBM Blue Gene/L
- I/O performance
- Need to be able to measure CPU load during
communications. - Expanding NetPIPE to do multiple pair-wise
communications. - Can measure the backplane performance on
switches. - Compare the line speed to end-point limited
performance. - Working toward measuring more of the global
properties of a network. - The network topology will need to be considered.
15Contact information
- Dave Turner - turner_at_ameslab.gov
- http//www.scl.ameslab.gov/Projects/MP_Lite/
- http//www.scl.ameslab.gov/Projects/NetPIPE/
16One-sided Puts between two Linux PCs
- MP_Lite is SIGIO based, so MPI_Put() and
MPI_Get() finish without a fence. - LAM/MPI has no message progress, so a fence is
required. - ARMCI uses a polling method, and therefore does
not require a fence. - An MPI-2 implementation of MPICH is under
development. - An MPI-2 implementation of MPI/Pro is under
development.
Netgear GA620 fiber GigE 32/64-bit 33/66 MHz
AceNIC driver
17The MP_Lite message-passing library
- A light-weight MPI implementation
- Highly efficient for the architectures supported
- Designed to be very user-friendly
- Ideal for performing message-passing research
- http//www.scl.ameslab.gov/Projects/MP_Lite/
18A NetPIPE example Performance on a Cray T3E
- Raw SHMEM delivers
- 2600 Mbps
- 2-3 us latency
- Cray MPI originally delivered
- 1300 Mbps
- 20 us latency
- MP_Lite delivers
- 2600 Mbps
- 9-10 us latency
- New Cray MPI delivers
- 2400 Mbps
- 20 us latency
The top of the spikes are where the message size
is divisible by 8 Bytes.