Title: Networkshop March 2005 Richard HughesJones Manchester
1Bandwidth Challenge, Land Speed Record,TCP/IP
and You
2Bandwidth Lust at SC2003
- Working with S2io, Cisco folks
- At the SLAC BoothRunning theBW Challenge
3The Bandwidth Challenge at SC2003
- The peak aggregate bandwidth from the 3 booths
was 23.21Gbits/s - 1-way link utilisations of gt90
- 6.6 TBytes in 48 minutes
4Multi-Gigabit flows at SC2003 BW Challenge
- Three Server systems with 10 Gigabit Ethernet
NICs - Used the DataTAG altAIMD stack 9000 byte MTU
- Send mem-mem iperf TCP streams From SLAC/FNAL
booth in Phoenix to - Pal Alto PAIX
- rtt 17 ms , window 30 MB
- Shared with Caltech booth
- 4.37 Gbit HighSpeed TCP I5
- Then 2.87 Gbit I16
- Fall when 10 Gbit on link
- 3.3Gbit Scalable TCP I8
- Tested 2 flows sum 1.9Gbit I39
- Chicago Starlight
- rtt 65 ms , window 60 MB
- Phoenix CPU 2.2 GHz
- 3.1 Gbit HighSpeed TCP I1.6
5Collaboration at SC2004
- Working with S2io, Sun, Chelsio
- The BW Challenge at the SLAC Booth
6UKLight ESLEA at SC2004
- UK e-Science Researchers from Manchester, UCL
ULCC involved in the Bandwidth Challenge - Collaborated with Scientists Engineers from
Caltech, CERN, FERMI, SLAC, Starlight, UKERNA
U. of Florida - Networks used by the SLAC/UK team
- 10 Gbit Ethernet link from SC2004 to ESnet/QWest
PoP in Sunnyvale - 10 Gbit Ethernet link from SC2004 and the
CENIC/NLR/Level(3) PoP in Sunnyvale - 10 Gbit Ethernet link from SC2004 to Chicago and
on to UKLight - UKLight focused on Gigabit disk-to-disk transfers
between UK sites and Pittsburgh - UK had generous support from Boston Ltd who
loaned the servers - The BWC Collaboration had support from
- S2io NICs
- Chelsio TOE
- Sun who loaned servers
- Essential support from Boston, Sun Cisco
7The Bandwidth Challenge SC2004
- The peak aggregate bandwidth from the booths was
101.13Gbits/s - That is 3 full length DVDs per second !
- 4 Times greater that SC2003 !
- Saturated TEN 10Gigabit Ethernet waves
- SLAC Booth Sunnyvale to Pittsburgh, LA to
Pittsburgh and Chicago to Pittsburgh (with
UKLight).
8Land Speed Record SC2004 Pittsburgh-Tokyo-CERN
Single stream TCP
- LSR Distance Speed
- Single Stream, Multiple Stream, IPv4 and IPv6
Standard TCP - Current single stream IPv4 University of Tokyo,
Fujitsu WIDE 9 Nov 05 - 20,645 km connection SC2004 booth - CERN via
Tokyo - Latency 433 ms RTT
- 10 Gbit Chelsio TOE Card
- 7.21 Gbps (TCP payload), 1500 B mtu taking about
10 min - 148,850 Tetabit meter / second (Internet2 LSR
approved record) - Full DVD in 5 s
9So whats the matter with TCP Did we cheat?
- Just a Well Engineered End-to-End Connection
- End-to-End no loss environment
- NO contention, NO sharing on the end-to-end path
- Processor speed and system bus characteristics
- TCP Configuration window size and frame size
(MTU) - Tuned PCI-X bus
- Tuned Network Interface Card driver
- A single TCP connection on the end-to-end path
- Memory-to-Memory transfer
- no disk system involved
From Robin Tasker
10TCP (Reno) Whats the problem?
- TCP has 2 phases
- Slowstart Probe the network to estimate the
Available BWExponential growth - Congestion AvoidanceMain data transfer phase
transfer rate glows slowly - AIMD and High Bandwidth Long Distance networks
- Poor performance of TCP in high bandwidth wide
area networks is due - in part to the TCP congestion control algorithm.
- For each ack in a RTT without loss
- cwnd -gt cwnd a / cwnd - Additive Increase,
a1 - For each window experiencing loss
- cwnd -gt cwnd b (cwnd) -
Multiplicative Decrease, b ½ - Packet loss is a killer !!
11TCP (Reno) Details
- Time for TCP to recover its throughput from 1
lost packet given by - for rtt of 200 ms
2 min
UK 6 ms Europe 20 ms USA 150 ms
12Investigation of new TCP Stacks
- The AIMD Algorithm Standard TCP (Reno)
- For each ack in a RTT without loss
- cwnd -gt cwnd a / cwnd - Additive Increase,
a1 - For each window experiencing loss
- cwnd -gt cwnd b (cwnd) -
Multiplicative Decrease, b ½ - High Speed TCP
- a and b vary depending on current cwnd using a
table - a increases more rapidly with larger cwnd
returns to the optimal cwnd size sooner for the
network path - b decreases less aggressively and, as a
consequence, so does the cwnd. The effect is that
there is not such a decrease in throughput. - Scalable TCP
- a and b are fixed adjustments for the increase
and decrease of cwnd - a 1/100 the increase is greater than TCP Reno
- b 1/8 the decrease on loss is less than TCP
Reno - Scalable over any link speed.
- Fast TCP
- Uses round trip time as well as packet loss to
indicate congestion with rapid convergence to
fair equilibrium for throughput. - HSTCP-LP, H-TCP, BiC-TCP
13Packet Loss with new TCP Stacks
- TCP Response Function
- Throughput vs Loss Rate further to right
faster recovery - Drop packets in kernel
MB-NG rtt 6ms
DataTAG rtt 120 ms
14Packet Loss and new TCP Stacks
- TCP Response Function
- UKLight London-Chicago-London rtt 177 ms
- 2.6.6 Kernel
- Agreement withtheory good
15High Throughput Demonstrations
Manchester (Geneva)
London (Chicago)
Dual Zeon 2.2 GHz
Dual Zeon 2.2 GHz
Cisco GSR
Cisco GSR
Cisco 7609
Cisco 7609
1 GEth
1 GEth
2.5 Gbit SDH MB-NG Core
16High Performance TCP DataTAG
- Different TCP stacks tested on the DataTAG
Network - rtt 128 ms
- Drop 1 in 106
- High-Speed
- Rapid recovery
- Scalable
- Very fast recovery
- Standard
- Recovery would take 20 mins
17- Is TCP fair?
- TCP Flows Sharing the Bandwidth
18Test of TCP Sharing Methodology (1Gbit/s)
Les Cottrell PFLDnet 2005
- Chose 3 paths from SLAC (California)
- Caltech (10ms), Univ Florida (80ms), CERN (180ms)
- Used iperf/TCP and UDT/UDP to generate traffic
- Each run was 16 minutes, in 7 regions
19TCP Reno single stream
Les Cottrell PFLDnet 2005
- Low performance on fast long distance paths
- AIMD (add a1 pkt to cwnd / RTT, decrease cwnd by
factor b0.5 in congestion) - Net effect recovers slowly, does not effectively
use available bandwidth, so poor throughput - Unequal sharing
SLAC to CERN
20- UK Transfers MB-NG and SuperJANET4
- Throughput for real users
21iperf Throughput Web100
- SuperMicro on MB-NG network
- HighSpeed TCP
- Linespeed 940 Mbit/s
- DupACK ? lt10 (expect 400)
22Applications Throughput Mbit/s
- HighSpeed TCP
- 2 GByte file RAID5
- SuperMicro SuperJANET
- bbcp
- bbftp
- Apachie
- Gridftp
- Previous work used RAID0(not disk limited)
23bbftp What else is going on?
- Scalable TCP
- BaBar SuperJANET
- SuperMicro SuperJANET
- Congestion window duplicate ACK
- Variation not TCP related?
- Disk speed / bus transfer
- Application
24- SC2004 Transfers with UKLight
- A Taster for Lambda Packet Switched Hybrid
Networks
25Transatlantic Ethernet TCP Throughput Tests
- Supermicro X5DPE-G2 PCs
- Dual 2.9 GHz Xenon CPU FSB 533 MHz
- 1500 byte MTU
- 2.6.6 Linux Kernel
- Memory-memory TCP throughput
- Standard TCP
- Wire rate throughput of 940 Mbit/s
- First 10 sec
- Work in progress to study
- Implementation detail
- Advanced stacks
- Effect of packet loss
26SC2004 Disk-Disk bbftp (work in progress)
- bbftp file transfer program uses TCP/IP
- UKLight Path- London-Chicago-London PCs-
Supermicro 3Ware RAID0 - MTU 1500 bytes Socket size 22 Mbytes rtt 177ms
SACK off - Move a 2 Gbyte file
- Web100 plots
- Standard TCP
- Average 825 Mbit/s
- (bbcp 670 Mbit/s)
- Scalable TCP
- Average 875 Mbit/s
- (bbcp 701 Mbit/s4.5s of overhead)
- Disk-TCP-Disk at 1Gbit/sis here!
27Summary, Conclusions Thanks
- Super Computing Bandwidth Challenge gives
opportunity to make world-wide High performance
tests. - Land Speed Record shows what can be achieved with
state of the art kit - Standard TCP not optimum for high throughput long
distance links - Packet loss is a killer for TCP
- Check on campus links equipment, and access
links to backbones - Users need to collaborate with the Campus Network
Teams - Dante Pert
- New stacks are stable give better response
performance - Still need to set the TCP buffer sizes !
- Check other kernel settings e.g. window-scale
maximum - Watch for TCP Stack implementation Enhancements
- Host is critical think Server quality not
Supermarket PC - Motherboards NICs, RAID controllers and Disks
matter - NIC should use 64 bit 133 MHz PCI-X
- 66 MHz PCI can be OK but 32 bit 33 MHz is too
slow for Gigabit rates - Worry about the CPU-Memory bandwidth as well as
the PCI bandwidth - Data crosses the memory bus at least 3 times
- Separate the data transfers use motherboards
with multiple 64 bit PCI-X buses - Choose a modern high throughput RAID controller
28More Information Some URLs
- UKLight web site http//www.uklight.ac.uk
- MB-NG project web site http//www.mb-ng.net/
- DataTAG project web site http//www.datatag.org/
- UDPmon / TCPmon kit writeup http//www.hep.man
.ac.uk/rich/net - Motherboard and NIC Tests
- http//www.hep.man.ac.uk/rich/net/nic/GigEth_te
sts_Boston.ppt http//datatag.web.cern.ch/datata
g/pfldnet2003/ - Performance of 1 and 10 Gigabit Ethernet Cards
with Server Quality Motherboards FGCS Special
issue 2004 - http// www.hep.man.ac.uk/rich/
- TCP tuning information may be found
athttp//www.ncne.nlanr.net/documentation/faq/pe
rformance.html http//www.psc.edu/networking/p
erf_tune.html - TCP stack comparisonsEvaluation of Advanced
TCP Stacks on Fast Long-Distance Production
Networks Journal of Grid Computing 2004 - PFLDnet http//www.ens-lyon.fr/LIP/RESO/pfldnet200
5/ - Dante PERT http//www.geant2.net/server/show/nav.0
0d00h002
29 30 31Topology of the MB NG Network
32Topology of the Production Network
Manchester Domain
3 routers2 switches
RAL Domain
routers switches
Key Gigabit Ethernet 2.5 Gbit POS Access 10 Gbit
POS
33SC2004 UKLIGHT Overview
SC2004
SLAC Booth
Cisco 6509
MB-NG 7600 OSR
Manchester
Caltech Booth UltraLight IP
UCL network
UCL HEP
NLR Lambda NLR-PITT-STAR-10GE-16
ULCC UKLight
K2
K2
Ci
UKLight 10G Four 1GE channels
Ci
Caltech 7600
UKLight 10G
Surfnet/ EuroLink 10G Two 1GE channels
Chicago Starlight
K2
34High Performance TCP MB-NG
- Drop 1 in 25,000
- rtt 6.2 ms
- Recover in 1.6 s
- Standard HighSpeed Scalable
35bbftp Host Network Effects
- 2 Gbyte file RAID5 Disks
- 1200 Mbit/s read
- 600 Mbit/s write
- Scalable TCP
- BaBar SuperJANET
- Instantaneous 220 - 625 Mbit/s
- SuperMicro SuperJANET
- Instantaneous 400 - 665 Mbit/s for 6 sec
- Then 0 - 480 Mbit/s
- SuperMicro MB-NG
- Instantaneous 880 - 950 Mbit/s for 1.3 sec
- Then 215 - 625 Mbit/s
36Average Transfer Rates Mbit/s
37UKLight and ESLEA
- Collaboration forming for SC2005
- Caltech, CERN, FERMI, SLAC, Starlight, UKLight,
- Current Proposals include
- Bandwidth Challenge with even faster disk-to-disk
transfers between UK sites and SC2005 - Radio Astronomy demo at 512 Mbit user data or 1
Gbit user dataJapan, Haystack(MIT), Jodrell
Bank, JIVE - High Bandwidth linkup between UK and US HPC
systems - 10Gig NLR wave to Seattle
- Set up a 10 Gigabit Ethernet Test Bench
- Experiments (CALICE) need to investigate gt25 Gbit
to the processor - ESLEA/UKlight need resources to study
- New protocols and congestion / sharing
- The interaction between protcol processing,
applications and storage - Monitoring L1/L2 behaviour in hybrid networks
3810 Gigabit Ethernet UDP Throughput Tests
- 1500 byte MTU gives 2 Gbit/s
- Used 16144 byte MTU max user length 16080
- DataTAG Supermicro PCs
- Dual 2.2 GHz Xenon CPU FSB 400 MHz
- PCI-X mmrbc 512 bytes
- wire rate throughput of 2.9 Gbit/s
- CERN OpenLab HP Itanium PCs
- Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz
- PCI-X mmrbc 512 bytes
- wire rate of 5.7 Gbit/s
- SLAC Dell PCs giving a
- Dual 3.0 GHz Xenon CPU FSB 533 MHz
- PCI-X mmrbc 4096 bytes
- wire rate of 5.4 Gbit/s
3910 Gigabit Ethernet Tuning PCI-X
- 16080 byte packets every 200 µs
- Intel PRO/10GbE LR Adapter
- PCI-X bus occupancy vs mmrbc
- Measured times
- Times based on PCI-X times from the logic
analyser - Expected throughput 7 Gbit/s
- Measured 5.7 Gbit/s
4010 Gigabit Ethernet SC2004 TCP Tests
- Sun AMD opteron compute servers v20z
- Chelsio TOE Tests between Linux 2.6.6. hosts
- 10 Gbit ethernet link from SC2004 to
CENIC/NLR/Level(3) PoP in Sunnyvale - Two 2.4GHz AMD 64 bit Opteron processors with 4GB
of RAM at SC2004 - 1500B MTU, all Linux 2.6.6
- in one direction 9.43G i.e. 9.07G goodput
- and the reverse direction 5.65G i.e. 5.44G
goodput - Total of 15G on wire.
- 10 Gbit ethernet link from SC2004 to ESnet/QWest
PoP in Sunnyvale - One 2.4GHz AMD 64 bit Opteron each end
- 2MByte window, 16 streams, 1500B MTU, all Linux
2.6.6 - in one direction 7.72Gbit/s i.e. 7.42 Gbit/s
goodput - 120mins (6.6Tbits shipped)
- S2io NICs with Solaris 10 in 42.2GHz Opteron cpu
v40z to one or more S2io or Chelsio NICs with
Linux 2.6.5 or 2.6.6 in 22.4GHz V20Zs - LAN 1 S2io NIC back to back 7.46 Gbit/s
- LAN 2 S2io in V40z to 2 V20z each NIC 6 Gbit/s
total 12.08 Gbit/s
41Transatlantic Ethernet disk-to-disk Tests
- Supermicro X5DPE-G2 PCs
- Dual 2.9 GHz Xenon CPU FSB 533 MHz
- 1500 byte MTU
- 2.6.6 Linux Kernel
- RAID0 (6 SATA disks)
- Bbftp (disk-disk) throughput
- Standard TCP
- Throughput of 436 Mbit/s
- First 10 sec
- Work in progress to study
- Throughput limitations
- Help real users
42SC2004 Disk-Disk bbftp (work in progress)
- UKLight Path- London-Chicago-London PCs-
Supermicro 3Ware RAID0 - MTU 1500 bytes Socket size 22 Mbytes rtt 177ms
SACK off - Move a 2 Gbyte file
- Web100 plots
- HS TCP
- Dont believe this is a protocol problem !