Networkshop March 2005 Richard HughesJones Manchester - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Networkshop March 2005 Richard HughesJones Manchester

Description:

Networkshop March 2005 Richard HughesJones Manchester – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 43
Provided by: rhu99
Category:

less

Transcript and Presenter's Notes

Title: Networkshop March 2005 Richard HughesJones Manchester


1
Bandwidth Challenge, Land Speed Record,TCP/IP
and You
2
Bandwidth Lust at SC2003
  • The SC Network
  • Working with S2io, Cisco folks
  • At the SLAC BoothRunning theBW Challenge

3
The Bandwidth Challenge at SC2003
  • The peak aggregate bandwidth from the 3 booths
    was 23.21Gbits/s
  • 1-way link utilisations of gt90
  • 6.6 TBytes in 48 minutes

4
Multi-Gigabit flows at SC2003 BW Challenge
  • Three Server systems with 10 Gigabit Ethernet
    NICs
  • Used the DataTAG altAIMD stack 9000 byte MTU
  • Send mem-mem iperf TCP streams From SLAC/FNAL
    booth in Phoenix to
  • Pal Alto PAIX
  • rtt 17 ms , window 30 MB
  • Shared with Caltech booth
  • 4.37 Gbit HighSpeed TCP I5
  • Then 2.87 Gbit I16
  • Fall when 10 Gbit on link
  • 3.3Gbit Scalable TCP I8
  • Tested 2 flows sum 1.9Gbit I39
  • Chicago Starlight
  • rtt 65 ms , window 60 MB
  • Phoenix CPU 2.2 GHz
  • 3.1 Gbit HighSpeed TCP I1.6

5
Collaboration at SC2004
  • Working with S2io, Sun, Chelsio
  • Setting up the BW Bunker
  • SCINet
  • The BW Challenge at the SLAC Booth

6
UKLight ESLEA at SC2004
  • UK e-Science Researchers from Manchester, UCL
    ULCC involved in the Bandwidth Challenge
  • Collaborated with Scientists Engineers from
    Caltech, CERN, FERMI, SLAC, Starlight, UKERNA
    U. of Florida
  • Networks used by the SLAC/UK team
  • 10 Gbit Ethernet link from SC2004 to ESnet/QWest
    PoP in Sunnyvale
  • 10 Gbit Ethernet link from SC2004 and the
    CENIC/NLR/Level(3) PoP in Sunnyvale 
  • 10 Gbit Ethernet link from SC2004 to Chicago and
    on to UKLight
  • UKLight focused on Gigabit disk-to-disk transfers
    between UK sites and Pittsburgh
  • UK had generous support from Boston Ltd who
    loaned the servers
  • The BWC Collaboration had support from
  • S2io NICs
  • Chelsio TOE
  • Sun who loaned servers
  • Essential support from Boston, Sun Cisco

7
The Bandwidth Challenge SC2004
  • The peak aggregate bandwidth from the booths was
    101.13Gbits/s
  • That is 3 full length DVDs per second !
  • 4 Times greater that SC2003 !
  • Saturated TEN 10Gigabit Ethernet waves
  • SLAC Booth Sunnyvale to Pittsburgh, LA to
    Pittsburgh and Chicago to Pittsburgh (with
    UKLight).

8
Land Speed Record SC2004 Pittsburgh-Tokyo-CERN
Single stream TCP
  • LSR Distance Speed
  • Single Stream, Multiple Stream, IPv4 and IPv6
    Standard TCP
  • Current single stream IPv4 University of Tokyo,
    Fujitsu WIDE 9 Nov 05
  • 20,645 km connection SC2004 booth - CERN via
    Tokyo
  • Latency 433 ms RTT
  • 10 Gbit Chelsio TOE Card
  • 7.21 Gbps (TCP payload), 1500 B mtu taking about
    10 min
  • 148,850 Tetabit meter / second (Internet2 LSR
    approved record)
  • Full DVD in 5 s

9
So whats the matter with TCP Did we cheat?
  • Just a Well Engineered End-to-End Connection
  • End-to-End no loss environment
  • NO contention, NO sharing on the end-to-end path
  • Processor speed and system bus characteristics
  • TCP Configuration window size and frame size
    (MTU)
  • Tuned PCI-X bus
  • Tuned Network Interface Card driver
  • A single TCP connection on the end-to-end path
  • Memory-to-Memory transfer
  • no disk system involved

From Robin Tasker
10
TCP (Reno) Whats the problem?
  • TCP has 2 phases
  • Slowstart Probe the network to estimate the
    Available BWExponential growth
  • Congestion AvoidanceMain data transfer phase
    transfer rate glows slowly
  • AIMD and High Bandwidth Long Distance networks
  • Poor performance of TCP in high bandwidth wide
    area networks is due
  • in part to the TCP congestion control algorithm.
  • For each ack in a RTT without loss
  • cwnd -gt cwnd a / cwnd - Additive Increase,
    a1
  • For each window experiencing loss
  • cwnd -gt cwnd b (cwnd) -
    Multiplicative Decrease, b ½
  • Packet loss is a killer !!

11
TCP (Reno) Details
  • Time for TCP to recover its throughput from 1
    lost packet given by
  • for rtt of 200 ms

2 min
UK 6 ms Europe 20 ms USA 150 ms
12
Investigation of new TCP Stacks
  • The AIMD Algorithm Standard TCP (Reno)
  • For each ack in a RTT without loss
  • cwnd -gt cwnd a / cwnd - Additive Increase,
    a1
  • For each window experiencing loss
  • cwnd -gt cwnd b (cwnd) -
    Multiplicative Decrease, b ½
  • High Speed TCP
  • a and b vary depending on current cwnd using a
    table
  • a increases more rapidly with larger cwnd
    returns to the optimal cwnd size sooner for the
    network path
  • b decreases less aggressively and, as a
    consequence, so does the cwnd. The effect is that
    there is not such a decrease in throughput.
  • Scalable TCP
  • a and b are fixed adjustments for the increase
    and decrease of cwnd
  • a 1/100 the increase is greater than TCP Reno
  • b 1/8 the decrease on loss is less than TCP
    Reno
  • Scalable over any link speed.
  • Fast TCP
  • Uses round trip time as well as packet loss to
    indicate congestion with rapid convergence to
    fair equilibrium for throughput.
  • HSTCP-LP, H-TCP, BiC-TCP

13
Packet Loss with new TCP Stacks
  • TCP Response Function
  • Throughput vs Loss Rate further to right
    faster recovery
  • Drop packets in kernel

MB-NG rtt 6ms
DataTAG rtt 120 ms
14
Packet Loss and new TCP Stacks
  • TCP Response Function
  • UKLight London-Chicago-London rtt 177 ms
  • 2.6.6 Kernel
  • Agreement withtheory good

15
High Throughput Demonstrations
Manchester (Geneva)
London (Chicago)
Dual Zeon 2.2 GHz
Dual Zeon 2.2 GHz
Cisco GSR
Cisco GSR
Cisco 7609
Cisco 7609
1 GEth
1 GEth
2.5 Gbit SDH MB-NG Core
16
High Performance TCP DataTAG
  • Different TCP stacks tested on the DataTAG
    Network
  • rtt 128 ms
  • Drop 1 in 106
  • High-Speed
  • Rapid recovery
  • Scalable
  • Very fast recovery
  • Standard
  • Recovery would take 20 mins

17
  • Is TCP fair?
  • TCP Flows Sharing the Bandwidth

18
Test of TCP Sharing Methodology (1Gbit/s)
Les Cottrell PFLDnet 2005
  • Chose 3 paths from SLAC (California)
  • Caltech (10ms), Univ Florida (80ms), CERN (180ms)
  • Used iperf/TCP and UDT/UDP to generate traffic
  • Each run was 16 minutes, in 7 regions

19
TCP Reno single stream
Les Cottrell PFLDnet 2005
  • Low performance on fast long distance paths
  • AIMD (add a1 pkt to cwnd / RTT, decrease cwnd by
    factor b0.5 in congestion)
  • Net effect recovers slowly, does not effectively
    use available bandwidth, so poor throughput
  • Unequal sharing

SLAC to CERN
20
  • UK Transfers MB-NG and SuperJANET4
  • Throughput for real users

21
iperf Throughput Web100
  • SuperMicro on MB-NG network
  • HighSpeed TCP
  • Linespeed 940 Mbit/s
  • DupACK ? lt10 (expect 400)

22
Applications Throughput Mbit/s
  • HighSpeed TCP
  • 2 GByte file RAID5
  • SuperMicro SuperJANET
  • bbcp
  • bbftp
  • Apachie
  • Gridftp
  • Previous work used RAID0(not disk limited)

23
bbftp What else is going on?
  • Scalable TCP
  • BaBar SuperJANET
  • SuperMicro SuperJANET
  • Congestion window duplicate ACK
  • Variation not TCP related?
  • Disk speed / bus transfer
  • Application

24
  • SC2004 Transfers with UKLight
  • A Taster for Lambda Packet Switched Hybrid
    Networks

25
Transatlantic Ethernet TCP Throughput Tests
  • Supermicro X5DPE-G2 PCs
  • Dual 2.9 GHz Xenon CPU FSB 533 MHz
  • 1500 byte MTU
  • 2.6.6 Linux Kernel
  • Memory-memory TCP throughput
  • Standard TCP
  • Wire rate throughput of 940 Mbit/s
  • First 10 sec
  • Work in progress to study
  • Implementation detail
  • Advanced stacks
  • Effect of packet loss

26
SC2004 Disk-Disk bbftp (work in progress)
  • bbftp file transfer program uses TCP/IP
  • UKLight Path- London-Chicago-London PCs-
    Supermicro 3Ware RAID0
  • MTU 1500 bytes Socket size 22 Mbytes rtt 177ms
    SACK off
  • Move a 2 Gbyte file
  • Web100 plots
  • Standard TCP
  • Average 825 Mbit/s
  • (bbcp 670 Mbit/s)
  • Scalable TCP
  • Average 875 Mbit/s
  • (bbcp 701 Mbit/s4.5s of overhead)
  • Disk-TCP-Disk at 1Gbit/sis here!

27
Summary, Conclusions Thanks
  • Super Computing Bandwidth Challenge gives
    opportunity to make world-wide High performance
    tests.
  • Land Speed Record shows what can be achieved with
    state of the art kit
  • Standard TCP not optimum for high throughput long
    distance links
  • Packet loss is a killer for TCP
  • Check on campus links equipment, and access
    links to backbones
  • Users need to collaborate with the Campus Network
    Teams
  • Dante Pert
  • New stacks are stable give better response
    performance
  • Still need to set the TCP buffer sizes !
  • Check other kernel settings e.g. window-scale
    maximum
  • Watch for TCP Stack implementation Enhancements
  • Host is critical think Server quality not
    Supermarket PC
  • Motherboards NICs, RAID controllers and Disks
    matter
  • NIC should use 64 bit 133 MHz PCI-X
  • 66 MHz PCI can be OK but 32 bit 33 MHz is too
    slow for Gigabit rates
  • Worry about the CPU-Memory bandwidth as well as
    the PCI bandwidth
  • Data crosses the memory bus at least 3 times
  • Separate the data transfers use motherboards
    with multiple 64 bit PCI-X buses
  • Choose a modern high throughput RAID controller

28
More Information Some URLs
  • UKLight web site http//www.uklight.ac.uk
  • MB-NG project web site http//www.mb-ng.net/
  • DataTAG project web site http//www.datatag.org/
  • UDPmon / TCPmon kit writeup http//www.hep.man
    .ac.uk/rich/net
  • Motherboard and NIC Tests
  • http//www.hep.man.ac.uk/rich/net/nic/GigEth_te
    sts_Boston.ppt http//datatag.web.cern.ch/datata
    g/pfldnet2003/
  • Performance of 1 and 10 Gigabit Ethernet Cards
    with Server Quality Motherboards FGCS Special
    issue 2004
  • http// www.hep.man.ac.uk/rich/
  • TCP tuning information may be found
    athttp//www.ncne.nlanr.net/documentation/faq/pe
    rformance.html http//www.psc.edu/networking/p
    erf_tune.html
  • TCP stack comparisonsEvaluation of Advanced
    TCP Stacks on Fast Long-Distance Production
    Networks Journal of Grid Computing 2004
  • PFLDnet http//www.ens-lyon.fr/LIP/RESO/pfldnet200
    5/
  • Dante PERT http//www.geant2.net/server/show/nav.0
    0d00h002

29
  • Any Questions?

30
  • Backup Slides

31
Topology of the MB NG Network
32
Topology of the Production Network
Manchester Domain
3 routers2 switches
RAL Domain
routers switches
Key Gigabit Ethernet 2.5 Gbit POS Access 10 Gbit
POS
33
SC2004 UKLIGHT Overview
SC2004
SLAC Booth
Cisco 6509
MB-NG 7600 OSR
Manchester
Caltech Booth UltraLight IP
UCL network
UCL HEP
NLR Lambda NLR-PITT-STAR-10GE-16
ULCC UKLight
K2
K2
Ci
UKLight 10G Four 1GE channels
Ci
Caltech 7600
UKLight 10G
Surfnet/ EuroLink 10G Two 1GE channels
Chicago Starlight
K2
34
High Performance TCP MB-NG
  • Drop 1 in 25,000
  • rtt 6.2 ms
  • Recover in 1.6 s
  • Standard HighSpeed Scalable

35
bbftp Host Network Effects
  • 2 Gbyte file RAID5 Disks
  • 1200 Mbit/s read
  • 600 Mbit/s write
  • Scalable TCP
  • BaBar SuperJANET
  • Instantaneous 220 - 625 Mbit/s
  • SuperMicro SuperJANET
  • Instantaneous 400 - 665 Mbit/s for 6 sec
  • Then 0 - 480 Mbit/s
  • SuperMicro MB-NG
  • Instantaneous 880 - 950 Mbit/s for 1.3 sec
  • Then 215 - 625 Mbit/s

36
Average Transfer Rates Mbit/s
37
UKLight and ESLEA
  • Collaboration forming for SC2005
  • Caltech, CERN, FERMI, SLAC, Starlight, UKLight,
  • Current Proposals include
  • Bandwidth Challenge with even faster disk-to-disk
    transfers between UK sites and SC2005
  • Radio Astronomy demo at 512 Mbit user data or 1
    Gbit user dataJapan, Haystack(MIT), Jodrell
    Bank, JIVE
  • High Bandwidth linkup between UK and US HPC
    systems
  • 10Gig NLR wave to Seattle
  • Set up a 10 Gigabit Ethernet Test Bench
  • Experiments (CALICE) need to investigate gt25 Gbit
    to the processor
  • ESLEA/UKlight need resources to study
  • New protocols and congestion / sharing
  • The interaction between protcol processing,
    applications and storage
  • Monitoring L1/L2 behaviour in hybrid networks

38
10 Gigabit Ethernet UDP Throughput Tests
  • 1500 byte MTU gives 2 Gbit/s
  • Used 16144 byte MTU max user length 16080
  • DataTAG Supermicro PCs
  • Dual 2.2 GHz Xenon CPU FSB 400 MHz
  • PCI-X mmrbc 512 bytes
  • wire rate throughput of 2.9 Gbit/s
  • CERN OpenLab HP Itanium PCs
  • Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz
  • PCI-X mmrbc 512 bytes
  • wire rate of 5.7 Gbit/s
  • SLAC Dell PCs giving a
  • Dual 3.0 GHz Xenon CPU FSB 533 MHz
  • PCI-X mmrbc 4096 bytes
  • wire rate of 5.4 Gbit/s

39
10 Gigabit Ethernet Tuning PCI-X
  • 16080 byte packets every 200 µs
  • Intel PRO/10GbE LR Adapter
  • PCI-X bus occupancy vs mmrbc
  • Measured times
  • Times based on PCI-X times from the logic
    analyser
  • Expected throughput 7 Gbit/s
  • Measured 5.7 Gbit/s

40
10 Gigabit Ethernet SC2004 TCP Tests
  • Sun AMD opteron compute servers v20z
  • Chelsio TOE Tests between Linux 2.6.6. hosts
  • 10 Gbit ethernet link from SC2004 to
    CENIC/NLR/Level(3) PoP in Sunnyvale 
  • Two 2.4GHz AMD 64 bit Opteron processors with 4GB
    of RAM at SC2004
  • 1500B MTU, all Linux 2.6.6
  • in one direction 9.43G i.e. 9.07G goodput
  • and the reverse direction 5.65G i.e. 5.44G
    goodput
  • Total of 15G on wire.
  • 10 Gbit ethernet link from SC2004 to ESnet/QWest
    PoP in Sunnyvale
  • One 2.4GHz AMD 64 bit Opteron each end
  • 2MByte window, 16 streams, 1500B MTU, all Linux
    2.6.6
  • in one direction 7.72Gbit/s i.e. 7.42 Gbit/s
    goodput
  • 120mins (6.6Tbits shipped)
  • S2io NICs with Solaris 10 in 42.2GHz Opteron cpu
    v40z to one or more S2io or Chelsio NICs with
    Linux 2.6.5 or 2.6.6 in 22.4GHz V20Zs
  • LAN 1 S2io NIC back to back 7.46 Gbit/s
  • LAN 2 S2io in V40z to 2 V20z each NIC 6 Gbit/s
    total 12.08 Gbit/s

41
Transatlantic Ethernet disk-to-disk Tests
  • Supermicro X5DPE-G2 PCs
  • Dual 2.9 GHz Xenon CPU FSB 533 MHz
  • 1500 byte MTU
  • 2.6.6 Linux Kernel
  • RAID0 (6 SATA disks)
  • Bbftp (disk-disk) throughput
  • Standard TCP
  • Throughput of 436 Mbit/s
  • First 10 sec
  • Work in progress to study
  • Throughput limitations
  • Help real users

42
SC2004 Disk-Disk bbftp (work in progress)
  • UKLight Path- London-Chicago-London PCs-
    Supermicro 3Ware RAID0
  • MTU 1500 bytes Socket size 22 Mbytes rtt 177ms
    SACK off
  • Move a 2 Gbyte file
  • Web100 plots
  • HS TCP
  • Dont believe this is a protocol problem !
Write a Comment
User Comments (0)
About PowerShow.com