Title: GridPP Meeting Edinburgh 45 Feb 04
1High Performance Networking for ALL
Members of GridPP are in many Network
collaborations including
Close links withSLACUKERNA, SURFNET and other
NRNsDanteInternet2Starlight,
NetherlightGGFRipeIndustry
UKLIGHT
2Network Monitoring 1
3Network Monitoring 2
24 Jan to 4 Feb 04 TCP iperf RAL to HEP Only 2
sites gt80 Mbit/s
24 Jan to 4 Feb 04 TCP iperf DL to HEP
HELP!
4High bandwidth, Long distance.Where is my
throughput?
- Robin Tasker
- CCLRC, Daresbury Laboratory, UK
- r.tasker_at_dl.ac.uk
DataTAG is a project sponsored by the European
Commission - EU Grant IST-2001-32459
5Throughput Whats the problem?
One Terabyte of data transferred in less than an
hour
On February 27-28 2003, the transatlantic DataTAG
network was extended, i.e. CERN - Chicago -
Sunnyvale (gt10000 km). For the first time, a
terabyte of data was transferred across the
Atlantic in less than one hour using a single TCP
(Reno) stream. The transfer was accomplished
from Sunnyvale to Geneva at a rate of 2.38 Gbits/s
6Internet2 Land Speed Record
On October 1 2003, DataTAG set a new Internet2
Land Speed Record by transferring 1.1 Terabytes
of data in less than 30 minutes from Geneva to
Chicago across the DataTAG provision,
corresponding to an average rate of 5.44 Gbits/s
using a single TCP (Reno) stream
7So how did we do that?
- Management of the End-to-End Connection
- Memory-to-Memory transfer no disk system
involved - Processor speed and system bus characteristics
- TCP Configuration window size and frame size
(MTU) - Network Interface Card and associated driver and
their configuration - End-to-End no loss environment from CERN to
Sunnyvale! - At least a 2.5 Gbits/s capacity pipe on the
end-to-end path - A single TCP connection on the end-to-end path
- No real user application
- Thats to say - not the usual User experience!
8Realistically whats the problem why do
network research?
- End System Issues
- Network Interface Card and Driver and their
configuration - TCP and its configuration
- Operating System and its configuration
- Disk System
- Processor speed
- Bus speed and capability
- Network Infrastructure Issues
- Obsolete network equipment
- Configured bandwidth restrictions
- Topology
- Security restrictions (e.g., firewalls)
- Sub-optimal routing
- Transport Protocols
- Network Capacity and the influence of Others!
- Many, many TCP connections
- Mice and Elephants on the path
9End Hosts Buses, NICs and Drivers
Throughput
- Use UDP packets to characterise Intel PRO/10GbE
Server Adaptor - SuperMicro P4DP8-G2 motherboard
- Dual Xenon 2.2GHz CPU
- 400 MHz System bus
- 133 MHz PCI-X bus
Latency
Bus Activity
10End Hosts Understanding NIC Drivers
- Linux driver basics TX
- Application system call
- Encapsulation in UDP/TCP and IP headers
- Enqueue on device send queue
- Driver places information in DMA descriptor ring
- NIC reads data from main memory
- via DMA and sends on wire
- NIC signals to processor that TX
- descriptor sent
- Linux driver basics RX
- NIC places data in main memory via
- DMA to a free RX descriptor
- NIC signals RX descriptor has data
- Driver passes frame to IP layer and
- cleans RX descriptor
- IP layer passes data to application
- Linux NAPI driver model
- On receiving a packet, NIC raises interrupt
- Driver switches off RX interrupts and schedules
RX DMA ring poll - Frames are pulled off DMA ring and is processed
up to application - When all frames are processed RX interrupts are
re-enabled - Dramatic reduction in RX interrupts under load
- Improving the performance of a Gigabit Ethernet
driver under Linux - http//datatag.web.cern.ch/datatag/papers/drafts/l
inux_kernel_map/
11Protocols TCP (Reno) Performance
- AIMD and High Bandwidth Long Distance networks
- Poor performance of TCP in high bandwidth wide
area networks is due - in part to the TCP congestion control algorithm
- For each ack in a RTT without loss
- cwnd -gt cwnd a / cwnd - Additive Increase,
a1 - For each window experiencing loss
- cwnd -gt cwnd b (cwnd) -
Multiplicative Decrease, b ½
12Protocols HighSpeed TCP Scalable TCP
- Adjusting the AIMD Algorithm TCP Reno
- For each ack in a RTT without loss
- cwnd -gt cwnd a / cwnd - Additive Increase,
a1 - For each window experiencing loss
- cwnd -gt cwnd b (cwnd) -
Multiplicative Decrease, b ½ - High Speed TCP
- a and b vary depending on current cwnd where
- a increases more rapidly with larger cwnd and as
a consequence returns to the optimal cwnd size
sooner for the network path and - b decreases less aggressively and, as a
consequence, so does the cwnd. The effect is that
there is not such a decrease in throughput. - Scalable TCP
- a and b are fixed adjustments for the increase
and decrease of cwnd - such that the increase is greater than TCP Reno,
and the decrease on - loss is less than TCP Reno
13Protocols HighSpeed TCP Scalable TCP
Success
HighSpeed TCP implemented by Gareth Manc Scalable
TCP implemented by Tom Kelly Camb Integration of
stacks into DataTAG Kernel Yee UCL Gareth
14Some Measurements of Throughput CERN -SARA
- Using the GÉANT Backup Link
- 1 GByte file transfers
- Blue Data
- Red TCP ACKs
- Standard TCP
- Average Throughput 167 Mbit/s
- Users see 5 - 50 Mbit/s!
- High-Speed TCP
- Average Throughput 345 Mbit/s
- Scalable TCP
- Average Throughput 340 Mbit/s
15Users, The Campus the MAN 1
Pete White Pat Meyrs
- NNW to SJ4 Access 2.5 Gbit PoS Hits 1 Gbit
50
- Man NNW Access 2 1 Gbit Ethernet
16Users, The Campus the MAN 2
- Message
- Continue to work with your network group
- Understand the traffic levels
- Understand the Network Topology
- LMN to site 1 Access 1 Gbit Ethernet
- LMN to site 2 Access 1 Gbit Ethernet
1710 GigEthernet Tuning PCI-X
1810 GigEthernet at SC2003 BW Challenge (Phoenix)
- Three Server systems with 10 GigEthernet NICs
- Used the DataTAG altAIMD stack 9000 byte MTU
- Streams From SLAC/FNAL booth in Phoenix to
- Pal Alto PAIX 17 ms rtt
- Chicago Starlight 65 ms rtt
- Amsterdam SARA 175 ms rtt
19Helping Real Users 1Radio Astronomy VLBIPoC
with NRNs GEANT 1024 Mbit/s 24 on 7 NOW
20VLBI Project Throughput Jitter 1-way Delay
- 1472 byte Packets Manchester -gt Dwingeloo JIVE
- 1472 byte Packets man -gt JIVE
- FWHM 22 µs (B2B 3 µs )
- 1-way Delay note the packet loss (points with
zero 1 way delay)
21VLBI Project Packet Loss Distribution
- Measure the time between lost packets in the time
series of packets sent. - Lost 1410 in 0.6s
- Is it a Poisson process?
- Assume Poisson is stationary
?(t) ? - Use Prob. Density Function P(t) ?
e-?t - Mean ? 2360 / s426 µs
- Plot log slope -0.0028expect -0.0024
- Could be additional process involved
22VLBI Traffic Flows Only testing!
- Manchester NetNorthWest - SuperJANET Access
links - Two 1 Gbit/s
- Access linksSJ4 to GÉANT GÉANT to
SurfNet
23 Throughput PCI transactions on the Mark5 PC
- Mark5 uses Supermicro P3TDLE
- 1.2 GHz PIII
- Mem bus 133/100 MHz
- 2 64bit 66 MHz PCI
- 4 32bit 33 MHz PCI
Ethernet
NIC
IDE Disc Pack
SuperStor
Input Card
Logic Analyser Display
24PCI Activity Read Multiple data blocks 0 wait
- Read 999424 bytes
- Each Data block
- Setup CSRs
- Data movement
- Update CSRs
- For 0 wait between reads
- Data blocks 600µs longtake 6 ms
- Then 744µs gap
- PCI transfer rate 1188Mbit/s(148.5 Mbytes/s)
- Read_sstor rate 778 Mbit/s (97 Mbyte/s)
- PCI bus occupancy 68.44
- Concern about Ethernet Traffic 64 bit 33 MHz PCI
needs 82 for 930 Mbit/s Expect 360 Mbit/s
Data transfer
Data Block131,072 bytes
CSR Access
PCI Burst 4096 bytes
25PCI Activity Read Throughput
- Flat then 1/t dependance
- 860 Mbit/s for Read blocks gt 262144 bytes
- CPU load 20
- Concern about CPU load needed to drive Gigabit
link -
26Helping Real Users 2HEPBaBar CMS
Application Throughput
27BaBar Case Study Disk Performace
- BaBar Disk Server
- Tyan Tiger S2466N motherboard
- 1 64bit 66 MHz PCI bus
- Athlon MP2000 CPU
- AMD-760 MPX chipset
- 3Ware 7500-8 RAID5
- 8 200Gb Maxtor IDE 7200rpm disks
- Note the VM parameterreadahead max
- Disk to memory (read)Max throughput 1.2 Gbit/s
150 MBytes/s) - Memory to disk (write)Max throughput 400 Mbit/s
50 MBytes/s)not as fast as Raid0
28BaBar Serial ATA Raid Controllers
29BaBar Case Study RAID Throughput PCI Activity
- 3Ware 7500-8 RAID5 parallel EIDE
- 3Ware forces PCI bus to 33 MHz
- BaBar Tyan to MB-NG SuperMicroNetwork mem-mem
619 Mbit/s - Disk disk throughput bbcp 40-45 Mbytes/s (320
360 Mbit/s) -
- PCI bus effectively full!
Read from RAID5 Disks
Write to RAID5 Disks
30MB NG SuperJANET4 Development Network BaBar
Case Study
- Status / Tests
- Manc host has DataTAG TCP stack
- RAL Host now available
- BaBar-BaBar mem-mem
- BaBar-BaBar real data MB-NG
- BaBar-BaBar real data SJ4
- Mbng-mbng real data MB-NG
- Mbng-mbng real data SJ4
- Different TCP stacks already installed
31Study of Applications MB NG SuperJANET4
Development Network
3224 Hours HighSpeed TCP mem-mem
- TCP mem-mem lon2-man1
- Tx 64 Tx-abs 64
- Rx 64 Rx-abs 128
- 941.5 Mbit/s - 0.5 Mbit/s
33Gridftp Throughput HighSpeedTCP
- Int Coal 64 128
- Txqueuelen 2000
- TCP buffer 1 M byte(rttBW 750kbytes)
- Interface throughput
- Acks received
- Data moved
- 520 Mbit/s
- Same for B2B tests
- So its not that simple!
34Gridftp Throughput Web100
- Throughput Mbit/s
- See alternate 600/800 Mbitand zero
- Cwnd smooth
- No dup Ack / send stall /timeouts
35http data transfers HighSpeed TCP
- Apachie web server out of the box!
- prototype client - curl http library
- 1Mbyte TCP buffers
- 2Gbyte file
- Throughput 72 MBytes/s
- Cwnd - some variation
- No dup Ack / send stall /timeouts
36More Information Some URLs
- MB-NG project web site http//www.mb-ng.net/
- DataTAG project web site http//www.datatag.org/
- UDPmon / TCPmon kit writeup http//www.hep.man
.ac.uk/rich/net - Motherboard and NIC Tests
- www.hep.man.ac.uk/rich/net/nic/GigEth_tests_Bos
ton.ppt http//datatag.web.cern.ch/datatag/pfldn
et2003/ - TCP tuning information may be found
athttp//www.ncne.nlanr.net/documentation/faq/pe
rformance.html http//www.psc.edu/networking/p
erf_tune.html