Title: Huge Data Transfer Experimentation over Lightpaths
1Huge Data Transfer Experimentation over Lightpaths
- Corrie Kost, Steve McDonald
- TRIUMF
- Wade Hong
- Carleton University
2Motivation
- LHC expected to come on line in 2007
- data rates expected to exceed a petabyte a year
- large Canadian HEP community involved in the
ATLAS experiment - establishment of a Canadian Tier 1 at TRIUMF
- replicate all/part of the experimental data
- need to be able transfer huge data to our Tier 1
3TRIUMF
- Tri University Meson Facility
- Canadas Laboratory for Particle and Nuclear
Physics - operated as a joint venture by UofA, UBC,
Carleton U, SFU, and UVic - located on the UBC campus in Vancouver
- five year funding from 2005 - 2009 announced in
federal budget - planned as the Canadian ATLAS Tier 1
4TRIUMF
5Lightpaths
- a significant design principle of CAnet 4 is the
ability to provide dedicated point to point
bandwidth over lightpaths under user control - similar philosophy of SURFnet provides the
ability to establish an end to end lightpath from
Canada to CERN - optical bypass isolates huge data transfers
from other users of the RE networks - lightpaths permit the extension of ethernet LANs
to the wide area
6Ethernet local to global
- the de facto LAN technology
- original ethernet
- shared media, half duplex, distance limited by
protocol - modern ethernet
- point to point, full duplex, switched, distance
limited by the optical components - cost effective
7Why native Ethernet Long Haul?
- more than 90 of the Internet traffic originates
from an Ethernet LAN - data traffic on the LAN increases due to new
applications - Ethernet services with incremental bandwidth
offer new business opportunities for carriers - why not native Ethernet?
- scalability, reliability, service guarantees
- all the above are research areas
- native Ethernet long haul connections can be used
today as a complement to the routed networks, not
a replacement
8Experimentation
- experimenting with 10 GbE hardware for the past 3
years - engaged 10 GbE NIC and network vendors
- mostly interested in disk to disk transfers with
commodity hardware - tweaking performance of Linux-based disk servers
- engaged hardware vendors to help build systems
- testing data transfers over dedicated lightpaths
- engineering solutions for the e2e lightpath last
mile - especially for 10 GbE
92002 Activities
- established the first end to end trans-atlantic
lightpath between TRIUMF and CERN for iGrid 2002 - bonded dual GbEs transported across a 2.5 Gbps
OC-48 - initial experimentation with 10GbE
- alpha Intel 10GbE LR NICs, Extreme Black Diamond
6808 with 10GbE LRi blades - transfered data from ATLAS DC from TRIUMF to CERN
using bbftp and tsunami
10Live continent to continent
- e2e lightpath up and running Sept 20 2045 CET
traceroute to cern-10g (192.168.2.2), 30 hops
max, 38 byte packets 1 cern-10g (192.168.2.2)
161.780 ms 161.760 ms 161.754 ms
11iGrid 2002 Topology
12Exceeding a Gbps
(Tsunami)
132003 Activities
- Canarie funded directed research project, CAnet
4 IGT to continue with experimentation - Canadian HEP community and CERN
- GbE lightpath experimentation between CERN and
UofA for real-time remote farms - data transfers over a GbE lightpath between CERN
and Carleton U for transferring 700GB of ATLAS
FCAL test beam data - took 6.5 hrs versus 67 days
14Current IGT Topology
152003 Activities
- re-establishment of 10 GbE experiments
- newer Intel 10 GbE NICs and Force 10 Networks
E600 switches, IXIA network testers, servers from
Intel and CERN OpenLab - established first native 10GbE end to end
trans-atlantic lightpath between Carleton U and
CERN - demonstrated at ITU Telecom World 2003
16Demo during ITU Telecom World 2003
10 GbE WAN PHY over an OC-192 circuit using
lightpaths provided by SURFnet and CAnet 4
9.24 Gbps using traffic generators
6 Gbps using UDP on PCs
5.65 Gbps using TCP on PCs
17Results on the transatlantic 10GbE
Single stream UDP throughput
Single stream TCP throughput
Data rates limited by the PC, even for memory to
memory tests
UDP uses less resources than TCP on
high-bandwidth delay product networks
182004-2005 Activities
- arrival of the third CAnet 4 lambda in the
summer of 2004, looked at establishing a 10 GbE
lightpath from TRIUMF - Neterion (s2io) Xframe 10 GbE NICs, Foundry
NetIron 40Gs, Foundry NetIron 1500, servers from
Sun Microsystems, and custom built disk servers
from Ciara Technologies. - distance problem between TRIUMF and the CAnet 4
OME 6500 in Vancouver - XENPAK 10 GbE WAN PHY at 1310nm
192004-2005 Activities
- testing data transfers between TRIUMF and CERN,
and TRIUMF and Carleton U over a 10 GbE lightpath - experimenting with robust data transfers
- attempt to maximize disk i/o performance from
Linux-based disk servers - experimenting with disk controllers and
processors - ATLAS Service Challenges in 2005
202004-2005 Activities
- exploring a more permanent 10 GbE lightpath to
CERN and lightpaths to Canadian Tier 2 ATLAS
sites from TRIUMF - CANARIE playing a lead role in helping to
facilitate - still need to solve some last mile lightpath
issues
21(No Transcript)
22Xeon-based Servers
- Dual 3.2 GHz Xeons
- 4GB memory
- 4 3WARE 9500S-4LP (8)
- 16 SATA150 120GB drives
- 40GB HITACHI 14R9200 drives
- INTEL 10GBE PXLA8590LR
23Some Xeon Server I/O Results
- read a pair of 80 GB (xfs) files for 67 hours
120TB average 524 MB/sec(Software Raid0 of 8
Sata disks on each of pair hardware Raid0
RocketRaid 1820A controllers on Storm2) - 10GbE S2IO Nics back-to-back 17 hrs 10TB
average 180MB/se ( from Storm2 to Storm1 with
software Raid0 of 4 disks on each of 3
3ware-9500S4 controllers in Raid0) - 10GbE lightpath Storm2 to Itanium machine at
CERN 10,15,20,25 bbftp streams averaged 18, 24,
27, 29 MB/sec disk-to-disk.
( Only 1 disk at CERN max write
speed 48MB/sec) - continued Storm1 to Storm2 testing many
sustainability problems encountered and resolved.
Details available on request. Dont do test
flights too close to ground
echo 100000 gt /proc/sys/vm/min_free_kbytes
24Opteron-based Servers
- Dual 2.4GHz Opterons
- 4GB Memory
- 1 WD800JB 80GB HD
- 16 SATA 300GB HD
(Seagate ST3300831AS) - 4 4 Port Infiniband-SATA
- 2 RocketRaid 1820A
- 10GbE NIC
- 2 PCI-X at 133MHz ()
- 2 PCI-X at 100MHz ()
Note 64bit 133MHz 8.4
Gb/s
25Multilane Infini-band SATA
26Server Specifications
27The Parameters
- 5 types of controllers
- number of controllers to use (1 to 4)
- number of disks/controller (1 to 16)
- RAID0, RAID5, RAID6, JBOD
- dual or quad Opteron systems
- 4-6 possible PCI-X slots (1 reserved for 10GigE)
- linux kernels (2.6.9, 2.6.10, 2.6.11)
- many tuning parameters (in addition to WAN) e.g.
- blockdev setra 8192 /dev/md0
- chunk-size in mdadm (1024)
- /sbin/setpci d 80861048 e6.b2e
- (modifies MMRBC field in PCI-X configuration
space for vendor 8086 and device 1048 to increase
transmit burst length on the bus - echo 100000 gt/proc/sys/vm/min_free_kbytes
- ifconfig eth3 txqueuelen 100000
28The SATA Controllers
3Ware-9500S-4
3Ware-9500S-8
Areca 1160
Highpoint RocketRaid 1820A
SuperMicro DAC-SATA-MV8
29Areca 1160 Details
Extensive tests were done by tweakers.net on
ARECA and 8 others www.tweakers.net/benchd
b/search/product/104629
www.tweakers.net/reviews/557
30Why do we need Raid 6?
- our experience is 1 out of 30 disk fails every 6
months - Raid5 rebuild in full operation of 15 300GB
disks takes 100hrs - probability that second disk fails during
rebuild 1 - ARECA-1160 TESTS of 15 300GB disks (1 broken)
- Raid5 or 6 fast build in 100minutes
- Raid5 or 6 background build up to 100hrs for
busy system - Acid Test
- Raid6 removed disk while very busy degraded
to Raid5 - rebuild takes 100 hrs
- removed second disk now critical - but after
raid5 built it proceeded to raid6
Raid 5 with 4TB of disk is too risky Raid 6
marginal cost minimal
31Some optimal I/O results
32Some details of I/O
33Puzzling I/O results
- read speeds for some 80 GB files consistently
50 faster (620 MB/sec) for md0 of 28 disk RAID
5 of RR 1820A) - read for other files consistently lower
- read speeds are up to 50 faster using /dev/md0
over direct use of /dev/sda (eg. Areca 1160 15
disk Raid5 190 to 323 MB/s - bi-stable (fast/slow) read modes within the same
file - diskscrubb utility re-maps bad blocks - takes
2hrs for 300 GB drive - weak blocks - not being remapped possible
reason for slow spots - room temperature gradient suspected - tested -
discounted
34Puzzling I/O results
Bi-stable state for reads a useful tool to
display which disk may be slowing I/O is iostat
x 1 Device rrqm/s wrqm/s r/s w/s
rsec/s wsec/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm util hda 0.00
0.00 1.00 0.00 8.00 0.00 4.00
0.00 8.00 0.01 9.00 9.00 0.90 md0
0.00 0.00 1920.00 0.00 491520.00 0.00
245760.00 0.00 256.00 0.00 0.00
0.00 0.00 sda 0.00 0.00 239.00 0.00
61440.00 0.00 30720.00 0.00 257.07
10.88 45.31 4.19 100.10 BAD sdb
0.00 0.00 238.00 0.00 61440.00 0.00
30720.00 0.00 258.15 2.80 11.76
2.46 58.50 sdc 0.00 0.00 240.00 0.00
61440.00 0.00 30720.00 0.00 256.00
2.85 11.91 2.40 57.70 sdd 0.00
0.00 240.00 0.00 61440.00 0.00 30720.00
0.00 256.00 3.01 12.61 2.58 61.80 sde
0.00 0.00 237.00 0.00 61440.00 0.00
30720.00 0.00 259.24 2.94 12.39
2.57 61.00 sdf 0.00 0.00 236.00 0.00
61440.00 0.00 30720.00 0.00 260.34
2.96 12.47 2.61 61.60 sdg 0.00
0.00 239.00 0.00 61440.00 0.00 30720.00
0.00 257.07 3.04 12.77 2.51 60.00 sdh
0.00 0.00 235.00 0.00 61440.00 0.00
30720.00 0.00 261.45 3.02 12.72
2.49 58.60 When working properly this
is... Device rrqm/s wrqm/s r/s w/s
rsec/s wsec/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm util hda 0.00
1.00 1.00 37.00 8.00 304.00 4.00
152.00 8.21 0.09 2.37 0.21
0.80 md0 0.00 0.00 3520.00 0.00
901120.00 0.00 450560.00 0.00 256.00
0.00 0.00 0.00 0.00 sda 0.00
0.00 434.00 0.00 112640.00 0.00 56320.00
0.00 259.54 8.57 19.52 2.30 100.00 sdb
0.00 0.00 446.00 1.00 112640.00
0.00 56320.00 0.00 251.99 8.07 20.50
2.20 98.30 sdc 0.00 0.00 440.00
0.00 112640.00 0.00 56320.00 0.00 256.00
6.11 13.89 2.25 98.80 sdd 0.00
0.00 440.00 0.00 112640.00 0.00 56320.00
0.00 256.00 4.63 10.52 2.18 96.10 sde
0.00 0.00 439.00 0.00 112640.00
0.00 56320.00 0.00 256.58 4.64 10.54
2.18 95.70 sdf 0.00 0.00 441.00
0.00 112640.00 0.00 56320.00 0.00 255.42
6.26 14.22 2.25 99.20 sdg 0.00
0.00 437.00 0.00 112640.00 0.00 56320.00
0.00 257.76 4.89 11.11 2.19 95.80 sdh
0.00 0.00 439.00 0.00 112640.00
0.00 56320.00 0.00 256.58 5.21 11.84
2.19 96.10 Solution? Change slow disk with
normal one.
35I/O related results
Shows drop in read speed depending on location of
the file. Reads significantly faster on the
outer part of the software Raid0 (JBOD) set.
36TRIUMF-CERN GbE lightpath
- currently a GbE circuit established since April
18th 2005 - uses ONS15454
- used primarily for the ATLAS Service Challenge
- hoping to have a 10 GbE lightpath to CERN by
Jan/Feb 2006
37Atlas SC3 Setup
ATLAS Tier1 Service Challenge 3 (primary contact
Reda Tafirout tafirout_at_triumf.ca) 3 Ciara
servers Intel SE7520BD2 (dual GigE, PCI-X,
etc.) dual 3 GHz, Nocona EMT64 (1 MB cache/ 800
MHz FSB) 2 GB RAM 1 system disk 80 Gig IDE
(laptop) 8 x 250 GB SATA150 (Seagate Barrac.
NCQ, 8 MB) 3Ware 9500S-8MI RAID5 Infiniband
connections 1 Evetek server (management node)
dual opteron 246 2.0 GHz (800 MHz FSB) 2 GB RAM
1 system disk WD 80 GB SATA 2 x 250 GB WD
SATA 3Ware 9500S-LP 4 channels ADAPTEC
Ultra160 SCSI 29160-LP Tape system 2 x IBM
4560SLX SDLT libraries - each with 1 SDLT drive
26 SDLT tapes have fibre channel interface
card All systems are running FC3 x86_64 2.6
kernel, and dCache for disk management (with
gridftp SRM access doors)
38TRIUMF-Carleton U lightpath
Servers 5x Dual 250 2.4 GHz Opterons 2GB
memory 16x 300GB SATA drives SunFire V40z Quad
Opteron 850 2.4GHz 8GB memory 3x 146GB
SCSI Network Cards Intel Pro/10GbE-LR S2io/Neteri
on Xframe Raid and SATA Controllers 3ware 9500s
8-port Rock Raid 1820A 8-port Super micro MV8
8-port Areca 1160 16-port Network MRV
CWDM Foundry NI1500 NI40G 10G-ER 1550nm LAN
PHY 10G-LR 1310 LAN/WAN PHY CAnet4 OME 6500
39Transfer results over 10 GbE
1GbE transfers disk-to-disk b/w TRIUMF-Carleton
Ottawa over 10G Circuit 115 MB/s sustained 5
days Equivalent to 46TB
Iperf between TRIUMF Ottawa memory-to-memory
1 week 3.74 Gbps averaged, 460 MB/s 350 TB
transferred (errors ignored)
40Transfer results over 10 GbE
Disk-to-Memory back-to-back short distance, 24
hrs Single TCP Stream Average of 2.4 Gbps,
300 MB/s (max disk read 361 MB/s 16 disks
RAID5)
Disk-to-Disk back-to-back short distance, 76TB in
4 days bbftp 5 TCP streams Average of 1.8
Gbps, 220 MB/s (max disk write 303 MB/s
15 disk ARECA RAID5) (max disk read 361 MB/sec
16 disk as 28 RR1820A RAID5)
41Pumping data into a 10GbE circuit
Bottleneck buffering? What are the solutions?
Zero-copy
42Conclusions/Observations
- dual opterons may still be I/O limited -
exploring hot wired quad opterons - SATA drives may need more quality
control/screening/repair - Raid 5 for 1-4 TB, Raid 6 for larger sets (now
up to 24 disks/controller) - some cards have a 2TB limit
- GbE delivers stable disk-disk long distance
transfers at 120 MB/s - there are critical tuning requirements - servers
cannot be used blindly - achieving robustness is not easy!
- lightpaths, however, make this much easier!
43Further Explorations
- 10 GbE network infrastructure
- over the past 3 years the 10 GbE networking
vendor space has matured - perhaps time to acquire something more permanent
- under consideration - XFP-based optics is the latest trend
- re-visit evaluation of different data transfer
protocols
44Further Explorations
- ATA over Ethernet
- had some discussions with Coraid
- explore how ethernet attached drives would behave
over long haul networks - iSCSI
- iSCSI over long haul networks
- Sun V40z with Solaris 10 (native iSCSI stack)
- demonstrated i/o over 500 MB/s
45Further Explorations
- 10 GbE NICs
- NICs with TOE
- Myrinet recently announced new lower cost 10GbE
compatible NICs - PCI-Express
- emergence of PCI-E disk controllers and NICs
46Thank You! kost_at_triumf.ca mcdonald_at_triumf.ca xion
g_at_physics.carleton.ca