Huge Data Transfer Experimentation over Lightpaths

1 / 46
About This Presentation
Title:

Huge Data Transfer Experimentation over Lightpaths

Description:

DOCTYPE plist PUBLIC '-//Apple Computer//DTD PLIST 1.0//EN' 'http://www.apple. ... key com.apple.print.PageFormat.PMHorizontalRes /key dict ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 47
Provided by: apan

less

Transcript and Presenter's Notes

Title: Huge Data Transfer Experimentation over Lightpaths


1
Huge Data Transfer Experimentation over Lightpaths
  • Corrie Kost, Steve McDonald
  • TRIUMF
  • Wade Hong
  • Carleton University

2
Motivation
  • LHC expected to come on line in 2007
  • data rates expected to exceed a petabyte a year
  • large Canadian HEP community involved in the
    ATLAS experiment
  • establishment of a Canadian Tier 1 at TRIUMF
  • replicate all/part of the experimental data
  • need to be able transfer huge data to our Tier 1

3
TRIUMF
  • Tri University Meson Facility
  • Canadas Laboratory for Particle and Nuclear
    Physics
  • operated as a joint venture by UofA, UBC,
    Carleton U, SFU, and UVic
  • located on the UBC campus in Vancouver
  • five year funding from 2005 - 2009 announced in
    federal budget
  • planned as the Canadian ATLAS Tier 1

4
TRIUMF
5
Lightpaths
  • a significant design principle of CAnet 4 is the
    ability to provide dedicated point to point
    bandwidth over lightpaths under user control
  • similar philosophy of SURFnet provides the
    ability to establish an end to end lightpath from
    Canada to CERN
  • optical bypass isolates huge data transfers
    from other users of the RE networks
  • lightpaths permit the extension of ethernet LANs
    to the wide area

6
Ethernet local to global
  • the de facto LAN technology
  • original ethernet
  • shared media, half duplex, distance limited by
    protocol
  • modern ethernet
  • point to point, full duplex, switched, distance
    limited by the optical components
  • cost effective

7
Why native Ethernet Long Haul?
  • more than 90 of the Internet traffic originates
    from an Ethernet LAN
  • data traffic on the LAN increases due to new
    applications
  • Ethernet services with incremental bandwidth
    offer new business opportunities for carriers
  • why not native Ethernet?
  • scalability, reliability, service guarantees
  • all the above are research areas
  • native Ethernet long haul connections can be used
    today as a complement to the routed networks, not
    a replacement

8
Experimentation
  • experimenting with 10 GbE hardware for the past 3
    years
  • engaged 10 GbE NIC and network vendors
  • mostly interested in disk to disk transfers with
    commodity hardware
  • tweaking performance of Linux-based disk servers
  • engaged hardware vendors to help build systems
  • testing data transfers over dedicated lightpaths
  • engineering solutions for the e2e lightpath last
    mile
  • especially for 10 GbE

9
2002 Activities
  • established the first end to end trans-atlantic
    lightpath between TRIUMF and CERN for iGrid 2002
  • bonded dual GbEs transported across a 2.5 Gbps
    OC-48
  • initial experimentation with 10GbE
  • alpha Intel 10GbE LR NICs, Extreme Black Diamond
    6808 with 10GbE LRi blades
  • transfered data from ATLAS DC from TRIUMF to CERN
    using bbftp and tsunami

10
Live continent to continent
  • e2e lightpath up and running Sept 20 2045 CET

traceroute to cern-10g (192.168.2.2), 30 hops
max, 38 byte packets 1 cern-10g (192.168.2.2)
161.780 ms 161.760 ms 161.754 ms
11
iGrid 2002 Topology
12
Exceeding a Gbps
(Tsunami)
13
2003 Activities
  • Canarie funded directed research project, CAnet
    4 IGT to continue with experimentation
  • Canadian HEP community and CERN
  • GbE lightpath experimentation between CERN and
    UofA for real-time remote farms
  • data transfers over a GbE lightpath between CERN
    and Carleton U for transferring 700GB of ATLAS
    FCAL test beam data
  • took 6.5 hrs versus 67 days

14
Current IGT Topology
15
2003 Activities
  • re-establishment of 10 GbE experiments
  • newer Intel 10 GbE NICs and Force 10 Networks
    E600 switches, IXIA network testers, servers from
    Intel and CERN OpenLab
  • established first native 10GbE end to end
    trans-atlantic lightpath between Carleton U and
    CERN
  • demonstrated at ITU Telecom World 2003

16
Demo during ITU Telecom World 2003
10 GbE WAN PHY over an OC-192 circuit using
lightpaths provided by SURFnet and CAnet 4
9.24 Gbps using traffic generators
6 Gbps using UDP on PCs
5.65 Gbps using TCP on PCs
17
Results on the transatlantic 10GbE
Single stream UDP throughput
Single stream TCP throughput
Data rates limited by the PC, even for memory to
memory tests
UDP uses less resources than TCP on
high-bandwidth delay product networks
18
2004-2005 Activities
  • arrival of the third CAnet 4 lambda in the
    summer of 2004, looked at establishing a 10 GbE
    lightpath from TRIUMF
  • Neterion (s2io) Xframe 10 GbE NICs, Foundry
    NetIron 40Gs, Foundry NetIron 1500, servers from
    Sun Microsystems, and custom built disk servers
    from Ciara Technologies.
  • distance problem between TRIUMF and the CAnet 4
    OME 6500 in Vancouver
  • XENPAK 10 GbE WAN PHY at 1310nm

19
2004-2005 Activities
  • testing data transfers between TRIUMF and CERN,
    and TRIUMF and Carleton U over a 10 GbE lightpath
  • experimenting with robust data transfers
  • attempt to maximize disk i/o performance from
    Linux-based disk servers
  • experimenting with disk controllers and
    processors
  • ATLAS Service Challenges in 2005

20
2004-2005 Activities
  • exploring a more permanent 10 GbE lightpath to
    CERN and lightpaths to Canadian Tier 2 ATLAS
    sites from TRIUMF
  • CANARIE playing a lead role in helping to
    facilitate
  • still need to solve some last mile lightpath
    issues

21
(No Transcript)
22
Xeon-based Servers
  • Dual 3.2 GHz Xeons
  • 4GB memory
  • 4 3WARE 9500S-4LP (8)
  • 16 SATA150 120GB drives
  • 40GB HITACHI 14R9200 drives
  • INTEL 10GBE PXLA8590LR

23
Some Xeon Server I/O Results
  • read a pair of 80 GB (xfs) files for 67 hours
    120TB average 524 MB/sec(Software Raid0 of 8
    Sata disks on each of pair hardware Raid0
    RocketRaid 1820A controllers on Storm2)
  • 10GbE S2IO Nics back-to-back 17 hrs 10TB
    average 180MB/se ( from Storm2 to Storm1 with
    software Raid0 of 4 disks on each of 3
    3ware-9500S4 controllers in Raid0)
  • 10GbE lightpath Storm2 to Itanium machine at
    CERN 10,15,20,25 bbftp streams averaged 18, 24,
    27, 29 MB/sec disk-to-disk.
    ( Only 1 disk at CERN max write
    speed 48MB/sec)
  • continued Storm1 to Storm2 testing many
    sustainability problems encountered and resolved.
    Details available on request. Dont do test
    flights too close to ground

echo 100000 gt /proc/sys/vm/min_free_kbytes
24
Opteron-based Servers
  • Dual 2.4GHz Opterons
  • 4GB Memory
  • 1 WD800JB 80GB HD
  • 16 SATA 300GB HD
    (Seagate ST3300831AS)
  • 4 4 Port Infiniband-SATA
  • 2 RocketRaid 1820A
  • 10GbE NIC
  • 2 PCI-X at 133MHz ()
  • 2 PCI-X at 100MHz ()
    Note 64bit 133MHz 8.4
    Gb/s

25
Multilane Infini-band SATA
26
Server Specifications
27
The Parameters
  • 5 types of controllers
  • number of controllers to use (1 to 4)
  • number of disks/controller (1 to 16)
  • RAID0, RAID5, RAID6, JBOD
  • dual or quad Opteron systems
  • 4-6 possible PCI-X slots (1 reserved for 10GigE)
  • linux kernels (2.6.9, 2.6.10, 2.6.11)
  • many tuning parameters (in addition to WAN) e.g.
  • blockdev setra 8192 /dev/md0
  • chunk-size in mdadm (1024)
  • /sbin/setpci d 80861048 e6.b2e
  • (modifies MMRBC field in PCI-X configuration
    space for vendor 8086 and device 1048 to increase
    transmit burst length on the bus
  • echo 100000 gt/proc/sys/vm/min_free_kbytes
  • ifconfig eth3 txqueuelen 100000

28
The SATA Controllers
3Ware-9500S-4
3Ware-9500S-8
Areca 1160
Highpoint RocketRaid 1820A
SuperMicro DAC-SATA-MV8
29
Areca 1160 Details
Extensive tests were done by tweakers.net on
ARECA and 8 others www.tweakers.net/benchd
b/search/product/104629
www.tweakers.net/reviews/557
30
Why do we need Raid 6?
  • our experience is 1 out of 30 disk fails every 6
    months
  • Raid5 rebuild in full operation of 15 300GB
    disks takes 100hrs
  • probability that second disk fails during
    rebuild 1
  • ARECA-1160 TESTS of 15 300GB disks (1 broken)
  • Raid5 or 6 fast build in 100minutes
  • Raid5 or 6 background build up to 100hrs for
    busy system
  • Acid Test
  • Raid6 removed disk while very busy degraded
    to Raid5
  • rebuild takes 100 hrs
  • removed second disk now critical - but after
    raid5 built it proceeded to raid6

Raid 5 with 4TB of disk is too risky Raid 6
marginal cost minimal
31
Some optimal I/O results
32
Some details of I/O
33
Puzzling I/O results
  • read speeds for some 80 GB files consistently
    50 faster (620 MB/sec) for md0 of 28 disk RAID
    5 of RR 1820A)
  • read for other files consistently lower
  • read speeds are up to 50 faster using /dev/md0
    over direct use of /dev/sda (eg. Areca 1160 15
    disk Raid5 190 to 323 MB/s
  • bi-stable (fast/slow) read modes within the same
    file
  • diskscrubb utility re-maps bad blocks - takes
    2hrs for 300 GB drive
  • weak blocks - not being remapped possible
    reason for slow spots
  • room temperature gradient suspected - tested -
    discounted

34
Puzzling I/O results
Bi-stable state for reads a useful tool to
display which disk may be slowing I/O is iostat
x 1   Device rrqm/s wrqm/s r/s w/s
rsec/s wsec/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm util hda 0.00
0.00 1.00 0.00 8.00 0.00 4.00
0.00 8.00 0.01 9.00 9.00 0.90 md0
0.00 0.00 1920.00 0.00 491520.00 0.00
245760.00 0.00 256.00 0.00 0.00
0.00 0.00 sda 0.00 0.00 239.00 0.00
61440.00 0.00 30720.00 0.00 257.07
10.88 45.31 4.19 100.10 BAD sdb
0.00 0.00 238.00 0.00 61440.00 0.00
30720.00 0.00 258.15 2.80 11.76
2.46 58.50 sdc 0.00 0.00 240.00 0.00
61440.00 0.00 30720.00 0.00 256.00
2.85 11.91 2.40 57.70 sdd 0.00
0.00 240.00 0.00 61440.00 0.00 30720.00
0.00 256.00 3.01 12.61 2.58 61.80 sde
0.00 0.00 237.00 0.00 61440.00 0.00
30720.00 0.00 259.24 2.94 12.39
2.57 61.00 sdf 0.00 0.00 236.00 0.00
61440.00 0.00 30720.00 0.00 260.34
2.96 12.47 2.61 61.60 sdg 0.00
0.00 239.00 0.00 61440.00 0.00 30720.00
0.00 257.07 3.04 12.77 2.51 60.00 sdh
0.00 0.00 235.00 0.00 61440.00 0.00
30720.00 0.00 261.45 3.02 12.72
2.49 58.60   When working properly this
is... Device rrqm/s wrqm/s r/s w/s
rsec/s wsec/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm util hda 0.00
1.00 1.00 37.00 8.00 304.00 4.00
152.00 8.21 0.09 2.37 0.21
0.80 md0 0.00 0.00 3520.00 0.00
901120.00 0.00 450560.00 0.00 256.00
0.00 0.00 0.00 0.00 sda 0.00
0.00 434.00 0.00 112640.00 0.00 56320.00
0.00 259.54 8.57 19.52 2.30 100.00 sdb
0.00 0.00 446.00 1.00 112640.00
0.00 56320.00 0.00 251.99 8.07 20.50
2.20 98.30 sdc 0.00 0.00 440.00
0.00 112640.00 0.00 56320.00 0.00 256.00
6.11 13.89 2.25 98.80 sdd 0.00
0.00 440.00 0.00 112640.00 0.00 56320.00
0.00 256.00 4.63 10.52 2.18 96.10 sde
0.00 0.00 439.00 0.00 112640.00
0.00 56320.00 0.00 256.58 4.64 10.54
2.18 95.70 sdf 0.00 0.00 441.00
0.00 112640.00 0.00 56320.00 0.00 255.42
6.26 14.22 2.25 99.20 sdg 0.00
0.00 437.00 0.00 112640.00 0.00 56320.00
0.00 257.76 4.89 11.11 2.19 95.80 sdh
0.00 0.00 439.00 0.00 112640.00
0.00 56320.00 0.00 256.58 5.21 11.84
2.19 96.10 Solution? Change slow disk with
normal one.
35
I/O related results
Shows drop in read speed depending on location of
the file. Reads significantly faster on the
outer part of the software Raid0 (JBOD) set.
36
TRIUMF-CERN GbE lightpath
  • currently a GbE circuit established since April
    18th 2005
  • uses ONS15454
  • used primarily for the ATLAS Service Challenge
  • hoping to have a 10 GbE lightpath to CERN by
    Jan/Feb 2006

37
Atlas SC3 Setup
ATLAS Tier1 Service Challenge 3 (primary contact
Reda Tafirout tafirout_at_triumf.ca) 3 Ciara
servers Intel SE7520BD2 (dual GigE, PCI-X,
etc.) dual 3 GHz, Nocona EMT64 (1 MB cache/ 800
MHz FSB) 2 GB RAM 1 system disk 80 Gig IDE
(laptop) 8 x 250 GB SATA150 (Seagate Barrac.
NCQ, 8 MB) 3Ware 9500S-8MI RAID5 Infiniband
connections 1 Evetek server (management node)
dual opteron 246 2.0 GHz (800 MHz FSB) 2 GB RAM
1 system disk WD 80 GB SATA 2 x 250 GB WD
SATA 3Ware 9500S-LP 4 channels ADAPTEC
Ultra160 SCSI 29160-LP Tape system 2 x IBM
4560SLX SDLT libraries - each with 1 SDLT drive
26 SDLT tapes have fibre channel interface
card All systems are running FC3 x86_64 2.6
kernel, and dCache for disk management (with
gridftp SRM access doors)
38
TRIUMF-Carleton U lightpath
Servers 5x Dual 250 2.4 GHz Opterons 2GB
memory 16x 300GB SATA drives SunFire V40z Quad
Opteron 850 2.4GHz 8GB memory 3x 146GB
SCSI Network Cards Intel Pro/10GbE-LR S2io/Neteri
on Xframe Raid and SATA Controllers 3ware 9500s
8-port Rock Raid 1820A 8-port Super micro MV8
8-port Areca 1160 16-port Network MRV
CWDM Foundry NI1500 NI40G 10G-ER 1550nm LAN
PHY 10G-LR 1310 LAN/WAN PHY CAnet4 OME 6500
39
Transfer results over 10 GbE
1GbE transfers disk-to-disk b/w TRIUMF-Carleton
Ottawa over 10G Circuit 115 MB/s sustained 5
days Equivalent to 46TB
Iperf between TRIUMF Ottawa memory-to-memory
1 week 3.74 Gbps averaged, 460 MB/s 350 TB
transferred (errors ignored)
40
Transfer results over 10 GbE
Disk-to-Memory back-to-back short distance, 24
hrs Single TCP Stream Average of 2.4 Gbps,
300 MB/s (max disk read 361 MB/s 16 disks
RAID5)
Disk-to-Disk back-to-back short distance, 76TB in
4 days bbftp 5 TCP streams Average of 1.8
Gbps, 220 MB/s (max disk write 303 MB/s
15 disk ARECA RAID5) (max disk read 361 MB/sec
16 disk as 28 RR1820A RAID5)
41
Pumping data into a 10GbE circuit
Bottleneck buffering? What are the solutions?
Zero-copy
42
Conclusions/Observations
  • dual opterons may still be I/O limited -
    exploring hot wired quad opterons
  • SATA drives may need more quality
    control/screening/repair
  • Raid 5 for 1-4 TB, Raid 6 for larger sets (now
    up to 24 disks/controller)
  • some cards have a 2TB limit
  • GbE delivers stable disk-disk long distance
    transfers at 120 MB/s
  • there are critical tuning requirements - servers
    cannot be used blindly
  • achieving robustness is not easy!
  • lightpaths, however, make this much easier!

43
Further Explorations
  • 10 GbE network infrastructure
  • over the past 3 years the 10 GbE networking
    vendor space has matured
  • perhaps time to acquire something more permanent
    - under consideration
  • XFP-based optics is the latest trend
  • re-visit evaluation of different data transfer
    protocols

44
Further Explorations
  • ATA over Ethernet
  • had some discussions with Coraid
  • explore how ethernet attached drives would behave
    over long haul networks
  • iSCSI
  • iSCSI over long haul networks
  • Sun V40z with Solaris 10 (native iSCSI stack)
  • demonstrated i/o over 500 MB/s

45
Further Explorations
  • 10 GbE NICs
  • NICs with TOE
  • Myrinet recently announced new lower cost 10GbE
    compatible NICs
  • PCI-Express
  • emergence of PCI-E disk controllers and NICs

46
Thank You! kost_at_triumf.ca mcdonald_at_triumf.ca xion
g_at_physics.carleton.ca
Write a Comment
User Comments (0)