Huge Data Transfer Experimentation over Lightpaths

1 / 46

About This Presentation

Title:

Huge Data Transfer Experimentation over Lightpaths

Description:

DOCTYPE plist PUBLIC '-//Apple Computer//DTD PLIST 1.0//EN' 'http://www.apple. ... key com.apple.print.PageFormat.PMHorizontalRes /key dict ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 47

Provided by: apan

more less

Transcript and Presenter's Notes

Title: Huge Data Transfer Experimentation over Lightpaths

1
Huge Data Transfer Experimentation over Lightpaths

Corrie Kost, Steve McDonald
TRIUMF
Wade Hong
Carleton University

2
Motivation

LHC expected to come on line in 2007
data rates expected to exceed a petabyte a year
large Canadian HEP community involved in the
ATLAS experiment
establishment of a Canadian Tier 1 at TRIUMF
replicate all/part of the experimental data
need to be able transfer huge data to our Tier 1

3
TRIUMF

Tri University Meson Facility
Canadas Laboratory for Particle and Nuclear
Physics
operated as a joint venture by UofA, UBC,
Carleton U, SFU, and UVic
located on the UBC campus in Vancouver
five year funding from 2005 - 2009 announced in
federal budget
planned as the Canadian ATLAS Tier 1

4
TRIUMF
5
Lightpaths

a significant design principle of CAnet 4 is the
ability to provide dedicated point to point
bandwidth over lightpaths under user control
similar philosophy of SURFnet provides the
ability to establish an end to end lightpath from
Canada to CERN
optical bypass isolates huge data transfers
from other users of the RE networks
lightpaths permit the extension of ethernet LANs
to the wide area

6
Ethernet local to global

the de facto LAN technology
original ethernet
shared media, half duplex, distance limited by
protocol
modern ethernet
point to point, full duplex, switched, distance
limited by the optical components
cost effective

7
Why native Ethernet Long Haul?

more than 90 of the Internet traffic originates
from an Ethernet LAN
data traffic on the LAN increases due to new
applications
Ethernet services with incremental bandwidth
offer new business opportunities for carriers
why not native Ethernet?
scalability, reliability, service guarantees
all the above are research areas
native Ethernet long haul connections can be used
today as a complement to the routed networks, not
a replacement

8
Experimentation

experimenting with 10 GbE hardware for the past 3
years
engaged 10 GbE NIC and network vendors
mostly interested in disk to disk transfers with
commodity hardware
tweaking performance of Linux-based disk servers
engaged hardware vendors to help build systems
testing data transfers over dedicated lightpaths
engineering solutions for the e2e lightpath last
mile
especially for 10 GbE

9
2002 Activities

established the first end to end trans-atlantic
lightpath between TRIUMF and CERN for iGrid 2002
bonded dual GbEs transported across a 2.5 Gbps
OC-48
initial experimentation with 10GbE
alpha Intel 10GbE LR NICs, Extreme Black Diamond
6808 with 10GbE LRi blades
transfered data from ATLAS DC from TRIUMF to CERN
using bbftp and tsunami

10
Live continent to continent

e2e lightpath up and running Sept 20 2045 CET

traceroute to cern-10g (192.168.2.2), 30 hops
max, 38 byte packets 1 cern-10g (192.168.2.2)
161.780 ms 161.760 ms 161.754 ms
11
iGrid 2002 Topology
12
Exceeding a Gbps
(Tsunami)
13
2003 Activities

Canarie funded directed research project, CAnet
4 IGT to continue with experimentation
Canadian HEP community and CERN
GbE lightpath experimentation between CERN and
UofA for real-time remote farms
data transfers over a GbE lightpath between CERN
and Carleton U for transferring 700GB of ATLAS
FCAL test beam data
took 6.5 hrs versus 67 days

14
Current IGT Topology
15
2003 Activities

re-establishment of 10 GbE experiments
newer Intel 10 GbE NICs and Force 10 Networks
E600 switches, IXIA network testers, servers from
Intel and CERN OpenLab
established first native 10GbE end to end
trans-atlantic lightpath between Carleton U and
CERN
demonstrated at ITU Telecom World 2003

16
Demo during ITU Telecom World 2003
10 GbE WAN PHY over an OC-192 circuit using
lightpaths provided by SURFnet and CAnet 4
9.24 Gbps using traffic generators
6 Gbps using UDP on PCs
5.65 Gbps using TCP on PCs
17
Results on the transatlantic 10GbE
Single stream UDP throughput
Single stream TCP throughput
Data rates limited by the PC, even for memory to
memory tests
UDP uses less resources than TCP on
high-bandwidth delay product networks
18
2004-2005 Activities

arrival of the third CAnet 4 lambda in the
summer of 2004, looked at establishing a 10 GbE
lightpath from TRIUMF
Neterion (s2io) Xframe 10 GbE NICs, Foundry
NetIron 40Gs, Foundry NetIron 1500, servers from
Sun Microsystems, and custom built disk servers
from Ciara Technologies.
distance problem between TRIUMF and the CAnet 4
OME 6500 in Vancouver
XENPAK 10 GbE WAN PHY at 1310nm

19
2004-2005 Activities

testing data transfers between TRIUMF and CERN,
and TRIUMF and Carleton U over a 10 GbE lightpath
experimenting with robust data transfers
attempt to maximize disk i/o performance from
Linux-based disk servers
experimenting with disk controllers and
processors
ATLAS Service Challenges in 2005

20
2004-2005 Activities

exploring a more permanent 10 GbE lightpath to
CERN and lightpaths to Canadian Tier 2 ATLAS
sites from TRIUMF
CANARIE playing a lead role in helping to
facilitate
still need to solve some last mile lightpath
issues

21
(No Transcript)
22
Xeon-based Servers

Dual 3.2 GHz Xeons
4GB memory
4 3WARE 9500S-4LP (8)
16 SATA150 120GB drives
40GB HITACHI 14R9200 drives
INTEL 10GBE PXLA8590LR

23
Some Xeon Server I/O Results

read a pair of 80 GB (xfs) files for 67 hours
120TB average 524 MB/sec(Software Raid0 of 8
Sata disks on each of pair hardware Raid0
RocketRaid 1820A controllers on Storm2)
10GbE S2IO Nics back-to-back 17 hrs 10TB
average 180MB/se ( from Storm2 to Storm1 with
software Raid0 of 4 disks on each of 3
3ware-9500S4 controllers in Raid0)
10GbE lightpath Storm2 to Itanium machine at
CERN 10,15,20,25 bbftp streams averaged 18, 24,
27, 29 MB/sec disk-to-disk.
( Only 1 disk at CERN max write
speed 48MB/sec)
continued Storm1 to Storm2 testing many
sustainability problems encountered and resolved.
Details available on request. Dont do test
flights too close to ground

echo 100000 gt /proc/sys/vm/min_free_kbytes
24
Opteron-based Servers

Dual 2.4GHz Opterons
4GB Memory
1 WD800JB 80GB HD
16 SATA 300GB HD
(Seagate ST3300831AS)
4 4 Port Infiniband-SATA
2 RocketRaid 1820A
10GbE NIC
2 PCI-X at 133MHz ()
2 PCI-X at 100MHz ()
Note 64bit 133MHz 8.4
Gb/s

25
Multilane Infini-band SATA
26
Server Specifications
27
The Parameters

5 types of controllers
number of controllers to use (1 to 4)
number of disks/controller (1 to 16)
RAID0, RAID5, RAID6, JBOD
dual or quad Opteron systems
4-6 possible PCI-X slots (1 reserved for 10GigE)
linux kernels (2.6.9, 2.6.10, 2.6.11)
many tuning parameters (in addition to WAN) e.g.
blockdev setra 8192 /dev/md0
chunk-size in mdadm (1024)
/sbin/setpci d 80861048 e6.b2e
(modifies MMRBC field in PCI-X configuration
space for vendor 8086 and device 1048 to increase
transmit burst length on the bus
echo 100000 gt/proc/sys/vm/min_free_kbytes
ifconfig eth3 txqueuelen 100000

28
The SATA Controllers
3Ware-9500S-4
3Ware-9500S-8
Areca 1160
Highpoint RocketRaid 1820A
SuperMicro DAC-SATA-MV8
29
Areca 1160 Details
Extensive tests were done by tweakers.net on
ARECA and 8 others www.tweakers.net/benchd
b/search/product/104629
www.tweakers.net/reviews/557
30
Why do we need Raid 6?

our experience is 1 out of 30 disk fails every 6
months
Raid5 rebuild in full operation of 15 300GB
disks takes 100hrs
probability that second disk fails during
rebuild 1
ARECA-1160 TESTS of 15 300GB disks (1 broken)
Raid5 or 6 fast build in 100minutes
Raid5 or 6 background build up to 100hrs for
busy system
Acid Test
Raid6 removed disk while very busy degraded
to Raid5
rebuild takes 100 hrs
removed second disk now critical - but after
raid5 built it proceeded to raid6

Raid 5 with 4TB of disk is too risky Raid 6
marginal cost minimal
31
Some optimal I/O results
32
Some details of I/O
33
Puzzling I/O results

read speeds for some 80 GB files consistently
50 faster (620 MB/sec) for md0 of 28 disk RAID
5 of RR 1820A)
read for other files consistently lower
read speeds are up to 50 faster using /dev/md0
over direct use of /dev/sda (eg. Areca 1160 15
disk Raid5 190 to 323 MB/s
bi-stable (fast/slow) read modes within the same
file
diskscrubb utility re-maps bad blocks - takes
2hrs for 300 GB drive
weak blocks - not being remapped possible
reason for slow spots
room temperature gradient suspected - tested -
discounted

34
Puzzling I/O results
Bi-stable state for reads a useful tool to
display which disk may be slowing I/O is iostat
x 1 Device rrqm/s wrqm/s r/s w/s
rsec/s wsec/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm util hda 0.00
0.00 1.00 0.00 8.00 0.00 4.00
0.00 8.00 0.01 9.00 9.00 0.90 md0
0.00 0.00 1920.00 0.00 491520.00 0.00
245760.00 0.00 256.00 0.00 0.00
0.00 0.00 sda 0.00 0.00 239.00 0.00
61440.00 0.00 30720.00 0.00 257.07
10.88 45.31 4.19 100.10 BAD sdb
0.00 0.00 238.00 0.00 61440.00 0.00
30720.00 0.00 258.15 2.80 11.76
2.46 58.50 sdc 0.00 0.00 240.00 0.00
61440.00 0.00 30720.00 0.00 256.00
2.85 11.91 2.40 57.70 sdd 0.00
0.00 240.00 0.00 61440.00 0.00 30720.00
0.00 256.00 3.01 12.61 2.58 61.80 sde
0.00 0.00 237.00 0.00 61440.00 0.00
30720.00 0.00 259.24 2.94 12.39
2.57 61.00 sdf 0.00 0.00 236.00 0.00
61440.00 0.00 30720.00 0.00 260.34
2.96 12.47 2.61 61.60 sdg 0.00
0.00 239.00 0.00 61440.00 0.00 30720.00
0.00 257.07 3.04 12.77 2.51 60.00 sdh
0.00 0.00 235.00 0.00 61440.00 0.00
30720.00 0.00 261.45 3.02 12.72
2.49 58.60 When working properly this
is... Device rrqm/s wrqm/s r/s w/s
rsec/s wsec/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm util hda 0.00
1.00 1.00 37.00 8.00 304.00 4.00
152.00 8.21 0.09 2.37 0.21
0.80 md0 0.00 0.00 3520.00 0.00
901120.00 0.00 450560.00 0.00 256.00
0.00 0.00 0.00 0.00 sda 0.00
0.00 434.00 0.00 112640.00 0.00 56320.00
0.00 259.54 8.57 19.52 2.30 100.00 sdb
0.00 0.00 446.00 1.00 112640.00
0.00 56320.00 0.00 251.99 8.07 20.50
2.20 98.30 sdc 0.00 0.00 440.00
0.00 112640.00 0.00 56320.00 0.00 256.00
6.11 13.89 2.25 98.80 sdd 0.00
0.00 440.00 0.00 112640.00 0.00 56320.00
0.00 256.00 4.63 10.52 2.18 96.10 sde
0.00 0.00 439.00 0.00 112640.00
0.00 56320.00 0.00 256.58 4.64 10.54
2.18 95.70 sdf 0.00 0.00 441.00
0.00 112640.00 0.00 56320.00 0.00 255.42
6.26 14.22 2.25 99.20 sdg 0.00
0.00 437.00 0.00 112640.00 0.00 56320.00
0.00 257.76 4.89 11.11 2.19 95.80 sdh
0.00 0.00 439.00 0.00 112640.00
0.00 56320.00 0.00 256.58 5.21 11.84
2.19 96.10 Solution? Change slow disk with
normal one.
35
I/O related results
Shows drop in read speed depending on location of
the file. Reads significantly faster on the
outer part of the software Raid0 (JBOD) set.
36
TRIUMF-CERN GbE lightpath

currently a GbE circuit established since April
18th 2005
uses ONS15454
used primarily for the ATLAS Service Challenge
hoping to have a 10 GbE lightpath to CERN by
Jan/Feb 2006

37
Atlas SC3 Setup
ATLAS Tier1 Service Challenge 3 (primary contact
Reda Tafirout tafirout_at_triumf.ca) 3 Ciara
servers Intel SE7520BD2 (dual GigE, PCI-X,
etc.) dual 3 GHz, Nocona EMT64 (1 MB cache/ 800
MHz FSB) 2 GB RAM 1 system disk 80 Gig IDE
(laptop) 8 x 250 GB SATA150 (Seagate Barrac.
NCQ, 8 MB) 3Ware 9500S-8MI RAID5 Infiniband
connections 1 Evetek server (management node)
dual opteron 246 2.0 GHz (800 MHz FSB) 2 GB RAM
1 system disk WD 80 GB SATA 2 x 250 GB WD
SATA 3Ware 9500S-LP 4 channels ADAPTEC
Ultra160 SCSI 29160-LP Tape system 2 x IBM
4560SLX SDLT libraries - each with 1 SDLT drive
26 SDLT tapes have fibre channel interface
card All systems are running FC3 x86_64 2.6
kernel, and dCache for disk management (with
gridftp SRM access doors)
38
TRIUMF-Carleton U lightpath
Servers 5x Dual 250 2.4 GHz Opterons 2GB
memory 16x 300GB SATA drives SunFire V40z Quad
Opteron 850 2.4GHz 8GB memory 3x 146GB
SCSI Network Cards Intel Pro/10GbE-LR S2io/Neteri
on Xframe Raid and SATA Controllers 3ware 9500s
8-port Rock Raid 1820A 8-port Super micro MV8
8-port Areca 1160 16-port Network MRV
CWDM Foundry NI1500 NI40G 10G-ER 1550nm LAN
PHY 10G-LR 1310 LAN/WAN PHY CAnet4 OME 6500
39
Transfer results over 10 GbE
1GbE transfers disk-to-disk b/w TRIUMF-Carleton
Ottawa over 10G Circuit 115 MB/s sustained 5
days Equivalent to 46TB
Iperf between TRIUMF Ottawa memory-to-memory
1 week 3.74 Gbps averaged, 460 MB/s 350 TB
transferred (errors ignored)
40
Transfer results over 10 GbE
Disk-to-Memory back-to-back short distance, 24
hrs Single TCP Stream Average of 2.4 Gbps,
300 MB/s (max disk read 361 MB/s 16 disks
RAID5)
Disk-to-Disk back-to-back short distance, 76TB in
4 days bbftp 5 TCP streams Average of 1.8
Gbps, 220 MB/s (max disk write 303 MB/s
15 disk ARECA RAID5) (max disk read 361 MB/sec
16 disk as 28 RR1820A RAID5)
41
Pumping data into a 10GbE circuit
Bottleneck buffering? What are the solutions?
Zero-copy
42
Conclusions/Observations