Title: Endtoend performance: issues and suggestions
1End-to-end performanceissues and suggestions
- TERENA 5th NRENs and Grids Workshop
- Paris, June 2007
- Mark Leese
2Talk Emphasis
- monALISA a monitoring tool/framework
- DANTE a network operator
- EGEE-II a Grid
- Mark a pseudo-Grid end user
- Im not a real user, but I look at the issues
from their viewpoint - Large Hadron Collider in the UK (GridPP)
- UK e-Science
- OGF
- Aimed at a mixed audience (NRENs and Grid users)
so some network/Grid things you will
already.Zzzzzzzzzzzz )
3Contents
- Just two things
- What makes the Grid different to other network
users, wrt performance? - What are the end-to-end performance (monitoring)
issues? Any suggestions? - If the links in the presentation dont work,
- they are listed again on the last three slides
41. What makes the Grid differentto other network
users, wrt performance?
5The Grid
- The Grid is all about
- Sharing resources
- the obvious, e.g. databases
- the specialised, e.g. remotely control telescopes
- and new ideas, e.g. CPU time
- co-allocate resources to a task to remove the
limitations of the individual resources - most basic analogy you can move house faster if
you have two vans - Sharing resources which are geographically
distributed - Sharing resources efficiently
- optimisation selecting the best resources for
the job
6The Grid
Network(s)
7The Grid
- Get apps running on the right resources
(wherever they are) - Make disparate compute resources into a coherent
whole
Network(s)
8Optimisation
- Its a little like the checkout counters in a
supermarket - There is a line of 10 checkouts to which you can
take your big shopping basket - Two checkouts you cannot use. They are for people
with five items or less caisse express - Another two checkouts cannot be used. They are
reserved for something else (the staffs lunch
break) - Six left how big is each queue and how long will
it take each person to exit the queue (how many
items in each basket)? - If you choose wrong, you get delayed!
- You miss the train, you get home late,
- your partner has given your dinner to the dog
- To take the analogy to extremes hopefully your
basket does not have a broken wheel )
9Scheduling
- Grid job the basic unit of work
- SEs provide storage resources and access to mass
storage systems - CEs provide processing power, e.g. cluster of
Worker Nodes (PC farm) - Scheduling deciding when a job will run, and
with which resources - Typically there will be many CEs capable of
running a job - If a CE already has lots of jobs queued, you
would like to use another - File replication proven technique for improving
data access - Distribute multiple copies of the same file
across a Grid - Increases number of CEs with good network
connectivity to the data - Extreme example Pisa?Roma or Pisa?Fermilab?
- So, typically there may also be several SEs
holding the required data
10Network Aware Scheduling (i)
- So we have a set of CEs a,b,c, and SEs
x,y,z, capable of running a job - We want a node from each list such that the job
will complete the fastest - Take account of
- capability of CEs
- size and number of jobs already waiting (queued)
at CEs - performance of network link for each CE-SE
combination - Further complicated by the compute/data intensity
of the job - computationally intensive job lots of maths
- data intensive job lots and lots and lots of
data - do we pull the data to the job or push the job to
the data?
11Network Aware Scheduling (ii)
- In Utopia we would know about the current state
of the network, and any future reserved bandwidth - In reality we could use monitored network
performance to make an estimate - Its not perfect, but patterns (diurnal
variation, chronic poor performance) can be
identified - The following slides show iperf tests between
dedicated test nodes at LHC sites in the UK
(GridPPs gridmon infrastructure)
12Network Aware Scheduling (iii.a)
- Transfer at 0000, yes. Transfer at 1200, no.
Theres a big difference between 500 and 200 Mbps
for data intensive jobs!
13Network Aware Scheduling (iii.b)
- RAL Tier-2?Tier-1 local transfers are likely the
best performers
14Network Aware Scheduling (iii.c)
- Here, you have absolutely no idea what
performance you would get ? avoid - Summary ignore the network at your peril )
15Network Aware Scheduling (iv)
- Two good papers to read
- B. Volckaert, P. Thysebaert, M. De Leenheer, F.
De Turck, B. Dhoedt, P. Demeester - Network Aware Scheduling in Grids
- Richard McClatchey, Ashiq Anjum, Heinz
Stockinger, Arshad Ali, Ian Willers, Michael
Thomas - Data Intensive and Network Aware (DIANA) Grid
Scheduling - We dont consider potential uses in more detail
(job placement, replica selection) because we
dont know if it will happen!
16Network Aware Scheduling (v)
- There are some ve feelings
- The network is not a problem. Over-provisioning
will always keep us ahead. Either that or fibre
and GigE everywhere - Report of the International Grid Performance
Workshop 2005 concluded that "Performance simply
is not on the critical path for many application
projects. Applications that struggle to get code
to execute correctly simply do not consider
whether they are using resources efficiently or
achieving good performance - Personal experience suggests that there is so
much to think about elsewhere, that the network
is often the last thing to be considered - Right now, Grid apps rely on the network being
good, with no real checks - And by way of real life indications
- EDG WP7 developed network cost function
- Returned cost of variable size file transfers
between source and dest Grid elements - Based on periodic (WP7) iperf measurements
- Used by WP2 Replica Optimization Service
- job placement where to start a job so that it is
as close as possible to the required data - replica selection from where to fetch the
closest replica once a job had started - EDG was not a production Grid, and the work was
not taken forward
17Network Aware Scheduling (vi)
- In EGEE
- Tommaso Coviello and Tiziana Ferrrari proposed to
use network performance data from EGEE-JRA4 - CompletionTimeCEi JobExecutionTime
- max(InputDataTransferTime,QueueTime)
- estimate file transfer times based on thruput
- reject paths exhibiting packet loss
- SEs selection refined based on SEs using low
congestion links (jitter the suggested test) - Some prototype work, but not taken forward
- QueueTime found to be unreliable
- Data for 100 paths required within 0.2 seconds of
receiving request - Grid Information Service was not ready to hold
the data - a problem for JRA4s Web Service interface (WS, ?
accessible but slow)
18Network Aware Scheduling (vii)
- In WLCG/EGEE (if I understand correctly)
- The close SE approach is applied
- Each CE must have a close SE the node with the
best access for data retrieval from that CE - These relationships are statically defined in the
Grids Information Service, which provides
information about the Grid resources and their
status - lcg-infosites --vo dteam closeSE
- Name of the CE g02.phy.bg.ac.yu2119/blah-pbs-dt
eam - se.phy.bg.ac.yu
- Name of the CE fangorn.man.poznan.pl2119/jobman
ager-lcgpbs-dteam - se1.egee.man.poznan.pl
- se2.egee.man.poznan.pl
19Network Aware Scheduling (viii)
- To run a job the user submits a job description
in JDL (Job Description Language) format - It defines which executable to run, any
parameters, input data (Grid files) etc. - A match-making process then takes places to
identify a CE to execute the job - Identify all CEs which
- can run the job, i.e. match the users
requirements (JDL) - are close to an SE holding the required input
Grid files - select CE with the highest rank
- by default, rank estimation of the time
interval between the being job submitted and
execution actually beginning - a function of the number of running and queued
jobs at each CE - See gLite User Guide for more info
- As already stated, the presence of replicas of
data increases the number of CEs close to the
data which can potentially execute the job - But decisions are still made on the static
declaration of close SEs - Users are able to re-write the site selection
code themselves
20Difference 1
- So, difference 1
- The Grid may use network performance data to
improve its decision making
21Difference 2
- Difference 2
- The Grid will exercise the network
22Qualitative View
- By its very nature
- sharing lots of resources to build powerful
systems - to process complex, large data sets
- in geographically distributed teams
- some in real-time, e.g. visualisation
- so far there has been lots of embarrassingly
parallel problems (completely independent tasks
which can be executed in parallel) but what about
tasks requiring inter-processor communication
(MPI, Message Passing Interface)? - a lot of data moving across the network
- high bandwidth
- low-latency
- stable and guaranteed transmission rates
23Quantitative View (i)
- The Large Hadron Collider is a collection of four
experiments based at CERN (ALICE, ATLAS, CMS and
LHCb) that will monitor the collision of
accelerated particles - 15 Petabytes of data generated every year
- Around 100,000 standard CPUs required to process
- GridPP (UK) is contributing the equivalent of
10,000 PCs
24Quantitative View (ii)
- My understanding is that the LHC when
operational, will be pushing out 700 Mbytes/s (
5 Gbps) from the Tier-0 to each Tier-1 - 11 Tier-1s, linked to CERN with 10 Gbps Optical
Private Network - So no problems there
- Additional variable flows 4 Gbps are expected
between the Tier-1s - What about Tier-1s to Tier-2s?
- gt 150 Tier-2s, 18 in UK
- Tier-1s and Tier-2s currently linked by standard
research networks - Are you going to commission dedicated fibres or
lambdas for each?
25Quantitative View (iii)
26Rolls Royce Networks
- Lots of projects working on adding extra
intelligence into the network, and/or interfacing
Grid applications with network control plane for
auto-provisioning of dedicated bandwidth - Ciscos Network Based On-demand/Grid System
(NBGS) - The NAREGI project
- Enlightened Computing
- http//www.g-lambda.net/
- These are still development projects
- Can fibre/lambdas be provided for all that need
it? - Even if provided, temptation to spend on CPU
power? - May still fall victim to end-system and last
mile (e.g. firewall) problems
27Is the Grid a lot of Hype?
- Its good to be skeptical about things. Every
four years people say England will win the World
Cup/Coupe du Monde -) - The Grid is ambitious
- but so was the World Wide Wait
- Now everyone loves the Web, and it has become
important to people - Internet banking, online shopping (flights,
holidays, music, supermarket), e-Government etc.
etc. - MySpace, Facebook, YouTube
- The Web also drove investment in the Net
infrastructure and as a result it can now support
video conferencing, VoIP etc.
28Summary of Differences
- Network Operations We can safely say that
greater demands will be placed on the network - massive datasets, 1000s of networked resources
- geographically distributed Long Fat Networks
- high bandwidth, high availability, low latency
- networks will need to be debugged for efficiency
- Network Intelligence The Grid may want to
consume network performance data to improve its
decision making
292. What are the end-to-endperformance
(monitoring) issues?
30The Overall Issue
- We have seen that the Grid could use network
performance data for decision making - but we dont know whether it will
- As a result, we concentrate on debugging the
network for Grid users
31End-to-End?
- When I say end-to-end I mean PC-PC, not PoP to
PoP or similar - Core and Metro Area are normally fine
- Most problems are in the last mile
- End-system
- NIC
- disc
- TCP config
- poor cabling
- the application itself (e.g. older versions of
scp) - I could go on for ever (no, please dont!)
- Site firewall
- Off-site connections
32So Many Issues
- Beyond the basics of which tests to run, and how
to control/schedule them, there are too many
end-to-end performance issues to consider when
monitoring. Here, I mention a few and make some
suggestions. - TCP performance
- Parallel TCP streams
- Different data transfer protocols (e.g. GridFTP
vrs HTTP) - New protocols, e.g. DDCP
- TCP-IP is ubiquitous so we stick with it - we
cant necessarily wait for new protocols and
network architectures - Measurement types
- active vrs passive
- capture logs of real GridFTP transfersis there
Grid Information Service support? - can we monitor Grid workflows in real-time?
- Too many test paths. Can we plug in to VO data to
test only the required paths
33Over-Provisioning
- Q Okay, so why dont we just throw some more
bandwidth at the problem? Upgrade the links. - A For want of a more interesting term to make
sure youre still paying attention, this is what
I call the Heroin Effect - You start off with a little, but thats not
really doing it for you its not solving the
problem. So you keep increasing the dose, yet
its never as good as you thought it would be. - By analogy you keep buying more and more
bandwidth to take you to new highs but it's never
quite as good as you thought it would be - Simple over-provisioning is not sufficient
- Doesnt address the key issue of end-to-end
performance - Network backbone in most cases is genuinely not
the source of the problem - Last mile (campus network?end-user system?your
app) often cause of the problem firewall,
wiring, hard disc, application and many more
potential culprits - Also, If simple over-provisioning was a total
solution, there would not be so much other work
going on, e.g. protocol research (high speed TCPs)
34Lets Puts Fibre Everywhere (1)
- Fibre is cheaper than it was, but for large
deployments, its still expensive - We can see the benefits of fibre with the UKLight
infrastructure and the ESLEA exploitation
project, but it still doesnt address the
end-to-end issue. Take a real-life ESLEA example
(thanks to ESLEA for the figures) - The UK wanted to transfer data from FermiLab
(Chicago) to UCL for analysis by physicists,
before returning the results - datasets currently 1-50TB
- 50TB would take gt 6 mths on production net, or
one week at 700Mbps - So a 1Gbps circuit-switched light path was
provisioned - Result disc-to-disc transfers _at_ 250Mbps, just
1/4 of theoretical max - Tests revealed a problem at an end site
35Lets Puts Fibre Everywhere (2)
- UCL RealityGrid, for modelling complex condensed
matter systems computational steering,
visualisation. - Test node 2 1.8GHz Athlon, 4 GB, GigE, CentOS
- DL HPCx super computer
- Test node 3 GHz P4, 2 GB, GigE, Scientific Linux
- RTT is always 9mS
- TCP bandwidth is, errr....
36Marks Tips
- There are lots of tools, frameworks,
infrastructures out there. - Massive list at http//www.slac.stanford.edu/xorg/
nmtf/nmtf-tools.html - Pick something that works for you - its a
balance of - ongoing administration
- deployment effort (e.g. persuading remote sites
to install tools and allow you to run tests) - how intrusive the tests are
- Start your investigations in the last mile
- Do put real data over the network
- you can send 1 ping a second forever and see 10-8
loss - you then run an iperf test and the performance is
terrible - Keep historic data things change
- you will want to look back, and you will want
points of reference - When you see a problem, follow it up and get
information - Not only is the problem fixed, but you get to
demonstrate why this is useful which helps with
deployment, support, growing user base - Remember the social aspects - persistent but
patient )
37Suggestions Tools and Techniques
- Start with the local host
- As you would expect
- uname
- netstat
- ifconfig (watch error counters etc.)
- LISA (Localhost Information Service Agent)
- a component of MonALISA
- almost complete system monitoring (load, CPU,
memory, disk, disk I/O, paging, processes,
network traffic and connectivity...) - Check everything
- TCP configuration
- machine load
- disc (sas, sata, nasty old ide?)
- If TCP is the problem, what UDP rates can you
achieve?
38Suggestions Tools and Techniques
- ping still useful but need to send much faster
than 1 per second, and for a long time.10-8 loss - back of envelope calculation on Saturday I ran
a 10 sec iperf test which transferred 624MB in
480,000 packets. So 1.3KB per packet - 1 loss every 100,000,000 packets 128GB
transferred before a loss causes your transfer
rate to drop - can use Synack tool (sparingly) if icmp is
blocked - traceroute and reverse traceroutes regularly
measuring the routes to your most important
collaborators is very useful - dedicated monitoring boxes are useful here
because they may be allowed (firewalls etc.) for
icmp
39Suggestions Tools and Techniques
- As we will see, time series data is probably the
most useful - When did your problems start? When did things
change? - Unfortunately, relies on there being proximity
between your paths/devices and ones for which
there is available data - If you suspect the problem is in the core you may
be able to find the problem router (or rough
location) through a so called "looking glass"
servers statistics of network operator
performance - ping and iperf very useful herebut be wary
- In May 2004, Les Cottrell (SLAC) said As
measured by NetFlow, 25 of the traffic on
Abilene is iperf and ping type traffic
40Suggestions Tools and Techniques
- Thrulay is an iperf-like tool for measuring TCP
and UDP bandwidth - useful because it also gives you the RTT seen by
the transfer, not ping/traceroutes estimate - Two detective type tools
- Tom Dunnigan and Rich Carlson's Network
Diagnostic Tool (NDT) - client-server
- useful because client can be lightweight Java
applet, runs in a Web browser on most systems - command line client (compile and install) also
available - public servers (linux boxes with Web100 kernels)
although I think only one outside US (thank you
SWITCH) - detects problems, makes suggestions duplex
problems, TCP tuning amongst others - The SURFnet Detective
41Suggestions Tools and Techniques
42Suggestions Tools and Techniques
- We could do these but dont because theres too
much data to process/correlate - Cisco NetFlow data routers record details of
all traffic flows which they see - src and dest IP addresses and ports
- start and end time
- amount of traffic transferred
- Parsing firewall logs
- root_at_gridmon2 iperf -c hepgrid7.ph.liv.ac.uk
- -------------------------------------------------
----------- - Client connecting to hepgrid7.ph.liv.ac.uk, TCP
port 5001 - TCP window size 16.0 KByte (default)
- -------------------------------------------------
----------- - 3 local 193.62.125.96 port 58316 connected
with 138.253.178.107 port 5001 - 3 0.0-10.0 sec 873 MBytes 732
Mbits/sec - Jun 10 221258 NetScreen device_idgw-fw
system-notification-00257(traffic)
start_time"2007-06-10 221555" duration22
servicetcp/port5001 src zoneESC-DMZ dst
zoneUntrust actionPermit sent948533470
rcvd40793960 srclthiddengt dstlthiddengt
src_port58316 dst_port5001 session_id995619 - Not wholly accurate (22 secs not 10) and ignores
overheads but can be used relative
43Suggestions Tools and Techniques
- SNMP data is (understandably) impossible to
obtain for non-networkers - Sharing data with the OGF NM-WG XML schemas may
improve things - And now some quick examples from gridmon
- Dedicated boxes
- Same spec, OS, configuration - makes life a lot
easier (comparing like-for like) - If running regular tests, get the results in an
SQL data fast, repeatable queries - If no dedicated boxes available, deploy a box
for - either the best performance possible
- Something representative of systems at that
end-site - Sorry, no-end system examples here we
configured the boxes ourselves -)
44Example 1
- Glasgow running transfer tests to Edinburgh over
weekend 28-29th October - Experiencing poor rates (80Mbps)
- 1st thing despite transferring just 80Mbps,
residual TCP bandwidth drops by 400Mbps - Warning bells
45Example 1
- Traceroute data reveals suspect router
- traceroute to gridmon.epcc.ed.ac.uk
(129.215.175.71), 30 hops max, 38 byte packets - 1 194.36.1.1 (194.36.1.1) 0.941 ms 0.882 ms
0.815 ms - 2 130.209.2.1 (130.209.2.1) 0.875 ms 0.831 ms
0.830 ms - 3 130.209.2.118 (130.209.2.118) 60.415 ms
55.453 ms 31.327 ms - 4 glasgowpop-ge1-2-glasgowuni-ge1-1-v152.clyde.ne
t.uk (194.81.62.153) 32.420 ms 34.404 ms
29.424 ms - 5 glasgow-bar.ja.net (146.97.40.57) 43.467 ms
52.298 ms 39.349 ms - 6 po9-0.glas-scr.ja.net (146.97.35.53) 45.856
ms 44.445 ms 41.388 ms - 7 po3-0.edin-scr.ja.net (146.97.33.62) 51.509
ms 63.493 ms 31.435 ms - 8 po0-0.edinburgh-bar.ja.net (146.97.35.62)
22.454 ms 25.412 ms 31.381 ms - 9 146.97.40.122 (146.97.40.122) 44.602 ms
42.494 ms 35.492 ms - 10 gridmon.epcc.ed.ac.uk (129.215.175.71)
33.515 ms 34.623 ms 37.694 ms
46Example 1
- Reverse route confirms. Traceroutes are normal
until we hit suspect router - traceroute to gppmon-gla.scotgrid.ac.uk
(194.36.1.56), 30 hops max, 38 byte packets - 1 vlan175.srif-kb1.net.ed.ac.uk
(129.215.175.126) 0.435 ms 0.387 ms 0.380 ms - 2 edinburgh-bar.ja.net (146.97.40.121) 0.357 ms
0.329 ms 0.322 ms - 3 po9-0.edin-scr.ja.net (146.97.35.61) 0.564 ms
0.485 ms 0.485 ms - 4 po3-0.glas-scr.ja.net (146.97.33.61) 1.656 ms
1.511 ms 1.499 ms - 5 po0-0.glasgow-bar.ja.net (146.97.35.54) 1.850
ms 1.352 ms 1.422 ms - 6 146.97.40.58 (146.97.40.58) 1.679 ms 1.661
ms 1.569 ms - 7 glasgowuni-ge1-1-glasgowpop-ge1-2-v152.clyde.ne
t.uk (194.81.62.154) 1.796 ms 1.677 ms 1.646
ms - 8 130.209.2.117 (130.209.2.117) 31.197 ms
34.615 ms 29.121 ms - 9 130.209.2.2 (130.209.2.2) 32.814 ms 32.158
ms 32.145 ms - gppmon-gla.scotgrid.ac.uk (194.36.1.56) 41.634
ms 37.555 ms 24.635 ms - Graphs and traceroutes provide evidence for
further investigation
47Example 1
- Further investigation revealed that the router
had exhausted its CAM space - ltsee next slide if you want to know what this isgt
- In simple terms, the router was forced to switch
in software - Because a particular lookup in a
routing/switching/access table was not being
hardware accelerated, problems were caused under
certain flow conditions - The solution the CAM dynamic database was
re-optimised (to free up CAM space) and the unit
began switching in hardware again
48Example 1
- CAM Content-Addressable Memory
- Hardware (fast) implementation of an associative
area - a data word (not memory address!) is used to
access it - the CAM searches its entire contents to see if
the data word is stored - if the word is found, the CAM returns a list of
one or more corresponding storage addresses, or
other data associated with those storage
addresses - CAM memory is used for switching and routing,
e.g. Ethernet switches store learned MAC
addresses and their associated switch port in CAM - MAC Address Located on Port
- ------------- ---------------
- 000039-0643f5 26
- 000089-01af9a 5
- 000102-162346 16
- When an Ethernet frame arrives at the switch with
a destination address of 000089-01af9a the switch
searches its CAM for that address. - The CAM will return 5 so the switch sends this
Ethernet frame out on port 5
49Example 2
- Local departmental firewall reconfigured to
switch off strict checking of TCP sequence
numbers - Potential minefield SACK etc.
50Example 3
- Almost constant 33 UDP packet loss
- Fatal to most/all applications using UDP
- Occasional dip to 0
51Example 3
- Zooming into a particular day shows a period of
0 loss - Site firewall limits UDP to 1,000 packets per
second, per endpoint pair - Temporarily raised to 20,000 pps for Video
Conferences
52The Answer
- Blair (vintage 1996) before he game to power
- Education, education, education became a mantra
for his party - NRENs are ideally placed to provide this
53The Answer
- Blair (vintage 1996) before he game to power
- Education, education, education became a mantra
for his party - NRENs are ideally placed to provide this
54The Answer
- Blair (vintage 1996) before he game to power
- Education, education, education became a mantra
for his party - NRENs are ideally placed to provide this
55NFNN
- Talks on TCP, LAN, diagnostic steps, security
- http//gridmon.dl.ac.uk/nfnn/
- As an example
- Networks for non-Networkers workshops
- Aimed at people working at the technical level in
high-bandwidth dependant science
56Your Application
- Is your application making effective use of the
network? - Consider using multiple TCP sockets (i.e.
multiple streams) for your data transfers - One thread per socket
- Keep your pipe full of data
- use asynchronous I/O, i.e. run computation and
I/O in parallel - pre-fetch data you know you are going to need,
again in parallel with other computation or I/O - when possible, read/write large blocks of data at
a time better to infrequently r/w ? 1MB than
frequently r/w 4K
57What Is Your Application Doing?
- Instrument your code, e.g. Netlogger, a
Networked Application Logger - Methodology and set of tools
- Low overhead can generate up to 5000/500
events/sec using the C/Java APIs with negligible
impact on the app - Simple and sensible methodology, e.g.
- Rule 3 Log all of the following events Entering
and exiting any program or software component,
and begin/end of all I/O (disk and network).
58Netlogger
- client side GridFTP
- note the large overhead ( 8s) of initial
handshaking before real writing begins -
59Conclusion
- The Grid could use network performance data
- The reality is that it doesnt
- The Grid will exercise networks
- Core fine. Metro mostly fine. Most problems
in the last mile. - Not every Grid app wants, needs or can afford
dedicated ?s - Education, education, education. But please, no
wars! - Tune your end systems and applications
- Instrument you application so you can see whats
happening - For more information m.j.leese_at_dl.ac.uk
60Links (1)
- The GridPP (LHC in the UK) "gridmon" network
monitoring infrastructure http//gridmon3.dl.ac.u
k/gridmon/ - Network Aware Scheduling in Grids
- "Network Aware Scheduling in Grids" paper
http//users.atlantis.ugent.be/bvolckae/papers/NOC
2004.pdf - "Data Intensive and Network Aware (DIANA) Grid
Scheduling" paper http//hst.web.cern.ch/hst/publ
ications/diana-JoGC.pdf - Report of the International Grid Performance
Workshop 2005 http//www-unix.mcs.anl.gov/schopf
/GPW2005/report.pdf - EDG WP7 Final Report https//edms.cern.ch/file/41
4132/2.1/DataGrid-07-D7-4-0206-2.0.pdf - EGEE-JRA4 http//egee-jra4.web.cern.ch/EGEE-JRA4/
- gLite User Guide https//edms.cern.ch/file/722398
/gLite-3-UserGuide.html
61Links (2)
- Rolls Royce Networks
- Ciscos Network Based On-demand/Grid System
http//www.terena.org/activities/nrens-n-grids/wor
kshop-03/NBGS-Terena.pdf - The NAREGI project http//www.naregi.org/index_e.
html - Enlightened Computing http//www.mcnc.org/index.c
fm?fuseactionpagefilenameenlightened_computing.
html - G-Lambda http//www.g-lambda.net
- Monitoring Grid workflows in real-time
http//www.di.unipi.it/augusto/seminars/200705_OG
F20/2007-04-09_OGF-Slides.pdf - Exploiting fibre infrastructures, UK ESLEA
project closing conference http//www.eslea.uklig
ht.ac.uk/conf.html - UCL Reality Grid project http//www.realitygrid.o
rg - Daresbury Laboratory HPCx super computer
http//www.hpcx.ac.uk
62Links (3)
- End host monitoring, LISA (Localhost Information
Service Agent) http//monalisa.cacr.caltech.edu - Synack, alternative ping tool http//www-iepm.sla
c.stanford.edu/tools/synack/ - Thrulay, iperf-like tool http//www.internet2.edu
/shalunov/thrulay/ - Network Diagnostic Tool http//e2epi.internet2.ed
u/ndt/ - SURFnet Detective http//detective.surfnet.nl/en/
index_en.html - Sharing network performance data, OGF Network
Measurements Working Group http//nmwg.internet2.
edu/ - TCP Selective Acknowledgements (SACK)
http//www.ietf.org/rfc/rfc2018.txt - Netlogger (Networked Application Logger)
http//dsd.lbl.gov/NetLogger/