Endtoend performance: issues and suggestions

About This Presentation

Title:

Endtoend performance: issues and suggestions

Description:

Mark = a pseudo-Grid end user. I'm not a real user, but I look ... Name of the CE: fangorn.man.poznan.pl:2119/jobmanager-lcgpbs-dteam. se1.egee.man.poznan.pl ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 63

Provided by: rtlin

Category:

more less

Transcript and Presenter's Notes

Title: Endtoend performance: issues and suggestions

1
End-to-end performanceissues and suggestions

TERENA 5th NRENs and Grids Workshop
Paris, June 2007
Mark Leese

2
Talk Emphasis

monALISA a monitoring tool/framework
DANTE a network operator
EGEE-II a Grid
Mark a pseudo-Grid end user
Im not a real user, but I look at the issues
from their viewpoint
Large Hadron Collider in the UK (GridPP)
UK e-Science
OGF
Aimed at a mixed audience (NRENs and Grid users)
so some network/Grid things you will
already.Zzzzzzzzzzzz )

3
Contents

Just two things
What makes the Grid different to other network
users, wrt performance?
What are the end-to-end performance (monitoring)
issues? Any suggestions?
If the links in the presentation dont work,
they are listed again on the last three slides

4
1. What makes the Grid differentto other network
users, wrt performance?
5
The Grid

The Grid is all about
Sharing resources
the obvious, e.g. databases
the specialised, e.g. remotely control telescopes
and new ideas, e.g. CPU time
co-allocate resources to a task to remove the
limitations of the individual resources
most basic analogy you can move house faster if
you have two vans
Sharing resources which are geographically
distributed
Sharing resources efficiently
optimisation selecting the best resources for
the job

6
The Grid
Network(s)
7
The Grid

Get apps running on the right resources
(wherever they are)
Make disparate compute resources into a coherent
whole

Network(s)
8
Optimisation

Its a little like the checkout counters in a
supermarket
There is a line of 10 checkouts to which you can
take your big shopping basket
Two checkouts you cannot use. They are for people
with five items or less caisse express
Another two checkouts cannot be used. They are
reserved for something else (the staffs lunch
break)
Six left how big is each queue and how long will
it take each person to exit the queue (how many
items in each basket)?
If you choose wrong, you get delayed!
You miss the train, you get home late,
your partner has given your dinner to the dog
To take the analogy to extremes hopefully your
basket does not have a broken wheel )

9
Scheduling

Grid job the basic unit of work
SEs provide storage resources and access to mass
storage systems
CEs provide processing power, e.g. cluster of
Worker Nodes (PC farm)
Scheduling deciding when a job will run, and
with which resources
Typically there will be many CEs capable of
running a job
If a CE already has lots of jobs queued, you
would like to use another
File replication proven technique for improving
data access
Distribute multiple copies of the same file
across a Grid
Increases number of CEs with good network
connectivity to the data
Extreme example Pisa?Roma or Pisa?Fermilab?
So, typically there may also be several SEs
holding the required data

10
Network Aware Scheduling (i)

So we have a set of CEs a,b,c, and SEs
x,y,z, capable of running a job
We want a node from each list such that the job
will complete the fastest
Take account of
capability of CEs
size and number of jobs already waiting (queued)
at CEs
performance of network link for each CE-SE
combination
Further complicated by the compute/data intensity
of the job
computationally intensive job lots of maths
data intensive job lots and lots and lots of
data
do we pull the data to the job or push the job to
the data?

11
Network Aware Scheduling (ii)

In Utopia we would know about the current state
of the network, and any future reserved bandwidth
In reality we could use monitored network
performance to make an estimate
Its not perfect, but patterns (diurnal
variation, chronic poor performance) can be
identified
The following slides show iperf tests between
dedicated test nodes at LHC sites in the UK
(GridPPs gridmon infrastructure)

12
Network Aware Scheduling (iii.a)

Transfer at 0000, yes. Transfer at 1200, no.
Theres a big difference between 500 and 200 Mbps
for data intensive jobs!

13
Network Aware Scheduling (iii.b)

RAL Tier-2?Tier-1 local transfers are likely the
best performers

14
Network Aware Scheduling (iii.c)

Here, you have absolutely no idea what
performance you would get ? avoid
Summary ignore the network at your peril )

15
Network Aware Scheduling (iv)

Two good papers to read
B. Volckaert, P. Thysebaert, M. De Leenheer, F.
De Turck, B. Dhoedt, P. Demeester
Network Aware Scheduling in Grids
Richard McClatchey, Ashiq Anjum, Heinz
Stockinger, Arshad Ali, Ian Willers, Michael
Thomas
Data Intensive and Network Aware (DIANA) Grid
Scheduling
We dont consider potential uses in more detail
(job placement, replica selection) because we
dont know if it will happen!

16
Network Aware Scheduling (v)

There are some ve feelings
The network is not a problem. Over-provisioning
will always keep us ahead. Either that or fibre
and GigE everywhere
Report of the International Grid Performance
Workshop 2005 concluded that "Performance simply
is not on the critical path for many application
projects. Applications that struggle to get code
to execute correctly simply do not consider
whether they are using resources efficiently or
achieving good performance
Personal experience suggests that there is so
much to think about elsewhere, that the network
is often the last thing to be considered
Right now, Grid apps rely on the network being
good, with no real checks
And by way of real life indications
EDG WP7 developed network cost function
Returned cost of variable size file transfers
between source and dest Grid elements
Based on periodic (WP7) iperf measurements
Used by WP2 Replica Optimization Service
job placement where to start a job so that it is
as close as possible to the required data
replica selection from where to fetch the
closest replica once a job had started
EDG was not a production Grid, and the work was
not taken forward

17
Network Aware Scheduling (vi)

In EGEE
Tommaso Coviello and Tiziana Ferrrari proposed to
use network performance data from EGEE-JRA4
CompletionTimeCEi JobExecutionTime
max(InputDataTransferTime,QueueTime)
estimate file transfer times based on thruput
reject paths exhibiting packet loss
SEs selection refined based on SEs using low
congestion links (jitter the suggested test)
Some prototype work, but not taken forward
QueueTime found to be unreliable
Data for 100 paths required within 0.2 seconds of
receiving request
Grid Information Service was not ready to hold
the data
a problem for JRA4s Web Service interface (WS, ?
accessible but slow)

18
Network Aware Scheduling (vii)

In WLCG/EGEE (if I understand correctly)
The close SE approach is applied
Each CE must have a close SE the node with the
best access for data retrieval from that CE
These relationships are statically defined in the
Grids Information Service, which provides
information about the Grid resources and their
status
lcg-infosites --vo dteam closeSE
Name of the CE g02.phy.bg.ac.yu2119/blah-pbs-dt
eam
se.phy.bg.ac.yu
Name of the CE fangorn.man.poznan.pl2119/jobman
ager-lcgpbs-dteam
se1.egee.man.poznan.pl
se2.egee.man.poznan.pl

19
Network Aware Scheduling (viii)

To run a job the user submits a job description
in JDL (Job Description Language) format
It defines which executable to run, any
parameters, input data (Grid files) etc.
A match-making process then takes places to
identify a CE to execute the job
Identify all CEs which
can run the job, i.e. match the users
requirements (JDL)
are close to an SE holding the required input
Grid files
select CE with the highest rank
by default, rank estimation of the time
interval between the being job submitted and
execution actually beginning
a function of the number of running and queued
jobs at each CE
See gLite User Guide for more info
As already stated, the presence of replicas of
data increases the number of CEs close to the
data which can potentially execute the job
But decisions are still made on the static
declaration of close SEs
Users are able to re-write the site selection
code themselves

20
Difference 1

So, difference 1
The Grid may use network performance data to
improve its decision making

21
Difference 2

Difference 2
The Grid will exercise the network

22
Qualitative View

By its very nature
sharing lots of resources to build powerful
systems
to process complex, large data sets
in geographically distributed teams
some in real-time, e.g. visualisation
so far there has been lots of embarrassingly
parallel problems (completely independent tasks
which can be executed in parallel) but what about
tasks requiring inter-processor communication
(MPI, Message Passing Interface)?
a lot of data moving across the network
high bandwidth
low-latency
stable and guaranteed transmission rates

23
Quantitative View (i)

The Large Hadron Collider is a collection of four
experiments based at CERN (ALICE, ATLAS, CMS and
LHCb) that will monitor the collision of
accelerated particles
15 Petabytes of data generated every year
Around 100,000 standard CPUs required to process
GridPP (UK) is contributing the equivalent of
10,000 PCs

24
Quantitative View (ii)

My understanding is that the LHC when
operational, will be pushing out 700 Mbytes/s (
5 Gbps) from the Tier-0 to each Tier-1
11 Tier-1s, linked to CERN with 10 Gbps Optical
Private Network
So no problems there
Additional variable flows 4 Gbps are expected
between the Tier-1s
What about Tier-1s to Tier-2s?
gt 150 Tier-2s, 18 in UK
Tier-1s and Tier-2s currently linked by standard
research networks
Are you going to commission dedicated fibres or
lambdas for each?

25
Quantitative View (iii)
26
Rolls Royce Networks

Lots of projects working on adding extra
intelligence into the network, and/or interfacing
Grid applications with network control plane for
auto-provisioning of dedicated bandwidth
Ciscos Network Based On-demand/Grid System
(NBGS)
The NAREGI project
Enlightened Computing
http//www.g-lambda.net/
These are still development projects
Can fibre/lambdas be provided for all that need
it?
Even if provided, temptation to spend on CPU
power?
May still fall victim to end-system and last
mile (e.g. firewall) problems

27
Is the Grid a lot of Hype?

Its good to be skeptical about things. Every
four years people say England will win the World
Cup/Coupe du Monde -)
The Grid is ambitious
but so was the World Wide Wait
Now everyone loves the Web, and it has become
important to people
Internet banking, online shopping (flights,
holidays, music, supermarket), e-Government etc.
etc.
MySpace, Facebook, YouTube
The Web also drove investment in the Net
infrastructure and as a result it can now support
video conferencing, VoIP etc.

28
Summary of Differences

Network Operations We can safely say that
greater demands will be placed on the network
massive datasets, 1000s of networked resources
geographically distributed Long Fat Networks
high bandwidth, high availability, low latency
networks will need to be debugged for efficiency
Network Intelligence The Grid may want to
consume network performance data to improve its
decision making

29
2. What are the end-to-endperformance
(monitoring) issues?
30
The Overall Issue

We have seen that the Grid could use network
performance data for decision making
but we dont know whether it will
As a result, we concentrate on debugging the
network for Grid users

31
End-to-End?

When I say end-to-end I mean PC-PC, not PoP to
PoP or similar
Core and Metro Area are normally fine
Most problems are in the last mile
End-system
NIC
disc
TCP config
poor cabling
the application itself (e.g. older versions of
scp)
I could go on for ever (no, please dont!)
Site firewall
Off-site connections

32
So Many Issues

Beyond the basics of which tests to run, and how
to control/schedule them, there are too many
end-to-end performance issues to consider when
monitoring. Here, I mention a few and make some
suggestions.
TCP performance
Parallel TCP streams
Different data transfer protocols (e.g. GridFTP
vrs HTTP)
New protocols, e.g. DDCP
TCP-IP is ubiquitous so we stick with it - we
cant necessarily wait for new protocols and
network architectures
Measurement types
active vrs passive
capture logs of real GridFTP transfersis there
Grid Information Service support?
can we monitor Grid workflows in real-time?
Too many test paths. Can we plug in to VO data to
test only the required paths

33
Over-Provisioning

Q Okay, so why dont we just throw some more
bandwidth at the problem? Upgrade the links.
A For want of a more interesting term to make
sure youre still paying attention, this is what
I call the Heroin Effect
You start off with a little, but thats not
really doing it for you its not solving the
problem. So you keep increasing the dose, yet
its never as good as you thought it would be.
By analogy you keep buying more and more
bandwidth to take you to new highs but it's never
quite as good as you thought it would be
Simple over-provisioning is not sufficient
Doesnt address the key issue of end-to-end
performance
Network backbone in most cases is genuinely not
the source of the problem
Last mile (campus network?end-user system?your
app) often cause of the problem firewall,
wiring, hard disc, application and many more
potential culprits
Also, If simple over-provisioning was a total
solution, there would not be so much other work
going on, e.g. protocol research (high speed TCPs)

34
Lets Puts Fibre Everywhere (1)

Fibre is cheaper than it was, but for large
deployments, its still expensive
We can see the benefits of fibre with the UKLight
infrastructure and the ESLEA exploitation
project, but it still doesnt address the
end-to-end issue. Take a real-life ESLEA example
(thanks to ESLEA for the figures)
The UK wanted to transfer data from FermiLab
(Chicago) to UCL for analysis by physicists,
before returning the results
datasets currently 1-50TB
50TB would take gt 6 mths on production net, or
one week at 700Mbps
So a 1Gbps circuit-switched light path was
provisioned
Result disc-to-disc transfers _at_ 250Mbps, just
1/4 of theoretical max
Tests revealed a problem at an end site

35
Lets Puts Fibre Everywhere (2)

UCL RealityGrid, for modelling complex condensed
matter systems computational steering,
visualisation.
Test node 2 1.8GHz Athlon, 4 GB, GigE, CentOS
DL HPCx super computer
Test node 3 GHz P4, 2 GB, GigE, Scientific Linux
RTT is always 9mS
TCP bandwidth is, errr....

36
Marks Tips

There are lots of tools, frameworks,
infrastructures out there.
Massive list at http//www.slac.stanford.edu/xorg/
nmtf/nmtf-tools.html
Pick something that works for you - its a
balance of
ongoing administration
deployment effort (e.g. persuading remote sites
to install tools and allow you to run tests)
how intrusive the tests are
Start your investigations in the last mile
Do put real data over the network
you can send 1 ping a second forever and see 10-8
loss
you then run an iperf test and the performance is
terrible
Keep historic data things change
you will want to look back, and you will want
points of reference
When you see a problem, follow it up and get
information
Not only is the problem fixed, but you get to
demonstrate why this is useful which helps with
deployment, support, growing user base
Remember the social aspects - persistent but
patient )

37
Suggestions Tools and Techniques

Start with the local host
As you would expect
uname
netstat
ifconfig (watch error counters etc.)
LISA (Localhost Information Service Agent)
a component of MonALISA
almost complete system monitoring (load, CPU,
memory, disk, disk I/O, paging, processes,
network traffic and connectivity...)
Check everything
TCP configuration
machine load
disc (sas, sata, nasty old ide?)
If TCP is the problem, what UDP rates can you
achieve?

38
Suggestions Tools and Techniques

ping still useful but need to send much faster
than 1 per second, and for a long time.10-8 loss
back of envelope calculation on Saturday I ran
a 10 sec iperf test which transferred 624MB in
480,000 packets. So 1.3KB per packet
1 loss every 100,000,000 packets 128GB
transferred before a loss causes your transfer
rate to drop
can use Synack tool (sparingly) if icmp is
blocked
traceroute and reverse traceroutes regularly
measuring the routes to your most important
collaborators is very useful
dedicated monitoring boxes are useful here
because they may be allowed (firewalls etc.) for
icmp

39
Suggestions Tools and Techniques

As we will see, time series data is probably the
most useful
When did your problems start? When did things
change?
Unfortunately, relies on there being proximity
between your paths/devices and ones for which
there is available data
If you suspect the problem is in the core you may
be able to find the problem router (or rough
location) through a so called "looking glass"
servers statistics of network operator
performance
ping and iperf very useful herebut be wary
In May 2004, Les Cottrell (SLAC) said As
measured by NetFlow, 25 of the traffic on
Abilene is iperf and ping type traffic

40
Suggestions Tools and Techniques

Thrulay is an iperf-like tool for measuring TCP
and UDP bandwidth
useful because it also gives you the RTT seen by
the transfer, not ping/traceroutes estimate
Two detective type tools
Tom Dunnigan and Rich Carlson's Network
Diagnostic Tool (NDT)
client-server
useful because client can be lightweight Java
applet, runs in a Web browser on most systems
command line client (compile and install) also
available
public servers (linux boxes with Web100 kernels)
although I think only one outside US (thank you
SWITCH)
detects problems, makes suggestions duplex
problems, TCP tuning amongst others
The SURFnet Detective

41
Suggestions Tools and Techniques
42
Suggestions Tools and Techniques

We could do these but dont because theres too
much data to process/correlate
Cisco NetFlow data routers record details of
all traffic flows which they see
src and dest IP addresses and ports
start and end time
amount of traffic transferred
Parsing firewall logs
root_at_gridmon2 iperf -c hepgrid7.ph.liv.ac.uk
-------------------------------------------------
-----------
Client connecting to hepgrid7.ph.liv.ac.uk, TCP
port 5001
TCP window size 16.0 KByte (default)
-------------------------------------------------
-----------
3 local 193.62.125.96 port 58316 connected
with 138.253.178.107 port 5001
3 0.0-10.0 sec 873 MBytes 732
Mbits/sec
Jun 10 221258 NetScreen device_idgw-fw
system-notification-00257(traffic)
start_time"2007-06-10 221555" duration22
servicetcp/port5001 src zoneESC-DMZ dst
zoneUntrust actionPermit sent948533470
rcvd40793960 srclthiddengt dstlthiddengt
src_port58316 dst_port5001 session_id995619
Not wholly accurate (22 secs not 10) and ignores
overheads but can be used relative

43
Suggestions Tools and Techniques

SNMP data is (understandably) impossible to
obtain for non-networkers
Sharing data with the OGF NM-WG XML schemas may
improve things
And now some quick examples from gridmon
Dedicated boxes
Same spec, OS, configuration - makes life a lot
easier (comparing like-for like)
If running regular tests, get the results in an
SQL data fast, repeatable queries
If no dedicated boxes available, deploy a box
for
either the best performance possible
Something representative of systems at that
end-site
Sorry, no-end system examples here we
configured the boxes ourselves -)

44
Example 1

Glasgow running transfer tests to Edinburgh over
weekend 28-29th October
Experiencing poor rates (80Mbps)
1st thing despite transferring just 80Mbps,
residual TCP bandwidth drops by 400Mbps
Warning bells

45
Example 1

Traceroute data reveals suspect router
traceroute to gridmon.epcc.ed.ac.uk
(129.215.175.71), 30 hops max, 38 byte packets
1 194.36.1.1 (194.36.1.1) 0.941 ms 0.882 ms
0.815 ms
2 130.209.2.1 (130.209.2.1) 0.875 ms 0.831 ms
0.830 ms
3 130.209.2.118 (130.209.2.118) 60.415 ms
55.453 ms 31.327 ms
4 glasgowpop-ge1-2-glasgowuni-ge1-1-v152.clyde.ne
t.uk (194.81.62.153) 32.420 ms 34.404 ms
29.424 ms
5 glasgow-bar.ja.net (146.97.40.57) 43.467 ms
52.298 ms 39.349 ms
6 po9-0.glas-scr.ja.net (146.97.35.53) 45.856
ms 44.445 ms 41.388 ms
7 po3-0.edin-scr.ja.net (146.97.33.62) 51.509
ms 63.493 ms 31.435 ms
8 po0-0.edinburgh-bar.ja.net (146.97.35.62)
22.454 ms 25.412 ms 31.381 ms
9 146.97.40.122 (146.97.40.122) 44.602 ms
42.494 ms 35.492 ms
10 gridmon.epcc.ed.ac.uk (129.215.175.71)
33.515 ms 34.623 ms 37.694 ms

46
Example 1

Reverse route confirms. Traceroutes are normal
until we hit suspect router
traceroute to gppmon-gla.scotgrid.ac.uk
(194.36.1.56), 30 hops max, 38 byte packets
1 vlan175.srif-kb1.net.ed.ac.uk
(129.215.175.126) 0.435 ms 0.387 ms 0.380 ms
2 edinburgh-bar.ja.net (146.97.40.121) 0.357 ms
0.329 ms 0.322 ms
3 po9-0.edin-scr.ja.net (146.97.35.61) 0.564 ms
0.485 ms 0.485 ms
4 po3-0.glas-scr.ja.net (146.97.33.61) 1.656 ms
1.511 ms 1.499 ms
5 po0-0.glasgow-bar.ja.net (146.97.35.54) 1.850
ms 1.352 ms 1.422 ms
6 146.97.40.58 (146.97.40.58) 1.679 ms 1.661
ms 1.569 ms
7 glasgowuni-ge1-1-glasgowpop-ge1-2-v152.clyde.ne
t.uk (194.81.62.154) 1.796 ms 1.677 ms 1.646
ms
8 130.209.2.117 (130.209.2.117) 31.197 ms
34.615 ms 29.121 ms
9 130.209.2.2 (130.209.2.2) 32.814 ms 32.158
ms 32.145 ms
gppmon-gla.scotgrid.ac.uk (194.36.1.56) 41.634
ms 37.555 ms 24.635 ms
Graphs and traceroutes provide evidence for
further investigation

47
Example 1

Further investigation revealed that the router
had exhausted its CAM space
ltsee next slide if you want to know what this isgt
In simple terms, the router was forced to switch
in software
Because a particular lookup in a
routing/switching/access table was not being
hardware accelerated, problems were caused under
certain flow conditions
The solution the CAM dynamic database was
re-optimised (to free up CAM space) and the unit
began switching in hardware again

48
Example 1

CAM Content-Addressable Memory
Hardware (fast) implementation of an associative
area
a data word (not memory address!) is used to
access it
the CAM searches its entire contents to see if
the data word is stored
if the word is found, the CAM returns a list of
one or more corresponding storage addresses, or
other data associated with those storage
addresses
CAM memory is used for switching and routing,
e.g. Ethernet switches store learned MAC
addresses and their associated switch port in CAM
MAC Address Located on Port
------------- ---------------
000039-0643f5 26
000089-01af9a 5
000102-162346 16
When an Ethernet frame arrives at the switch with
a destination address of 000089-01af9a the switch
searches its CAM for that address.
The CAM will return 5 so the switch sends this
Ethernet frame out on port 5

49
Example 2

Local departmental firewall reconfigured to
switch off strict checking of TCP sequence
numbers
Potential minefield SACK etc.

50
Example 3

Almost constant 33 UDP packet loss
Fatal to most/all applications using UDP
Occasional dip to 0

51
Example 3

Zooming into a particular day shows a period of
0 loss
Site firewall limits UDP to 1,000 packets per
second, per endpoint pair
Temporarily raised to 20,000 pps for Video
Conferences

52
The Answer

Blair (vintage 1996) before he game to power

Education, education, education became a mantra
for his party
NRENs are ideally placed to provide this

53
The Answer

Blair (vintage 1996) before he game to power

Education, education, education became a mantra
for his party
NRENs are ideally placed to provide this

54
The Answer

Blair (vintage 1996) before he game to power

Education, education, education became a mantra
for his party
NRENs are ideally placed to provide this

55
NFNN

Talks on TCP, LAN, diagnostic steps, security
http//gridmon.dl.ac.uk/nfnn/

As an example
Networks for non-Networkers workshops
Aimed at people working at the technical level in
high-bandwidth dependant science

56
Your Application

Is your application making effective use of the
network?
Consider using multiple TCP sockets (i.e.
multiple streams) for your data transfers
One thread per socket
Keep your pipe full of data
use asynchronous I/O, i.e. run computation and
I/O in parallel
pre-fetch data you know you are going to need,
again in parallel with other computation or I/O
when possible, read/write large blocks of data at
a time better to infrequently r/w ? 1MB than
frequently r/w 4K

57
What Is Your Application Doing?

Instrument your code, e.g. Netlogger, a
Networked Application Logger
Methodology and set of tools
Low overhead can generate up to 5000/500
events/sec using the C/Java APIs with negligible
impact on the app
Simple and sensible methodology, e.g.
Rule 3 Log all of the following events Entering
and exiting any program or software component,
and begin/end of all I/O (disk and network).

58
Netlogger

client side GridFTP
note the large overhead ( 8s) of initial
handshaking before real writing begins

59
Conclusion

The Grid could use network performance data
The reality is that it doesnt
The Grid will exercise networks
Core fine. Metro mostly fine. Most problems
in the last mile.
Not every Grid app wants, needs or can afford
dedicated ?s
Education, education, education. But please, no
wars!
Tune your end systems and applications
Instrument you application so you can see whats
happening
For more information m.j.leese_at_dl.ac.uk

60
Links (1)

The GridPP (LHC in the UK) "gridmon" network
monitoring infrastructure http//gridmon3.dl.ac.u
k/gridmon/
Network Aware Scheduling in Grids
"Network Aware Scheduling in Grids" paper
http//users.atlantis.ugent.be/bvolckae/papers/NOC
2004.pdf
"Data Intensive and Network Aware (DIANA) Grid
Scheduling" paper http//hst.web.cern.ch/hst/publ
ications/diana-JoGC.pdf
Report of the International Grid Performance
Workshop 2005 http//www-unix.mcs.anl.gov/schopf
/GPW2005/report.pdf
EDG WP7 Final Report https//edms.cern.ch/file/41
4132/2.1/DataGrid-07-D7-4-0206-2.0.pdf
EGEE-JRA4 http//egee-jra4.web.cern.ch/EGEE-JRA4/
gLite User Guide https//edms.cern.ch/file/722398
/gLite-3-UserGuide.html

61
Links (2)

Rolls Royce Networks
Ciscos Network Based On-demand/Grid System
http//www.terena.org/activities/nrens-n-grids/wor
kshop-03/NBGS-Terena.pdf
The NAREGI project http//www.naregi.org/index_e.
html
Enlightened Computing http//www.mcnc.org/index.c
fm?fuseactionpagefilenameenlightened_computing.
html
G-Lambda http//www.g-lambda.net
Monitoring Grid workflows in real-time
http//www.di.unipi.it/augusto/seminars/200705_OG
F20/2007-04-09_OGF-Slides.pdf
Exploiting fibre infrastructures, UK ESLEA
project closing conference http//www.eslea.uklig
ht.ac.uk/conf.html
UCL Reality Grid project http//www.realitygrid.o
rg
Daresbury Laboratory HPCx super computer
http//www.hpcx.ac.uk

62
Links (3)

End host monitoring, LISA (Localhost Information
Service Agent) http//monalisa.cacr.caltech.edu
Synack, alternative ping tool http//www-iepm.sla
c.stanford.edu/tools/synack/
Thrulay, iperf-like tool http//www.internet2.edu
/shalunov/thrulay/
Network Diagnostic Tool http//e2epi.internet2.ed
u/ndt/
SURFnet Detective http//detective.surfnet.nl/en/
index_en.html
Sharing network performance data, OGF Network
Measurements Working Group http//nmwg.internet2.
edu/
TCP Selective Acknowledgements (SACK)
http//www.ietf.org/rfc/rfc2018.txt
Netlogger (Networked Application Logger)
http//dsd.lbl.gov/NetLogger/