Title: EGEEEPCC Work on Network Performance Monitoring
1EGEE/EPCC Work on Network Performance Monitoring
- GÉANT2 JRA1 Meeting, Berlin
- June 26 2008
2Outline
- Introduction
- EGEE Challenges and Strategy for NPM
- Architecture and Tools Developed
- Diagnostic Tool
- PCP Probes Coordination Protocol
- Deployment Challenges
- Future Work
3EPCC involvement with NPM
- EGEE-NPM
- Aim Make available network performance
measurements for EGEE infrastructure to rest of
project - EGEE 1 April 2004 31 March 2006
- JRA4
- BAR
- NPM
- Initial version of NPM services
- EGEE-II 1 April 2006 31 March 2008
- SA1 (Operations)
- NPM only
- NM-WG version 2 services
- PCP
- JISC-NPM 1April 2008 31 March 2009
- Disseminate work to other JISC projects,
collaborate with others, eg DEISA2
4EGEE Challenges
- Scale and heterogeneity of EGEE fabric poses a
requirement to support diversity of all kinds - Multitude of ways of collecting monitoring data
- Different measurement types
- End-to-end
- Appropriate to experience of user and
application, eg TCP achievable bandwidth - Backbone
- Lower level measurements, used to pin-point
source of problems - Different measurement tools
- Different data formats
- Many administrative domains
- Different user groups
5The Importance of end-to-end monitoring
- Most network problems can be attributed to the
last mile - Campus issues, not backbone
- Grid users want to know the expected performance
of their application - Dont always realise that they wont get the full
backbone bandwidth - Network infrastructure, configuration, firewalls
etc - Influence of other network users
- Machine TCP configuration, disk system, memory,
processor speed etc - Application itself
Grids require reliable monitoring of the network
from source machine on one campus right through
to destination machine elsewhere
6Strategy
- Facilitate access to data collected by existing
measurement tools - Lots already exist so no need to develop our own
- Data federation through use of GGF/OGF NM-WG
schema - Has prompted fruitful collaboration with GÉANT2
JRA1 - Availability of perfSONAR utilisation data
through our tools
7NPM Architecture
- User Interface
- Path Selection
- Metric Selection
- Plotting of results
Clients
- Mediator
- Single point of contact for clients
- Metadata discovery
- Brokers data requests
Middleware
- e2emonit
- Active end-to-end data
- perfSONAR
- Passive utilisation data from networks such as
GÉANT2
Frameworks
8Tools and Supported Frameworks
- Clients
- Diagnostic Tool
- For use by people
- Web based application for ease of access
- Middleware
- Mediator
- Single point of contact for clients
- Clients do not need to maintain list of
frameworks - Discovery of metadata
- Insulate clients from interface changes
- Exposes NM-WG web-service interface
- Added value services
- caching of data
- Measurement Frameworks
- e2emonit
- End-to-end metrics (TCP/UDP achievable bandwidth,
RTT, packet loss, OWDV) - Active measurement tools (iperf, ping, udpmon)
9NPM Diagnostic Tool
- The Diagnostic Tool can be accessed using a
standard web browser, which users can be
individually authorised to use. - The intended user is a NOC/GOC/ROC operator, but
anyone can use it to investigate problems. - The sites and metrics displayed depend on where
and which measurement tool has been deployed,
using NM-WG metadata queries to the Mediator. - Currently deployed with access to some perfSONAR
MAs and test e2emonit data.
10NPM Diagnostic Tool (2)
- The parameters used to gather measurements are
shown. - Here the iperf tool was used to measure the TCP
achievable bandwidth. - These parameters can be useful in interpreting
the results.
11NPM Diagnostic Tool (3)
- Information from multiple paths may be plotted at
the same time. - Here utilisation data for the GÉANT2/JANET router
is plotted for both inbound and outbound traffic
over the course of one week, obtained from the
GÉANT2 perfSONAR Measurement Archive.
12Deployment Challenges
- The usefulness of NPM depends on the data that is
available - Providing data federation tools not enough by
itself - We would like to use data that is already
collected - But monitoring tools currently not sufficiently
deployed across sites - Ideally individual regional federations or VOs
make decisions on which tools to deploy for their
infrastructure - E.g. GridPP deployment of gridmon within UK
- We then help to make this data available through
an NM-WG interface
13Gridmon (1)
- Network monitoring for the UK GridPP
infrastructure (UK contribution to EGEE) - Mark Leese _at_ STFC Daresbury Lab
- Active end-to-end measurements
- Similar tools/metrics to e2emonit
- TCP/UDP achievable bandwidth, RTT, packet loss
- Well defined set of sites and paths of interest
- Tier 1 centre to all, Tier 2 centres to others in
same region - Hope to soon deploy NM-WG web service
- Useful comparison of schema implementations
- Integrate into Mediator and DT
14Gridmon (2)
15Gridmon (3)
16More Deployment Challenges
- Deployment of monitoring tools is not so easy
- There has to be a clear benefit to the site
before they install tools - This benefit is not obvious until after an
incident has occurred, by which time it is too
late - Firewall changes may be difficult (eg ICMP
blocked by default) - Technically or politically
- Tools need to be trivial to install and robust
when running - Sys-admins very busy
- Need to carefully consider scheduling for
end-to-end tests - Overlapping measurements
- Network overload
Solution ? Develop PCP
17PCP Probes Control Protocol
- Developed to solve management overhead of running
active measurement probes - eg manual cron jobs
- Token-based mechanism to co-ordinate periodic
execution of monitoring tasks - But applicable to any kind of task requiring
regular scheduling across administrative domains - Prevents overlapping measurements
- Probe will not run until token received
- Groups of sites form cliques
- Robust
- Can cope with sites in the clique being
unreachable - Secure
- Only pre-defined activities may be run
- VOMS/X.509 based authentication of users
- But designed as pluggable security
18PCP Operation
19Even More Deployment Challenges
- Different user groups may have widely different
requirements for displaying data - e.g. site or service admins may just want an
alarm that tells them your network is broken,
and never look at the DT - But network people would not contemplate
investigating problems without clear historical
data to look at - The network is still assumed by many to just
work
20The Future (1)
- We are no longer involved in EGEE
- But funded by JISC for a further year to do
similar work - EGEE plans
- EGEE SA2 (ENOC etc) have a small amount of effort
from DFN - On-demand measurements, requested by ENOC
- Central web server for authorisation, archiving,
control - New BWCTL-like plugins for traceroute, ping, DNS
lookup, nmap - LHC-OPN Deploying perfSONAR services
21(TSA2.2.4) Network monitoring tools DFN
- Network monitoring tools for efficient
troubleshooting - Launch test on demand from the Grid site under
central server control ping, traceroute, DNS
lookup, nmap and bandwith measurements
2
ENOC supervisor
1
3
ENOC
5
4
administrator
Grid site B
Grid site A
Local site light PerfSONARs sensor
Central ENOC monitoring server
SA2 Networking support Transition meeting May 08
21
22The Future (2)
- Gridmon
- Collaboration around NM-WG v2 interfaces
- DEISA
- Fewer sites involved, currently 11
- DEISA plan to evaluate perfSONAR in the coming
months - But we need to do something useful soon
- Is there an opportunity to work together to
deliver something useful to DEISA that would also
enhance perfSONAR? - Alarms?
- Presentation?
23The Future (3)
- For large projects in general
- If multiple frameworks are deployed, then have to
pursue interoperability through NM-WG and use
Mediator-like components - But are multiple frameworks really deployed?
- Where is NM-WG going?
- Why not install try to install perfSONAR services
everywhere?
24Summary
- Provision of federated access to network
measurement data has been demonstrated - Based on OGF NM-WG schema
- Getting access to data itself is much harder
- Deployment challenges
- Need to sell to sites the value of having data
available - Differences between metrics provided by network
providers and those that can be provided by
individual sites - end-to-end active vs. passive monitoring
- Should projects be attempting to do their own
monitoring? - If they dont then it is left up to providers
- But only projects can provide meaningful
end-to-end measurements - What happens when a site is active in multiple
projects?