Title: LHCOPN Monitoring Working Group Update
1LHC-OPN MonitoringWorking Group Update
- Shawn McKee
- LHC-OPN T0-T1 Meeting
- Rome, Italy
- April 4th, 2006
2LHC-OPN Monitoring Overview
- The LHC-OPN exists to share LHC data with, and
between, the T1 centers - Being able to monitor this network is vital to
its success and is required for operations. - Monitoring is important for
- Fault notification
- Performance tracking
- Problem diagnosis
- Scheduling and prediction
- Security
- See previous (Amsterdam) talk for an overview and
details on all this
3The LHC-OPN Network
4LHC-OPN Monitoring View
The diagram to the right is a logical
representation of the LHC-OPN showing monitoring
hosts The LHC-OPN extends to just inside the T1
edge Read/query access should be guaranteed on
LHC-OPN owned equipment. We also request RO
access to devices along the path to enable quick
fault isolation
5Status Update
- During the Amsterdam meeting (Jan 2006)
- we decided to focus on two areas
- Important/required metrics
- Prototyping LHC-OPN monitoring
- There is an updated LHC-OPN Monitoring document
on the LHC-OPN web page emphasizing this new
focus. - This Meeting
- What metrics should be required for LHC-OPN?
- We need to move forward on prototyping LHC-OPN
monitoring services volunteer sites?
6Monitoring Possibilities by Layer
- For each layer we could monitor a number of
- metrics of the LHC-OPN
- Layer-1
- Optical power levels
- Layer-2
- Packet statistics (e.g., RMON)
- Layer-3/4
- Netflow
- All Layers
- Utilization (bandwidth in use,Mbits/sec)
- Availability (track accessibility of device over
time) - Error Rates
- Capacity
- Topology
7LHC-OPN Paths Multiple Layers
- Each T0-T1 path has many views
- Each OSI Layer (1-3) may have different devices
involved.
- This diagram is likely simpler than most cases in
the LHC-OPN
8Metrics for the LHC-OPN(EGEE Network Performance
Metrics V2)
- For edge-to-edge monitoring this list of
relevant metrics include - Availability (of T0-T1 path, each hop, T1-T1?)
- Capacity (T0-T1, each hop)
- Utilization (T0-T1, each hop)
- Delays (T0-T1 paths, One-way, RTT, jitter)
- Error Rates (T0-T1, each hop)
- Topology (L3 traceroute, L1?, L2)
- MTU (each path and hop)
- What about Scheduled Downtime, Trouble Tickets?
9Availability
- Availability (or uptime) measures the amount of
time the network is up and running. - Can be by hop or a complete path
- Methodology
- Layer 1 Measure power levels/bit rate?
- Layer 2 Utilize SNMP to check interface
- Layer 3 ping
- Units Expressed as a percentage
10Capacity
- Capacity is the maximum amount of data per unit
time a hop or path can transport. - Can be listed by hop or path
- Methodology
- Layer 1 Surveyed (operator entry)
- Layer 2 SNMP query on interface
- Layer 3 Minimum of component hops
- Units Bit rate (BitsK,M,G per second)
11Utilization
- Utilization is the amount of capacity being
consumed on a hop or path. - Can be listed by hop or path
- Methodology
- Layer 2 Use of SNMP to query interface stats
- Layer 3 List of utilization along path
- Units Bits per second
12Delay
- Delay metrics are at Layer 3 (IP) and are defined
by RFC 2679 and 2681 and IPPM. - Delay related info are three types one-way delay
(OWD), one-way delay variation (jitter) and
round-trip time (RTT) - One way delay between two observation points is
the time between occurrence of the first bit of
the packet on the first point and the last bit of
the packet at the second point. - Methodology application (OWAMP) generating
defined size packet with time-stamp to target
end-host application. - Units Time (seconds)
- Jitter is the one way delay difference along a
given unidirectional path (RFC 3393) - Methodology statistical analysis of OWD
application - Units Time (positive or negative)
- Round-trip time (RFC 2681) well defined
- Methodology ping
- Units Time (min/max/average) or a histogram of
time
13Error Rates
- Error rates track the bit or packet error rate
(depending upon layer). - Can be listed by hop or path
- Methodology
- Layer 1 Read (TL1) equipment error rate
- Layer 2 SNMP access to interface error counter
- Layer 3 Checksum errors on packets
- Units Fraction (erroneous/total for bits or
packets)
14Topology
- Topology refers to the connectivity between nodes
in the network (varies by OSI layer) - Methodology
- Layer 1 Surveyed (input)
- Layer 2 Surveyed (input)possible L2 discovery?
- Layer 3 Traceroute or equivalent
- Units Representation should record a vector of
node-link pairs representing the described path - May vary with time (that is what is interesting)
but that is probably only trackable at L3.
15MTU
- The Maximum Transmission Unit is defined as the
maximum size of a packet which an interface can
transmit without having to fragment it. - Can be listed by hop or path
- Methodology Use Path MTU Discovery (RFC 1191)
- Units Bytes
16LHC-OPN Which Metrics Are REQUIRED (if any)?
- We should converge on a minimal set of metrics
that the LHC-OPN Monitoring needs to provide - Example for each T0-T1 path
- Availability (is path up?)
- Capacity (path bottleneck bandwidth)
- Utilization (current usage along path)
- Error rates? (bit errors along path)
- Delay?
- Topology?
- MTU?
- Do we need/require hop level metrics at various
layers? - How to represent/monitor downtime and trouble
tickets? (Is this in scope?)
17REMINDER T0 Site Requests
- A robust machine meeting the following specs must
be made available - Dual cpu Xeon 3 GHz processors or dual opteron
2.2 GHz or better - 4 Gigabytes of memory to support monitoring apps
and large TCP buffers - 1 or 10 Gigabit network interface on the LHC-OPN.
- 200 GB of disk space to allow for the LHCOPN apps
data repository. - A separate disk (200 GB) to back-up the LHCOPN
data repository. - OPTIONAL An out-of-band link for
maintenance/problem diagnosis. - Suitably privileged account(s) for software
installation/access. - This machine should NOT be used for other
services. - SNMP RO access for the above machine is required
for all L2 and L3 devices or proxies (in case of
security/performance concerns) - Access to Netflow (or equiv.) LHC-OPN data from
the edge device - Appropriate RO access (via proxy?) to the optical
components (for optical power monitoring) must
allowed from this same host. - Access (testing/maint.) must be allowed from all
LHC-OPN nets. - The Tier-0 needs a point-of-contact (POC) for
LHC-OPN monitoring.
18REMINDER T1 Site Requests
- A dedicated LHC-OPN monitoring host must be
provided - A gigabyte of memory
- 2 GHz Xeon or better CPU.
- 1 Gigabit network interface on the LHC-OPN.
- At least 20 GB of disk space allocated for
LHC-OPN monitoring apps. - An suitably privileged account for software
installation. - OPTIONAL An out-of-band network link for
maintenance/problem diagnosis - OPTIONAL This host should only be used for
LHC-OPN monitoring - OPTIONAL each Tier-1 site should provide a
machine similar to the Tier-0. - SNMP RO access for the above machine is required
for all T1 LHC-OPN L2 and L3 devices or proxies
(for security/performance concerns) - Access to Netflow (or equiv.) LHC-OPN data from
the edge device - Appropriate RO access, possibly via proxy, to the
T1 LHC-OPN optical components (for optical power
monitoring) must allowed from this host. - Access (testing/maint.) should be allowed from
all LHC-OPN networks. - The Tier-1 needs to provide a point-of-contact
(POC) for LHC-OPN monitoring
19REMINDER NREN Desired Access
- We expect that we will be unable to require
anything for all possible NRENs in the LHC-OPN.
However the following list represents what we
would like to have for the LHC-OPN - SNMP (readonly) access to LHC-OPN related L2/L3
devices from either a closely associated Tier-1
site or the Tier-0 site. We require associated
details about the device(s) involved with the
LHC-OPN for this NREN - Suitable (readonly) access to the optical
components along the LHC-OPN path which are part
of this NREN. We require associated details
about the devices involved. - Topology information on how the LHC-OPN maps onto
the NREN - Information about planned service outages and
interruptions. For example URLs containing this
information, mailing lists, applications which
manage them, etc. - Responsibility for each acquiring NREN
information should be distributed to the various
Tier-1 POCs.
20Prototype Deployments
- We like to begin prototype distribution
deployments to at least two Tier-1s and the
Tier-0 - The goal is to prototype various software which
might be used for LHC-OPN monitoring - Active measurements (and scheduling?)
- Various applications which can provide LHC-OPN
metrics (perhaps in different ways) - GUI interfaces to LHC-OPN data
- Metric data management/searching for LHC-OPN
- Alerts and automated problem handling
applications - Interactions between all the preceding
- This process should lead to a final LHC-OPN
monitoring system matched to our needs.
21Prototype Deployment Needs
- For sites volunteering to support the LHC-OPN
monitoring prototypes we need - Suitable host (see requirements)
- Account details (username/password). Can provide
SSH public key as alternative for passwd. - Any constraints or limitations about host usage.
- Out-of-band access info (if any)
- Each site should also provide a monitoring
point-of-contact. - VOLUNTEERS? (email smckee_at_umich.edu)
22Monitoring Site Requirements
- Eventually each LHC-OPN site should provide the
following for monitoring - Appropriate host(s) (see previous slides)
- Point-of-contact for monitoring
- L1/L2/L3 Map to Tier-0 listing relevant nodes
and links - Responsible for contacting intervening NRENs
- Map is used for topology and capacity information
- Should include node(device) address, description
and access information - Readonly access for LHC-OPN components
- Suitable account(s) on monitoring host
- Sooner rather than laterdictated by interest
23Future Directions / Related Activities
- There are a number of existing efforts we
anticipate actively - prototyping for LHC-OPN monitoring
(alphabetically) - EGEE JRA4/ EGEE-II SA1 Network Performance
Monitoring - This project has been working on an
architecture and a series of prototype services
intended to provide Grid operators and middleware
with both end-to-end and edge-to-edge performance
data. See http//egee-jra4.web.cern.ch/EGEE-JRA4
/ and a demo at https//egee.epcc.ed.ac.uk28443/n
pm-dt/ - IEPM -- Internet End-to-end Performance
Monitoring. The IEPM effort has its origins in
the 1995 WAN monitoring group at SLAC. IEPM-BW
was developed to provide an infrastructure more
focused on a making active end-to-end performance
measurements for a few high-performance paths. - MonALISA Monitoring Agents using a Large-scale
Integrated Services Architecture. This framework
has been designed and implemented as a set of
autonomous agent-based dynamic services that
collect and analyze real-time information from a
wide variety of sources (grid nodes, network
routers and switches, optical switches, running
jobs, etc.) - NMWG Schema - The NMWG (Network Measurement
Working Group) focuses on characteristics of
interest to grid applications and works in
collaboration with, other standards groups such
as the IETF IPPM WG and the Internet2 End-to-End
Performance Initiative. The NMWG will determine
which of the network characteristics are relevant
to Grid applications, and pursue standardization
of the attributes required to describe these
characteristics. - PerfSonar This project plans to deploy a
monitoring infrastructure across Abilene
(Internet2), ESnet, and GEANT. A standard set of
measurement applications will regularly measure
these backbones and store their results in the
Global Grid Forum Network Measurement Working
Group schema (see below).
24Summary and Conclusion
- The LHC-OPN monitoring document is updated to
reflect the new emphasis on - Determining the appropriate metrics
- Prototyping possible applications/systems
- All sites should identify an LHC-OPN
point-of-contact to help expedite the monitoring
effort. - We have a number of possibilities regarding
metrics. Defining which (if any) are required
will help direct the prototyping efforts. - Prototyping is ready to proceed -- we need to
identify sites which will host this effort.