LHCOPN Monitoring Working Group Update - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

LHCOPN Monitoring Working Group Update

Description:

REMINDER: T0 Site Requests. A robust machine meeting the following specs must be made available: ... REMINDER: NREN Desired Access ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 25

Provided by: smc74

Learn more at: https://www.slac.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: LHCOPN Monitoring Working Group Update

1
LHC-OPN MonitoringWorking Group Update

Shawn McKee
LHC-OPN T0-T1 Meeting
Rome, Italy
April 4th, 2006

2
LHC-OPN Monitoring Overview

The LHC-OPN exists to share LHC data with, and
between, the T1 centers
Being able to monitor this network is vital to
its success and is required for operations.
Monitoring is important for
Fault notification
Performance tracking
Problem diagnosis
Scheduling and prediction
Security
See previous (Amsterdam) talk for an overview and
details on all this

3
The LHC-OPN Network
4
LHC-OPN Monitoring View
The diagram to the right is a logical
representation of the LHC-OPN showing monitoring
hosts The LHC-OPN extends to just inside the T1
edge Read/query access should be guaranteed on
LHC-OPN owned equipment. We also request RO
access to devices along the path to enable quick
fault isolation
5
Status Update

During the Amsterdam meeting (Jan 2006)
we decided to focus on two areas
Important/required metrics
Prototyping LHC-OPN monitoring
There is an updated LHC-OPN Monitoring document
on the LHC-OPN web page emphasizing this new
focus.
This Meeting
What metrics should be required for LHC-OPN?
We need to move forward on prototyping LHC-OPN
monitoring services volunteer sites?

6
Monitoring Possibilities by Layer

For each layer we could monitor a number of
metrics of the LHC-OPN
Layer-1
Optical power levels
Layer-2
Packet statistics (e.g., RMON)
Layer-3/4
Netflow
All Layers
Utilization (bandwidth in use,Mbits/sec)
Availability (track accessibility of device over
time)
Error Rates
Capacity
Topology

7
LHC-OPN Paths Multiple Layers

Each T0-T1 path has many views
Each OSI Layer (1-3) may have different devices
involved.

This diagram is likely simpler than most cases in
the LHC-OPN

8
Metrics for the LHC-OPN(EGEE Network Performance
Metrics V2)

For edge-to-edge monitoring this list of
relevant metrics include
Availability (of T0-T1 path, each hop, T1-T1?)
Capacity (T0-T1, each hop)
Utilization (T0-T1, each hop)
Delays (T0-T1 paths, One-way, RTT, jitter)
Error Rates (T0-T1, each hop)
Topology (L3 traceroute, L1?, L2)
MTU (each path and hop)
What about Scheduled Downtime, Trouble Tickets?

9
Availability

Availability (or uptime) measures the amount of
time the network is up and running.
Can be by hop or a complete path
Methodology
Layer 1 Measure power levels/bit rate?
Layer 2 Utilize SNMP to check interface
Layer 3 ping
Units Expressed as a percentage

10
Capacity

Capacity is the maximum amount of data per unit
time a hop or path can transport.
Can be listed by hop or path
Methodology
Layer 1 Surveyed (operator entry)
Layer 2 SNMP query on interface
Layer 3 Minimum of component hops
Units Bit rate (BitsK,M,G per second)

11
Utilization

Utilization is the amount of capacity being
consumed on a hop or path.
Can be listed by hop or path
Methodology
Layer 2 Use of SNMP to query interface stats
Layer 3 List of utilization along path
Units Bits per second

12
Delay

Delay metrics are at Layer 3 (IP) and are defined
by RFC 2679 and 2681 and IPPM.
Delay related info are three types one-way delay
(OWD), one-way delay variation (jitter) and
round-trip time (RTT)
One way delay between two observation points is
the time between occurrence of the first bit of
the packet on the first point and the last bit of
the packet at the second point.
Methodology application (OWAMP) generating
defined size packet with time-stamp to target
end-host application.
Units Time (seconds)
Jitter is the one way delay difference along a
given unidirectional path (RFC 3393)
Methodology statistical analysis of OWD
application
Units Time (positive or negative)
Round-trip time (RFC 2681) well defined
Methodology ping
Units Time (min/max/average) or a histogram of
time

13
Error Rates

Error rates track the bit or packet error rate
(depending upon layer).
Can be listed by hop or path
Methodology
Layer 1 Read (TL1) equipment error rate
Layer 2 SNMP access to interface error counter
Layer 3 Checksum errors on packets
Units Fraction (erroneous/total for bits or
packets)

14
Topology

Topology refers to the connectivity between nodes
in the network (varies by OSI layer)
Methodology
Layer 1 Surveyed (input)
Layer 2 Surveyed (input)possible L2 discovery?
Layer 3 Traceroute or equivalent
Units Representation should record a vector of
node-link pairs representing the described path
May vary with time (that is what is interesting)
but that is probably only trackable at L3.

15
MTU

The Maximum Transmission Unit is defined as the
maximum size of a packet which an interface can
transmit without having to fragment it.
Can be listed by hop or path
Methodology Use Path MTU Discovery (RFC 1191)
Units Bytes

16
LHC-OPN Which Metrics Are REQUIRED (if any)?

We should converge on a minimal set of metrics
that the LHC-OPN Monitoring needs to provide
Example for each T0-T1 path
Availability (is path up?)
Capacity (path bottleneck bandwidth)
Utilization (current usage along path)
Error rates? (bit errors along path)
Delay?
Topology?
MTU?
Do we need/require hop level metrics at various
layers?
How to represent/monitor downtime and trouble
tickets? (Is this in scope?)

17
REMINDER T0 Site Requests

A robust machine meeting the following specs must
be made available
Dual cpu Xeon 3 GHz processors or dual opteron
2.2 GHz or better
4 Gigabytes of memory to support monitoring apps
and large TCP buffers
1 or 10 Gigabit network interface on the LHC-OPN.
200 GB of disk space to allow for the LHCOPN apps
data repository.
A separate disk (200 GB) to back-up the LHCOPN
data repository.
OPTIONAL An out-of-band link for
maintenance/problem diagnosis.
Suitably privileged account(s) for software
installation/access.
This machine should NOT be used for other
services.
SNMP RO access for the above machine is required
for all L2 and L3 devices or proxies (in case of
security/performance concerns)
Access to Netflow (or equiv.) LHC-OPN data from
the edge device
Appropriate RO access (via proxy?) to the optical
components (for optical power monitoring) must
allowed from this same host.
Access (testing/maint.) must be allowed from all
LHC-OPN nets.
The Tier-0 needs a point-of-contact (POC) for
LHC-OPN monitoring.

18
REMINDER T1 Site Requests

A dedicated LHC-OPN monitoring host must be
provided
A gigabyte of memory
2 GHz Xeon or better CPU.
1 Gigabit network interface on the LHC-OPN.
At least 20 GB of disk space allocated for
LHC-OPN monitoring apps.
An suitably privileged account for software
installation.
OPTIONAL An out-of-band network link for
maintenance/problem diagnosis
OPTIONAL This host should only be used for
LHC-OPN monitoring
OPTIONAL each Tier-1 site should provide a
machine similar to the Tier-0.
SNMP RO access for the above machine is required
for all T1 LHC-OPN L2 and L3 devices or proxies
(for security/performance concerns)
Access to Netflow (or equiv.) LHC-OPN data from
the edge device
Appropriate RO access, possibly via proxy, to the
T1 LHC-OPN optical components (for optical power
monitoring) must allowed from this host.
Access (testing/maint.) should be allowed from
all LHC-OPN networks.
The Tier-1 needs to provide a point-of-contact
(POC) for LHC-OPN monitoring

19
REMINDER NREN Desired Access

We expect that we will be unable to require
anything for all possible NRENs in the LHC-OPN.
However the following list represents what we
would like to have for the LHC-OPN
SNMP (readonly) access to LHC-OPN related L2/L3
devices from either a closely associated Tier-1
site or the Tier-0 site. We require associated
details about the device(s) involved with the
LHC-OPN for this NREN
Suitable (readonly) access to the optical
components along the LHC-OPN path which are part
of this NREN. We require associated details
about the devices involved.
Topology information on how the LHC-OPN maps onto
the NREN
Information about planned service outages and
interruptions. For example URLs containing this
information, mailing lists, applications which
manage them, etc.
Responsibility for each acquiring NREN
information should be distributed to the various
Tier-1 POCs.

20
Prototype Deployments

We like to begin prototype distribution
deployments to at least two Tier-1s and the
Tier-0
The goal is to prototype various software which
might be used for LHC-OPN monitoring
Active measurements (and scheduling?)
Various applications which can provide LHC-OPN
metrics (perhaps in different ways)
GUI interfaces to LHC-OPN data
Metric data management/searching for LHC-OPN
Alerts and automated problem handling
applications
Interactions between all the preceding
This process should lead to a final LHC-OPN
monitoring system matched to our needs.

21
Prototype Deployment Needs

For sites volunteering to support the LHC-OPN
monitoring prototypes we need
Suitable host (see requirements)
Account details (username/password). Can provide
SSH public key as alternative for passwd.
Any constraints or limitations about host usage.
Out-of-band access info (if any)
Each site should also provide a monitoring
point-of-contact.
VOLUNTEERS? (email smckee_at_umich.edu)

22
Monitoring Site Requirements

Eventually each LHC-OPN site should provide the
following for monitoring
Appropriate host(s) (see previous slides)
Point-of-contact for monitoring
L1/L2/L3 Map to Tier-0 listing relevant nodes
and links
Responsible for contacting intervening NRENs
Map is used for topology and capacity information
Should include node(device) address, description
and access information
Readonly access for LHC-OPN components
Suitable account(s) on monitoring host
Sooner rather than laterdictated by interest

23
Future Directions / Related Activities

There are a number of existing efforts we
anticipate actively
prototyping for LHC-OPN monitoring
(alphabetically)
EGEE JRA4/ EGEE-II SA1 Network Performance
Monitoring - This project has been working on an
architecture and a series of prototype services
intended to provide Grid operators and middleware
with both end-to-end and edge-to-edge performance
data. See http//egee-jra4.web.cern.ch/EGEE-JRA4
/ and a demo at https//egee.epcc.ed.ac.uk28443/n
pm-dt/
IEPM -- Internet End-to-end Performance
Monitoring. The IEPM effort has its origins in
the 1995 WAN monitoring group at SLAC. IEPM-BW
was developed to provide an infrastructure more
focused on a making active end-to-end performance
measurements for a few high-performance paths.
MonALISA Monitoring Agents using a Large-scale
Integrated Services Architecture. This framework
has been designed and implemented as a set of
autonomous agent-based dynamic services that
collect and analyze real-time information from a
wide variety of sources (grid nodes, network
routers and switches, optical switches, running
jobs, etc.)
NMWG Schema - The NMWG (Network Measurement
Working Group) focuses on characteristics of
interest to grid applications and works in
collaboration with, other standards groups such
as the IETF IPPM WG and the Internet2 End-to-End
Performance Initiative. The NMWG will determine
which of the network characteristics are relevant
to Grid applications, and pursue standardization
of the attributes required to describe these
characteristics.
PerfSonar This project plans to deploy a
monitoring infrastructure across Abilene
(Internet2), ESnet, and GEANT. A standard set of
measurement applications will regularly measure
these backbones and store their results in the
Global Grid Forum Network Measurement Working
Group schema (see below).

24
Summary and Conclusion

The LHC-OPN monitoring document is updated to
reflect the new emphasis on
Determining the appropriate metrics
Prototyping possible applications/systems
All sites should identify an LHC-OPN
point-of-contact to help expedite the monitoring
effort.
We have a number of possibilities regarding
metrics. Defining which (if any) are required
will help direct the prototyping efforts.
Prototyping is ready to proceed -- we need to
identify sites which will host this effort.