LHCOPN Monitoring Working Group Update - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

LHCOPN Monitoring Working Group Update

Description:

REMINDER: T0 Site Requests. A robust machine meeting the following specs must be made available: ... REMINDER: NREN Desired Access ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 25
Provided by: smc74
Category:

less

Transcript and Presenter's Notes

Title: LHCOPN Monitoring Working Group Update


1
LHC-OPN MonitoringWorking Group Update
  • Shawn McKee
  • LHC-OPN T0-T1 Meeting
  • Rome, Italy
  • April 4th, 2006

2
LHC-OPN Monitoring Overview
  • The LHC-OPN exists to share LHC data with, and
    between, the T1 centers
  • Being able to monitor this network is vital to
    its success and is required for operations.
  • Monitoring is important for
  • Fault notification
  • Performance tracking
  • Problem diagnosis
  • Scheduling and prediction
  • Security
  • See previous (Amsterdam) talk for an overview and
    details on all this

3
The LHC-OPN Network
4
LHC-OPN Monitoring View
The diagram to the right is a logical
representation of the LHC-OPN showing monitoring
hosts The LHC-OPN extends to just inside the T1
edge Read/query access should be guaranteed on
LHC-OPN owned equipment. We also request RO
access to devices along the path to enable quick
fault isolation
5
Status Update
  • During the Amsterdam meeting (Jan 2006)
  • we decided to focus on two areas
  • Important/required metrics
  • Prototyping LHC-OPN monitoring
  • There is an updated LHC-OPN Monitoring document
    on the LHC-OPN web page emphasizing this new
    focus.
  • This Meeting
  • What metrics should be required for LHC-OPN?
  • We need to move forward on prototyping LHC-OPN
    monitoring services volunteer sites?

6
Monitoring Possibilities by Layer
  • For each layer we could monitor a number of
  • metrics of the LHC-OPN
  • Layer-1
  • Optical power levels
  • Layer-2
  • Packet statistics (e.g., RMON)
  • Layer-3/4
  • Netflow
  • All Layers
  • Utilization (bandwidth in use,Mbits/sec)
  • Availability (track accessibility of device over
    time)
  • Error Rates
  • Capacity
  • Topology

7
LHC-OPN Paths Multiple Layers
  • Each T0-T1 path has many views
  • Each OSI Layer (1-3) may have different devices
    involved.
  • This diagram is likely simpler than most cases in
    the LHC-OPN

8
Metrics for the LHC-OPN(EGEE Network Performance
Metrics V2)
  • For edge-to-edge monitoring this list of
    relevant metrics include
  • Availability (of T0-T1 path, each hop, T1-T1?)
  • Capacity (T0-T1, each hop)
  • Utilization (T0-T1, each hop)
  • Delays (T0-T1 paths, One-way, RTT, jitter)
  • Error Rates (T0-T1, each hop)
  • Topology (L3 traceroute, L1?, L2)
  • MTU (each path and hop)
  • What about Scheduled Downtime, Trouble Tickets?

9
Availability
  • Availability (or uptime) measures the amount of
    time the network is up and running.
  • Can be by hop or a complete path
  • Methodology
  • Layer 1 Measure power levels/bit rate?
  • Layer 2 Utilize SNMP to check interface
  • Layer 3 ping
  • Units Expressed as a percentage

10
Capacity
  • Capacity is the maximum amount of data per unit
    time a hop or path can transport.
  • Can be listed by hop or path
  • Methodology
  • Layer 1 Surveyed (operator entry)
  • Layer 2 SNMP query on interface
  • Layer 3 Minimum of component hops
  • Units Bit rate (BitsK,M,G per second)

11
Utilization
  • Utilization is the amount of capacity being
    consumed on a hop or path.
  • Can be listed by hop or path
  • Methodology
  • Layer 2 Use of SNMP to query interface stats
  • Layer 3 List of utilization along path
  • Units Bits per second

12
Delay
  • Delay metrics are at Layer 3 (IP) and are defined
    by RFC 2679 and 2681 and IPPM.
  • Delay related info are three types one-way delay
    (OWD), one-way delay variation (jitter) and
    round-trip time (RTT)
  • One way delay between two observation points is
    the time between occurrence of the first bit of
    the packet on the first point and the last bit of
    the packet at the second point.
  • Methodology application (OWAMP) generating
    defined size packet with time-stamp to target
    end-host application.
  • Units Time (seconds)
  • Jitter is the one way delay difference along a
    given unidirectional path (RFC 3393)
  • Methodology statistical analysis of OWD
    application
  • Units Time (positive or negative)
  • Round-trip time (RFC 2681) well defined
  • Methodology ping
  • Units Time (min/max/average) or a histogram of
    time

13
Error Rates
  • Error rates track the bit or packet error rate
    (depending upon layer).
  • Can be listed by hop or path
  • Methodology
  • Layer 1 Read (TL1) equipment error rate
  • Layer 2 SNMP access to interface error counter
  • Layer 3 Checksum errors on packets
  • Units Fraction (erroneous/total for bits or
    packets)

14
Topology
  • Topology refers to the connectivity between nodes
    in the network (varies by OSI layer)
  • Methodology
  • Layer 1 Surveyed (input)
  • Layer 2 Surveyed (input)possible L2 discovery?
  • Layer 3 Traceroute or equivalent
  • Units Representation should record a vector of
    node-link pairs representing the described path
  • May vary with time (that is what is interesting)
    but that is probably only trackable at L3.

15
MTU
  • The Maximum Transmission Unit is defined as the
    maximum size of a packet which an interface can
    transmit without having to fragment it.
  • Can be listed by hop or path
  • Methodology Use Path MTU Discovery (RFC 1191)
  • Units Bytes

16
LHC-OPN Which Metrics Are REQUIRED (if any)?
  • We should converge on a minimal set of metrics
    that the LHC-OPN Monitoring needs to provide
  • Example for each T0-T1 path
  • Availability (is path up?)
  • Capacity (path bottleneck bandwidth)
  • Utilization (current usage along path)
  • Error rates? (bit errors along path)
  • Delay?
  • Topology?
  • MTU?
  • Do we need/require hop level metrics at various
    layers?
  • How to represent/monitor downtime and trouble
    tickets? (Is this in scope?)

17
REMINDER T0 Site Requests
  • A robust machine meeting the following specs must
    be made available
  • Dual cpu Xeon 3 GHz processors or dual opteron
    2.2 GHz or better
  • 4 Gigabytes of memory to support monitoring apps
    and large TCP buffers
  • 1 or 10 Gigabit network interface on the LHC-OPN.
  • 200 GB of disk space to allow for the LHCOPN apps
    data repository.
  • A separate disk (200 GB) to back-up the LHCOPN
    data repository.
  • OPTIONAL An out-of-band link for
    maintenance/problem diagnosis.
  • Suitably privileged account(s) for software
    installation/access.
  • This machine should NOT be used for other
    services.
  • SNMP RO access for the above machine is required
    for all L2 and L3 devices or proxies (in case of
    security/performance concerns)
  • Access to Netflow (or equiv.) LHC-OPN data from
    the edge device
  • Appropriate RO access (via proxy?) to the optical
    components (for optical power monitoring) must
    allowed from this same host.
  • Access (testing/maint.) must be allowed from all
    LHC-OPN nets.
  • The Tier-0 needs a point-of-contact (POC) for
    LHC-OPN monitoring.

18
REMINDER T1 Site Requests
  • A dedicated LHC-OPN monitoring host must be
    provided
  • A gigabyte of memory
  • 2 GHz Xeon or better CPU.
  • 1 Gigabit network interface on the LHC-OPN.
  • At least 20 GB of disk space allocated for
    LHC-OPN monitoring apps.
  • An suitably privileged account for software
    installation.
  • OPTIONAL An out-of-band network link for
    maintenance/problem diagnosis
  • OPTIONAL This host should only be used for
    LHC-OPN monitoring
  • OPTIONAL each Tier-1 site should provide a
    machine similar to the Tier-0.
  • SNMP RO access for the above machine is required
    for all T1 LHC-OPN L2 and L3 devices or proxies
    (for security/performance concerns)
  • Access to Netflow (or equiv.) LHC-OPN data from
    the edge device
  • Appropriate RO access, possibly via proxy, to the
    T1 LHC-OPN optical components (for optical power
    monitoring) must allowed from this host.
  • Access (testing/maint.) should be allowed from
    all LHC-OPN networks.
  • The Tier-1 needs to provide a point-of-contact
    (POC) for LHC-OPN monitoring

19
REMINDER NREN Desired Access
  • We expect that we will be unable to require
    anything for all possible NRENs in the LHC-OPN.
    However the following list represents what we
    would like to have for the LHC-OPN
  • SNMP (readonly) access to LHC-OPN related L2/L3
    devices from either a closely associated Tier-1
    site or the Tier-0 site. We require associated
    details about the device(s) involved with the
    LHC-OPN for this NREN
  • Suitable (readonly) access to the optical
    components along the LHC-OPN path which are part
    of this NREN. We require associated details
    about the devices involved.
  • Topology information on how the LHC-OPN maps onto
    the NREN
  • Information about planned service outages and
    interruptions. For example URLs containing this
    information, mailing lists, applications which
    manage them, etc.
  • Responsibility for each acquiring NREN
    information should be distributed to the various
    Tier-1 POCs.

20
Prototype Deployments
  • We like to begin prototype distribution
    deployments to at least two Tier-1s and the
    Tier-0
  • The goal is to prototype various software which
    might be used for LHC-OPN monitoring
  • Active measurements (and scheduling?)
  • Various applications which can provide LHC-OPN
    metrics (perhaps in different ways)
  • GUI interfaces to LHC-OPN data
  • Metric data management/searching for LHC-OPN
  • Alerts and automated problem handling
    applications
  • Interactions between all the preceding
  • This process should lead to a final LHC-OPN
    monitoring system matched to our needs.

21
Prototype Deployment Needs
  • For sites volunteering to support the LHC-OPN
    monitoring prototypes we need
  • Suitable host (see requirements)
  • Account details (username/password). Can provide
    SSH public key as alternative for passwd.
  • Any constraints or limitations about host usage.
  • Out-of-band access info (if any)
  • Each site should also provide a monitoring
    point-of-contact.
  • VOLUNTEERS? (email smckee_at_umich.edu)

22
Monitoring Site Requirements
  • Eventually each LHC-OPN site should provide the
    following for monitoring
  • Appropriate host(s) (see previous slides)
  • Point-of-contact for monitoring
  • L1/L2/L3 Map to Tier-0 listing relevant nodes
    and links
  • Responsible for contacting intervening NRENs
  • Map is used for topology and capacity information
  • Should include node(device) address, description
    and access information
  • Readonly access for LHC-OPN components
  • Suitable account(s) on monitoring host
  • Sooner rather than laterdictated by interest

23
Future Directions / Related Activities
  • There are a number of existing efforts we
    anticipate actively
  • prototyping for LHC-OPN monitoring
    (alphabetically)
  • EGEE JRA4/ EGEE-II SA1 Network Performance
    Monitoring - This project has been working on an
    architecture and a series of prototype services
    intended to provide Grid operators and middleware
    with both end-to-end and edge-to-edge performance
    data. See http//egee-jra4.web.cern.ch/EGEE-JRA4
    / and a demo at https//egee.epcc.ed.ac.uk28443/n
    pm-dt/
  • IEPM -- Internet End-to-end Performance
    Monitoring. The IEPM effort has its origins in
    the 1995 WAN monitoring group at SLAC. IEPM-BW
    was developed to provide an infrastructure more
    focused on a making active end-to-end performance
    measurements for a few high-performance paths.
  • MonALISA Monitoring Agents using a Large-scale
    Integrated Services Architecture. This framework
    has been designed and implemented as a set of
    autonomous agent-based dynamic services that
    collect and analyze real-time information from a
    wide variety of sources (grid nodes, network
    routers and switches, optical switches, running
    jobs, etc.)
  • NMWG Schema - The NMWG (Network Measurement
    Working Group) focuses on characteristics of
    interest to grid applications and works in
    collaboration with, other standards groups such
    as the IETF IPPM WG and the Internet2 End-to-End
    Performance Initiative. The NMWG will determine
    which of the network characteristics are relevant
    to Grid applications, and pursue standardization
    of the attributes required to describe these
    characteristics.
  • PerfSonar This project plans to deploy a
    monitoring infrastructure across Abilene
    (Internet2), ESnet, and GEANT. A standard set of
    measurement applications will regularly measure
    these backbones and store their results in the
    Global Grid Forum Network Measurement Working
    Group schema (see below).

24
Summary and Conclusion
  • The LHC-OPN monitoring document is updated to
    reflect the new emphasis on
  • Determining the appropriate metrics
  • Prototyping possible applications/systems
  • All sites should identify an LHC-OPN
    point-of-contact to help expedite the monitoring
    effort.
  • We have a number of possibilities regarding
    metrics. Defining which (if any) are required
    will help direct the prototyping efforts.
  • Prototyping is ready to proceed -- we need to
    identify sites which will host this effort.
Write a Comment
User Comments (0)
About PowerShow.com