Carrier Grade IP? - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Carrier Grade IP?

Description:

Why carrier grade IP? What makes it hard? Solution approach and ... IP? Increasing number of diverse applications over IP ... Rings and IP Link Failures ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 17
Provided by: Albe157
Category:
Tags: carrier | grade | ip

less

Transcript and Presenter's Notes

Title: Carrier Grade IP?


1
Carrier Grade IP?
  • Albert Greenberg
  • Jennifer Yates
  • Fred True
  • ATT Labs-Research

2
Agenda
  • Why carrier grade IP?
  • What makes it hard?
  • Solution approach and roadmap
  • Focus
  • Information Systems for fault/performance data
    management
  • Payoffs in improved network and service

3
Why Carrier Grade IP?
9
9
9
9
9
?
  • Increasing number of diverse applications over IP
  • Data, Web, Voice, Video (IPTV), Gaming,
  • Increasingly stringent requirements
  • Commerce / business critical transactions
  • Outages expose enterprises to huge losses
  • Web-based apps 24x7
  • Activity at all hours when to schedule
    maintenance?
  • Performance sensitive applications
  • Small network glitches cascade that trigger large
    application outages
  • Increasing pressure to scale
  • More service, more infrastructure, lower cost,
    fewer people

4
What Are We Up Against?
  • Hard, long lasting failures
  • Fiber cuts, router failures, line card failures,
  • Hardware and software problems
  • Approach design and control for
    diversity/resilience engineer net mgt systems
    for rapid service restoration
  • Chronic, intermittent faults
  • Outages that clear themselves, but keep
    recurring
  • Impact that adds up, even if the per event impact
    is small
  • Hardware and software problems
  • Approach engineer net mgt systems for forensics
    network/systems update to prevent recurrence

5
Solution and Roadmap
  • Removal of single points of failure
  • Fast and reliable failure detection
  • Fast service recovery (restoration)
  • Fast fault repair
  • Hitless maintenance
  • What about the edge?
  • Cost and single points of failure concentrated at
    the edge
  • Innovation 1N interface sparing, 1N router
    sparing (router farm)
  • What about the network (edge core)?
  • Fast diagnosis for real time response and
    off-line forensics
  • Innovation network data management systems that
    simplify analysis of complex and/or massive
    network data

6
Focus Information systems for fault/performance
data management -- Goals
  • Scale Efficient storage of potentially large and
    complex data feeds over long periods of time
  • Feature-Richness Comprehensive capabilities for
    data querying and reporting, which could be used
    to construct a variety of higher level
    applications
  • Speed Support for real-time data
  • Ease of Operation Very low maintenance and
    management overhead DBA-less! (DBA Data
    Base Administrator)
  • Straightforward paradigm for adding new
    feeds/tables Wizard-like
  • Automatic creation of various database
    mechanisms bulk data ingest, load control
    scripts, schema, data aging, logging/alerts
  • Automatic configuration of logs, alerts
  • Open design Employ the use of open toolsets
    where possible

7
Whats Hard in Network Data Management?
  • Data Distribution Getting data where it needs
    to be without complex, disjoint interfaces
  • Solution Data Distribution Bus
  • Managing Change Constant churn of new data,
    changing record layouts and schema, field values,
    etc.
  • Solution Automation and code generation
  • Keeping Track of Things Managing a coherent
    catalog of metadata loading status, schema,
    business intelligence (e.g. field validation
    rules)
  • Solution Integrated metadata database and query
    tools
  • Scale Building a system with features that
    scale evenly. Harnessing parallelism throughout
    the design scaling outside the box.
  • Solution Daytona data management system
    provides scale, stability, speed optimized for
    reliable processing of reliable data
  • Maintaining Uniformity across Data Sources
    facilitating data correlation and combining
    encouraging the use of common conventions, field
    types, keys, etc.
  • Solution Automation brings homogeneity to the
    data model!

8
Logical Data Architecture
Applications Correlation, Reporting, Ticketing,
Planning, Custom analysis
Application
Application
Application
Application
Application
DataStore
RealtimeHistorical Data
Data Distribution Bus
Network Elements
Ri
Ri
Ri
Provider Network(s)
Ri
9
Data Distribution Bus
  • Data Distribution Bus Glue between data
    collectors, repositories, and reporting/analysis
    systems. One logical system (and associated
    business process) to transport data everywhere.

  • Automated data transit management, data tracking,
    and publisher/subscriber model
  • Decoupling of data publishers from data
    subscribers easier to manage
  • 1-time configuration for each publisher/subscriber
  • Inherently parallel/scalable
  • Unified interface for all publishers/subscribers
  • Unified alerting/alarming
  • Short-term recovery buffer for critical data

Subscriber Systems

Data Distribution Bus Architecture


DB/Indexed


Staging

Staging

Controller


Raw Feed

Staging

Data index

metadata

..
Data

Normalizers


.
Subscription Manager (inbound
feeds)

10
Data Automation
Source template
User World
Automated World
MakeSource
  • Systems for services
  • Ingest
  • Validation
  • Logging
  • Versioning
  • Retry
  • Aging of old data
  • So that OPs and automated systems can concentrate
    on challenges in improving network reliability

Code, Function,
and
Configlet
Library
DS Catalog Update
DS Catalog
DB Tables
Ingest
Control Scripts
Indices
Module
Data loading,
Data aging,
Maintenance
Generated Code/Files
11
Automated Analytic Toolkit
  • Libraries for temporal, spatial clustering
  • Pairwise network data time series correlation
    testing
  • Chronic, intermittent fault identification
    (temporal correlation)
  • Silent fault localization (spatial correlation)
  • Reduced false alarm rate
  • Automated, rapid classification of all
    performance impacting events
  • Real time and offline customer trouble shooting
  • Select edge interfaces, network paths, traffic,
    applications by customer
  • Select customer traffic, services, applications
    by network element
  • Fault prediction

12
Example CPU Anomalies Link Load
  • Anomaly detection identifies unusual behaviour
    Correlation testing identified routers with CPU
    varying daily with load surprising!!!!
  • Some observed increasing over time
  • Operations forensics tracks this down to subtle
    configuration issue
  • Closed out a DOS vulnerability, potentially
    amplifying small attacks on an interface into
    total router failure
  • Automated global configuration repair

13
Example Cross-Layer and Automated Correlation
Degree of Correlation
Repair Rolled back configuration across network
Investigation / testing
High
Root cause identified field measurement and
extensive lab test lab experiments guided by
network insights to reproduce failure scenarios
Correlation identified Correlation across
vendors Specific ring technology? Specific
router technology (vendor, card type)?
Time
Configuration changes in network for fast
routing convergence
SONET Rings and IP Link Failures
14
Example Silent Failure Localization
Network
Route Monitor
E2E Data Monitor
  • Real time localization of outages for rapid
    failure recovery
  • Particularly for silent faults (i.e., no alarms
    generated to indicate which network element is
    having a problem)
  • Designed to operate in harsh network environment
  • Multiple simultaneous failures
  • Missing data
  • Correlate end to end monitoring alerts with
    topology to find most likely fault location

Alarms Failed MPLS tunnels
Real-time routing
Route calculation for each failed MPLS tunnell
Alarms and corresponding routes of all impacted
MPLS tunnels
Temporal clustering and spatial
correlation localize fault
15
Outcomes
9
9
9
9
9
  • Improved Network
  • Identification and permanent removal of egregious
    problems, that had been flying under the radar
  • Improved Network Management Systems and Processes
  • Faster service restoration and network repair

Detect
Localize
Restore
Diagnose
Repair
16
How To Push Automation As Far As Possible
  • Timely, accurate information is essential!!!
  • Example Precise topology and available capacity
    now
  • Tools that separate capabilities from policies,
    since policies can change fast
  • Example link utilizations should be lt 80 except
    for links involved in that new VoIP trial in
    Phoenix with vendor X equipment, where
    utilizations should be lt 50, except for ...
  • Statistical versus those rooted in domain
    expertise?
  • Big Guard Rails extensive monitoring and info
    correlation/validation
  • Huge Operations involvement at every step
  • Simpler, repeatable tasks/repairs automated
Write a Comment
User Comments (0)
About PowerShow.com