Advanced Network Management Introduction and Background - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Network Management Introduction and Background

Description:

A clock at an interface in WAN2 that supports T3 link loses SYNC 4 times a ... Robust (loss of a single alarm or generation of spurious event should not affect ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 24
Provided by: chadi
Category:

less

Transcript and Presenter's Notes

Title: Advanced Network Management Introduction and Background


1
Fault Management
Mani Subramanian Network Management Principles
and practice, Addison-Wesley, 2000.
2
Fault Management
  • The process of locating and correcting network
    problems and faults
  • fault is a failure of a network component, which
  • results in loss of connectivity
  • It is the most important functional management
    area
  • Resolve problem
  • Process, 5 steps
  • Identify faults
  • Gathering information via traps (linkDown,
    egpNeighborLoss) and polling
  • Traps may not be sufficient
  • Is a received trap an important one???
  • Locate Fault
  • Detect all failed components and trace down the
    tree topology to the source (e.g., interface card
    failure on a router? all connected components
    will indicate a failure)
  • Fault isolation by network and SNMP tools
  • Use artificial intelligence /correlation
    techniques
  • Restore service (high priority)
  • Identify the root cause of the problem (trouble
    ticket)
  • Resolve problem

3
Network Restoration- example
Collapsed Hierarchy, Improved Efficiency
Failure detected
Resources successfully setup, Restore traffic
Source notified
Message received and resources configured. SEND
ACK
  • Traffic is successfully restored only after
    failure notification
  • and a round trip configuration/confirmation.

4
Preliminaries
  • An event is an exceptional condition in the
    operation of the network
  • Software failure
  • Performance bottleneck
  • Configuration inconsistencies
  • Intrusion attempts
  • Network management operations
  • Monitoring events
  • Interpreting events
  • Handling events
  • A single problem event may cause many symptom
    events
  • Correlating symptom events to identify and
    localize the underlying problems

5
Illustrative scenario
  • A client application exchanges data over a TCP
    connection with a DB server
  • Distinct domains each administered by a different
    organization

6
Illustrative scenario
  • Problem scenario
  • A clock at an interface in WAN2 that supports
    T3 link loses SYNC 4 times a second for 0.25 ms
  • ? intermittent noise causing loss of 0.1 of T3
    capacity
  • ? this small noise causes bit errors in a large
    number of packets routed over C-D
  • ? Bit errors cause packet losses, either at
    routers (if IP header corrupted) or at
    destinations

7
Illustrative scenario
  • ? performance of TCP connection degrades due to
    packet loss
  • ? TCP sender interprets this as congestion and
    hence reduces its window
  • TCP increases its window gradually until new
    packet loss
  • However due to the noise, the TCP window will not
    increase
  • DB transactions by client will last longer
  • DB server performance will degrade due to records
    lock-out, causing frequent aborts for remote
    transactions

8
Illustrative scenario
  • Three important points
  • problems propagate among related objects, and
    possibly amplified by various protocol mechanisms
  • single problem can cause numerous observable
    events in multiple domains
  • some problems are not observable where they
    originate
  • WAN2 domain may observe minor error events at the
    T3 interface, but these events may be
    indistinguishable from normal operating noise ?
    WAN2 may be unaware that there is a problem
  • Challenges
  • Determine events to monitor and ways to analyze
    them
  • Operations staff must have knowledge of
    operational parameters of managed objects and the
    significance of its events
  • Correlation of events and coordination among
    different domains
  • Automating the management activities (manual
    processing does not scale)

9
Modeling the Scenario
  • Partition the system into multiple management
    domains (e.g., enterprise domain, ED, and router
    domain, RD)
  • Each domain has a domain manager (DM) to monitor,
    correlate and handle its events
  • A MD may subscribe to receive notifications from
    other domains
  • ED sees the RD as a single entity connecting LAN1
    and LAN2

10
Modeling the Scenario
Detects only IP header corruption
  • Any problem in the connection is seen as RD
    problem
  • Inside each domain, finer grained correlation can
    determine the particular problem using symptoms
    from other domains
  • Example packet loss is degraded TCP performance
    is detected by ED not by the RD.
  • this symptom is received by the RD and can be
    correlated along with other observable symptoms
    to isolate the clock problem.

11
Automating Event Management
  • An automated event management system (AEMS) must
    accurately model and store knowledge of the
    underlying system and its associated events.
  • Static Information associated with managed
    objects such as SNMP traps, thresholds for MIB
    variables, etc.
  • Dynamic information reflects addition, removal,
    upgrades of network devices, etc.
  • The process of automation is that of developing
    correlation algorithms to analyze observable
    events
  • Correlation algorithms must
  • Scalable to large networks involving complex
    systems
  • Handle a large number of symptoms caused by a
    single problem
  • Fast --real time correlation
  • Robust (loss of a single alarm or generation of
    spurious event should not affect its decision ?
    insensitive or resilient to noise

12
Problems and Symptoms
  • A problem is an event that can be handled
    directly e.g., a faulty interface
  • Some problems are directly observable or
    indirectly by observing their symptoms
  • Symptoms are observable events
  • Degraded application performance is a symptom of
    a faulty interface
  • Symptoms cannot be handled symptoms persist
    unless the problem is resolved
  • Problems and symptoms propagate from one object
    to another
  • Noise in WAN ? bit errors in link C-D ? loss of
    packets at routers ? poor TCP performance ?
    frequent transaction aborts in the DB server

13
Event Correlation System
  • Monitors typically collect managed data at
    network elements and detect out of tolerance
    conditions, generating appropriate alarms.
  • The correlator uses an event model to analyze
    these alarms.
  • The event model represents knowledge of various
    events and their causal relationships
  • Event model depends on the expert people
  • The correlator determines the common problems
    that caused the observed alarms.

14
Event Knowledge
  • The Modelers event knowledge contains the
    following information for each class of managed
    objects
  • The data attributes of objects of this class
    (e.g., MIB variables).
  • The set of events that are observable within
    instances of this class (e.g., a particular MIB
    variable is above threshold), or by asynchronous
    event notifications.
  • The set of events caused by each problem. This
    set can include events within the object, as well
    as events in other objects to which the object is
    related.
  • The problems that can originate within instances
    of this class.
  • The relationships in which an instance of the
    class can be involved.
  • The events and/or problems that are exported by
    instances of the class.

15
Coding Approach for Event Correlation
  • Treat the complete set of events caused by a
    problem as a code that identifies the problem
  • Correlation is the process of decoding the set of
    observed symptoms
  • Determine which problem has these symptoms as its
    code
  • Note traditionally, alarms are typically
    correlated through searches over the event model
    knowledge base
  • Complexity of search limits scalability
  • Event model is a large database and the received
    alarms or symptoms may also be quite large

16
Coding Approach for Event Correlation
  • Two phases
  • Codebook selection phase
  • Select a subset of events for monitoring
    codebook
  • Codebook is an optimal subset of events that must
    be monitored to distinguish the problems of
    interests from one another
  • Ensure a desired level of noise tolerance
  • Algorithms must decode or infer the problem in
    the presence of lose alarms or the existence of
    spurious alarms
  • Decoding
  • Find the problem whose associated symptoms (i.e.,
    code) match the observed symptoms most closely

17
Causality Graph Models
  • Correlation is concerned with analysis of
    causality relations among events
  • e ? f denotes causality of event f by event e
  • Causality is a partial order relation between
    events
  • Relation ? can be described by a graph whose
    nodes represent the events and edges represent
    causality

18
Causality Graph Models
  • A symptom caused by another symptom
  • do not contribute any information
  • about the problem

Event that is neither a symptom nor a problem.
Causal equivalence
All these indirect symptoms can be eliminated
without loss of information
Correlation graph
19
Correlation
  • Information contained in the correlation graph
    must be converted into codes, one for each
    problem in the graph.
  • A code for a problem p is a vector p of 0s an
    1s. Each bit corresponds to a symptom in the
    graph
  • example code is of length 3 (3 symptoms) after
    ordering of the symptoms (e.g., ltS3, S6, S9gt)
  • ? code for p1 is p1 (1,0,1)
  • This means p1 causes symptoms S3 and S9
  • p2 (1, 1, 0) and p11 (1, 0, 1)

Correlation graph
Event correlation is finding problems whose
codes optimally match an observed symptom vector
20
Correlation
  • What happens when we observe symptoms S3 and S9?
  • Both P1 and P11 match the observed vector!
  • Clearly we know there is a problem but
    cannot identify the problem since both problems
    have identical codes..
  • What happens when we observe symptoms (0, 1, 0)?
  • two possibilities (1) a false event or (2)
    P3 occurred but one symptom was lost.

Correlation graph
Interpretation depends on whether loss is
more likely than false alarm generation
In case spurious or lost symptoms are
unlikely, information provided by S9 is redundant
? (1, 0) and (1, 1) are sufficient to correlate
event vectors. Subset of symptoms required to
provide desired level of distinction between
problems is called codebook
21
Correlation- example
  • Codebook contains only three symptoms
  • The codebook distinguishes among all problems
    however, it guarantees distinction by only a
    single symptom

A loss or spurious generation of S4 will result
in decoding error
Distinction between problems is measured by the
hamming Distance between their codes
22
Correlation- example
Event vectors 011100, 101100, 110100, 111000
will be decoded as P1 with a single symptom
loss and 111110, 111101 is interpreted as P1
with a single spurious symptom
When two error symptoms occur, decoder will
detect the error but cannot correctly (uniquely)
decode the event (e.g., P1 and P4)
23
Correlation- Advantages
Write a Comment
User Comments (0)
About PowerShow.com