Advanced Network Management Introduction and Background - PowerPoint PPT Presentation

About This Presentation

Title:

Advanced Network Management Introduction and Background

Description:

A clock at an interface in WAN2 that supports T3 link loses SYNC 4 times a ... Robust (loss of a single alarm or generation of spurious event should not affect ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 24

Provided by: chadi

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Network Management Introduction and Background

1
Fault Management
Mani Subramanian Network Management Principles
and practice, Addison-Wesley, 2000.
2
Fault Management

The process of locating and correcting network
problems and faults
fault is a failure of a network component, which
results in loss of connectivity
It is the most important functional management
area
Resolve problem

Process, 5 steps
Identify faults
Gathering information via traps (linkDown,
egpNeighborLoss) and polling
Traps may not be sufficient
Is a received trap an important one???
Locate Fault
Detect all failed components and trace down the
tree topology to the source (e.g., interface card
failure on a router? all connected components
will indicate a failure)
Fault isolation by network and SNMP tools
Use artificial intelligence /correlation
techniques
Restore service (high priority)
Identify the root cause of the problem (trouble
ticket)
Resolve problem

3
Network Restoration- example
Collapsed Hierarchy, Improved Efficiency
Failure detected
Resources successfully setup, Restore traffic
Source notified
Message received and resources configured. SEND
ACK

Traffic is successfully restored only after
failure notification
and a round trip configuration/confirmation.

4
Preliminaries

An event is an exceptional condition in the
operation of the network
Software failure
Performance bottleneck
Configuration inconsistencies
Intrusion attempts
Network management operations
Monitoring events
Interpreting events
Handling events
A single problem event may cause many symptom
events
Correlating symptom events to identify and
localize the underlying problems

5
Illustrative scenario

A client application exchanges data over a TCP
connection with a DB server
Distinct domains each administered by a different
organization

6
Illustrative scenario

Problem scenario
A clock at an interface in WAN2 that supports
T3 link loses SYNC 4 times a second for 0.25 ms
? intermittent noise causing loss of 0.1 of T3
capacity
? this small noise causes bit errors in a large
number of packets routed over C-D
? Bit errors cause packet losses, either at
routers (if IP header corrupted) or at
destinations

7
Illustrative scenario

? performance of TCP connection degrades due to
packet loss
? TCP sender interprets this as congestion and
hence reduces its window
TCP increases its window gradually until new
packet loss
However due to the noise, the TCP window will not
increase
DB transactions by client will last longer
DB server performance will degrade due to records
lock-out, causing frequent aborts for remote
transactions

8
Illustrative scenario

Three important points
problems propagate among related objects, and
possibly amplified by various protocol mechanisms
single problem can cause numerous observable
events in multiple domains
some problems are not observable where they
originate
WAN2 domain may observe minor error events at the
T3 interface, but these events may be
indistinguishable from normal operating noise ?
WAN2 may be unaware that there is a problem

Challenges
Determine events to monitor and ways to analyze
them
Operations staff must have knowledge of
operational parameters of managed objects and the
significance of its events
Correlation of events and coordination among
different domains
Automating the management activities (manual
processing does not scale)

9
Modeling the Scenario

Partition the system into multiple management
domains (e.g., enterprise domain, ED, and router
domain, RD)
Each domain has a domain manager (DM) to monitor,
correlate and handle its events
A MD may subscribe to receive notifications from
other domains
ED sees the RD as a single entity connecting LAN1
and LAN2

10
Modeling the Scenario
Detects only IP header corruption

Any problem in the connection is seen as RD
problem
Inside each domain, finer grained correlation can
determine the particular problem using symptoms
from other domains
Example packet loss is degraded TCP performance
is detected by ED not by the RD.
this symptom is received by the RD and can be
correlated along with other observable symptoms
to isolate the clock problem.

11
Automating Event Management

An automated event management system (AEMS) must
accurately model and store knowledge of the
underlying system and its associated events.
Static Information associated with managed
objects such as SNMP traps, thresholds for MIB
variables, etc.
Dynamic information reflects addition, removal,
upgrades of network devices, etc.
The process of automation is that of developing
correlation algorithms to analyze observable
events
Correlation algorithms must
Scalable to large networks involving complex
systems
Handle a large number of symptoms caused by a
single problem
Fast --real time correlation
Robust (loss of a single alarm or generation of
spurious event should not affect its decision ?
insensitive or resilient to noise

12
Problems and Symptoms

A problem is an event that can be handled
directly e.g., a faulty interface
Some problems are directly observable or
indirectly by observing their symptoms
Symptoms are observable events
Degraded application performance is a symptom of
a faulty interface
Symptoms cannot be handled symptoms persist
unless the problem is resolved
Problems and symptoms propagate from one object
to another
Noise in WAN ? bit errors in link C-D ? loss of
packets at routers ? poor TCP performance ?
frequent transaction aborts in the DB server

13
Event Correlation System

Monitors typically collect managed data at
network elements and detect out of tolerance
conditions, generating appropriate alarms.
The correlator uses an event model to analyze
these alarms.
The event model represents knowledge of various
events and their causal relationships
Event model depends on the expert people
The correlator determines the common problems
that caused the observed alarms.

14
Event Knowledge

The Modelers event knowledge contains the
following information for each class of managed
objects
The data attributes of objects of this class
(e.g., MIB variables).
The set of events that are observable within
instances of this class (e.g., a particular MIB
variable is above threshold), or by asynchronous
event notifications.
The set of events caused by each problem. This
set can include events within the object, as well
as events in other objects to which the object is
related.
The problems that can originate within instances
of this class.
The relationships in which an instance of the
class can be involved.
The events and/or problems that are exported by
instances of the class.

15
Coding Approach for Event Correlation

Treat the complete set of events caused by a
problem as a code that identifies the problem
Correlation is the process of decoding the set of
observed symptoms
Determine which problem has these symptoms as its
code
Note traditionally, alarms are typically
correlated through searches over the event model
knowledge base
Complexity of search limits scalability
Event model is a large database and the received
alarms or symptoms may also be quite large

16
Coding Approach for Event Correlation

Two phases
Codebook selection phase
Select a subset of events for monitoring
codebook
Codebook is an optimal subset of events that must
be monitored to distinguish the problems of
interests from one another
Ensure a desired level of noise tolerance
Algorithms must decode or infer the problem in
the presence of lose alarms or the existence of
spurious alarms
Decoding
Find the problem whose associated symptoms (i.e.,
code) match the observed symptoms most closely

17
Causality Graph Models

Correlation is concerned with analysis of
causality relations among events
e ? f denotes causality of event f by event e
Causality is a partial order relation between
events
Relation ? can be described by a graph whose
nodes represent the events and edges represent
causality

18
Causality Graph Models

A symptom caused by another symptom
do not contribute any information
about the problem

Event that is neither a symptom nor a problem.
Causal equivalence
All these indirect symptoms can be eliminated
without loss of information
Correlation graph
19
Correlation

Information contained in the correlation graph
must be converted into codes, one for each
problem in the graph.
A code for a problem p is a vector p of 0s an
1s. Each bit corresponds to a symptom in the
graph
example code is of length 3 (3 symptoms) after
ordering of the symptoms (e.g., ltS3, S6, S9gt)
? code for p1 is p1 (1,0,1)
This means p1 causes symptoms S3 and S9
p2 (1, 1, 0) and p11 (1, 0, 1)

Correlation graph
Event correlation is finding problems whose
codes optimally match an observed symptom vector
20
Correlation

What happens when we observe symptoms S3 and S9?
Both P1 and P11 match the observed vector!
Clearly we know there is a problem but
cannot identify the problem since both problems
have identical codes..
What happens when we observe symptoms (0, 1, 0)?
two possibilities (1) a false event or (2)
P3 occurred but one symptom was lost.

Correlation graph
Interpretation depends on whether loss is
more likely than false alarm generation
In case spurious or lost symptoms are
unlikely, information provided by S9 is redundant
? (1, 0) and (1, 1) are sufficient to correlate
event vectors. Subset of symptoms required to
provide desired level of distinction between
problems is called codebook
21
Correlation- example

Codebook contains only three symptoms
The codebook distinguishes among all problems
however, it guarantees distinction by only a
single symptom

A loss or spurious generation of S4 will result
in decoding error
Distinction between problems is measured by the
hamming Distance between their codes
22
Correlation- example
Event vectors 011100, 101100, 110100, 111000
will be decoded as P1 with a single symptom
loss and 111110, 111101 is interpreted as P1
with a single spurious symptom
When two error symptoms occur, decoder will
detect the error but cannot correctly (uniquely)
decode the event (e.g., P1 and P4)
23
Correlation- Advantages

Write a Comment

User Comments (0)