Title: Advanced Network Management Introduction and Background
1Fault Management
Mani Subramanian Network Management Principles
and practice, Addison-Wesley, 2000.
2Fault Management
- The process of locating and correcting network
problems and faults - fault is a failure of a network component, which
- results in loss of connectivity
- It is the most important functional management
area - Resolve problem
- Process, 5 steps
- Identify faults
- Gathering information via traps (linkDown,
egpNeighborLoss) and polling - Traps may not be sufficient
- Is a received trap an important one???
- Locate Fault
- Detect all failed components and trace down the
tree topology to the source (e.g., interface card
failure on a router? all connected components
will indicate a failure) - Fault isolation by network and SNMP tools
- Use artificial intelligence /correlation
techniques - Restore service (high priority)
- Identify the root cause of the problem (trouble
ticket) - Resolve problem
3Network Restoration- example
Collapsed Hierarchy, Improved Efficiency
Failure detected
Resources successfully setup, Restore traffic
Source notified
Message received and resources configured. SEND
ACK
- Traffic is successfully restored only after
failure notification - and a round trip configuration/confirmation.
4Preliminaries
- An event is an exceptional condition in the
operation of the network - Software failure
- Performance bottleneck
- Configuration inconsistencies
- Intrusion attempts
- Network management operations
- Monitoring events
- Interpreting events
- Handling events
- A single problem event may cause many symptom
events - Correlating symptom events to identify and
localize the underlying problems
5Illustrative scenario
- A client application exchanges data over a TCP
connection with a DB server - Distinct domains each administered by a different
organization
6Illustrative scenario
- Problem scenario
- A clock at an interface in WAN2 that supports
T3 link loses SYNC 4 times a second for 0.25 ms - ? intermittent noise causing loss of 0.1 of T3
capacity - ? this small noise causes bit errors in a large
number of packets routed over C-D - ? Bit errors cause packet losses, either at
routers (if IP header corrupted) or at
destinations
7Illustrative scenario
- ? performance of TCP connection degrades due to
packet loss - ? TCP sender interprets this as congestion and
hence reduces its window - TCP increases its window gradually until new
packet loss - However due to the noise, the TCP window will not
increase - DB transactions by client will last longer
- DB server performance will degrade due to records
lock-out, causing frequent aborts for remote
transactions
8Illustrative scenario
- Three important points
- problems propagate among related objects, and
possibly amplified by various protocol mechanisms - single problem can cause numerous observable
events in multiple domains - some problems are not observable where they
originate - WAN2 domain may observe minor error events at the
T3 interface, but these events may be
indistinguishable from normal operating noise ?
WAN2 may be unaware that there is a problem
- Challenges
- Determine events to monitor and ways to analyze
them - Operations staff must have knowledge of
operational parameters of managed objects and the
significance of its events - Correlation of events and coordination among
different domains - Automating the management activities (manual
processing does not scale)
9Modeling the Scenario
- Partition the system into multiple management
domains (e.g., enterprise domain, ED, and router
domain, RD) - Each domain has a domain manager (DM) to monitor,
correlate and handle its events - A MD may subscribe to receive notifications from
other domains - ED sees the RD as a single entity connecting LAN1
and LAN2
10Modeling the Scenario
Detects only IP header corruption
- Any problem in the connection is seen as RD
problem - Inside each domain, finer grained correlation can
determine the particular problem using symptoms
from other domains - Example packet loss is degraded TCP performance
is detected by ED not by the RD. - this symptom is received by the RD and can be
correlated along with other observable symptoms
to isolate the clock problem.
11Automating Event Management
- An automated event management system (AEMS) must
accurately model and store knowledge of the
underlying system and its associated events. - Static Information associated with managed
objects such as SNMP traps, thresholds for MIB
variables, etc. - Dynamic information reflects addition, removal,
upgrades of network devices, etc. - The process of automation is that of developing
correlation algorithms to analyze observable
events - Correlation algorithms must
- Scalable to large networks involving complex
systems - Handle a large number of symptoms caused by a
single problem - Fast --real time correlation
- Robust (loss of a single alarm or generation of
spurious event should not affect its decision ?
insensitive or resilient to noise
12Problems and Symptoms
- A problem is an event that can be handled
directly e.g., a faulty interface - Some problems are directly observable or
indirectly by observing their symptoms - Symptoms are observable events
- Degraded application performance is a symptom of
a faulty interface - Symptoms cannot be handled symptoms persist
unless the problem is resolved - Problems and symptoms propagate from one object
to another - Noise in WAN ? bit errors in link C-D ? loss of
packets at routers ? poor TCP performance ?
frequent transaction aborts in the DB server
13Event Correlation System
- Monitors typically collect managed data at
network elements and detect out of tolerance
conditions, generating appropriate alarms. - The correlator uses an event model to analyze
these alarms. - The event model represents knowledge of various
events and their causal relationships - Event model depends on the expert people
- The correlator determines the common problems
that caused the observed alarms.
14Event Knowledge
- The Modelers event knowledge contains the
following information for each class of managed
objects - The data attributes of objects of this class
(e.g., MIB variables). - The set of events that are observable within
instances of this class (e.g., a particular MIB
variable is above threshold), or by asynchronous
event notifications. - The set of events caused by each problem. This
set can include events within the object, as well
as events in other objects to which the object is
related. - The problems that can originate within instances
of this class. - The relationships in which an instance of the
class can be involved. - The events and/or problems that are exported by
instances of the class.
15Coding Approach for Event Correlation
- Treat the complete set of events caused by a
problem as a code that identifies the problem - Correlation is the process of decoding the set of
observed symptoms - Determine which problem has these symptoms as its
code - Note traditionally, alarms are typically
correlated through searches over the event model
knowledge base - Complexity of search limits scalability
- Event model is a large database and the received
alarms or symptoms may also be quite large
16Coding Approach for Event Correlation
- Two phases
- Codebook selection phase
- Select a subset of events for monitoring
codebook - Codebook is an optimal subset of events that must
be monitored to distinguish the problems of
interests from one another - Ensure a desired level of noise tolerance
- Algorithms must decode or infer the problem in
the presence of lose alarms or the existence of
spurious alarms - Decoding
- Find the problem whose associated symptoms (i.e.,
code) match the observed symptoms most closely
17Causality Graph Models
- Correlation is concerned with analysis of
causality relations among events - e ? f denotes causality of event f by event e
- Causality is a partial order relation between
events - Relation ? can be described by a graph whose
nodes represent the events and edges represent
causality
18Causality Graph Models
- A symptom caused by another symptom
- do not contribute any information
- about the problem
Event that is neither a symptom nor a problem.
Causal equivalence
All these indirect symptoms can be eliminated
without loss of information
Correlation graph
19Correlation
- Information contained in the correlation graph
must be converted into codes, one for each
problem in the graph. - A code for a problem p is a vector p of 0s an
1s. Each bit corresponds to a symptom in the
graph - example code is of length 3 (3 symptoms) after
ordering of the symptoms (e.g., ltS3, S6, S9gt) - ? code for p1 is p1 (1,0,1)
- This means p1 causes symptoms S3 and S9
- p2 (1, 1, 0) and p11 (1, 0, 1)
Correlation graph
Event correlation is finding problems whose
codes optimally match an observed symptom vector
20Correlation
- What happens when we observe symptoms S3 and S9?
- Both P1 and P11 match the observed vector!
- Clearly we know there is a problem but
cannot identify the problem since both problems
have identical codes.. - What happens when we observe symptoms (0, 1, 0)?
- two possibilities (1) a false event or (2)
P3 occurred but one symptom was lost.
Correlation graph
Interpretation depends on whether loss is
more likely than false alarm generation
In case spurious or lost symptoms are
unlikely, information provided by S9 is redundant
? (1, 0) and (1, 1) are sufficient to correlate
event vectors. Subset of symptoms required to
provide desired level of distinction between
problems is called codebook
21Correlation- example
- Codebook contains only three symptoms
- The codebook distinguishes among all problems
however, it guarantees distinction by only a
single symptom
A loss or spurious generation of S4 will result
in decoding error
Distinction between problems is measured by the
hamming Distance between their codes
22Correlation- example
Event vectors 011100, 101100, 110100, 111000
will be decoded as P1 with a single symptom
loss and 111110, 111101 is interpreted as P1
with a single spurious symptom
When two error symptoms occur, decoder will
detect the error but cannot correctly (uniquely)
decode the event (e.g., P1 and P4)
23Correlation- Advantages