Title: Facing fault management as it is,
1- Facing fault management as it is,
- aiming for what you would like it to be
Roy Sterritt University of Ulster
2Fault Management Domain
- High speed telecommunication networks comprise
of many complex interacting components. - Faults cause alarm messages in both local
components and the rest of the network. - These alarms are cascaded to their neighbour
multiplexer until they reach the Element
Controller (EC). - Thus the alarm behaviour is complex, apparently
chaotic, and difficult to characterise.
3Fault Management Domain e.g.
150 runs of the same simple simulated fault
lowest 24 alarm events raised (run 124) highest
31 alarm events (run 22) average 27 alarm events
Therefore there are two real world concerns
(1) the sheer volume of alarm event traffic when
a fault occurs (2) the cause not the symptoms
4Commercial Fault Management Systems
Monitoring, Filtering and Masking deals with the
first (the sheer volume of alarm event traffic).
Yet does not deal with the second (the cause not
the symptoms) - it presents a reduced set of the
symptoms, - the operator still has to diagnose
the fault.
5Facing fault management as it is (RBS)
- Time-to-market and RD life-cycle constantly
being squeezed - Market demands for features and functionality
increase with each release - Multi-vendor Support is now the rule rather the
exception (Heterogeneity) - Competing to offer sophisticated services,
including intelligent F.M., for competitive
advantage - gt Impact on fault management
- RBS development
- RBS maintenance burden
- gt Automation
6Aiming for what you would like it to be
- Automated or Intelligent fault diagnosis not
achievable from rules alone - Reluctance to utilise other than RBS in the
engine of critical FM systems - Data Mining, one approach to automating
learning, not user centred - Transformation from data discoveries to
knowledge discoveries requires human
interpretation and evaluation - Therefore
- Computer-aided Human Discovery (visualisation)
- Human-aided Computer Discovery (data mining/
knowledge discovery)
7Intelligent Fault Management
Tier 1 Visualisation Correlation
Rules
Tier 2 KA / RBS Correlation
Tier 3 Data Mining Correlation
Historical Fault Management Data
Fault Management Data
8Explicit
Implicit
rule portDisabled when ?x
Form(alarmPPI-Unexpl_Signal port?p)
?y Form(alarmLP-PLM port?p) then
retract ?x retract ?y
assert Form(JigAlm-portDisabled, port?p)
Tacit
Unknown
9Example Simple Injected Fault
as recorded as user actions in the Event Log
i.e. on Enfield slot 2 ports 1-8 disconnected
then connected
10Event Log - Data Mountain
For this simple test with 16 commands the
breakdown of the event log is
No of lines 10,100 No of event records 758 No
of alarm records 476 No of login records
106 No of user action records 16 No of
message tool records 159 No of system error
records 1
11Event Log - Data Mountain
Event ----- 09/11/1998 142838 Slip 8901 Type
TN-1X Path /bireh708/TN-1X/Acton/S9-13 Event
Type PPI-AIS User Label S9-13 NE Time
09/11/1998 142838 Alarm present Sev Minor NE
ID 5002 Alarm ID 1557
12Example Simple Auto-test
on Enfield slot 2 ports 1-8 disconnected then
connected
13Example Simple Auto-test
PPI-AIS - An AIS has been detected in the
incoming 2 Mbit/s or 34 Mbit/s traffic. Note if
signal is unstructured (e.g. not conform to
ITU-T G732 AIS may be a valid signal).
PPI-Unexp_Signal - A 2 Mbit/s or 34 Mbit/s HDB3
signal has been detected on a tributary which is
configured not to expect traffic (no connection).
LP-PLM - The value of the signal label bits in
the V5 byte does not correspond with the expected.
INT-TU-AIS - An AIS has been detected internally
in the pointer bytes of the TU.
LP-EXC - The BER of the BIP-2 error check has
exceeded the upper threshold (excessive BER).
14Example Simple Auto-test
Visual Correlation between PPI-Unexp_Signal
(Enfield) and LP-PLM (Acton)
15ILOG JRules
PPI-Unexpl_Signal, LP-PLM, Disconnected Port",
0, 2, 15
rule portDisabled when ?x
Form(alarmPPI-Unexpl_Signal port?p)
?y Form(alarmLP-PLM port?p) then
retract ?x retract ?y
assert Form(JigAlm-portDisabled, port?p)
A simplified version to demonstrate the style of
an ILOG JRule
16Inducing a BBN PowerConstructor
- BBN structure (variables nodes dependencies
direct arcs) - parameters (prior conditional
probabilities) - PowerConstructor (J. Cheng, D.A. Bell W. Liu
1997) uses a three-phased approach based on Chow
Liu (1968) - 1st phase - drafting - utilises the Chow-Liu
algorithm for identifying strong dependencies
between variables by the calculation of Mutual
Information. - 2nd stage - thickening - performs conditional
independence (CI) tests on pairs of nodes that
were not included in the first stage. - 3rd stage - thinning - then performs further CI
tests to ensure that all - edges that have been added are necessary.
- This three-stage approach manages to keep to one
CI test per decision on an edge throughout each
stage and as such has a favourable time
complexity of O(N2).
17Data Mining Probabilistic Networks
Part of BBN mined/induced results .
BBN structure (variables nodes dependencies
direct arcs) parameters (prior
conditional probabilities)
18BBN used in FMS
after consultation with engineers and standards
PPI_AIS
INT-TU-AIS
Fault
Faulty TU Faulty Payload Manager Unstructured
Signal
19BBN used in FMS .
PPI_Unexp_Sig
LP_PLM
Fault
Faulty TU Cable mis-connection
20Conclusion
Facing fault management as it is
- Automated / Intelligent development
discovery - Open/visible holistic process
- Human and computer discovery
Aiming for what you would like it to be
- Automated / Intelligent fault diagnosis
- Open/visible holistic process
- Utilise open graphical AI techniques such as
BBN
21Future Work Integration Adaptation
221999-2002 - Jigsaw Programme Nortel Networks
Belfast Labs and IRTU (Start programme 187)