Title: Master
1Masters Thesis, Mikko NieminenEspoo, February
14th, 2006
- TROUBLESHOOTING IN LIVE WCDMA NETWORKS
Supervisor Professor Heikki Hämmäinen
2Background to the Study
- The number of live WCDMA networks is growing
quickly. - The first commercial Third Generation Partnership
Project (3GPP) compliant network, J-phone, was
opened in December 2002. - By October of 2005, there were 80 live commercial
WCDMA networks and the amount of subscribers was
nearly 40 million. By that time, around 140
licenses had been awarded for WCDMA, the current
WCDMA license holders having more than 500
million subscribers in their Second Generation
(2G) networks. - Especially in Europe and Asia, WCDMA network
deployment after successful field trials and
service launches has entered a new critical
stage the phase of network optimisation and
network troubleshooting.
3Research Problem
- As the amount of WCDMA subscribers quickly
increases, operators and equipment vendors are
facing big challenges in maintaining and
troubleshooting their networks. - We may raise the question of how one can
efficiently narrow down the root causes of the
problems when there is a huge amount of
subscribers and traffic in a live WCDMA network. - What are the principles of examination of the
fault scenarios and narrowing down the problem
investigation into logical manageable pieces? - Which are the tools and methods that are in
practice used in WCDMA network troubleshooting
today? - In order tackle these questions and challenges,
this Thesis presents a Framework for
KPI-triggered troubleshooting in live WCDMA
networks. - The applicability of the Framework is
demonstrated by applying it to a selection of
real troubleshooting cases that have occurred in
commercial WCDMA networks.
4Scope of the Study
- This study concentrates on the KPI-triggered
problems in live WCDMA networks. - In general, the faults can be classified into
three categories - Critical, which are emergency problems that
require immediate actions, - Major (which we refer in this study as
KPI-triggered problems) - Minor which do not affect the services of the
network. - The viewpoint of is from the equipment vendors
side, the main objective being to create
guidelines for troubleshooting experts and
technical support personnel of WCDMA network
manufacturers in order to perform troubleshooting
and narrow the problems down following a defined
logic. - This Thesis mainly concentrates on WCDMA network
troubleshooting from a Radio Access Network
perspective. The reasoning behind this approach
is that the UTRAN covers most of the WCDMA
specific functionality and intelligence, and
therefore brings the majority of the
troubleshooting challenges also.
5Research Methods
- This Thesis is mainly based on the study of
various technical specifications and interviews
of WCDMA network troubleshooting experts. - The main literature sources are the 3GPP
specifications of release 99, since the majority
of the live WCDMA networks were based on 3GPP
release 99 during the writing of this Thesis. - It can be noted that 3GPP release 4 networks are
currently gaining foothold in the live WCDMA
networks. However, there are only minor
differences in the Radio Access functionality of
the afore-mentioned two 3GPP specification
releases.
6Structure of the Thesis
- Introduction to WCDMA Networks
- UTRAN Protocols
- Call Trace Analysis
- Key Performance Indicators
- Framework for KPI-Triggered Troubleshooting
- Cases from Live WCDMA Networks
7WCDMA network architecture
PSTN
INTERNET
GGSN
GMSC
AuC
CORE NETWORK
HLR
EIR
SGSN
MSC/VLR
UTRAN
RNC
RNC
Node B
Node B
Node B
Node B
cell
cell
cell
cell
cell
cell
cell
cell
UE
ME
USIM
8UTRAN architecture
UTRAN
Core Network (CN)
Iu-CS
Node B
3G MSC
RNC
Node B
Uu
Iur
Iub
Node B
SGSN
RNC
User Equipment (UE)
Node B
Iu-PS
9UMTS Bearer Services
10Summary of Protocols (CS user plane)
Iub
Iu
Uu
CS application and coding
CS application and coding
RLC
RLC
MAC
MAC
Iu-UP protocol
Iu-UP protocol
WCDMA L1
WCDMA L1
FP
FP
AAL2
AAL2
AAL2
AAL2
ATM
ATM
ATM
ATM
PDH/SDH
PDH/SDH
PDH/SDH
PDH/SDH
RNC
Node B
UE
MSC
11Summary of Protocols (UE control plane)
Iub
Iu
Uu
NAS
NAS
RRC
RRC
RANAP
RANAP
RLC
RLC
SCCP
SCCP
MAC
MAC
MTP3b
MTP3b
SSCF-NNI
SSCF-NNI
WCDMA L1
WCDMA L1
FP
FP
SSCOP
SSCOP
AAL2
AAL2
AAL5
AAL5
ATM
ATM
ATM
ATM
PDH/SDH
PDH/SDH
PDH/SDH
PDH/SDH
RNC
Node B
UE
CN
12Overview of WCDMA Call Setup
MT Call
MO Call
RRC Connection Establishment
Radio Access Bearer Establishment
Paging
User Plane Data Flow
13RRC connection establishment (DCH)
UE
RNC
Node B
1. RRC CONNECTION REQUEST
2. Admission Control
3. RADIO LINK SETUP REQUEST
4. Start RX
5. RADIO LINK SETUP ESPONSE
6. ESTABLISH REQUEST
7. ESTABLISH CONFIRM
8. UPLINK DOWNLINK SYNC
FP
FP
9. Start TX
10. RRC CONNECTION SETUP
11. L1 SYNCH
12. RL RESTORE INDICATION
13. RRC CONNECTION SETUP COMPLETE
14Protocol Analysers
Company Product Home Country
Nethawk 47 3G Analyser Finland
Agilent 48 Signaling Analyzer United States
Tektronix 49 K15 United States
Radcom 50 Performer Analyser Israel
Acterna 51 Telecom Protocol Analyzer United States
15RRC Connection Events and KPIs
UE
RNC
CN
RRC CONNECTION REQUEST
Event 1
Event 1
RRC_CONN_ATT_EST
Setup phase
incremented
RRC CONNECTION SETUP
Event 2
RRC_CONN_ATT_COMP
Event 2
incremented
Event 3
RRC_CONN_ACC_COMP
incremented
RRC CONNECTION SETUP COMPLETE
Event 3
Event 4
RRC_CONN_ACT_COMP
incremented
Event 4
IU RELEASE COMMAND
Sum of RRC_CONN_STP_COMP
x 100
RRC Setup Complete Rate
Sum of RRC_CONN_STP_ATT
Sum of RRC_CONN_ACT_COMP
x 100
RRC Retainability Rate
Sum of RRC_CONN_ACC_COMP
16RRC connection Phases
Active
Access
Setup
Phase
Setup
Access
Active
Complete
Complete
complete
Success
Access
Active
Release
Active
Failures
RRC Drop
Attempts
Access Failures
Setup Failures, Blocking
17Other WCDMA network KPIs
Sum of RAB_STP_COMP
x 100
RAB Setup Complete Rate
Sum of RAB_STP_ATT
Sum of RAB_ACC_COMP
x 100
RAB Establishment Complete Rate
Sum of RAB_STP_ATT
Sum of RAB_ACT_COMP
x 100
RAB Retainability Rate
Sum of RAB_ACC_COMP
Sum of RAB_ACC_COMP
x 100
CSSR
Sum of RRC_CONN_STP_ATT
Sum of RAB_ACT_COMP
x 100
CCSR
Sum of RRC_CONN_STP_ATT
18Fault Classification
Fault Class Description Examples
A-CRITICAL Total or major outages that are not avoidable with a workaround solution. Critical (emergency duty contacted) problems severely affect service, capacity/traffic, billing, and maintenance capabilities and require immediate corrective action, regardless of time of day or day of the week as viewed by the operator. System restart, all links down Simultaneous restarts of active computer units More than 50 per cent of traffic handling capacity out of use Subscriber related network element functionality is not working
B-MAJOR The problem leads to degradation of network performance or the fault affects traffic randomly. Major problems cause conditions that seriously affect system performance, operation, maintenance, and administration and require immediate attention as viewed by the operator. The urgency is less than in critical situations because of a lesser immediate or impending effect on system performance, customers, and the customers operation and revenue. Capacity/quality related functionality is not working as supposed to Problems seriously affecting end user service, but avoidable with a workaround solution Configuration changes (network, HW, and SW) are not working as supposed to Subscriber related functions are not working completely Performance measurement, alarm management or activation of a new feature fails Single restart of computer units
C-MINOR Minor fault not affecting operation or service quality Other problems that the operator does does not view as critical or major are considered minor. Minor problems do not significantly impair the functioning of the system or affect the service to customers. These problems are tolerable during system use. Failures not seriously affecting traffic Errors in operating commands syntax Cosmetic errors in operational commands or statistics output Minor errors in documentation
19Framework for KPI-Triggered Troubleshooting
- Framework is designed for investigating and
soelving B-MAJOR level i.e. KPI-triggered
faults - Before applying the Framework
- The general alarm status of the network has been
checked. No clear network alarms pointing to the
root cause of the fault can be detected. - Traces from external interfaces of RNC have been
taken with a protocol analyser in order to record
the fault scenario. Also RNC internal trace has
been taken when the fault took place. - The basic fault scenario has been analysed and
clarified.
20Is the problem new in the operator network?
No
Yes
Perform simulation of the fault in test bed.
Does the fault still occur?
New SW, HW, parameters, UE model or feature
introduced?
No
Yes
No
Yes
Yes
Is the fault operator specific?
Perform simulation of the fault with reference
conditions. Does the fault still occur?
No
No
Yes
Has average network load increased significantly
and/or does the problem occur at a specific time
of day?
Analyse and investigate the differences between
the working and faulty conditions.
Yes
No
Use RNC Performance Tester to generate load in
test bed and perform analysis.
Analyse the traces. Investigate fault scope.
Transmission specific
Node B specific
Service specific
RNC specific
CN specific
Country specific
UE specific
Analyse network element and interface specific
alarms, parameters, capacity, logs and traces.
Take specific actions depending on problem
scope (refer to detailed Framework notes).
In case of MVI environment, check IOT results and
contact foreign vendor. Investigate own vendors
default parameters and compare implementation
againts 3GPP specifications. Compare own
default parameters with other default parameters
of other vendors. Execute air interface protocol
analysis and drive tests.
21Case Increased AMR call drop rate
- A decrease in RAB Retainability Rate KPI for AMR
telephony service was experienced during the last
three months in an operator network. - The decrease was around 2 on each RNC compared
to the time when the network was performing well.
Actions that had already been taken with no
positive effect - Soft reset for all Node Bs and for all RNCs
- Hard reset and re-commissioning of Node Bs
- Alarms checked and no major alarms found
22Case Increased AMR call drop rate
Is the problem new in the operator network?
I.
Yes
New SW, HW, parameters, UE model or feature
introduced?
II.
Yes
Perform simulation of the fault in reference
conditions. Does the fault still occur?
III.
No
Analyse and investigate the differences between
the working and faulty conditions.
IV.
23Case Increased AMR call drop rate
- Solution
- The short term solution was that the parameter
for planned maximum downlink transmission power
of all the Node Bs in the operator network was
changed to the default value of 34 dBm. In this
way, the problem disappeared in the operator
network. - The long term solution was to implement a fix of
the bug into the next software release of the
Node B.
24Results
- As a result of thorough research conducted for
this Thesis, a Framework for KPI-triggered
troubleshooting for live WCDMA networks was
developed. - The Framework is mainly targeted for WCDMA
network equipment vendors, to help them in
solving major service affecting faults occurring
in the live WCDMA networks of today. - Troubleshooting cases from live WCDMA networks
were solved using the Framework developed, in
order to verify the results and test the
applicability and practicality of the Framework.
25Assessment of the results
- The applicability and relevance of the
troubleshooting Framework was tested against
three different fault cases from live WCDMA
networks. - The results were fairly promising since all the
cases were successfully solved by utilising the
Framework. The Framework was found to be quite
practical and suitable for solving KPI-triggered
problems in live WCDMA networks. - However, it must be taken into account that the
Framework was tested with a limited number of
cases, because of time and resource limitations.
If more extensive testing and verification with a
large number of cases would be applied, there is
a possibility that optimisations and improvements
to the Framework could be done. - Still, the basic logic of the Framework was
proven with reasonable relevance. The results
presented in this study can be easily tested in
the future against a number of cases in order to
verify the results with more extensive
statistical reliability.
26Exploitation of the results
- The results of this study will be used as source
material in the development of UTRAN
troubleshooting competence development and
advanced learning solution creation, targeted for
troubleshooting experts and customer support
engineers of one of the leading WCDMA network
equipment vendors. - Also, the results of the Thesis will be used as
an input in creation of customer documentation
for UTRAN troubleshooting. - There is also an intention to further test the
relevance and reliability of the results of this
Thesis by applying it in the 24/7 RAN technical
support operator service of the equipment vendor
in question.
27Future Research
- The significance of Performance Indicator based
troubleshooting is increasing continuously in
live WCDMA networks. - Once the PI and KPI specifications become more
mature, more extensive study of the most relevant
Performance Indicators used in WCDMA network
troubleshooting is essential. - Also, there is a need to develop a Framework and
logic for solving emergency problems in WCDMA
networks. - As the growth of complexity of telecommunication
networks increases, effective and efficient
troubleshooting procedures are essential in order
to manage the diversity of network technologies
and the increasing quality requirements of the
operators.