Title: OAM and QoS
1OAM and QoS
- Presented by
- Yaakov (J) Stein
- Chief Scientist
Unique Access Solutions
2Service Guarantees
3Why do we pay for services ?
- Generally good (and frequently much better than
toll quality) - voice service is available free of charge
(Skype, Fring, Nimbuzz, ) - So why does anyone pay for voice services ?
- Similarly, one can get free
- (WiFi) Internet access
- email boxes
- file storage and sharing
- web hosting
- software services
- So why pay ?
4Paying for QoS
- The simple answer is that one doesnt pay for the
service - one pays for Quality of Service guarantees
- In our voice model
- But what does QoS mean
- and why are we willing to pay for it ?
- To explain, we need to review some history
5Father of the telephone
- Everyone knows that the father of the telephone
was - Alexander Graham Bell
- (along with his assistant Mr. Watson)
- But Bell did not invent the telephone network
- Bell and Watson sold pairs of phones to customers
- The father of the telephone network was
- Theodore Vail
6Theodore Vail -
- Theodore Who?
- Son of Alfred Vail (Morses coworker)
- Ex-General Superintendent of US Railway Mail
Service - First general manager of Bell Telephone
- Father of the PSTN
- Why is he so important?
- Organized PSTN
- Established principle of reinvestment in RD
- Established Bell Telephones IPR division
- Executed merger with Western Union to form ATT
- Solved the main technological problems
- use of copper wire
- use of twisted pairs
- Organized telephony as a service (like the postal
service!) - Vailism is the philosophy that public services
should be run as closed centralized monopolies
for the public good -
7Whats the difference ?
- In the Bell-Watson model
- the customer pays once, but is responsible for
- installation
- wires
- wiring
- operations
- power
- fault repair
- performance (distortion and noise)
- infrastructure maintenance
- while the Bell company is responsible only for
- providing functioning telephones
- In the Vail model the customer pays a monthly fee
- but the provider assumes responsibility for
everything - including fault repair and performance
maintenance - the telephone company owns the telephone sets
and even the wires in the walls ! -
8Service Level Agreements
- In order to justify recurring payments
- the provider agrees to a minimum level of
service in an SLA - SLAs should capture Quality of user Experience
(QoE) - but this is often hard to quantify
- So SLAs usually actually detail measurable
network parameters - that influence QoE, such as
- availability (e.g., the famous five nines)
- time to repair (e.g., the famous 50 ms)
- information rate (throughput)
- information latency (delay)
- allowable defect densities (noise/distortion)
- Availability (basic connectivity) always
influences QoE - It is hard to predict the effect of the other
parameters on QoE even when there is only one
application (e.g., voice) - When multiple applications are in use - it may be
impossible
9Some Applications
- System traffic
- routing protocols, DNS, DHCP, time delivery,
system update, OAM, - tunneling and VPN setup
- Business processes
- database access, backup and data-center, B2B,
ERP - Communications - interactive
- voice, video conferencing, telepresence, instant
messaging, - remote desktop, application sharing
- Communications non-interactive
- email, broadcast programming, music
- video progressive download, live streaming,
interactive - Information gathering
- http(s), Web 2.0, file transfer
- Recreational
- gaming, p2p file transfer
- Malicious
- DoS, malware injection, illicit information
retrieval
10What do applications need ?
- Some applications only require availability
- Some also require minimum available throughput
- Some require delay less then some end-end (or RT)
delay - Some require packet loss ratio (PLR) less than
some percentage - and these parameters are not necessarily
independent - For example,
- TCP throughput drops with PLR
11Some rules of thumb
- Mission Critical (and life critical) applications
require - high availability
- If there are any MC applications
- then system traffic requires high availability
too - MC applications do not necessarily require strict
throughput - but always indirectly require
- a certain minimal average throughput
- bounded delay
- If the MC application uses TCP then it requires
- low PLR
- Real-time applications require
- sufficient throughput
- but not necessarily low PLR (audio and video
codecs have PLC) - Interactive applications require
- low RT delay
- It may be more scalable for a SP to measure 1-way
delays
12OAM
13Monitoring an SLA
- The Service Providers justification for payment
- is the maintenance of an SLA
- To ensure SLA compliance, the SP must
- monitor the SLA parameters
- take action if parameter is dropping below
compliance levels - But how does the SP verify/ensure that the SLA is
being met ? - Monitoring is carried out using
- Operations, Administration, Maintenance (OAM)
- The customer too may use OAM to see that the SP
is compliant ! - Technical note
- OAM is a user-plane function
- but may influence control and management plane
operations - for example
- OAM may trigger protection switching, but doesnt
switch - OAM may detect provisioned links, but doesnt
provision them
14Operations, Administration, Maintenance
- Traditionally, one distinguishes between 2 OAM
functionalities - Fault Monitoring
- OAM runs continuously/periodically at required
rate - detection and reporting of anomalies, defects,
and failures - used to trigger mechanisms in the
- control plane (e.g. protection switching) and
- management plane (alarms)
- required for maintenance of basic connectivity
(availability) - Performance Monitoring
- OAM run
- before enabling a service
- on-demand or
- per schedule
- measurement of performance criteria (delay, PDV,
etc.) - required for maintenance of all other QoE
attributes
15Early OAM
- Analog channels and 64 kbps digital channels
- did not have mechanisms to check signal validity
and quality - Thus
- major faults could go undetected for long periods
of time - hard to characterize and localize faults when
reported - minor defects might be unnoticed indefinitely
- As PDH networks evolved, more and more OAM was
added on - monitoring for valid signal
- loopbacks
- defect reporting
- alarm indication/inhibition
- The OAM overhead started to explode in size !
- When SONET/SDH was designed
- bounded overhead was reserved for OAM functions
16OAM for Packet Switched Networks
- OAM is more complex for Packet Switched Networks
- in addition to the previous defects
- loss of signal
- bit errors
- we have new defect types
- packets may be lost
- packets may be delayed
- packets may delivered to the wrong destination
- The first PSN-like network to acquire OAM was ATM
(I.610) - Although technically ATM is cell-based, not
packet-based
17Some FM OAM mechanisms (1)
- How do we perform Continuity Check ?
- send OAM packets at a constant known rate
- if CC packets are not received for gt3 intervals
then declare a fault - see also LB / echo mode
- How do we perform Connectivity Verification ?
- send OAM packets to a known destination
- if CV packets are received somewhere else then
declare a fault - How do we indicate AIS (FDI) ?
- when do not receive forward traffic send AIS OAM
packets - if AIS packets received then declare a fault
- How do we indicate RDI (BDI) ?
- when do not receive reverse traffic send RDI OAM
packets - if RDI packets received then declare a fault
- Note RDI is often a flag set on CC message
18Some FM OAM mechanisms (2)
- How do we use LoopBack ?
- non-intrusive (in-service) (echo mode)
- send LB request OAM packet to remote site
- remote site replies with LB reply
- if LB reply not received then declare a fault
- intrusive (out-of-service)
- put remote site into LB mode
- remote sites reflects (and does not forward) all
traffic - (note that it must monitor OAM traffic)
- if packets sent are not received then declare a
fault - note need to inform next hops of LB by locking
- How do we use LinkTrace ?
- send LB request OAM packet to next hop
- send LB request to following hop
- etc.
19Some PM OAM mechanisms (1)
- How do we measure Packet Loss Ratio ?
- Traffic (counter) based
- maintain 2 counters
- number of packets transmitted to peer Tx
- number of packets received from peer Rx
- send Tx counter to peer at time 1 Tx(1)
- peer notes its Rx counter at time of reception
Rx(2) - and its Tx counter at time of its reply Tx(3)
- originator notes its Rx counter when reply is
received Rx(4) - calculate PLR in both directions
- Synthetic
- do not maintain counters use OAM packets
- Note synthetic loss is only a rough estimate
- How do we measure Throughput?
- Primitive way (RFC 2544)
- send packets at maximum rate and observe packet
loss - reduce rate until no loss is observed
- Note there are more sophisticated mechanisms !
20Some PM OAM mechanisms (2)
- How do we measure 1-way Packet Delay (Latency) ?
- synchronize clocks at both OAM peers
- send timestamp T1 to peer
- peer timestamps receipt with T2
- calculate time difference T2 T1
- How do we measure 2-way Packet Delay (Latency) ?
- send timestamp T1 to peer
- peer timestamps receipt with T2
- peer replies at T3
- originator timestamps receipt of reply at T4
- calculate time difference (T4 T1) (T3 - T2)
- assuming symmetry, 1-way delay is half this
amount - Note do not need to synchronize clocks
- How do we measure Packet Delay Variation ?
- send timestamps at a constant rate
- peer calculates timestamp differences and
statistics thereof - Note do not need to synchronize clocks
21Ethernet OAM
22What about Ethernet ?
- Carrier Ethernet has replaced ATM as the default
layer-2 - Ethernet is by far the most widespread network
interface - Ethernet has some advantages as compared to ATM
- it has network-wide unique addresses
- it has a source address in every packet
- but some aspects make Ethernet OAM more difficult
- ConnectionLess (CL)
- multipoint to multipoint
- overlapping layering need OAM for operator,
SPs, customer - some specific problematic ETH behaviors
(flooding, multicast, )
23Whats the problem with CL ?
- OAM makes a lot of sense in Connection Oriented
environments - connections last a relatively long amount of time
- there is some SLA at the connection level
- For CL networks, the network path is neither
known nor pinned - So it doesnt really make sense to talk about FM
- what does continuity mean if when a link goes
down - the network automatically reroutes around the
failure ? - The Ethernet CL problem is solved by overlaying
CO functionality - flows or
- EVCs
24Ethernet OAM
- For many years there was no OAM for Ethernet
- (LANs dont need OAM)
- now there are two incompatible ones!
- Link layer OAM 802.3 clause 57 (EFM OAM,
802.3ah) - single link only
- slow protocol, limited functionality
- some management functions
- Service OAM Y.1731, 802.1ag (CFM)
- any network configuration
- multilevel OAM functionality
- In some cases one may need to run both
- while in others only service OAM makes sense
- Link layer OAM is only for a single link, which
is necessarily CO - Service OAM is most frequently used for
infrastructure networks, - which are also CO
25Layer 2 control protocols (L2CPs)
- Do not be confused - L2CPs are NOT OAM !
- Here are a few well-known L2CPs
protocol DA reference
STP/RSTP/MSTP 01-80-C2-00-00-00 802.2 LLC 802.1D 8,9 802.1D17 802.1Q 13
PAUSE 01-80-C2-00-00-01 802.3 31B 802.3x
LACP/LAMP 01-80-C2-00-00-02 EtherType 88-09 Subtype 01 and 02 802.3 43 (ex 802.3ad)
Link OAM 01-80-C2-00-00-02 EtherType 88-09 Subtype 03 802.3 57 (ex 802.3ah)
ESMC 01-80-C2-00-00-02 EtherType 88-09 Subtype 10 G.8264
Port Authentication 01-80-C2-00-00-03 802.1X
E-LMI 01-80-C2-00-00-07 MEF-16
Provider MSTP 01-80-C2-00-00-08 802.1D 802.1ad
Provider MMRP 01-80-C2-00-00-0D 802.1ak
LLDP 01-80-C2-00-00-0E EtherType 88-CC 802.1AB-2009
GARP (GMRP, GVRP) Block 01-80-C2-00-00-20 through 01-80-C2-00-00-2F 802.1D 10, 11, 12
Note IEEE disallows forwarding of L2CPs, MEF
allows it under certain circumstances
26Link Layer OAM (AKA EFM OAM)
- Ethernet in the First Mile (Last Mile ?)
- EFM networks are mostly p2p DSL links or p2mp
PONs - thus a link layer OAM is sufficient for EFM
applications - Since EFM link is between customer and Service
Provider - EFM OAM entities are either active (SP) or
passive (customer) - active entity can place passive one into LB mode
but not the reverse - EFM OAMPDUs are a slow protocol frames never
forwarded - Ethertype 88-09 and subtype 03
- messages multicast to slow protocol specific
group address - OAMPDUs must be sent once per second (heartbeat)
- messages are TLV-based
27EFM OAM capabilities
- 6 codes are defined
- Information (autodiscovery, heartbeat, fault
notification) - Event notification (statistics reporting)
- Variable request (active entity query passives
configuration) (mngt) - Variable response (passive entity responds to
query) (mngt) - Loopback control (active entity enable/disable of
intrusive LB mode) - Organization specific (proprietary extensions)
- and there are flags in every OAMPDU to
- expedite notification of critical events
- link fault (RDI)
- dying gasp
- unspecified
- monitor slow degradations in performance
28Service OAM (AKA CFM, Y.1731)
- Many SPs need to monitor full networks
- not just single links
- Service layer OAM provides end-to-end integrity
- of the Ethernet service over arbitrary server
layers - Because Ethernet is flat
- not true client-server layering (except
MAC-in-MAC) - service layer OAM is multilevel
- Because SPs want to replace transport networks
with Ethernet - service OAM must support all OAM features
- and must enable advanced transport capabilities
- (such as linear/ring protection switching)
- a transport network is a network with
- High availability (Fault Management OAM and
Automatic Protection Switching) - SLA support (Performance Management OAM and QoS
mechanisms) - a Management plane (optionally a control plane)
for configuration and provisioning - Efficiency and Scalability
29Y.1731 messages
- Y.1731 supports many OAM message types
- Continuity Check proactive heartbeat with 7
possible rates - Synthetic Loss Measurement on demand loss rate
estimation - LoopBack unicast/multicast pings with optional
patterns - Link Trace identify path taken to detect
failures and loops - AIS periodically sent when CC fails
- RDI flag set to indicate reverse defect
- Client Signal Fail sent by MEP when client
doesnt support AIS - LoCK signal inform peer entity about diagnostic
actions - TeST signal in-service/out-of-service tests for
loss rate, etc. - Automatic Protection Switching
- Maintenance Communications Channel remote
maintenance - EXPerimental
- Vendor SPecific
30Y.1731 frame format
- after DA, SA and Ethertype (8902)
- Y.1731/802.1ag PDUs have the following header
(may be VLAN tagged) - if there are sequence numbers/timestamp(s)
- they immediately follow
- then come TLVs, the end TLV, followed by the
CRC - TLVs have 1B type and 2B length fields
- there may or not be a value field
- the end-TLV has type zero and no length or
value fields
31Y.1731 PDU types
opcode OAM Type DA
1 CCM M1 or U
3 LBM M1 or U
2 LBR U
5 LTM M2
4 LTR U
6-31 RES IEEE
32-63 unused RES ITU-T
33 AIS M1 or U
35 LCK M1or U
37 TST M1 or U
39 Linear APS M1or U
40 Ring APS M1or U
41 MCC M1 or U
43 LMM M1 or U
42 LMR U DA
45 1DM M1 or U
47 DMM M1 or U
46 DMR UA
49 EXM
48 EXR
51 VSM
50 VSR
52 CSF M1 or U
55 SLM U
54 SLR U
64-255 RES IEEE
32MEPs and MIPs
- Maintenance Entity (ME) entity that requires
maintenance - ME is a relationship between ME end points
- because Ethernet is MP2MP, we need to define a ME
Group - MEGs can be nested, but not overlapped
- MEG LEVEL takes a value 0 7
- by default - 0,1,2 operator, 3,4 SP, 5,6,7
customer - MEP MEG end point (MEG ME group, ME
Maintenance Entity) - (in IEEE
MEG is called MA Maintenance Association) - unique MEG IDs specify to which MEG we send the
OAM message - MEPs responsible for OAM messages not leaking out
- but transparently transfer OAM messages of higher
level - MIPs MEG Intermediate Points
- never originate OAM messages,
- process some OAM messages
- transparently transfer others
33MEPs and MIPs (cont.)
34How is OAM used ?
- MEF-30 Service OAM FM and MEF-xx Service OAM PM
- describe the use of OAM for Carrier Ethernet
networks, such as - which Y.1731/802.1 features/messages should be
used - where to put MEPs, what MA and MEG levels names
should be used - minimum number of EVCs that must be supported
- what should be reported and how
- Y.1564 (ex Y.156sam) Ethernet Service Activation
Test Methodology - describes commissioning procedures (replaces
RFC2544-like benchmarking) - Tests that desired performance level can be
achieved, including - CIR, EIR (and optionally CBS and EBS for
bursting) - traffic policing
- rate, loss, delay, delay variation, availability
(measured simultaneously) - Testing in two steps
- Service Configuration Test each service
separately - Service Performance Test all services together
- Performance testing may be for
- 15 minutes (new service on operational network)
- 2 hours (single operator network)
- 24 hours (multiple operator networks)
35QoS enforcement
36QoS approaches
- There are two approaches to QoS handling
- IntServ (guaranteed QoS)
- define traffic flows (CO approach)
- guarantee QoS attributes for each flow
- reserve resources at each router along the flow
- signaling protocol (e.g., RSVP) needed
- DiffServ (statistical QoS)
- retain CL paradigm
- no guaranteed QoS attributes
- mark packets (differentiated e.g., gold,
silver, bronze) - marking can be by VLAN, P-bits, IP-ToS/DSCP, or
general flow - offer special treatment (priority) relative to
other packets - no resource reservation
- For Ethernet and IP DiffServ is the preferred
approach
37Some fields for marking
- Example
- For an IPv4 packet inside Q-in-Q Ethernet
- we have various choices for marking priority
802.1p user priority field AKA P-bits 0
7 priority tagging (VLAN0) if no VLAN P0 means
non-expedited traffic 802.1Q recommends mappings
- IP ToS
- RFC 2474 redefined ToS to contain
- 6 bit DSCP (see also RFC 4594)
- 2 bit ECN
38Queuing
- Ethernet switches have queues FIFO buffers
- on each output port
- If there were only one queue
- then traffic handling would be FIF
- To enable DiffServ prioritization
- multiple queues are used
- Outgoing frames are inserted into queues
- according to priority marking
- Many methods for emptying queues
- The most popular are
- Strict Priority
- always take from nonempty queue
- of highest priority
- Weighted Fair Queuing
- take from nonempty queues according
- to configured weight
39Traffic shaping
- One of the most important parts of an SLA is the
- Committed Information Rate (bps)
- This is the datarate (bandwidth) SP guarantees
will be forwarded - There may also be an
- Extra Information Rate (bps)
- This is a datarate that the SP will forward if
possible - Packet traffic is often bursty
- A customer who did not send data for a while
- will expect to be able to send a higher rate
afterwards - This is accomplished via traffic shaping
- time integration is accomplished by leaky/token
buckets - the effect of shaping is marking drop eligibility
- (marking a packet on the line is only possible
with S-tags!) - There is often also traffic policing
- policing simply discards packets to police a
maximum rate !
40MEF token bucket algorithm
- Metro Ethernet Forum 10.x defines a bandwidth
profile - there are two byte buckets, C of size CBS and E
of size EBS (in bytes) - tokens are added to the buckets at rate CIR/8 and
EIR/8 - when bucket overflows tokens are lost (use it or
lose it) - if ingress frame length lt number of tokens in C
bucket - frame is green and its length in tokens is
debited from C bucket - else
- if ingress frame length lt number of tokens in E
bucket - frame is yellow and its length of tokens is
debited from E bucket - else frame is red
- green frames are delivered
- and service objectives apply
- yellow frames are delivered
- but service objectives dont apply
- red frames are discarded
- for simplicity we assume
- no coupling and
- no sharing !