Multiservice Switch Architecture

About This Presentation

Title:

Multiservice Switch Architecture

Description:

Source: 'Network Processors and Coprocessors for Broadband Network ... the backup card needs to run in lockstep with the primary card, or hot standby mode ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 68

Provided by: hche6

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multiservice Switch Architecture

1
Multiservice Switch Architecture
2
Scope

Discuss only distributed architecture
Focus on data path functions

3
Outline

Architecture Overview
Data Path Processing
Data path functions
Fast or slow path processing
Control and Data Plane partitioning
High Availability

4
Architecture Overview Logic View
Control Module
ATM NIC
Cell / Packet Switch Fabric
IP NIC
IP NIC
5
Architecture Overview Forwarding Paths Fast and
Slow
Control Module
ATM NIC
Cell / Packet Switch Fabric
IP NIC
IP NIC
6
Architecture Overview Interfaces
Source Agilent Technologies
7
Physical View
Source Network Processors and Coprocessors for
Broadband Network Applications, T. A. Chu, ACORN
Networks
8
Card Level Paths Fast and Slow
Source Network Processors and Coprocessors for
Broadband Network Applications, T. A. Chu, ACORN
Networks
9
Architecture Overview Coprocessors
Source Network Processor Based Systems Design
Issues and Challenges, I. Jeyasubramanian,
et.al., HCL Technologies Limited
10
Architecture Overview Coprocessors
Source Network Processor Based Systems Design
Issues and Challenges, I. Jeyasubramanian,
et.al., HCL Technologies Limited
11
Architecture Overview Software Architecture
EMS
Policy
Routing
MPLS
FIB
Control Plane
Forwarding/control Interface
(2)
(4)
(4)
Policer
B u f f e r Mgt.
S c h e d u l e r
(5)
Classifier
(4)
Action
(3)
IP Header Validation
(6)
(9)
(10)
IP/MPLS Header processing
(2)
(1)
(7)
(8)
Pre-IP Processing
(2)
Mapper
FE
12
Data Path Processing

(1) The ingress Ethernet frame from an input port
or frame from switching fabric are validated and
decapsulated.
(2) For non-IP frames, such as PPP and IS-IS,
Pre-IP Processing will result in the frame PDU to
be directly forwarded to the control card
(2) For MPLS labeled packet which needs to be
forwarded by label swapping, the label swap table
is looked up in the Pre-IP Processing module and
the labeled packet is sent to the IP/MPLS header
processing module for further processing
(2) IP packet header information is validated
(3) In the Classifier, the Firewall/policy based
classification and IP forwarding table lookup are
performed
(4) For DiffServ based filtering, classified
packet flows are policed and marked/remarked
(4) For a non-DiffServ router or DiffServ router
in the core, the policer module may be bypassed,
and the packet is acted upon based on the outcome
of the Classifier.

13
Data Path Processing

(4) IP based Control protocol packets are sent
to the control card for further processing, e.g.,
OSPF, RSVP-TE packets.
(5) The marked packet from the policer is sent
to the Action module to be rate limited. One or
multiple thresholds can be used to decide whether
the packet should be dropped based on the current
traffic rate and the color of the packet (only
for DiffServ)
(6) The packet is processed including TTL
update, fragmentation, checksum update, and
encapsulation
(7) The Mapper maps the packet to one of the
eight output queues based on IP precedence
subfield, DSCP, or even input interface ID or
circuit ID the packet came from.
(8) The Buffer Manager further sends the packet
to the appropriate queue.
(9) The scheduler schedules the packet out to
the circuit.

14
Protocol Stack Overview
15
ISO LLC

Logical Link Control is specified in ISO 11802 .
LLC consists of three fields, a destination SAP
address, a source SAP address and a control
field. Multiple protocols encapsulated over LLC
can be identified by protocol identification.
ISO provides its scheme for network layer
protocol identification (NLPID) as specified in
ISO 9577 . ISO assigns an LLC SAP address (0xFE)
for use of ISO 9577 NLPID scheme. IEEE 802.1a
provides its own scheme for network layer
protocol identification (SNAP). For this purpose,
ISO assigns an LLC SAP address (0xAA) for the use
of IEEE802.1a SNAP scheme.
The LLC encapsulation comes with two different
formats. One is based on the ISO NLPID (Network
Layer Protocol Identifier (PID)) format and the
other is based on IEEE 802.1a SubNetwork
Attachment Point (SNAP) format or LLC/SNAP
format.

16
ISO LLC

The LLC header value 0xFE-FE-03 must be used to
identify a routed PDU for the ISO NLPID format
(e.g. PPP, IS-IS, etc.).
The LLC header is 3-octet in length and its value
is 0xAA-AA-03, indicating the presence of a SNAP
header. Note The LLC/SNAP format must be used
for IP datagram encapsulation.
The SNAP header consists of a three octet
Organization Unique Identifier (OUI) and a two
octet PID. The SNAP header uniquely identifies a
routed or bridged protocol. The OUI value
0x00-00-00 indicates that the PID is an
EtherType.

PID
OUI
17
ISO LLC

Examples
Note AppleTalk LLC 0xaa aa 03 OUI
0x080007 SNAP 0x809b

18
Frame Format
19
Ethernet Frame Format
20
Pre-IP processing Ingress
21
Pre-IP Processing Generic MPLS Label Swapping
22
ATM-LSR Label Swapping
23
IP Header Format
24
TCP Header Format
25
UDP Header Format
26
IP Header Validation
27
Search Key and Filter Rule
28
Packet Classification
29
LER
30
Action Types

Accept
Discard
Reject
Routing instance
Alert
Count
Log
DSCP set
Rate limit

31
IP/MPLS Header Processing TTL Update
32
MTU Check and Fragmentation
33
Fragmentation at a LSR
34
A Fragmentation Algorithm

FO -- Fragment offset in the units of 8-octets
IHL -- Internet Header Length in the units of
4-octets
DF -- Dont Fragment flag
MF -- More Fragment flag
TL -- Total Length in octets
OFO -- Old Fragment Offset
OIHL -- Old Internet Header Length
OMF -- Old More Fragments flag
OTL -- Old Total Length
NFB -- Number of Fragment Blocks (Block size 8
Octets)
MTU -- Maximum Transmission Unit in Octets

35
A Fragmentation Algorithm

IF TL lt MTU
THEN submit this datagram to the next step in
datagram processing
ELSE IF DF 1
THEN discard the datagram and may send an ICMP
Destination Unreachable message (See Section
6.2.2) back to the source
ELSE
To produce the first fragment
i. Copy the original internet
header
ii. OIHL lt IHL OTL lt TL OFO
lt FO OMFlt MF
iii. NFB lt (MTU-IHL4)/8
iv. Attach the first NFB8 data
octets
v. Correct the header MF lt 1
TL lt (IHL4)(NFB8) Recompute Checksum
vi. Submit this fragment to the
next step in datagram processing
To produce the second fragment
vii. Selectively copy the internet
header (some options are not copied, see Section
6.2.1.4)
viii. Append the remaining data
ix. Correct the header IHL lt
(OIHL4)-(Length of options not copied)
3/4 TL lt OTL NFB8 (OIHL-IHL)4 FO lt
OFO NFB MF lt OMF Recompute Checksum
x. Submit this fragment to the
fragmentation test
DONE.

36
Checksum Update

HC old checksum in header
HC new checksum in header
M old value of a 16-bit field
M new value of a 16-bit field
Then the algorithm is as follows
IF M-M1
HC HC 0xfffe with borrow
ELSE
HC HC - M M with borrow

37
Fast or Slow Paths Forwarding

Some gray areas
ICMP
Options field
Packet fragmentation

38
ICMP
39
ICMP
40
ICMP
41
ICMP

May have different handlings for different ICMP
type messages
Informational ICMP may be handled by control
card, e.g.,
Timestamp/Timestamp Reply
Echo/Echo Reply
ICMP relevant to data forwarding may be handled
by the network processor itself, e.g.,
Destination Unreachable
Source Quench (obsolete)
Redirect
Time Exceed
Parameter Problem
Rate limiting to the central control card for
ICMP packets should be enforced to prevent ICMP
DOS

42
Options Field

Needs to be done by either central control card
or local CPU, preferably the central control card

43
Fragmentation

About 3 Internet traffic needs fragmentation
Slow path forwarding can be problematic
An Example for an OC-192 interface, the CPU has
to handle 300Mbps traffic!

44
Fragmentation

Concept of Wire-speed forwarding
Assumptions
A network processor working at 200 MHz clock rate
or 5 ns
One instruction per clock cycle
There are 8 threads working in pipeline
Minimum frame size is 60 bytes
Line rate 1 Gigabit per second
Per frame time 60x8/1Gigabit 480 ns
Instruction budget 480/596 instructions per
packet
Latency budget 480x8 3840 ns
Wire-speed So long as the network processor is
work conserving and the instruction budget is not
exceeded, wire-speed forwarding is maintained

45
Fragmentation

Traditional perception
Fragmentation should not be done by the
network processor because it consumes too many
clock cycles or instructions
Traditional perception could be wrong and the
truth might be
Care needs to be taken for the load and store
of the IP header information for updating to
avoid long latency for packet fragmentation
Instruction budget is not an issue because it is
calculated based on available clock cycles for
minimum sized packet

46
Function Partitioning

Why is it important?
Distributed or Centralized?
Ideally local information should be handled by
local components, however, the need for
information exchange between components sometimes
call for centralized approach
Components mainly involve control card/Central
CPU, NICs, and local CPU,

47
Function Partitioning

Examples
Framing at ingress or egress NIC?
ARP and/or PPP running on local CPU or central
CPU?
Control plane functions running on local CPU or
central CPU?

48
Framing at Ingress or Egress

Definitions
Ingress framing do the layer 2 framing for
outgoing packet at the ingress NIC
Egress framing do the layer 2 framing for
outgoing packet at the egress NIC
Which ones better?
Ingress framing requires globalization of local
information, e.g., ARP tables, interface MTUs,
etc more memory space
Egress framing requires more processing on the
same packet, e.g., another IP forwarding table
lookup to find the next hop IP address or more
overhead on carrying next hop IP address from
ingress to egress
Prioritizing ingress processing versus egress
processing in the network processor may favor one
solution over the other

49
ARP Scope

Within an IP subnet
A physical interface may support multiple IP
subnet

50
ARP

Design choices
Distributed Solution
Run ARPs locally on the local CPUs in NICs
Centralized Solution
Run ARPs on the central control processor
Hybrid solution
Run ARPs locally but the ARP tables are
centralized
Impact of different design choices
Distributed solution is good when packet framing
is done at the egress NICs
If packet framing is done at the ingress NICs,
centralized solution may be better
Hybrid solution can be a good choice when central
control processor power is constrained while
packet framing needs to be done at the ingress
NICs

51
PPP

Two main purposes for using PPP
Broadband access, i.e., ADSL
Can be distributed or centralized depending on
how the subscriber management server is connected
with the router
Support for POS framing
Local to the POS interface, but centralized
control is OK

52
Card Redundancy

Background
IP Internet was built with high resiliency in
mind, i.e., recover from a link or node failure
through
rerouting without bounded delay, delay jitter,
and loss guarantee
Packet recovery through transport layer
retransmission
When a node comes back, it takes the following
steps for a routing domain to become stable
Bring up the environment, i.e., RTOS
Activate the Processes for routing protocol
stacks
Bring the IP interfaces up
Establish neighboring relationships through hello
protocol

53
Card Redundancy

Establish adjacency relationships
Bring the database in synch through topology
information flooding
Calculate the shortest path routes and create the
FIB
Download the FIB into each NIC card to facilitate
packet classification and forwarding
Network wide stable state is reached after all
the nodes in the routing domain have done with
the above steps
In general, it takes 10s of seconds to 10s of
minutes to reach a stable state, during that time
period packets can be sent into a transient loop
or black holed.

54
Card Redundancy

To support mission critical and voice
applications, resiliency alone is not sufficient
High availability (e.g., 5 9s uptime) becomes an
essential requirement for router design. Todays
ISPs cares more about the availability/reliability
than price/performance
A primary approach to achieve high availability
is through card level redundancy

55
Card Redundancy
One for one redundancy
56
Card Redundancy

Assume that most of the control plane functions
are carried out in the control card and the line
cards make use of the FIB downloaded from the
control card to perform data path functions
The physical separation of the data path
functions from the control path functions allows
headless forwarding during the switchover phase,
line cards continue to forward packets based on
the FIB info downloaded from the control card
prior to switchover
However, headless forwarding alone cannot achieve
high availability, why?

57
Control Card Redundancy

Control card switchover triggers
Software upgrade
Hardware failure
Software failure
Software failures can be further breakdown
Software bugs (e.g., memory leak, deadlock, etc.)
Reception of bad protocol info
Software failures can also be classified into two
failure types
Deterministic
Non-deterministic (e.g., due to race conditions)

58
Two Solutions

Nonstop Routing
Hide the failure from all other nodes
Nonstop Forwarding
Hide the failure from all other nodes except its
neighbors

59
Nonstop Routing

Three levels of redundancy in terms of backup
cards readiness to take over
Cold standby after switchover, the backup card
starts its software from scratch
Warm standby after switchover, the backup card
has partial information, like database, partial
protocol state info of the active one
Hot standby after switchover, the backup card
immediately works as if there is no switchover.

60
Nonstop Routing

Hot standby is relatively easy to implement for
line cards and switch fabric
It is very difficult to achieve hot standby for
control cards
Why?

61
Nonstop Routing

3 levels to achieve warm to hot standby
Database replication, e.g., FIB replication
Full state or partial state replication, e.g., a
repository of neighboring/adjacency state
Partial or full runtime environment/operating
system level redundancy, e.g., running the
processes in the redundant card in synch with
their counterparts in the active card

62
Nonstop Routing

Level 1 redundancy
the easiest to implement and is recommended to be
used by some routing software vendors, e.g.,
Ipinfusion and Data Connection Ltd
Can deal with non-deterministic software failures
May recover from certain deterministic software
failures if the backup card runs a different
version of the protocol software
Has significant impact on network performance due
to slow routing convergence

63
Nonstop Routing

Level 2 redundancy
Improves over level 1 by duplicating state
information, in addition to database. It
therefore improves routing convergence time
FIB can still become stale due to system bring-up
(or bootup) phase and state info retrieval phase

64
Nonstop Routing

Level 3 redundancy
Offers the highest level availability when
software upgrade or hardware failure occurs
Difficult to achieve since the backup card needs
to run in lockstep with the primary card, or hot
standby mode
Cannot recover from software failures

65
Nonstop Routing

In practice
Combine three levels, example
Full database replication
Partial state machine replication
Full TCP redundancy at operating system level
Vendor solutions
Alcatel, Avici, etc. (2002)

66
Nonstop Routing

A critical open issue
60 of the failures due to software
The existing solutions cannot deal with software
failures well
Any other solutions?

67
Nonstop Forwarding

An example Hitless Restart
During switchover phase, tell all your
neighboring nodes to pretend nothing has gone
wrong, and your neighbors, in turn, tell their
neighbors that you are doing just fine
Can this approach alone provides high
availability, why or why not?
A combined approach may be helpful

Write a Comment

User Comments (0)