Multiservice Switch Architecture - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Multiservice Switch Architecture

Description:

Source: 'Network Processors and Coprocessors for Broadband Network ... the backup card needs to run in lockstep with the primary card, or hot standby mode ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 68
Provided by: hche6
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Multiservice Switch Architecture


1
Multiservice Switch Architecture
2
Scope
  • Discuss only distributed architecture
  • Focus on data path functions

3
Outline
  • Architecture Overview
  • Data Path Processing
  • Data path functions
  • Fast or slow path processing
  • Control and Data Plane partitioning
  • High Availability

4
Architecture Overview Logic View
Control Module
ATM NIC
Cell / Packet Switch Fabric
IP NIC
IP NIC
5
Architecture Overview Forwarding Paths Fast and
Slow
Control Module
ATM NIC
Cell / Packet Switch Fabric
IP NIC
IP NIC
6
Architecture Overview Interfaces
Source Agilent Technologies
7
Physical View
Source Network Processors and Coprocessors for
Broadband Network Applications, T. A. Chu, ACORN
Networks
8
Card Level Paths Fast and Slow
Source Network Processors and Coprocessors for
Broadband Network Applications, T. A. Chu, ACORN
Networks
9
Architecture Overview Coprocessors
Source Network Processor Based Systems Design
Issues and Challenges, I. Jeyasubramanian,
et.al., HCL Technologies Limited
10
Architecture Overview Coprocessors
Source Network Processor Based Systems Design
Issues and Challenges, I. Jeyasubramanian,
et.al., HCL Technologies Limited
11
Architecture Overview Software Architecture
EMS
Policy
Routing
MPLS
FIB
Control Plane
Forwarding/control Interface
(2)
(4)
(4)
Policer
B u f f e r Mgt.
S c h e d u l e r
(5)
Classifier
(4)
Action
(3)
IP Header Validation
(6)
(9)
(10)
IP/MPLS Header processing
(2)
(1)
(7)
(8)
Pre-IP Processing
(2)
Mapper
FE
12
Data Path Processing
  • (1) The ingress Ethernet frame from an input port
    or frame from switching fabric are validated and
    decapsulated.
  • (2) For non-IP frames, such as PPP and IS-IS,
    Pre-IP Processing will result in the frame PDU to
    be directly forwarded to the control card
  • (2) For MPLS labeled packet which needs to be
    forwarded by label swapping, the label swap table
    is looked up in the Pre-IP Processing module and
    the labeled packet is sent to the IP/MPLS header
    processing module for further processing
  • (2) IP packet header information is validated
  • (3) In the Classifier, the Firewall/policy based
    classification and IP forwarding table lookup are
    performed
  • (4) For DiffServ based filtering, classified
    packet flows are policed and marked/remarked
  • (4) For a non-DiffServ router or DiffServ router
    in the core, the policer module may be bypassed,
    and the packet is acted upon based on the outcome
    of the Classifier.

13
Data Path Processing
  • (4) IP based Control protocol packets are sent
    to the control card for further processing, e.g.,
    OSPF, RSVP-TE packets.
  • (5) The marked packet from the policer is sent
    to the Action module to be rate limited. One or
    multiple thresholds can be used to decide whether
    the packet should be dropped based on the current
    traffic rate and the color of the packet (only
    for DiffServ)
  • (6) The packet is processed including TTL
    update, fragmentation, checksum update, and
    encapsulation
  • (7) The Mapper maps the packet to one of the
    eight output queues based on IP precedence
    subfield, DSCP, or even input interface ID or
    circuit ID the packet came from.
  • (8) The Buffer Manager further sends the packet
    to the appropriate queue.
  • (9) The scheduler schedules the packet out to
    the circuit.

14
Protocol Stack Overview
15
ISO LLC
  • Logical Link Control is specified in ISO 11802 .
    LLC consists of three fields, a destination SAP
    address, a source SAP address and a control
    field. Multiple protocols encapsulated over LLC
    can be identified by protocol identification.
  • ISO provides its scheme for network layer
    protocol identification (NLPID) as specified in
    ISO 9577 . ISO assigns an LLC SAP address (0xFE)
    for use of ISO 9577 NLPID scheme. IEEE 802.1a
    provides its own scheme for network layer
    protocol identification (SNAP). For this purpose,
    ISO assigns an LLC SAP address (0xAA) for the use
    of IEEE802.1a SNAP scheme.
  • The LLC encapsulation comes with two different
    formats. One is based on the ISO NLPID (Network
    Layer Protocol Identifier (PID)) format and the
    other is based on IEEE 802.1a SubNetwork
    Attachment Point (SNAP) format or LLC/SNAP
    format.

16
ISO LLC
  • The LLC header value 0xFE-FE-03 must be used to
    identify a routed PDU for the ISO NLPID format
    (e.g. PPP, IS-IS, etc.).
  • The LLC header is 3-octet in length and its value
    is 0xAA-AA-03, indicating the presence of a SNAP
    header. Note The LLC/SNAP format must be used
    for IP datagram encapsulation.
  • The SNAP header consists of a three octet
    Organization Unique Identifier (OUI) and a two
    octet PID. The SNAP header uniquely identifies a
    routed or bridged protocol. The OUI value
    0x00-00-00 indicates that the PID is an
    EtherType.

PID
OUI
17
ISO LLC
  • Examples
  • Note AppleTalk LLC 0xaa aa 03 OUI
    0x080007 SNAP 0x809b

18
Frame Format
19
Ethernet Frame Format
20
Pre-IP processing Ingress
21
Pre-IP Processing Generic MPLS Label Swapping
22
ATM-LSR Label Swapping
23
IP Header Format
24
TCP Header Format
25
UDP Header Format
26
IP Header Validation
27
Search Key and Filter Rule
28
Packet Classification
29
LER
30
Action Types
  • Accept
  • Discard
  • Reject
  • Routing instance
  • Alert
  • Count
  • Log
  • DSCP set
  • Rate limit

31
IP/MPLS Header Processing TTL Update
32
MTU Check and Fragmentation
33
Fragmentation at a LSR
34
A Fragmentation Algorithm
  • FO -- Fragment offset in the units of 8-octets
  • IHL -- Internet Header Length in the units of
    4-octets
  • DF -- Dont Fragment flag
  • MF -- More Fragment flag
  • TL -- Total Length in octets
  • OFO -- Old Fragment Offset
  • OIHL -- Old Internet Header Length
  • OMF -- Old More Fragments flag
  • OTL -- Old Total Length
  • NFB -- Number of Fragment Blocks (Block size 8
    Octets)
  • MTU -- Maximum Transmission Unit in Octets

35
A Fragmentation Algorithm
  • IF TL lt MTU
  • THEN submit this datagram to the next step in
    datagram processing
  • ELSE IF DF 1
  • THEN discard the datagram and may send an ICMP
    Destination Unreachable message (See Section
    6.2.2) back to the source
  • ELSE
  • To produce the first fragment
  •                                                   
                 i.      Copy the original internet
    header
  •                                                   
                ii.      OIHL lt IHL OTL lt TL OFO
    lt FO OMFlt MF
  •                                                   
              iii.      NFB lt (MTU-IHL4)/8
  •                                                   
              iv.      Attach the first NFB8 data
    octets
  •                                                   
               v.      Correct the header MF lt 1
    TL lt (IHL4)(NFB8) Recompute Checksum
  •                                                   
              vi.      Submit this fragment to the
    next step in datagram processing
  • To produce the second fragment
  •                                                   
            vii.      Selectively copy the internet
    header (some options are not copied, see Section
    6.2.1.4)
  •                                                
             viii.      Append the remaining data
  •                                                   
               ix.      Correct the header IHL lt
    (OIHL4)-(Length of options not copied)
    3/4 TL lt OTL NFB8 (OIHL-IHL)4 FO lt
    OFO NFB MF lt OMF Recompute Checksum
  •                                                   
              x.      Submit this fragment to the
    fragmentation test
  • DONE.

36
Checksum Update
  • HC old checksum in header
  • HC new checksum in header
  • M old value of a 16-bit field
  • M new value of a 16-bit field
  •  
  • Then the algorithm is as follows
  •  
  • IF M-M1
  • HC HC 0xfffe with borrow
  • ELSE
  • HC HC - M M with borrow
  •  

37
Fast or Slow Paths Forwarding
  • Some gray areas
  • ICMP
  • Options field
  • Packet fragmentation

38
ICMP
39
ICMP
40
ICMP
41
ICMP
  • May have different handlings for different ICMP
    type messages
  • Informational ICMP may be handled by control
    card, e.g.,
  • Timestamp/Timestamp Reply
  • Echo/Echo Reply
  • ICMP relevant to data forwarding may be handled
    by the network processor itself, e.g.,
  • Destination Unreachable
  • Source Quench (obsolete)
  • Redirect
  • Time Exceed
  • Parameter Problem
  • Rate limiting to the central control card for
    ICMP packets should be enforced to prevent ICMP
    DOS

42
Options Field
  • Needs to be done by either central control card
    or local CPU, preferably the central control card

43
Fragmentation
  • About 3 Internet traffic needs fragmentation
  • Slow path forwarding can be problematic
  • An Example for an OC-192 interface, the CPU has
    to handle 300Mbps traffic!

44
Fragmentation
  • Concept of Wire-speed forwarding
  • Assumptions
  • A network processor working at 200 MHz clock rate
    or 5 ns
  • One instruction per clock cycle
  • There are 8 threads working in pipeline
  • Minimum frame size is 60 bytes
  • Line rate 1 Gigabit per second
  • Per frame time 60x8/1Gigabit 480 ns
  • Instruction budget 480/596 instructions per
    packet
  • Latency budget 480x8 3840 ns
  • Wire-speed So long as the network processor is
    work conserving and the instruction budget is not
    exceeded, wire-speed forwarding is maintained

45
Fragmentation
  • Traditional perception
  • Fragmentation should not be done by the
    network processor because it consumes too many
    clock cycles or instructions
  • Traditional perception could be wrong and the
    truth might be
  • Care needs to be taken for the load and store
    of the IP header information for updating to
    avoid long latency for packet fragmentation
  • Instruction budget is not an issue because it is
    calculated based on available clock cycles for
    minimum sized packet

46
Function Partitioning
  • Why is it important?
  • Distributed or Centralized?
  • Ideally local information should be handled by
    local components, however, the need for
    information exchange between components sometimes
    call for centralized approach
  • Components mainly involve control card/Central
    CPU, NICs, and local CPU,

47
Function Partitioning
  • Examples
  • Framing at ingress or egress NIC?
  • ARP and/or PPP running on local CPU or central
    CPU?
  • Control plane functions running on local CPU or
    central CPU?

48
Framing at Ingress or Egress
  • Definitions
  • Ingress framing do the layer 2 framing for
    outgoing packet at the ingress NIC
  • Egress framing do the layer 2 framing for
    outgoing packet at the egress NIC
  • Which ones better?
  • Ingress framing requires globalization of local
    information, e.g., ARP tables, interface MTUs,
    etc more memory space
  • Egress framing requires more processing on the
    same packet, e.g., another IP forwarding table
    lookup to find the next hop IP address or more
    overhead on carrying next hop IP address from
    ingress to egress
  • Prioritizing ingress processing versus egress
    processing in the network processor may favor one
    solution over the other

49
ARP Scope
  • Within an IP subnet
  • A physical interface may support multiple IP
    subnet

50
ARP
  • Design choices
  • Distributed Solution
  • Run ARPs locally on the local CPUs in NICs
  • Centralized Solution
  • Run ARPs on the central control processor
  • Hybrid solution
  • Run ARPs locally but the ARP tables are
    centralized
  • Impact of different design choices
  • Distributed solution is good when packet framing
    is done at the egress NICs
  • If packet framing is done at the ingress NICs,
    centralized solution may be better
  • Hybrid solution can be a good choice when central
    control processor power is constrained while
    packet framing needs to be done at the ingress
    NICs

51
PPP
  • Two main purposes for using PPP
  • Broadband access, i.e., ADSL
  • Can be distributed or centralized depending on
    how the subscriber management server is connected
    with the router
  • Support for POS framing
  • Local to the POS interface, but centralized
    control is OK

52
Card Redundancy
  • Background
  • IP Internet was built with high resiliency in
    mind, i.e., recover from a link or node failure
    through
  • rerouting without bounded delay, delay jitter,
    and loss guarantee
  • Packet recovery through transport layer
    retransmission
  • When a node comes back, it takes the following
    steps for a routing domain to become stable
  • Bring up the environment, i.e., RTOS
  • Activate the Processes for routing protocol
    stacks
  • Bring the IP interfaces up
  • Establish neighboring relationships through hello
    protocol

53
Card Redundancy
  • Establish adjacency relationships
  • Bring the database in synch through topology
    information flooding
  • Calculate the shortest path routes and create the
    FIB
  • Download the FIB into each NIC card to facilitate
    packet classification and forwarding
  • Network wide stable state is reached after all
    the nodes in the routing domain have done with
    the above steps
  • In general, it takes 10s of seconds to 10s of
    minutes to reach a stable state, during that time
    period packets can be sent into a transient loop
    or black holed.

54
Card Redundancy
  • To support mission critical and voice
    applications, resiliency alone is not sufficient
  • High availability (e.g., 5 9s uptime) becomes an
    essential requirement for router design. Todays
    ISPs cares more about the availability/reliability
    than price/performance
  • A primary approach to achieve high availability
    is through card level redundancy

55
Card Redundancy
One for one redundancy
56
Card Redundancy
  • Assume that most of the control plane functions
    are carried out in the control card and the line
    cards make use of the FIB downloaded from the
    control card to perform data path functions
  • The physical separation of the data path
    functions from the control path functions allows
    headless forwarding during the switchover phase,
    line cards continue to forward packets based on
    the FIB info downloaded from the control card
    prior to switchover
  • However, headless forwarding alone cannot achieve
    high availability, why?

57
Control Card Redundancy
  • Control card switchover triggers
  • Software upgrade
  • Hardware failure
  • Software failure
  • Software failures can be further breakdown
  • Software bugs (e.g., memory leak, deadlock, etc.)
  • Reception of bad protocol info
  • Software failures can also be classified into two
    failure types
  • Deterministic
  • Non-deterministic (e.g., due to race conditions)

58
Two Solutions
  • Nonstop Routing
  • Hide the failure from all other nodes
  • Nonstop Forwarding
  • Hide the failure from all other nodes except its
    neighbors

59
Nonstop Routing
  • Three levels of redundancy in terms of backup
    cards readiness to take over
  • Cold standby after switchover, the backup card
    starts its software from scratch
  • Warm standby after switchover, the backup card
    has partial information, like database, partial
    protocol state info of the active one
  • Hot standby after switchover, the backup card
    immediately works as if there is no switchover.

60
Nonstop Routing
  • Hot standby is relatively easy to implement for
    line cards and switch fabric
  • It is very difficult to achieve hot standby for
    control cards
  • Why?

61
Nonstop Routing
  • 3 levels to achieve warm to hot standby
  • Database replication, e.g., FIB replication
  • Full state or partial state replication, e.g., a
    repository of neighboring/adjacency state
  • Partial or full runtime environment/operating
    system level redundancy, e.g., running the
    processes in the redundant card in synch with
    their counterparts in the active card

62
Nonstop Routing
  • Level 1 redundancy
  • the easiest to implement and is recommended to be
    used by some routing software vendors, e.g.,
    Ipinfusion and Data Connection Ltd
  • Can deal with non-deterministic software failures
  • May recover from certain deterministic software
    failures if the backup card runs a different
    version of the protocol software
  • Has significant impact on network performance due
    to slow routing convergence

63
Nonstop Routing
  • Level 2 redundancy
  • Improves over level 1 by duplicating state
    information, in addition to database. It
    therefore improves routing convergence time
  • FIB can still become stale due to system bring-up
    (or bootup) phase and state info retrieval phase

64
Nonstop Routing
  • Level 3 redundancy
  • Offers the highest level availability when
    software upgrade or hardware failure occurs
  • Difficult to achieve since the backup card needs
    to run in lockstep with the primary card, or hot
    standby mode
  • Cannot recover from software failures

65
Nonstop Routing
  • In practice
  • Combine three levels, example
  • Full database replication
  • Partial state machine replication
  • Full TCP redundancy at operating system level
  • Vendor solutions
  • Alcatel, Avici, etc. (2002)

66
Nonstop Routing
  • A critical open issue
  • 60 of the failures due to software
  • The existing solutions cannot deal with software
    failures well
  • Any other solutions?

67
Nonstop Forwarding
  • An example Hitless Restart
  • During switchover phase, tell all your
    neighboring nodes to pretend nothing has gone
    wrong, and your neighbors, in turn, tell their
    neighbors that you are doing just fine
  • Can this approach alone provides high
    availability, why or why not?
  • A combined approach may be helpful
Write a Comment
User Comments (0)
About PowerShow.com