Design Requirements for Bullet-Proof Packet Passers - PowerPoint PPT Presentation

About This Presentation
Title:

Design Requirements for Bullet-Proof Packet Passers

Description:

Design Requirements for Bullet-Proof Packet Passers Avi Freedman avi_at_freedman.net Chief Technical Officer, Netaxs VP and Chief Network Architect, Akamai – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 24
Provided by: AviFre5
Category:

less

Transcript and Presenter's Notes

Title: Design Requirements for Bullet-Proof Packet Passers


1
Design RequirementsforBullet-Proof Packet
Passers
  • Avi Freedman
  • avi_at_freedman.net
  • Chief Technical Officer, Netaxs
  • VP and Chief Network Architect, Akamai

2
Overview
  • Goals and problems in Good Networking
  • Current and future SLAs
  • Failure analysis
  • Hardware requirements
  • Software requirements
  • Sample architecture Nortel OPC
  • Open questions

3
Goals for Good Networking
  • The three things that customers seem to want from
    IP networking
  • Stability
  • Performance
  • Burstability/capacity assurance
  • Price
  • Order varies, but Stability is almost always 1

4
Problems in Good Networking
  • Performance is often a backbone capacity and
    more often a peering/transit issues.
  • Burstability problems come from lack of large
    aggregation capabilities (no 100 gb ports to
    connect 1gb customers to) a soluble engineering
    effort, though, with enough of even todays
    hardware.

5
Problems in Good Networking
  • The biggest problem is stability. Four main
    causes
  • Operator error
  • Software
  • Fiber cuts
  • Hardware
  • One can argue over ranking, but all are
    important.
  • Fiber is a soluble issue with money and
    engineering.
  • Well revisit these.

6
Current and Future SLAs
  • Todays SLAs are fairly weak. SLAs of the future
    will trend towards minutes per year of outage,
    with large credits for complete outages.
  • CDNs already offer SLAs that give 1 day credit
    for a 15 minute slowdown (not even outage).
  • Todays hardware and software cannot be relied
    upon to pass IP packets reliably enough to meet
    these SLAs.
  • To meet these SLAs, 5 minutes/year of system-wide
    outage is probably all that customers will
    tolerate at some point and the first network to
    offer it in a vacuum will win huge market share.

7
Failure Analysis Op Error
  • What causes operator error?
  • Often its not ignorance, but the fact that doing
    distributed configuration is hard with todays
    tools.
  • Key point cisco no method has caused many a
    network outage.
  • GUIs are unwiedly, though.
  • And Unix OS on routers is a security problem!
  • Industry work on safer GUIs is needed.

8
Failure Analysis - Hardware
  • Hardware is typically less of a problem, but OIR
    often stands for Online insert and reboot.
  • The design needs to be simple, elegant, and
    redundant.
  • Ideally, scalable and expandable as well, but
    simplicity of design is the best assurance of
    stability.

9
Failure Analysis - Software
  • Router software causes literally hundreds of
    outages per year even (excuse the term)
    megalapses inside networks.
  • Most of the problems do NOT relate to protocol
    design, though there are scaling issues to be
    solved there.
  • Most of the problems come from
  • Bad code
  • Bad OS (OS fails to protect against bad code)

10
Failure Analysis CPU Protection
  • Additionally, there is a chronic problem in that
    vendors are not providing sufficient protection
    for the route-processing engines, and as denial
    of service attacks get more aggressive, this is a
    growing problem!
  • The industry needs to describe to vendors what
    rules are needed
  • (Dont allow multicast except for OSPF to
    connected interfaces, etc)

11
Failure Analysis Software Modularity
  • In addition to contributing to bad code, the more
    monolithic nature of current router OSs make it
    hard to avoid downtime while upgrading the
    network.
  • Upgrade-on-the-fly (with a base OS that remains
    unchanged) is an elusive goal, but it is
    achievable 5ESS and DMS boxes prove it.

12
Sample Architecture Nortel OPC
  • As a case study, we consider the Nortel OPTera
    Packet Core, which has been designed around
    carrier-class robustness, with feedback from
    industry and telephony-switch engineers.
  • The OPC is a 3-year-old research project that
    went into product mode about a year ago.
    Products are about a year out, so Nortel is
    aggressively seeking input about robustness!

13
OPC Design Requirements
  • The OPC team defined 99.999 as the target
    uptime, and defined uptime as uptime across
    ports. So, 5 minutes downtime across all (of up
    to) 480 ports, or potentially more downtime
    across fewer ports.
  • Figures 2 software upgrades/year, and splits
    acceptable failures roughly evenly between
    hardware and software.

14
OPC Hardware Overview
  • The OPC starts with a base 20 slot application
    shelf chassis of port and/or processor cards,
    and fabric slots. Base config can run in-chassis
    fabric, but is not expandable on the fly.
  • If broken out into an application shelf and
    fabric shelf, can be expanded to full 480-slot
    config without downtime or packet loss.

15
OPC Hardware Overview
  • Each slot has (up to) 10gb of port capacity,
    and 16gb of backplane (14.5gb effective after
    overhead).
  • Maximally configured, a 4.8tb router consisting
    of 24 application shelves in 12 racks, 16 fabric
    shelves in 4 bays, and a processor shelf.
  • Each shelf can be up to 1km apart (entire system
    must be within 1km diameter per spec, though its
    not clear this is a robustness-enhancing function
    until the router can operate partitioned)

16
OPC Fabric
  • The OPC fabric is passive with each possible
    set of boards, the config is fixed, and no
    software is required to drive or configure the
    fabric.
  • Can be imagined as parallel train tracks, with
    each board being a station, and slightly fewer
    trains shuttling 4 cells of traffic (each cell
    being one of 4 fixed priorities per cell). More
    boards is more stations.

17
OPC Card Architecture
  • Each card has a general-purpose CPU (Motorola
    750), and two packet-processor chips (the RSP2).
  • The RSP2 runs software, mostly microcode,
    scheduling, etc
  • The RSP2 can do up to 100 instructions on each of
    16 packets in parallel, and then in serial for
    packet modification.
  • For read-only packet processing, within 1 of
    line rate is possible per card. 40-43 byte
    packets are line-rate, 65-70 byte packets yeild lt
    1 loss, beyond is line-rate.

18
OPC - Software
  • The major cause of software-based router failures
    is bad code. Ultimately, better software
    engineering is required.
  • Along the way, sound software architecture and
    protective features are needed.
  • And on-the-fly upgrade-ability.
  • As well as main-CPU-protection.

19
OPC Main-CPU Protection
  • Each boards RSP2s can do packet classification
    inbound or outbound, can throw away packets,
    replicate them (multicast or sniffing), kick them
    up to the main CPU, or send them to another
    port/card.
  • The capability exists as well to shape different
    classes of traffic as part of kicking packets up
    to the main CPU on-card or on another card.
  • The key is the ruleset input is needed.

20
Main CPU Protection
  • As a general issue, rules should be reflected in
    multiple router vendors.
  • Rules such as
  • 64k/sec of BGP from an IP, only if we are talking
    to that IP
  • No non-OSPF multicast
  • 10 packets per second to each connected IP

21
Nortel OPC - CLI
  • Nortel is soliciting input on robust CLI design
    to reduce operator error.
  • Possibilities include ability for comments,
    transactions (commit/rollback), network-wise
    synchronized update (though this can cause
    instability as well)

22
OPC Software Architecture
  • We now talk about the software that runs on the
    main CPUs, and the main Motorola 750 procs per
    board.
  • Chorus multi-threaded, multi-CPU real-time OS as
    a base. Has memory protection and preemptive
    multitasking.
  • IPC layer (RACE) on top, handles communication
    between processes agents and threads. Among
    other things, RACE allows virtual synchrony
    running multiple processes in parallel and taking
    the first answer as a result.
  • This allows for easy upgrading of processes, and
    robustness in case of single- or multi-card
    failures.

23
Open Questions
  • What are other vendors doing? Cisco, Juniper,
    Avici all seem to be missing in major areas
    Nortel is addressing. Of course, you can buy
    Cisco, Juniper, and Avici products now ?
  • CLI design input
  • CPU protection rule input
  • Software architecture input (what modules should
    be on-the-fly upgrade-able) for example,
    trade-offs in BGP converge-ance vs.
    upgrade-ability.
Write a Comment
User Comments (0)
About PowerShow.com