Congestion Avoidance - PowerPoint PPT Presentation

About This Presentation
Title:

Congestion Avoidance

Description:

Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt) Anurag Maunder Sanera Systems amaunder_at_sanera.net – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 20
Provided by: Jerr55
Learn more at: https://www.ietf.org
Category:

less

Transcript and Presenter's Notes

Title: Congestion Avoidance


1
Congestion Avoidance Controlfor OSPF
Networks(draft-ash-manral-ospf-congestion-control
-00.txt)
Anurag Maunder Sanera Systems amaunder_at_sanera.net
Jerry Ash ATT gash_at_att.com
Gagan Choudhury ATT choudhury_at_att.com
Vera Sapozhnikova ATT sapozhnikova_at_att.com
Vishwas Manral NetPlane Systems vishwasm_at_netplane
.com
Mostafa Hashem Sherif ATT mhs_at_att.com
2
Outline (draft-ash-manral-ospf-congestion-control
-00.txt)
  • problem
  • concerns over scalability of IGP link-state
    protocols (e.g., OSPF)
  • much evidence that LS protocols cannot recover
    from large failures widespread loss of topology
    database information
  • failure experience
  • vendor analysis
  • simulation modeling
  • propose protocol mechanisms to address problem
  • throttle LSA updates/retranmissions
  • detect notify congestion state
  • neighbor nodes throttle LSA updates/retransmission
    s
  • keep adjacencies up
  • database backup resynchronization
  • proprietary implementations of mechanisms have
    improved scalability/stability
  • need standard features for uniform implementation
    interoperability
  • issues discussed on list

3
Background Motivation
  • Failure experience
  • LS routing protocols cannot recover from large
    flooding storms
  • triggered by wide range of causes network
    failures, bugs, operational errors, etc.
  • flooding storm overwhelms processors, causes
    database asynchrony incorrect shortest path
    calculation, etc.
  • ATT has experienced several very large LS
    protocol failures (4/13/1998, 7/2000, 2/20/2001,
    described in I-D)
  • vendor analysis of LS protocol recovery from
    total network failure(loss of all database
    information in the specified scenario, 400 nodes,
    etc.)
  • recovery time estimates up to 5.5 hours
  • expectation is that vendor equipment recovery not
    adequate under large failure scenario
  • network-wide event simulation model choudhury
  • medium to large flooding storms cause network to
    recover with difficulty and/or not recover at all
  • model validated -- results match actual network
    experience

4
Failure ExperienceATT Frame Relay Network,
4/13/98
  • cause effect
  • administrative error coupled with a software bug
  • result was the loss of all topology database
    information
  • the link-state protocol then attempted to recover
    the database with the usual Hello topology
    state updates (TSUs)
  • huge overload of control messages kept network
    down for very long time
  • several problems occurred to prevent the network
    from recovering properly (based on root-cause
    analysis)
  • very large number of TSUs being sent to every
    node to process, causing general processor
    overload
  • route computation based on incomplete topology
    recovery routes generated based on transient,
    asynchronous topology information then in need
    of frequent re-computation
  • inadequate work queue management to allow
    processes to complete before more work is put
    into the process queue
  • inability to access node processors with network
    management commands due to lack of necessary
    priority of these messages
  • worked with vendor to make protocol fixes to
    address problems
  • along the lines suggested in the I-D

5
Proposed Protocol MechanismsThrottle LSA
Updates/Retransmissions
  • detect node-congestion by
  • length of internal work queues
  • high processor occupancy long CPU busy times
  • notify congestion state to other nodes
  • use TBD packet to convey congestion signal
  • when a node detects congestion from a neighbor
  • progressively decrease flooding rate, e.g.
  • double LSA_RETRANSMIT_INTERVAL for low congestion
  • quadruple LSA_RETRANSMIT_INTERVAL for high
    congestion
  • simulation analysis shows proposed mechanisms
    perform effectively (Choudhury)
  • deals better with non-linear failure modes than
    statistical detection/notification methods

6
Issues Discussed on List
  • is there a problem (need to prevent catastrophic
    network collapse)
  • most seem to agree there is a problem
  • several have observed LSA storms their ill
    effects
  • storms triggered by hardware failure, software
    bug, faulty operational practice, etc., many
    different events
  • sometimes network cannot recover
  • unacceptable to operators
  • vendors invited to analyze failure scenario given
    in draft
  • no response yet
  • how to solve problem
  • better/smart implementation/coding of protocol
    within current specification
  • e.g., never losing an adjacency solves problem
  • these are proprietary, single-vendor,
    implementation extensions
  • standard protocol extensions
  • for uniform implementation
  • for multi-vendor interoperability
  • already demonstrated with proprietary,
    single-vendor implementations

7
Issues Discussed on List
  • what protocol extensions?
  • not just signaling congestion message on the
    wire but also response
  • need uniform response to congestion signal slow
    down by this much to be effective
  • rather than implementation dependent response
  • like helper router response to grace LSA from
    congested router in hitless restart
  • how evaluate effectiveness of proposals
  • expert analysis based on experience
  • simulation
  • a couple of academic shaky simulation
    comments
  • validated simulations used widely
  • for network design of routing features, nm
    features, congestion control, etc.
  • for many years
  • many large-scale network design examples (e.g.,
    Dynamic Routing in Telecommunications Networks,
    McGraw Hill)
  • white-box approach
  • implement text in the lab
  • expert analysis, simulation, white-box all useful

8
Issues Discussed at IETF-55Routing Area Meeting
MPLS WG Meeting
  • box builders view
  • stop intruding into our box
  • design choices should be made by box builders
  • nothing wrong with current way of building boxes
  • box users view
  • still observe major failures
  • most agree there is a problem (from list
    discussion)
  • box-builder/vendor analysis shows unacceptable
    failure response (in draft)
  • box-builders/vendors invited to analyze scenario
    in draft
  • box-builders approach doesnt work to prevent
    failures
  • boxes need a few, critical, standard protocol
    mechanisms to address problem
  • have gotten vendors to make proprietary changes
    to fix problem
  • require standard protocol extensions
  • for uniform implementation
  • for multi-vendor interoperability
  • user requirements need to drive solution to
    problem

9
Conclusions
  • problem
  • concerns over scalability of IGP link-state
    protocols
  • evidence that LS routing protocols (e.g., OSPF)
    currently can not recover from large failures
    widespread loss of topology database information
  • problem is flooding, data base asynchrony,
    shortest path calculation, etc.
  • evidence based on failure experience, vendor
    analysis, simulation modeling
  • propose protocol mechanisms to address problem,
    e.g.
  • throttle LSA update/retransmissions
  • detect notify congestion state
  • neighbor nodes throttle LSA updates/retransmission
    s
  • simulation analysis shows effectiveness of
    proposed changes (Choudhury)
  • propose draft as an OSPF WG document
  • refine/evolve proposed protocol extensions

10
Backup Slides
11
Proposed Congestion Control Mechanisms
  • throttle LSA updates/retransmissions
  • detect notify congestion state
  • congested node signals other nodes to limit rate
    of LSA messages sent to it
  • neighbor nodes throttle LSA updates/retransmission
    s
  • automatically reduce rate under congestion
  • keep adjacencies up
  • database backup resynchronization
  • topology database automatically recovered from
    loss based on local backup mechanisms
  • allows a node to recover gracefully from local
    faults on the node
  • prioritized processing of Hello LSA Ack
    messages (Choudhury draft)

12
Keep Adjacencies Up
  • increase adjacency break interval under
    congestion
  • goal is to avoid breaking adjacencies by
    increasing wait interval for non-receipt of Hello
    messages
  • if node detects congestion from a neighbor if
    no packet received in NODE_DEAD-INTERVAL
  • wait additional time ADJACENCY_BREAK_INTERAL
    before calling adjacency down
  • throttle setups of link adjacencies
  • define MAX_ADJACENCY_BUILD_COUNT maximum number
    of adjacencies a node can bring up at one time

13
Database Backup Resynchronization
  • database backup
  • node should provide a local, primary, nonvolatile
    memory backup GR-472-CORE
  • node should back up all non-self-originated LSAs,
    routing tables, states of interfaces
  • database should be backed up at least every 5
    minutes
  • restoration of data should be completed within 5
    minutes of initiation GR-472-CORE
  • nodes signal neighbors when safe to perform
    resynchronization procedures
  • based on TBD packet format
  • under resynchronization, node
  • should generate all its own LSAs
  • should receive only LSAs that have changed
    between time it failed current time
  • should base its routing on current database,
    derived as above

14
Database Backup Resynchronization
  • database resynchronization
  • propose changes to receiving/transmitting
    database summary LSA request packets
  • when in full state
  • node sends receives database summary LSA
    request packets as if performing database
    synchronization when peer data structure is in
    Negotiating, Exchanging, loading states
  • node informs neighbor when to use resync
    procedures
  • node supports resync to neighbor request by
    receiving/transmitting database summary LSA
    request packets

15
Failure Experience
  • other failures which have occurred with similar
    consequences
  • moderate TSU storm following ATM nodes upgrade,
    7/2000
  • network recovered, with difficulty
  • large TSU storm in ATM network, 2/20/2001
    pappalardo1, pappalardo2
  • manual procedures required to reduce TSU flooding
    stabilize network
  • desirable to automate procedures for TSU flooding
    reduction under overload
  • worked with vendor to make protocol fixes to
    address problems
  • along the lines suggested in the I-D
  • other relevant LS-network failures have been
    reported cholewka, jander
  • conclusions
  • LS vulnerable to loss of database information,
    control overload to re-sync databases, other
    failure/overload scenarios
  • networks more vulnerable in absence of adequate
    protection mechanisms
  • generic problem of LS protocols
  • across a variety of implementations
  • across FR, ATM, IP-based technologies

16
Vendor Analysis
  • vendors service providers asked to analyze LS
    protocol recovery from total network failure(loss
    of all database information in the specified
    scenario
  • network scenario
  • 400 node network
  • 100 backbone nodes
  • 3 edge nodes per backbone node (edge single
    homed)
  • backbone nodes connected to max of 10 backbone
    nodes
  • max node adjacency is 13
  • sparse network
  • 101 peer groups
  • 1 backbone peer group with 100 backbone nodes
  • 100 edge peer groups, each with 3 nodes, all
    homed on the backbone peer group
  • 1,000,000 addresses advertised

17
Vendor Analysis
  • projected recovery times
  • Recovery Time Estimate A 3.5 hours
  • Recovery Time Estimate B 5-15 minutes
  • Recovery Time Estimate C 5.5 hours
  • expectation is that vendor equipment recovery not
    adequate under large failure scenario

18
Analysis Modeling
  • various studies published atmf00-0249, maunder,
    choudhury
  • choudhury reports network-wide event simulation
    model
  • study impact of a TSU storm
  • captures
  • node congestion
  • propagation delay between nodes
  • retransmissions if TSU not acknowledged within 5
    seconds
  • link declared down if Hello delayed beyond
    node-dead interval (aka inactivity timer in
    PNNI, router-dead interval in OSPF)
  • link recovery following database synchronization
  • approximates real network behavior processing
    times
  • results show
  • dispersion -- number of control packets generated
    but not processed in at least one node
  • medium to large TSU storms cause network to
    recover with difficulty and/or not recover at all
  • results match actual network experience

19
Impact of TSU Storm on Network Stability
Write a Comment
User Comments (0)
About PowerShow.com