EtherRake: Diagnosis and Monitoring in Data Center - PowerPoint PPT Presentation

About This Presentation
Title:

EtherRake: Diagnosis and Monitoring in Data Center

Description:

* IP Router Errors ... the speed with which all devices on the network-having been notified of the failure-can calculate an alternate path ... DHCP DHCP problem ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 37
Provided by: Zhich8
Category:

less

Transcript and Presenter's Notes

Title: EtherRake: Diagnosis and Monitoring in Data Center


1
EtherRake Diagnosis and Monitoring in Data
Center Enterprise Networks
Lab for Internet and Security Technology (LIST)
Northwestern Univ.
2
General Idea of EtherRake
  • Problem statement
  • Emerging DC and enterprise networks are mainly
    comprised of large of switches which need
    monitoring and diagnosis

3
General Idea of EtherRake
  • A centralized structure.
  • Collector at each switches
  • Collect Neighbors
  • Collect port information
  • Collect forwarding tables
  • Monitor Plane
  • Transmit collected information
  • Processing Center
  • Link the frames
  • Construct Logical Topology
  • Find the problems

4
Collector at each switches
  • Take Cisco switches for example
  • Port information
  • show port status (display interface ethernet0/1
    for huawei)
  • Neighbor Information
  • show CDP neighbors
  • Forwarding tables (aka switch table)
  • show MAC interface mapping

5
Collector at each switches
  • Port information
  • Port Number 2 Bytes
  • Status 4 bits
  • Total 3 Bytes 100 300 Bytes lt 0.4KB per
    switch
  • Neighbor Information
  • Mac Address 48 bits
  • Total 6Bytes 100 600 Bytes lt 0.6KB per switch
  • Forwarding Tables
  • To be decided. We are not using it in our
    approach now. We can transfer updates only which
    means normally we dont need to transfer
    anything.
  • Total 1 KB 1024 (number of switches) 1MB in
    one round.

6
Collector at each switches
  • Synchronization
  • Cristian's algorithm (P is processing center, and
    S is a collector)
  • P requests the time from S
  • After receiving the request from P, S prepares a
    response and appends the time T from its own
    clock.
  • P then sets its time to be T RTT/2
  • Multiple measurement can reduce the error.
  • Accuracy. (T min) to (T RTT - min) where min
    is the minimum one-way time.

7
Monitor Plane
  • Monitor Plane is a plane that co-exists with data
    plane and control plane in the same channel. It
    is used to transfer monitoring data.

8
Monitor Plane
  • Monitor plane is used to collect data for
    monitoring data plane.
  • Switching in monitor plane has two methods.
  • Normally, control plane will assist monitor plane
    forwarding.
  • Under error, monitor plane will do flooding.

9
Processing Center
  • Collect port information, forwarding tables and
    neighbor information from all the switches.
  • Construct the logical topology of switches based
    on the port neighbor info
  • Detect loops in the logical topology for STP loop
    problems
  • Check for any missing/dead switches

10
Problems to Solve
  • STP Error Detection
  • End-to-end Error Detection
  • Other Hardware/Software Errors of Switches and
    Their Detection
  • TRILL Potential Problems

11
End-to-end Connectivity Monitoring
  • Based on the neighbor and port information, check
    if all switches and end hosts are on a connected
    ST.
  • End hosts are also neighbors for leaf node
    switch.
  • Forwarding table also records info of past
    connectivity

12
Other Software Errors of Switches and its
Detection
  • One-Way Link Problem. No backward frames.
  • From EtherRakes view, interface of the other
    direction is dead.
  • Deferred Frames. Buffer is full. Frames have to
    be dropped.
  • Encode the buffer status (e.g., full) to the
    status bit
  • Links between switches and routers
    disabled/unactivated.
  • Detected by the port status bits or lack of
    heartbeat
  • Switches down, e.g., unbootable IOS problems
  • Same as above

13
Limitations on Other Switch Software Errors
Detection
  • Some errors have to be detected at the data plane
    or application plane.
  • VLAN Problems. Hosts in the same VLAN cannot
    communicate with each other.

14
Hardware Errors of Switches and its Detection
  • Switch Port Errors.
  • Switch Module Errors.
  • Both will be detected by the port status reports

15
STP Errors (1)
  • Count to Infinity when removing the root

16
STP Errors (2)
  • Forwarding Loops
  • BPDU Loss Induced Forwarding Loops. If the
    blocked port fails to receive BPDUs from its peer
    bridge for an extended period of time, it may
    start forwarding data.

17
STP Errors (3)
  • Forwarding Loops
  • MaxAge Induced Forwarding Loops (MaxAge 6)

18
STP Errors (4)
  • Forwarding Loops
  • Count to Infinity Induced Forwarding Loops
  • Pollution of Forwarding Tables

19
Previous STP Errors Detection
  • EtherFuse (sigcomm 07)
  • Plug a fuse into Ethernet
  • Problem Remaining
  • Where to plug it?
  • How many do we need?

20
Previous STP Errors Detection
  • Cisco Prevention Methods
  • Loop Guard. Prevent loss BPDU induced loops.

21
Some Existing Solutions
  • Cisco Discovery Protocol (CDP)
  • Discovery cisco apparatus in neighborhood
  • Monitoring aliveness of neighboring nodes
  • Limitations
  • No detail status report for diagnosis
  • Limited by one hop.
  • Cisco Unidirectional Link Detection (UDLD).
  • Detect One-Way Link Problem.

22
General Monitoring Metrics for Detection
  • Connectivity. Based on frames tree, EtherRake can
    find the connectivity of a path.
  • Delay. EtherRake can link frames and calculate
    the time spent on each switch.
  • Throughput. EtherRake can calculate throughput by
    collected frames.

23
TRILL Potential Problems
  • Routing loops
  • Caused by inconsistent views of network topology.
  • Mitigated using hop count
  • Scalability issue
  • No clear idea on how much TRILL can scale

24
Backup
25
Detection of STP Errors by EtherRake
  • Find STP errors by EtherRake.
  • Link collected frames into traces
  • Detect frame forwarding loops
  • Leverage on the switch and ARP table info
  • Challenges
  • Scalability optimize collection of traces
  • Ambiguity and accuracy frame linking

26
End-to-end Connectivity Monitoring
  • Diagnose Connectivity Problem from A to B by
    EtherRake
  • Find the frames that are on the way from A to B.
  • Link the frames and find a path.
  • Locate the problem.

27
IP Router Errors OSPF (1)
  • Network Convergence Time. The time taken by all
    the OSPF routers in the network to go back to
    steady state operations after there is a change
    in the network state.

28
IP Router Errors OSPF (2)
  • Routing Load on Processors

29
IP Router Errors OSPF (3)
  • Route Flaps. Routing table changes in a router,
    usually in response to a network failure or a
    recovery.

30
Cisco Solution
  • Bi-directional Forwarding Detection (BFD)
  • Try to Speed Network Convergence (three parts).
  • Failure detection the speed with which a device
    on the network can detect and react to a failure
    of one of its own components, or the failure
    of a component in a routing protocol peer.
  • Information dissemination the speed with which
    the failure in the previous stage can be
    communicated to other devices in the network
  • Repair the speed with which all devices on the
    network-having been notified of the failure-can
    calculate an alternate path through which
    data can flow.

31
IP Router Errors DHCP
  • DHCP problem
  • Configuration problem.
  • Inability to acquire or renew a lease.
  • How to keep the same IP address in multi-boot
    machines?

32
EtherFuse (1)
  • A Ethernet Fuse that is plugged into the network
    for monitoring the status of network.

33
EtherFuse (2)
  • Detection of Count to Infinity
  • Detecting cost to the same root R of BPDUs

34
  • Detection of Forwarding Loops.
  • Combination of Passive Sniffing and Active
    Probing.

35
Package View Switching
  • Forwarding packages from the view of packages.
  • Each package will have memory about the history
    of the path it has already gone through and
    decide which way to go based on the memory it
    has.
  • Here is the steps. (Generally speaking, it is
    deep-first searching from the view of packages.)

36
Package View Switching
  1. Normally, when a package arrives at a switch, it
    will choose the default port which is the port
    that control plane provide.
  2. If the package has already tried the default
    port, it will randomly choose a new port that it
    has never been to.
  3. If the package tried every port at this switch,
    it will go back to the port where it is from.
  4. Package will be discarded when it arrived at its
    origin and finds no other way to go. Or package
    arrives at the destination which is the monitor
    center.
Write a Comment
User Comments (0)
About PowerShow.com