Low Overhead Fault Tolerant Networking in Myrinet - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Low Overhead Fault Tolerant Networking in Myrinet

Description:

The network processor can hang and stop responding. A host system can crash/hang ... If the interface hangs, then L_timer is not executed, causing our interval timer ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 24
Provided by: mallikag
Category:

less

Transcript and Presenter's Notes

Title: Low Overhead Fault Tolerant Networking in Myrinet


1
Low Overhead Fault Tolerant Networking (in
Myrinet)
  • Architecture and Real-Time Systems (ARTS) Lab.
  • Department of Electrical and Computer Engineering
  • University of Massachusetts Amherst MA 01003

2
Motivation
  • An increasing use of COTS components in systems
    has been motivated by the need to
  • Reduce cost in design and maintenance
  • Reduce software complexity
  • The emergence of low cost, high performance COTS
    networking solutions
  • e.g., Myrinet, SCI, FiberChannel etc.
  • The increasing complexity of network interfaces
    has renewed concerns about its reliability
  • The amount of silicon used has increased
    tremendously

3
The Basic Question
How can we incorporate fault tolerance into a
COTS network technology without greatly
compromising its performance?
4
Microprocessor-based Networks
  • Most modern network technologies have processors
    in their interface cards that help to achieve
    superior network performance
  • Many of these technologies allow changes in the
    program running on the network processor
  • Such programmable interfaces offer numerous
    benefits
  • Developing different fault tolerance techniques
  • Validating fault recovery using fault injection
  • experimenting with different communication
    protocols
  • We use Myrinet as the platform for our study

5
Myrinet
  • Myrinet is a cost-effective high performance (2.2
    Gb/s) packet switching technology
  • At its core is a powerful RISC processor
  • It is scalable to thousands of nodes
  • Low latency communication (8 ms) is achieved
    through direct interaction with network interface
    (OS bypass)
  • Flow control, error control and simple heartbeat
    mechanisms are incorporated in hardware
  • Link and routing specifications are public
    standard
  • Myrinet support software is supplied open
    source

6
Myrinet Configuration
Host Node
System Memory
Host Processor
System Bridge
I/O Bus
LANai SRAM
Timers
0
1
2
PCI Bridge
DMA Engine
Host Interface
Packet Interface
SAN/LAN Conversion
RISC
PCIDMA
LANai 9
7
Hardware Software
Application
Host Processor
System Memory
Middleware (e.g., MPI)
TCP/IP interface
OS driver
I/O Bus
Myrinet Card
Network Processor
Local Memory
Myrinet Control Program
Programmable Interface
8
Susceptability to Failures
  • Dependability evaluation was carried out using
    software implemented fault injection
  • Faults were injected in the Control Program (MCP)
  • A wide range of failures were observed
  • Unexpected latencies and reduction of bandwidth
  • The network processor can hang and stop
    responding
  • A host system can crash/hang
  • A remote network interface can get affected
  • Similar type of failures can be expected from
    other high-speed networks
  • Such failures can greatly impact the
    reliability/availability of the system

9
Summary of Experiments
Failure Category
Count
of Injections
Total
2080
100
  • More than 50 of the failures were host interface
    hangs

10
Design Considerations
  • The faults must be detected and diagnosed as
    quickly as possible
  • The network interface must be up and running as
    soon as possible
  • The recovery process must ensure that no messages
    are lost or improperly received/sent
  • Complete correctness should be achieved
  • The overhead on the normal running of the system
    must be minimal
  • The fault tolerance should be made as transparent
    to the user as possible

11
Fault Detection
  • Continuously polling the card can be very costly
  • We use a spare interval timer to implement a
    watchdog timer functionality for fault detection
  • We set the LANai to raise an interrupt when the
    timer expires
  • A routine (L_timer) that the LANai is supposed to
    execute every so often resets this interval timer
  • If the interface hangs, then L_timer is not
    executed, causing our interval timer to expire
    and raising a FATAL interrupt

12
Fault Recovery Summary
  • The FATAL interrupt signal is picked by the fault
    recovery daemon on the host
  • The failure is verified through numerous probing
    messages
  • The control program is reloaded into the LANai
    SRAM
  • Any process that was accessing the board prior to
    the failure is also restored to its original
    state
  • Simply reloading the MCP will not ensure
    correctness

13
Myrinet Programming Model
  • Flow control is achieved through send and receive
    tokens
  • Myrinet software (GM) provides reliable in-order
    delivery of messages
  • A modified form of Go-Back-N protocol is used
  • Sequence numbers for the protocol are provided by
    the MCP
  • One stream of sequence numbers exists per
    destination

14
Typical Control Flow
Sender
Receiver
User process provides receive buffer User process
sets recv token
User process prepares message User process sets
send token
LANai sdmas message LANai sends message LANai
receives ACK LANai sends event to process
LANai recvs message LANai sends ACK LANai rdmas
message LANai sends event to process
User process handles notification event User
process reuses buffer
User process handles notification event User
process reuses buffer
15
Duplicate Messages
Sender
Receiver
User process provides receive buffer User process
sets recv token
User process prepares message User process sets
send token
LANai recvs message LANai sends ACK LANai rdmas
message LANai sends event to process
LANai sdmas message LANai sends message
LANai goes down
Lost ACK
User process handles notification event User
process reuses buffer
Driver reloads MCP into board Driver resends all
unacked messages LANai sdmas
message LANai sends message
LANai recvs message
Duplicate message
ERROR!
Lack of redundant state information is the cause
for this problem
16
Lost Messages
Sender
Receiver
User process provides receive buffer User process
sets recv token
User process prepares message User process sets
send token
LANai recvs message LANai sends ACK
LANai sdmas message LANai sends message LANai
receives ACK LANai sends event to process
LANai goes down
User process handles notification event User
process reuses buffer
Driver reloads MCP into board Driver sets all
recv tokens again LANai waits for
message
ERROR!
Incorrect commit point is the cause of this
problem
17
Fault Recovery
  • We need to keep a copy of the state information
  • Checkpointing can be a big overhead
  • Logging critical message information is enough
  • GM functions are modified so that
  • A copy of the send tokens and the receive tokens
    is made with every send and receive call
  • The host processes provide the sequence numbers,
    one per (destination node, local port) pair
  • Copy of send and receive token is removed when
    the send/receive completes successfully
  • MCP is modified
  • ACK is sent out only after a message is DMAed to
    host memory

18
Performance Impact
  • The scheme has been integrated successfully into
    GM
  • Over 1 man year for complete implementation
  • How much of the performance of the system has
    been compromised ?
  • After all one cant get a free lunch these days!
  • Performance is measured using two key parameters
  • Bandwidth obtained with large messages
  • Latency of small messages

19
Latency
20
Bandwidth
21
Summary of Results
Host Platform Pentium III with 256MB
RedHat Linux 7.2
22
Summary of Results
Fault Detection Latency 50 ms Fault Recovery
Latency 0.765 s Per-Process Latency 0.50
s
23
Our Contributions
  • We have devised smart ways to detect and recover
    from network interface failures
  • Our fault detection technique for network
    processor hangs uses software implemented
    watchdog timers
  • Fault recovery time (including reloading of
    network control program) 2 seconds
  • Performance impact is under 1 for messages over
    1KB
  • Complete user transparency was achieved
Write a Comment
User Comments (0)
About PowerShow.com