Title: Low Overhead Fault Tolerant Networking in Myrinet
1Low Overhead Fault Tolerant Networking (in
Myrinet)
- Architecture and Real-Time Systems (ARTS) Lab.
- Department of Electrical and Computer Engineering
- University of Massachusetts Amherst MA 01003
2Motivation
- An increasing use of COTS components in systems
has been motivated by the need to - Reduce cost in design and maintenance
- Reduce software complexity
- The emergence of low cost, high performance COTS
networking solutions - e.g., Myrinet, SCI, FiberChannel etc.
- The increasing complexity of network interfaces
has renewed concerns about its reliability - The amount of silicon used has increased
tremendously
3The Basic Question
How can we incorporate fault tolerance into a
COTS network technology without greatly
compromising its performance?
4Microprocessor-based Networks
- Most modern network technologies have processors
in their interface cards that help to achieve
superior network performance - Many of these technologies allow changes in the
program running on the network processor - Such programmable interfaces offer numerous
benefits - Developing different fault tolerance techniques
- Validating fault recovery using fault injection
- experimenting with different communication
protocols - We use Myrinet as the platform for our study
5Myrinet
- Myrinet is a cost-effective high performance (2.2
Gb/s) packet switching technology - At its core is a powerful RISC processor
- It is scalable to thousands of nodes
- Low latency communication (8 ms) is achieved
through direct interaction with network interface
(OS bypass) - Flow control, error control and simple heartbeat
mechanisms are incorporated in hardware - Link and routing specifications are public
standard - Myrinet support software is supplied open
source
6Myrinet Configuration
Host Node
System Memory
Host Processor
System Bridge
I/O Bus
LANai SRAM
Timers
0
1
2
PCI Bridge
DMA Engine
Host Interface
Packet Interface
SAN/LAN Conversion
RISC
PCIDMA
LANai 9
7Hardware Software
Application
Host Processor
System Memory
Middleware (e.g., MPI)
TCP/IP interface
OS driver
I/O Bus
Myrinet Card
Network Processor
Local Memory
Myrinet Control Program
Programmable Interface
8Susceptability to Failures
- Dependability evaluation was carried out using
software implemented fault injection - Faults were injected in the Control Program (MCP)
- A wide range of failures were observed
- Unexpected latencies and reduction of bandwidth
- The network processor can hang and stop
responding - A host system can crash/hang
- A remote network interface can get affected
- Similar type of failures can be expected from
other high-speed networks - Such failures can greatly impact the
reliability/availability of the system
9Summary of Experiments
Failure Category
Count
of Injections
Total
2080
100
- More than 50 of the failures were host interface
hangs
10Design Considerations
- The faults must be detected and diagnosed as
quickly as possible - The network interface must be up and running as
soon as possible - The recovery process must ensure that no messages
are lost or improperly received/sent - Complete correctness should be achieved
- The overhead on the normal running of the system
must be minimal - The fault tolerance should be made as transparent
to the user as possible
11Fault Detection
- Continuously polling the card can be very costly
- We use a spare interval timer to implement a
watchdog timer functionality for fault detection - We set the LANai to raise an interrupt when the
timer expires - A routine (L_timer) that the LANai is supposed to
execute every so often resets this interval timer - If the interface hangs, then L_timer is not
executed, causing our interval timer to expire
and raising a FATAL interrupt
12Fault Recovery Summary
- The FATAL interrupt signal is picked by the fault
recovery daemon on the host - The failure is verified through numerous probing
messages - The control program is reloaded into the LANai
SRAM - Any process that was accessing the board prior to
the failure is also restored to its original
state - Simply reloading the MCP will not ensure
correctness
13Myrinet Programming Model
- Flow control is achieved through send and receive
tokens - Myrinet software (GM) provides reliable in-order
delivery of messages - A modified form of Go-Back-N protocol is used
- Sequence numbers for the protocol are provided by
the MCP - One stream of sequence numbers exists per
destination
14Typical Control Flow
Sender
Receiver
User process provides receive buffer User process
sets recv token
User process prepares message User process sets
send token
LANai sdmas message LANai sends message LANai
receives ACK LANai sends event to process
LANai recvs message LANai sends ACK LANai rdmas
message LANai sends event to process
User process handles notification event User
process reuses buffer
User process handles notification event User
process reuses buffer
15Duplicate Messages
Sender
Receiver
User process provides receive buffer User process
sets recv token
User process prepares message User process sets
send token
LANai recvs message LANai sends ACK LANai rdmas
message LANai sends event to process
LANai sdmas message LANai sends message
LANai goes down
Lost ACK
User process handles notification event User
process reuses buffer
Driver reloads MCP into board Driver resends all
unacked messages LANai sdmas
message LANai sends message
LANai recvs message
Duplicate message
ERROR!
Lack of redundant state information is the cause
for this problem
16Lost Messages
Sender
Receiver
User process provides receive buffer User process
sets recv token
User process prepares message User process sets
send token
LANai recvs message LANai sends ACK
LANai sdmas message LANai sends message LANai
receives ACK LANai sends event to process
LANai goes down
User process handles notification event User
process reuses buffer
Driver reloads MCP into board Driver sets all
recv tokens again LANai waits for
message
ERROR!
Incorrect commit point is the cause of this
problem
17Fault Recovery
- We need to keep a copy of the state information
- Checkpointing can be a big overhead
- Logging critical message information is enough
- GM functions are modified so that
- A copy of the send tokens and the receive tokens
is made with every send and receive call - The host processes provide the sequence numbers,
one per (destination node, local port) pair - Copy of send and receive token is removed when
the send/receive completes successfully - MCP is modified
- ACK is sent out only after a message is DMAed to
host memory
18Performance Impact
- The scheme has been integrated successfully into
GM - Over 1 man year for complete implementation
- How much of the performance of the system has
been compromised ? - After all one cant get a free lunch these days!
- Performance is measured using two key parameters
- Bandwidth obtained with large messages
- Latency of small messages
19Latency
20Bandwidth
21Summary of Results
Host Platform Pentium III with 256MB
RedHat Linux 7.2
22Summary of Results
Fault Detection Latency 50 ms Fault Recovery
Latency 0.765 s Per-Process Latency 0.50
s
23Our Contributions
- We have devised smart ways to detect and recover
from network interface failures - Our fault detection technique for network
processor hangs uses software implemented
watchdog timers - Fault recovery time (including reloading of
network control program) 2 seconds - Performance impact is under 1 for messages over
1KB - Complete user transparency was achieved