Low Overhead Fault Tolerant Networking in Myrinet - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Low Overhead Fault Tolerant Networking in Myrinet

Description:

The network processor can hang and stop responding. A host system can crash/hang ... If the interface hangs, then L_timer is not executed, causing our interval timer ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 24

Provided by: mallikag

Category:

more less

Transcript and Presenter's Notes

Title: Low Overhead Fault Tolerant Networking in Myrinet

1
Low Overhead Fault Tolerant Networking (in
Myrinet)

Architecture and Real-Time Systems (ARTS) Lab.
Department of Electrical and Computer Engineering
University of Massachusetts Amherst MA 01003

2
Motivation

An increasing use of COTS components in systems
has been motivated by the need to
Reduce cost in design and maintenance
Reduce software complexity
The emergence of low cost, high performance COTS
networking solutions
e.g., Myrinet, SCI, FiberChannel etc.
The increasing complexity of network interfaces
has renewed concerns about its reliability
The amount of silicon used has increased
tremendously

3
The Basic Question
How can we incorporate fault tolerance into a
COTS network technology without greatly
compromising its performance?
4
Microprocessor-based Networks

Most modern network technologies have processors
in their interface cards that help to achieve
superior network performance
Many of these technologies allow changes in the
program running on the network processor
Such programmable interfaces offer numerous
benefits
Developing different fault tolerance techniques
Validating fault recovery using fault injection
experimenting with different communication
protocols
We use Myrinet as the platform for our study

5
Myrinet

Myrinet is a cost-effective high performance (2.2
Gb/s) packet switching technology
At its core is a powerful RISC processor
It is scalable to thousands of nodes
Low latency communication (8 ms) is achieved
through direct interaction with network interface
(OS bypass)
Flow control, error control and simple heartbeat
mechanisms are incorporated in hardware
Link and routing specifications are public
standard
Myrinet support software is supplied open
source

6
Myrinet Configuration
Host Node
System Memory
Host Processor
System Bridge
I/O Bus
LANai SRAM
Timers
0
1
2
PCI Bridge
DMA Engine
Host Interface
Packet Interface
SAN/LAN Conversion
RISC
PCIDMA
LANai 9
7
Hardware Software
Application
Host Processor
System Memory
Middleware (e.g., MPI)
TCP/IP interface
OS driver
I/O Bus
Myrinet Card
Network Processor
Local Memory
Myrinet Control Program
Programmable Interface
8
Susceptability to Failures

Dependability evaluation was carried out using
software implemented fault injection
Faults were injected in the Control Program (MCP)
A wide range of failures were observed
Unexpected latencies and reduction of bandwidth
The network processor can hang and stop
responding
A host system can crash/hang
A remote network interface can get affected
Similar type of failures can be expected from
other high-speed networks
Such failures can greatly impact the
reliability/availability of the system

9
Summary of Experiments
Failure Category
Count
of Injections
Total
2080
100

More than 50 of the failures were host interface
hangs

10
Design Considerations

The faults must be detected and diagnosed as
quickly as possible
The network interface must be up and running as
soon as possible
The recovery process must ensure that no messages
are lost or improperly received/sent
Complete correctness should be achieved
The overhead on the normal running of the system
must be minimal
The fault tolerance should be made as transparent
to the user as possible

11
Fault Detection

Continuously polling the card can be very costly
We use a spare interval timer to implement a
watchdog timer functionality for fault detection
We set the LANai to raise an interrupt when the
timer expires
A routine (L_timer) that the LANai is supposed to
execute every so often resets this interval timer
If the interface hangs, then L_timer is not
executed, causing our interval timer to expire
and raising a FATAL interrupt

12
Fault Recovery Summary

The FATAL interrupt signal is picked by the fault
recovery daemon on the host
The failure is verified through numerous probing
messages
The control program is reloaded into the LANai
SRAM
Any process that was accessing the board prior to
the failure is also restored to its original
state
Simply reloading the MCP will not ensure
correctness

13
Myrinet Programming Model

Flow control is achieved through send and receive
tokens
Myrinet software (GM) provides reliable in-order
delivery of messages
A modified form of Go-Back-N protocol is used
Sequence numbers for the protocol are provided by
the MCP
One stream of sequence numbers exists per
destination

14
Typical Control Flow
Sender
Receiver
User process provides receive buffer User process
sets recv token
User process prepares message User process sets
send token
LANai sdmas message LANai sends message LANai
receives ACK LANai sends event to process
LANai recvs message LANai sends ACK LANai rdmas
message LANai sends event to process
User process handles notification event User
process reuses buffer
User process handles notification event User
process reuses buffer
15
Duplicate Messages
Sender
Receiver
User process provides receive buffer User process
sets recv token
User process prepares message User process sets
send token
LANai recvs message LANai sends ACK LANai rdmas
message LANai sends event to process
LANai sdmas message LANai sends message
LANai goes down
Lost ACK
User process handles notification event User
process reuses buffer
Driver reloads MCP into board Driver resends all
unacked messages LANai sdmas
message LANai sends message
LANai recvs message
Duplicate message
ERROR!
Lack of redundant state information is the cause
for this problem
16
Lost Messages
Sender
Receiver
User process provides receive buffer User process
sets recv token
User process prepares message User process sets
send token
LANai recvs message LANai sends ACK
LANai sdmas message LANai sends message LANai
receives ACK LANai sends event to process
LANai goes down
User process handles notification event User
process reuses buffer
Driver reloads MCP into board Driver sets all
recv tokens again LANai waits for
message
ERROR!
Incorrect commit point is the cause of this
problem
17
Fault Recovery

We need to keep a copy of the state information
Checkpointing can be a big overhead
Logging critical message information is enough
GM functions are modified so that
A copy of the send tokens and the receive tokens
is made with every send and receive call
The host processes provide the sequence numbers,
one per (destination node, local port) pair
Copy of send and receive token is removed when
the send/receive completes successfully
MCP is modified
ACK is sent out only after a message is DMAed to
host memory

18
Performance Impact

The scheme has been integrated successfully into
GM
Over 1 man year for complete implementation
How much of the performance of the system has
been compromised ?
After all one cant get a free lunch these days!
Performance is measured using two key parameters
Bandwidth obtained with large messages
Latency of small messages

19
Latency
20
Bandwidth
21
Summary of Results
Host Platform Pentium III with 256MB
RedHat Linux 7.2
22
Summary of Results
Fault Detection Latency 50 ms Fault Recovery
Latency 0.765 s Per-Process Latency 0.50
s
23
Our Contributions

We have devised smart ways to detect and recover
from network interface failures
Our fault detection technique for network
processor hangs uses software implemented
watchdog timers
Fault recovery time (including reloading of
network control program) 2 seconds
Performance impact is under 1 for messages over
1KB
Complete user transparency was achieved