Title: A new transformation scheme based on active replication strategy that tolerates failures
1A new transformation scheme based on active
replication strategy that tolerates failures
- Hamoudi Kalla, Alain Girault and Yves Sorel
Pop Art team and Aoste team
Paris, April 23, 2004
2Outline
- Introduction
- Model and problem
- State of the art
- The proposed fault-tolerant method for tolerating
- Processors failures
- Communication media failures
- Both processors and communication media failures
- Example
- Conclusion and future work
3Introduction
High level program
Compiler
Model of the algorithm
Architecture specification Distribution
constraints Execution times Real-time
constraints Failure specification
Distribution and scheduling fault-tolerant
heuristic
Fault-tolerant distributed static schedule
Code generator
Fault-tolerant distributed code
4Models Application algorithm
- Algorithm graph
I1
A
C
O
B
I2
I1 and I2 are inputs operations (sensors)
O is output operation (actuator) A, B and C
are computations operations A C is
data-dependence
5Models Hardware architecture
- Architecture graph
P1
P2
L12
P1
P2
B1
L23
L13
P3
P3
Architecture with point-to-point links
Architecture with multipoint links
Memory
P1, P2 and P3 are processors L12, L13 and
L23 are point-to-point communication links B1
is multipoint communication link com1 and
com2 are communicators
com1
operator
com2
Processor
6Models Component Failures
- Only processors and communication media
(point-to-point and multipoint) can fails. - Failures can be characterized as transient or
permanent. - At least a fixed number of processors can
fail-stop. - At least a fixed number of communication media
can fail-stop partially or completely.
L12
P1
P2
P1
P2
P1
P2
m1
m1
L23
L13
P3
P3
P3
Processor failures
Partial communication media failures
complete communication media failures
7Problem ?
- Find a distributed schedule of the algorithm on
the architecture which is fault-tolerant to
processors and communication media failures ?
I1
A
C
O
B
I2
SynDEx
algorithm graph
Distribution/scheduling
L12
P1
P2
L23
L13
P3
architecture graph
SynDEx is a system level CAD software tool for
optimizing the implementation of real-time
embeded applications on multicomponenet
architecture
8State of the art
- A system is fault tolerant if it can mask the
presence of faults in the system by using
hardware and/or software redundancy
I1
A
C
O
P4
B
I2
SynDEx
algorithm graph
Distribution/scheduling
L12
P1
P2
L23
L13
P3
architecture graph
(a) Approaches for tolerating processor failures
(b) Approaches for tolerating communication
media failures
9State of the art
- A system is fault tolerant if it can mask the
presence of faults in the system by using
hardware and/or software redundancy
I1
A
C
O
B
I2
SynDEx
algorithm graph
Distribution/scheduling
L12
P1
P2
L23
L13
P3
architecture graph
(a) Approaches for tolerating processor failures
(b) Approaches for tolerating communication
media failures
10State of the art
- A system is fault tolerant if it can mask the
presence of faults in the system by using
hardware and/or software redundancy
- Active software redundancy (Hashimoto et al.,
2002(a) Fragopoulou and Akl, 1995(b))(a)
Multiple redundant copies of an operation are
scheduled on different processors.(b) Multiple
redundant copies of a message are sent along
disjoint paths.
- Passive software redundancy (Qin et al.,
2002(a) Sriram et al., 1999(b))(a) each
operation is replicated on primary and backups
copies, but only the primary is executed.(b)
One copy of the message is sent, and if it fails,
another copy will be transmitted.
(a) Approaches for tolerating processor failures
(b) Approaches for tolerating communication
media failures
11Outline
- Introduction
- Model and problem
- State of the art
- The proposed fault-tolerant method for tolerating
- Processor failures
- Communication media failures (point-to-point
links) - Both processor and communication media failures
- Example
- Conclusion and future work
12The Proposed fault-tolerant method
Principle (1)
We use active software redundancy for both
operations and communications.
Motivations
- Makes the recovery from failures bounded.
- Makes the system predictable.
- Easier to integrate to SynDEx.
13The Proposed fault-tolerant method
Principle (2)
Algorithm graph (Alg)
Graph transformation
NPF processors failures
NLF links failures
New Alg with redundancy and exclusion relations
Architecture graph (Arc)
Real-time and embedding constraints
14The Proposed fault-tolerant method
Algorithm graph transformation (1) Tolerating
NPF processors failures
A
B
. . .
. . .
NPF1 replicas of B
NPF1 replicas of A
A
B
b1. final algorithm sub-graph
a. initial algorithm sub-graph
15The Proposed fault-tolerant method
Algorithm graph transformation (2) Tolerating
NLF links failures
A
B
One replica of B
one replica of A
NLF1 replicas of data
b2. final algorithm sub-graph
a. initial algorithm sub-graph
16The Proposed fault-tolerant method
Algorithm graph transformation (3) Tolerating
NPF processors and NLF links failures
NPF1 and NLF1
17The Proposed fault-tolerant method
Algorithm graph transformation (4) Tolerating
NPF processors and NLF links failures
NPF1 and NLF1
18The Proposed fault-tolerant method
Algorithm graph transformation (5) Tolerating
NPF processors and NLF links failures
NPFgt1 and NLFgt1
A
R
...
B
...
NPF1 replicas of A
NPF1 replicas of B
R
A
NLF routing operations R
a. initial algorithm sub-graph
b. final algorithm sub-graph
19The Proposed fault-tolerant method
Graph transformation
NPF processors failures
NLF links failures
A
R
New Alg with redundancy and exclusion relations
B
...
NPF1 replica of B
...
NPF1 replica of A
R
A
NLF routing operations R
Fault-tolerant distributed real-time executive
Architecture graph Arc
Real-time and embedding constraints
20The Proposed fault-tolerant method
Implantation
- B1 will receive its input data NPFNLF1 times
(NPF1, NLF1)as soon as it receives the first
input, B1 is executed, and it ignores the later
inputs
A1
data
L34
L14
L24
L12
L23
P1
P2
P3
P4
R
B1
two replicas of B
two replicas of A
A2
B1
SynDEx
a transformed algorithm sub-graph
B1
L12
P1
P2
time
L23
L14
L24
Temporary schedule
start time (B1) min ( end communication
A1,A2,R )
L34
P4
P3
architecture graph
21Outline
- Introduction
- Model and problem
- State of the art
- The proposed fault-tolerant method for tolerating
- Processor failures
- Communication media failures (multipoint links)
- Both processor and communication media failures
- Example
- Conclusion and future work
22The Proposed fault-tolerant method
- We use the active software redundancy of
operations where each operation is replicated on
NPF1 different processors to tolerate NPF
processors failures.
P1
P2
B1
B2
P3
P4
Temporary schedule
Algorithm sub-graph
architecture graph
23The Proposed fault-tolerant method
- Use the passive software redundancy of
communication
24The Proposed fault-tolerant method
Why data fragmentation ?
- Distinction between complete and partial
communication links failures
- Enable rapid recovery from processors and
communication links failures
25The Proposed fault-tolerant method
- Recovery from processor failures
26The Proposed fault-tolerant method
- Recovery from partial communication links failures
27The Proposed fault-tolerant method
- Recovery from complete communication media
failures
28Example (1)
29Example (2)
30Conclusion and future work
Result
- A new method to tolerate both communication
links and processor failures in distributed
real-time systems, which may be reduce the
overhead of the recovery from failures.
Future work
- Benchmarks.
- Using passive redundancy to tolerate
communication links failures. - Taking into account sensors and actuators
failures.
31References
Fragopoulou and Akl, 1995.
Fragopoulou, P. and Akl, S.G. (1995). Fault
tolerant communication algorithms on the star
network using disjoint paths. In Proceedings of
the 28th Hawaii International Conference on
System Sciences, HICSS95, Kingston, Canada.
Sriram et al., 1999. Sriram, R., Manimaran,
G., and Murthy, C.S.R. (1999). An integrated
scheme for establishing dependable real-time
channels in multihop networks. In Proc. ICCCN,
pages 528533.
Qin et al., 2002. Qin, X., Jiang, H., and
Swanson, D.R. (2002). An efficient fault-tolerant
scheduling algorithm for real-time tasks with
precedence constraints in heterogeneous systems.
In Proceedings of the 31th International
Conference on Parallel Processing, Vancouver,
Canada.
Hashimoto et al., 2002. Hashimoto, K.,
Tsuchiya, T., and Kikuno, T. (2002). Effective
scheduling of duplicated tasks for fault
tolerance in multiprocessor systems. IEICE
Transactions on Information and Systems.
32Questions ?