A new transformation scheme based on active replication strategy that tolerates failures - PowerPoint PPT Presentation

About This Presentation

Title:

A new transformation scheme based on active replication strategy that tolerates failures

Description:

Memory. operator. com2. com1. Architecture with point-to-point links ... Hawaii International Conference on System Sciences, HICSS'95, Kingston, Canada. ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 33

Provided by: Kal89

Category:

more less

Transcript and Presenter's Notes

Title: A new transformation scheme based on active replication strategy that tolerates failures

1
A new transformation scheme based on active
replication strategy that tolerates failures

Hamoudi Kalla, Alain Girault and Yves Sorel

Pop Art team and Aoste team
Paris, April 23, 2004
2
Outline

Introduction
Model and problem
State of the art
The proposed fault-tolerant method for tolerating
Processors failures
Communication media failures
Both processors and communication media failures
Example
Conclusion and future work

3
Introduction
High level program
Compiler
Model of the algorithm
Architecture specification Distribution
constraints Execution times Real-time
constraints Failure specification
Distribution and scheduling fault-tolerant
heuristic
Fault-tolerant distributed static schedule
Code generator
Fault-tolerant distributed code
4
Models Application algorithm

Algorithm graph

I1
A
C
O
B
I2
I1 and I2 are inputs operations (sensors)
O is output operation (actuator) A, B and C
are computations operations A C is
data-dependence
5
Models Hardware architecture

Architecture graph

P1
P2
L12
P1
P2
B1
L23
L13
P3
P3
Architecture with point-to-point links
Architecture with multipoint links
Memory
P1, P2 and P3 are processors L12, L13 and
L23 are point-to-point communication links B1
is multipoint communication link com1 and
com2 are communicators
com1
operator
com2
Processor
6
Models Component Failures

Only processors and communication media
(point-to-point and multipoint) can fails.
Failures can be characterized as transient or
permanent.
At least a fixed number of processors can
fail-stop.
At least a fixed number of communication media
can fail-stop partially or completely.

L12
P1
P2
P1
P2
P1
P2
m1
m1
L23
L13
P3
P3
P3
Processor failures
Partial communication media failures
complete communication media failures
7
Problem ?

Find a distributed schedule of the algorithm on
the architecture which is fault-tolerant to
processors and communication media failures ?

I1
A
C
O
B
I2
SynDEx
algorithm graph
Distribution/scheduling
L12
P1
P2
L23
L13
P3
architecture graph
SynDEx is a system level CAD software tool for
optimizing the implementation of real-time
embeded applications on multicomponenet
architecture
8
State of the art

A system is fault tolerant if it can mask the
presence of faults in the system by using
hardware and/or software redundancy

I1
A
C
O
P4
B
I2
SynDEx
algorithm graph
Distribution/scheduling
L12
P1
P2
L23
L13
P3
architecture graph
(a) Approaches for tolerating processor failures
(b) Approaches for tolerating communication
media failures
9
State of the art

A system is fault tolerant if it can mask the
presence of faults in the system by using
hardware and/or software redundancy

I1
A
C
O
B
I2
SynDEx
algorithm graph
Distribution/scheduling
L12
P1
P2
L23
L13
P3
architecture graph
(a) Approaches for tolerating processor failures
(b) Approaches for tolerating communication
media failures
10
State of the art

A system is fault tolerant if it can mask the
presence of faults in the system by using
hardware and/or software redundancy

Active software redundancy (Hashimoto et al.,
2002(a) Fragopoulou and Akl, 1995(b))(a)
Multiple redundant copies of an operation are
scheduled on different processors.(b) Multiple
redundant copies of a message are sent along
disjoint paths.

Passive software redundancy (Qin et al.,
2002(a) Sriram et al., 1999(b))(a) each
operation is replicated on primary and backups
copies, but only the primary is executed.(b)
One copy of the message is sent, and if it fails,
another copy will be transmitted.

(a) Approaches for tolerating processor failures
(b) Approaches for tolerating communication
media failures
11
Outline

Introduction
Model and problem
State of the art
The proposed fault-tolerant method for tolerating
Processor failures
Communication media failures (point-to-point
links)
Both processor and communication media failures
Example
Conclusion and future work

12
The Proposed fault-tolerant method
Principle (1)
We use active software redundancy for both
operations and communications.
Motivations

Makes the recovery from failures bounded.

Makes the system predictable.

Easier to integrate to SynDEx.

13
The Proposed fault-tolerant method
Principle (2)
Algorithm graph (Alg)
Graph transformation
NPF processors failures
NLF links failures
New Alg with redundancy and exclusion relations
Architecture graph (Arc)
Real-time and embedding constraints
14
The Proposed fault-tolerant method
Algorithm graph transformation (1) Tolerating
NPF processors failures
A
B
. . .
. . .
NPF1 replicas of B
NPF1 replicas of A
A
B
b1. final algorithm sub-graph
a. initial algorithm sub-graph
15
The Proposed fault-tolerant method
Algorithm graph transformation (2) Tolerating
NLF links failures
A
B
One replica of B
one replica of A
NLF1 replicas of data
b2. final algorithm sub-graph
a. initial algorithm sub-graph
16
The Proposed fault-tolerant method
Algorithm graph transformation (3) Tolerating
NPF processors and NLF links failures
NPF1 and NLF1
17
The Proposed fault-tolerant method
Algorithm graph transformation (4) Tolerating
NPF processors and NLF links failures
NPF1 and NLF1
18
The Proposed fault-tolerant method
Algorithm graph transformation (5) Tolerating
NPF processors and NLF links failures
NPFgt1 and NLFgt1
A
R
...
B
...
NPF1 replicas of A
NPF1 replicas of B
R
A
NLF routing operations R
a. initial algorithm sub-graph
b. final algorithm sub-graph
19
The Proposed fault-tolerant method
Graph transformation
NPF processors failures
NLF links failures
A
R
New Alg with redundancy and exclusion relations
B
...
NPF1 replica of B
...
NPF1 replica of A
R
A
NLF routing operations R
Fault-tolerant distributed real-time executive
Architecture graph Arc
Real-time and embedding constraints
20
The Proposed fault-tolerant method
Implantation

B1 will receive its input data NPFNLF1 times
(NPF1, NLF1)as soon as it receives the first
input, B1 is executed, and it ignores the later
inputs

A1
data
L34
L14
L24
L12
L23
P1
P2
P3
P4
R
B1
two replicas of B
two replicas of A
A2
B1
SynDEx
a transformed algorithm sub-graph
B1
L12
P1
P2
time
L23
L14
L24
Temporary schedule
start time (B1) min ( end communication
A1,A2,R )
L34
P4
P3
architecture graph
21
Outline

Introduction
Model and problem
State of the art
The proposed fault-tolerant method for tolerating
Processor failures
Communication media failures (multipoint links)
Both processor and communication media failures
Example
Conclusion and future work

22
The Proposed fault-tolerant method

We use the active software redundancy of
operations where each operation is replicated on
NPF1 different processors to tolerate NPF
processors failures.

P1
P2
B1
B2
P3
P4
Temporary schedule
Algorithm sub-graph
architecture graph
23
The Proposed fault-tolerant method

Use the passive software redundancy of
communication

24
The Proposed fault-tolerant method
Why data fragmentation ?

Distinction between complete and partial
communication links failures

Enable rapid recovery from processors and
communication links failures

25
The Proposed fault-tolerant method

Recovery from processor failures

26
The Proposed fault-tolerant method

Recovery from partial communication links failures

27
The Proposed fault-tolerant method

Recovery from complete communication media
failures

28
Example (1)
29
Example (2)
30
Conclusion and future work
Result

A new method to tolerate both communication
links and processor failures in distributed
real-time systems, which may be reduce the
overhead of the recovery from failures.

Future work

Benchmarks.
Using passive redundancy to tolerate
communication links failures.
Taking into account sensors and actuators
failures.

31
References
Fragopoulou and Akl, 1995.
Fragopoulou, P. and Akl, S.G. (1995). Fault
tolerant communication algorithms on the star
network using disjoint paths. In Proceedings of
the 28th Hawaii International Conference on
System Sciences, HICSS95, Kingston, Canada.
Sriram et al., 1999. Sriram, R., Manimaran,
G., and Murthy, C.S.R. (1999). An integrated
scheme for establishing dependable real-time
channels in multihop networks. In Proc. ICCCN,
pages 528533.
Qin et al., 2002. Qin, X., Jiang, H., and
Swanson, D.R. (2002). An efficient fault-tolerant
scheduling algorithm for real-time tasks with
precedence constraints in heterogeneous systems.
In Proceedings of the 31th International
Conference on Parallel Processing, Vancouver,
Canada.
Hashimoto et al., 2002. Hashimoto, K.,
Tsuchiya, T., and Kikuno, T. (2002). Effective
scheduling of duplicated tasks for fault
tolerance in multiprocessor systems. IEICE
Transactions on Information and Systems.
32
Questions ?

Write a Comment

User Comments (0)