A new transformation scheme based on active replication strategy that tolerates failures - PowerPoint PPT Presentation

About This Presentation
Title:

A new transformation scheme based on active replication strategy that tolerates failures

Description:

Memory. operator. com2. com1. Architecture with point-to-point links ... Hawaii International Conference on System Sciences, HICSS'95, Kingston, Canada. ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 33
Provided by: Kal89
Category:

less

Transcript and Presenter's Notes

Title: A new transformation scheme based on active replication strategy that tolerates failures


1
A new transformation scheme based on active
replication strategy that tolerates failures
  • Hamoudi Kalla, Alain Girault and Yves Sorel

Pop Art team and Aoste team
Paris, April 23, 2004
2
Outline
  • Introduction
  • Model and problem
  • State of the art
  • The proposed fault-tolerant method for tolerating
  • Processors failures
  • Communication media failures
  • Both processors and communication media failures
  • Example
  • Conclusion and future work

3
Introduction
High level program
Compiler
Model of the algorithm
Architecture specification Distribution
constraints Execution times Real-time
constraints Failure specification
Distribution and scheduling fault-tolerant
heuristic
Fault-tolerant distributed static schedule
Code generator
Fault-tolerant distributed code
4
Models Application algorithm
  1. Algorithm graph

I1
A
C
O
B
I2
I1 and I2 are inputs operations (sensors)
O is output operation (actuator) A, B and C
are computations operations A C is
data-dependence
5
Models Hardware architecture
  1. Architecture graph

P1
P2
L12
P1
P2
B1
L23
L13
P3
P3
Architecture with point-to-point links
Architecture with multipoint links
Memory
P1, P2 and P3 are processors L12, L13 and
L23 are point-to-point communication links B1
is multipoint communication link com1 and
com2 are communicators
com1
operator
com2
Processor
6
Models Component Failures
  1. Only processors and communication media
    (point-to-point and multipoint) can fails.
  2. Failures can be characterized as transient or
    permanent.
  3. At least a fixed number of processors can
    fail-stop.
  4. At least a fixed number of communication media
    can fail-stop partially or completely.

L12
P1
P2
P1
P2
P1
P2
m1
m1
L23
L13
P3
P3
P3
Processor failures
Partial communication media failures
complete communication media failures
7
Problem ?
  • Find a distributed schedule of the algorithm on
    the architecture which is fault-tolerant to
    processors and communication media failures ?

I1
A
C
O
B
I2
SynDEx
algorithm graph
Distribution/scheduling
L12
P1
P2
L23
L13
P3
architecture graph
SynDEx is a system level CAD software tool for
optimizing the implementation of real-time
embeded applications on multicomponenet
architecture
8
State of the art
  • A system is fault tolerant if it can mask the
    presence of faults in the system by using
    hardware and/or software redundancy

I1
A
C
O
P4
B
I2
SynDEx
algorithm graph
Distribution/scheduling
L12
P1
P2
L23
L13
P3
architecture graph
(a) Approaches for tolerating processor failures
(b) Approaches for tolerating communication
media failures
9
State of the art
  • A system is fault tolerant if it can mask the
    presence of faults in the system by using
    hardware and/or software redundancy

I1
A
C
O
B
I2
SynDEx
algorithm graph
Distribution/scheduling
L12
P1
P2
L23
L13
P3
architecture graph
(a) Approaches for tolerating processor failures
(b) Approaches for tolerating communication
media failures
10
State of the art
  • A system is fault tolerant if it can mask the
    presence of faults in the system by using
    hardware and/or software redundancy
  1. Active software redundancy (Hashimoto et al.,
    2002(a) Fragopoulou and Akl, 1995(b))(a)
    Multiple redundant copies of an operation are
    scheduled on different processors.(b) Multiple
    redundant copies of a message are sent along
    disjoint paths.
  1. Passive software redundancy (Qin et al.,
    2002(a) Sriram et al., 1999(b))(a) each
    operation is replicated on primary and backups
    copies, but only the primary is executed.(b)
    One copy of the message is sent, and if it fails,
    another copy will be transmitted.

(a) Approaches for tolerating processor failures
(b) Approaches for tolerating communication
media failures
11
Outline
  • Introduction
  • Model and problem
  • State of the art
  • The proposed fault-tolerant method for tolerating
  • Processor failures
  • Communication media failures (point-to-point
    links)
  • Both processor and communication media failures
  • Example
  • Conclusion and future work

12
The Proposed fault-tolerant method
Principle (1)
We use active software redundancy for both
operations and communications.
Motivations
  • Makes the recovery from failures bounded.
  • Makes the system predictable.
  • Easier to integrate to SynDEx.

13
The Proposed fault-tolerant method
Principle (2)
Algorithm graph (Alg)
Graph transformation
NPF processors failures
NLF links failures
New Alg with redundancy and exclusion relations
Architecture graph (Arc)
Real-time and embedding constraints
14
The Proposed fault-tolerant method
Algorithm graph transformation (1) Tolerating
NPF processors failures
A
B
. . .
. . .
NPF1 replicas of B
NPF1 replicas of A
A
B
b1. final algorithm sub-graph
a. initial algorithm sub-graph
15
The Proposed fault-tolerant method
Algorithm graph transformation (2) Tolerating
NLF links failures
A
B
One replica of B
one replica of A
NLF1 replicas of data
b2. final algorithm sub-graph
a. initial algorithm sub-graph
16
The Proposed fault-tolerant method
Algorithm graph transformation (3) Tolerating
NPF processors and NLF links failures
NPF1 and NLF1
17
The Proposed fault-tolerant method
Algorithm graph transformation (4) Tolerating
NPF processors and NLF links failures
NPF1 and NLF1
18
The Proposed fault-tolerant method
Algorithm graph transformation (5) Tolerating
NPF processors and NLF links failures
NPFgt1 and NLFgt1
A
R
...
B
...
NPF1 replicas of A
NPF1 replicas of B
R
A
NLF routing operations R
a. initial algorithm sub-graph
b. final algorithm sub-graph
19
The Proposed fault-tolerant method
Graph transformation
NPF processors failures
NLF links failures
A
R
New Alg with redundancy and exclusion relations
B
...
NPF1 replica of B
...
NPF1 replica of A
R
A
NLF routing operations R
Fault-tolerant distributed real-time executive
Architecture graph Arc
Real-time and embedding constraints
20
The Proposed fault-tolerant method
Implantation
  • B1 will receive its input data NPFNLF1 times
    (NPF1, NLF1)as soon as it receives the first
    input, B1 is executed, and it ignores the later
    inputs

A1
data
L34
L14
L24
L12
L23
P1
P2
P3
P4
R
B1
two replicas of B
two replicas of A
A2
B1
SynDEx
a transformed algorithm sub-graph
B1
L12
P1
P2
time
L23
L14
L24
Temporary schedule
start time (B1) min ( end communication
A1,A2,R )
L34
P4
P3
architecture graph
21
Outline
  • Introduction
  • Model and problem
  • State of the art
  • The proposed fault-tolerant method for tolerating
  • Processor failures
  • Communication media failures (multipoint links)
  • Both processor and communication media failures
  • Example
  • Conclusion and future work

22
The Proposed fault-tolerant method
  1. We use the active software redundancy of
    operations where each operation is replicated on
    NPF1 different processors to tolerate NPF
    processors failures.

P1
P2
B1
B2
P3
P4
Temporary schedule
Algorithm sub-graph
architecture graph
23
The Proposed fault-tolerant method
  1. Use the passive software redundancy of
    communication

24
The Proposed fault-tolerant method
Why data fragmentation ?
  1. Distinction between complete and partial
    communication links failures
  1. Enable rapid recovery from processors and
    communication links failures

25
The Proposed fault-tolerant method
  1. Recovery from processor failures

26
The Proposed fault-tolerant method
  1. Recovery from partial communication links failures

27
The Proposed fault-tolerant method
  1. Recovery from complete communication media
    failures

28
Example (1)
29
Example (2)
30
Conclusion and future work
Result
  • A new method to tolerate both communication
    links and processor failures in distributed
    real-time systems, which may be reduce the
    overhead of the recovery from failures.

Future work
  • Benchmarks.
  • Using passive redundancy to tolerate
    communication links failures.
  • Taking into account sensors and actuators
    failures.

31
References
Fragopoulou and Akl, 1995.
Fragopoulou, P. and Akl, S.G. (1995). Fault
tolerant communication algorithms on the star
network using disjoint paths. In Proceedings of
the 28th Hawaii International Conference on
System Sciences, HICSS95, Kingston, Canada.
Sriram et al., 1999. Sriram, R., Manimaran,
G., and Murthy, C.S.R. (1999). An integrated
scheme for establishing dependable real-time
channels in multihop networks. In Proc. ICCCN,
pages 528533.
Qin et al., 2002. Qin, X., Jiang, H., and
Swanson, D.R. (2002). An efficient fault-tolerant
scheduling algorithm for real-time tasks with
precedence constraints in heterogeneous systems.
In Proceedings of the 31th International
Conference on Parallel Processing, Vancouver,
Canada.
Hashimoto et al., 2002. Hashimoto, K.,
Tsuchiya, T., and Kikuno, T. (2002). Effective
scheduling of duplicated tasks for fault
tolerance in multiprocessor systems. IEICE
Transactions on Information and Systems.
32
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com