Title: 7. Fault Tolerance Through Dynamic (or Standby) Redundancy
17. Fault Tolerance Through Dynamic (or Standby)
Redundancy
- The lowest-cost fault-tolerance technique in
multiprocessors. - Steps performed
- When a fault is detected, a fault location or
diagnosis procedure is triggered. - The faulty processor is then replaced by a spare
processor or spare processing capability through
reconfiguration. - Finally, error recovery is performed, whereby the
spare processor, using typically checkpointed
information, takes over the computations of the
faulty processor from where it left off.
27. Fault Tolerance Through Dynamic or Standby
Redundancy
- In summary, Dynamic Redundancy is performed in 3
steps - I. Fault detection and location
- II. Reconfiguration of the system around the
- faulty processor
- III. Error recovery
37. Fault Tolerance Through Dynamic or Standby
Redundancy
- Several approaches perform fault detection in
multiprocessors
- Scheduled off-line testing for permanent faults
- Duplication and comparison
- Diagnostics and coding techniques
Described next ...
47. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.1 Fault Detection in Multiprocessors
- 7.1.1 Fault Detection Through Duplication and
Comparison
- A) Each processor of the multiprocessor can be
duplicated, and the results compared before
communicating to the processor pairs.
57. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.1 Fault Detection in Multiprocessors
- 7.1.1 Fault Detection Through Duplication and
Comparison
- B) Another approach is dividing the P processors
of a multiprocessor into P/2 pairs. The global
memory which consists of M memory modules can
either be divided into M/2 pairs. Comparators
can be kept inside each processor and memory
module, and results of both computations must
match for an operation to be executed. If an
error is detected by a processor pair, both
processors of the pair are powered off, and the
computations are able to proceed on the P- 2
remaining processors, configured as (P-2)/2 pairs
of processors.
67. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.1 Fault Detection in Multiprocessors
- 7.1.1 Fault Detection Through Duplication and
Comparison
- C) Alternatively, the comparison operation can
also be performed in software, by means of
checkpoint comparison techniques.
77. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.1 Fault Detection in Multiprocessors
- 7.1.1 Fault Detection Through Duplication and
Comparison
- D) Finally, the duplication and comparison
operation can be performed by means of time
redundancy. This is useful when one cannot afford
the redundancy of duplication for cost, weight,
power, and space constraints (e.g., embarked,
battery-powered electronics). - In the presence of task dependencies (see
example), one often finds processors that are
idle, since there are no ready tasks. In such
situations, one can map the original task graph
on P/2 processors, get better processor
utilization, and use the remaining P/2 processors
to perform the duplicate computation of the task
graph. Hence, in real task graphs, one can
observe less than 100 time overhead.
87. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.1 Fault Detection in Multiprocessors
- 7.1.1 Fault Detection Through Duplication and
Comparison
1
2
3
4
a) Original task graph mapping
6
1
2
3
4
5
6
7
-
5
7
b) Example of mapping duplicated task graphs on
disjoint sets of processors
1
2
3
4
2d
3d
4d
1d
6
5
6d
5d
7
7d
1
2,5
3,7
4,6
1d
2d,5d
3d,7d
4d,6d
97. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.1 Fault Detection in Multiprocessors
- 7.1.2 Fault Detection Using Diagnostics and
Coding - Techniques
- See 2.2 Information Redundancy
107. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.2 Recovery Strategies for Multiprocessor
Systems - Since most faults are transient or intermittent,
s simple recovery procedure may be merely to
reexcute the computation. - Recovery issues are more complex in distributed
systems (communicating processes) one has to
ensure that the correct execution of one process
is not affected by the faulty execution of a
communicating process. - Recovery techniques are different for
distributed- and shared-memory multiprocessors
multiple processes can access memory and have
different or erroneous copies of the same
variables, creating an inconsistent state when
the error is detected. - Therefore, some scheme must be devised that will
be able to store enough error-free processor
state information at a reliable place from where
it can be retrieved and used to restart the
program (rollback recovery) from a consistent
state, in the event of a transient failure in one
or more processors during program execution.
117. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.2 Recovery Strategies for Multiprocessor
Systems - The most popular scheme Checkpointing !
- It involves storing as much information about the
processor state as necessary at discreet points
(checkpoints, or rollback points) in the program
to ensure that the program can be rolled back to
those points in the event of a node failure, and
restarted from there, as though no fault had
occurred. - Processor states varies from one system to
another. Generally it involves the register set
of the processor, the program counter, the state
of cache, and even memory as well, or at least
those parts of it that have been altered by the
processor since the last checkpoint. - This information is stored in reliable storage,
that is, memory assumed not to fail. Such a
memory could be a disk, or memory protected by
using error-correcting codes, or duplicated
memory and/or registers.
127. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.3 Rollback Recovery Using Checkpoints
- Rollback recovery using checkpoints is a very
cost-effective method of providing fault
tolerance against transient and intermittent
faults. - Various implementations and overhead issues are
illustrated in the following.
137. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.3 Rollback Recovery Using Checkpoints
- 7.3.1 Processor Cache-Based Checkpoints
CPU
Register
Active State
Bank A
Bank B
Ta1
Tb1
1 Ta1
4 Tb1
Cache
data
data
2 Flush
5 Flush
CPU Register Save Area
Main Memory
Ta2
Tb2
3 Ta2
6 Tb2
Checkpoint State
Processor-based checkpoint and rollback recovery.
Fault-tolerant techniques to flush cache.
147. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.3 Rollback Recovery Using Checkpoints
- 7.3.1 Processor Cache-Based Checkpoints
Condition Failure Action
Ta1 Ta2 Tb1 Tb2 None None
Ta1 gtTa2 Tb1 Tb2 Flush A Copy Bank B to A
Ta1 Ta2 gt Tb1 Tb2 Between Copy Bank A to B
Ta1 Ta2 Tb1 gt Tb2 Flush B Copy Bank A to B
Failure Conditions.
157. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.3 Rollback Recovery Using Checkpoints
- 7.3.2 Virtual Checkpoints
k
Checkpoint (v lt V)
Active (v V)
j
Checkpoint
Paging Disk
Real Memory
Virtual Memory
Overview of Single Page Mapping.
Basic Concept.
167. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.3 Rollback Recovery Using Checkpoints
- 7.3.2 Virtual Checkpoints
tc1
tc2
tc2
V 0
V 1
V 2
V 1
V 2
tr0 m0 checkpoint
tr1 m1 Active v 1
tr1 m1 checkpoint
tr2 m2 Active v 2
tr2
tr3
Case 1 First reference after checkpoint.
Case 2 Page previously referenced.
177. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.3 Rollback Recovery Using Checkpoints
- 7.3.2 Virtual Checkpoints
I am alive
Checkpointed State
Pri
Pri
Primary Process
Backup Process
Primary Process
Backup Process
I am alive
Primary process checkpoints the state with the
backup process.
I am alive messages are used for fault
detection.
187. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.4 Rollback Recovery in Communicating
......Multiprocessors - 7.4.1 Shared-Memory Multiprocessors
Bus Line Set by Processor to Indicate ...
Shared sharing a block on the bus.
Establish Rollback Point that a rollback point is being established.
Rollback that it is backing up to the prior rollback point.
Bus lines.
197. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.4 Rollback Recovery in Communicating
......Multiprocessors - 7.4.2 Distributed-Memory Multiprocessors
P1
P2
Communication
Checkpoint
Domino effect in recovery of multiprocesses.
207. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.4 Rollback Recovery in Communicating
......Multiprocessors - 7.4.2 Distributed-Memory Multiprocessors
Consistent and inconsistent recovery lines.
217. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.4 Rollback Recovery in Communicating
Multiprocessors - 7.4.3 Recovery in Distributed Shared-Memory
Systems - Typically, distributed shared-memory (DSM)
systems are loosely coupled, geographically
distributed systems of processors, each processor
with its own memory. - Implemented by using Virtual Memory programmers
see a single shared memory, which in reality is
made up of individual memories residing in
different processors. - Pages are used as the basic blocks of memory
transfer. Each node keeps in its own local memory
a subset of the total number of pages from the
shared virtual memory. - A page fault is generated whenever a node tries
to access a nonresident page. A page request is
then generated and sent to a distinguished owner
node that has a copy of the page needed. Upon
reception of the page request, the owner node
transfers the new page to the requester, which
then becomes the new owner. - An owner-node keeps a page-table with information
on the nodes which have read-only copies of pages
that are owned by the owner-node.
227. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.4 Rollback Recovery in Communicating
Multiprocessors - 7.4.3 Recovery in Distributed Shared-Memory
Systems - A local checkpoint for the mentioned system
consists of (the important information that must
be saved) - a) the contents of locally owned pages that have
been modified since the last checkpoint on the
local node. - b) the page-table entries for locally owned pages
that have been modified since the last
checkpoint. - This is in addition to the state information of
the local processor, which is also stored with
each checkpoint in reliable storage. - How the reliable storage is implemented depends
upon the resources available, as well as on the
level of reliability desired from the system. - A process on a recovering processor is expected
to retrieve any clean pages that it might need
from previous checkpoints stored on disk, in
addition to any dirty pages that were stored in
the last checkpoint before failure.
237. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.4 Rollback Recovery in Communicating
......Multiprocessors - 7.4.4 Recovery in Database Systems
- Database systems employ atomic actions known as
transactions to maintain consistency and
integrity in the presence of concurrent
activities. - Since transactions are atomic activities, in the
event a transaction is aborted, its actions have
to be undone to restore consistency to the
system. - Because of the all-or-nothing property of
atomic actions, an important amount of work might
be abandoned needlessly when an internal error is
encountered.
247. Fault Tolerance Through Dynamic or Standby
Redundancy
- 7.4 Rollback Recovery in Communicating
Multiprocessors - 7.4.4 Recovery in Database Systems
- Shadowing is a typical implementation of
recovery-oriented mechanism on database systems,
which involves using a new disk page to write the
modified version of a database page. When the
transaction completes (or commits), the page to
which it was writing becomes the permanent page,
or it is discarded if the transaction aborts.
Recovery is fast, since it only involves
discarding the modified pages into which the
transactions in the active list are writing. - Thus, a scheme for distributed systems has been
considered which uses pages as the invisible unit
of memory that is stored as part of a checkpoint
and used for recovery.