Title: Static and Dynamic Fault Diagnosis
1Static and Dynamic Fault Diagnosis
- Richard Beigel
- Univ. Illinois at Chicago
- and DIMACS
2Nonstandard computing architectures
- Perceptrons and small-depth circuits
- Optically interconnected multiprocessors
- DNA computing
-
Self-diagnosing Systems
3Brief history of system-level fault diagnosis
- Preparata et al 67
- static, nonadaptive
- Nakajima 81
- static, adaptive, serial
- Hakimi Nakajima 84
- static, adaptive, parallel
4Recent advances in system-level diagnosis
- Distributed diagnosis
- Diagnosing intermittent faults
- Diagnosis with errors
- Fast parallel diagnosis of static faults
- Ongoing diagnosis and repair of dynamic faults
5Fault diagnosis problem
- Given n processors
- a primitive by which each processor can test any
other - a reliable external controller that observes test
results - Determine which are good and which are faulty
- Assume perfect communication in a complete network
6Whats so hard about that?
Say Ah
Ha Ha!
OK, you pass
Faulty processors may give incorrect test results
7Possible test results
8A majority of processors must be goodfor
diagnosis to be possible
Were all good Theyre all faulty
Were all good Theyre all faulty
9Serial diagnosis of static faults
- n processors, at most t faults, t lt n/2
- Nonadaptive diagnosis
- n(t1) tests are necessary and sufficient
- Preparata et al 67
- Adaptive diagnosis
- nt-1 tests are necessary and sufficient
- Nakajima 81
10Distributed diagnosis of static faults
- In the distributed diagnosis model there is no
central controller, and all good processors must
learn the status of the other processors. - Distributed diagnosis is reducible to the
cooperative collect problem, and can be solved
with tests Aspnes-Hurwood 96
11INTERMITTENT FAULTS AND ERRORS
- Work in progress by Beigel and Fu
12Intermittent faults
- An intermittent fault may appear faulty in some
tests and good in others - We cannot hope to diagnose intermittent faults as
such because they might exhibit consistent
behavior in all tests - Goal correctly diagnose all other processors
13Errors
- An error is a misdiagnosis by a good processor.
- Note the similarity to an intermittent fault
faulty
good
good
14Results
- In rounds, we can perform static diagnosis
assuming that a majority of the processors are
good and at most t of them are intermittently
faulty. - In rounds, we can perform static diagnosis in
the presence of errors. Assuming at most t
errors per round, the results will be within
of a correct diagnosis.
15PARALLEL DIAGNOSIS OF STATIC FAULTS
- Perform many tests simultaneously
16Parallel diagnosis of static faults
- 84 Hakimi Schmeichel O(n/logn)
- 90 S H Otsuka Sullivan O(logn)
- 89 Beigel Kosaraju Sullivan O(1)
- 93 Beigel Margulis Spielman 32
- 94 Beigel Hurwood Kahale 10
- best lower bound 5
17Digraphs
- tester testee
- testing round directed matching
18SHOS 90 generates a large mutual admiration
society
- MAS strongly connected component with all good
edges - Either
- all nodes good, or
- all nodes faulty
-
g
g
g
g
g
g
g
g
g
g
19SHOS 90O(logn) pairing algorithm
20What about processors that dont like each other?
- Build one chain for each good processor we found
(4 rounds) - Most chains must have a good processor in each
level (count!) - Total 4 1 rounds
21Beigel-Margulis-Spielman 94
- non (32 rounds)
- Find several MASs of size including
at least one good MAS - Large MASs test each other and all remaining
processors in 4 rounds
- constructive (84 rounds)
- Find several MASs of size including
at least one good MAS - Large MASs test each other and all remaining
processors in 6 rounds
22Expander graphs guarantee a good big MAS
- In the Cayley graphs of Margulis and LPS with
p37, every n/2-node induced subgraph contains a
strong component of size - (cf Alon Chung 88, who find long paths)
- degree of undirected graph 38
- 78 directed matchings cover graph
- 78 6 84 rounds
23Random graphs guarantee a good big MAS
- If G consists of 14 directed Hamiltonian paths on
n vertices then, whp, every n/2-node induced
subgraph contains a strong component of size - 28 directed matchings cover graph
- 28 4 32 rounds
24Beigel-Hurwood-Kahale 95 speeds up BMS 94
- In k1 rounds build MASs of size
- also build one chain of dont-likes
- each MAS can be in simultaneous tests
- Perform Gs directed matchings in 1 round
- Process chain in 2 or 3 more rounds
- Constructive 13 rounds. Non 10 rounds.
25Lower boundUpper bound for smaller t
- n processors, at most t faults
- If 5 rounds are necessary
- If 4 rounds suffice
- algorithm uses lower-degree expanders
26DIAGNOSIS AND REPAIR OF DYNAMIC FAULTS
- Processors fail each round,
- but algorithm may order repairs
27Ongoing diagnosis and repair of dynamic faults
- Processors may fail each round, but algorithm may
order repairs - In each round
- 1. perform tests
- 2. direct that up to t processors are repaired
- 3. at most t processors fail
- Goal bound number of faults at all times
28Results for n processorsat most t failures per
round
- When t gt 70 and n gt 376tlogt 50t, we can
maintain n - 64tlogt - 10t good processors at all
times - This works even if the number of faults exceeds
n/2 - When n 640 and t 1, we can maintain 520 good
processors at all times.
29Whys this hard?
- We cant determine the status of a chosen
processor because its testers might fail right
before we choose them
- Mutual admiration societies dont work either
30SIFT and WINNOW
- SIFT finds a large set G consisting of processors
that were good when SIFT started running, and a
small set F containing some faulty processors - WINNOW uses G to diagnose most of the faulty
processors in F - Algorithm SIFT, WINNOW, repair, repeat
31SIFT algorithm
- Let r 2logt
- In 2r rounds form undirected hypercubes of size
- Put MASs into G, others into F
- MASs must have been entirely good at start of
SIFT, and are still mostly good
32WINNOW algorithm
- Choose a processor P in F
- For 2logt rounds,
- test P and every processor that has tested P so
far, using testers in G - If the tests always call P faulty but dont call
any of the others faulty then we can be sure that
P really is faulty - Most old faults are diagnosed, but 4tlogt new
ones could accumulate.
33Summary
- We have efficient algorithms for
- diagnosis in the presence of a small number of
intermittent faults - diagnosis with a small number of diagnosis errors
- parallel fault diagnosis
- ongoing diagnosis of dynamic faults