Title: Network Fault Tolerance
1HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR
INFORMATIK
DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wi
ntersemester 2000/2001 Leitung Prof. Dr.
Miroslaw Malek www.informatik.hu-berlin.de/rok/f
tc
2FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline
- Introduction (Unit I)
- Motivation
- System views
- Dependability rings
- Dependable design methodology
- Dependability Concepts, Measures and Models (UNIT
DCMM) - Basic definitions
- Dependability measures
- Dependability models
- Examples
- Dependability evaluation tools
- Testing Techniques (UNIT TT)
- Testing techniques principles
- Processor testing
- Memory testing
- Network testing
3FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline
- Fault Diagnosis Techniques (UNIT FST)
- Fault detection techniques
- Fault location (isolation) methods
- Fault Recovery and Tolerance Techniques (UNIT
FRTT) (System Level) - Dynamic techniques
- Static techniques
- Hybrid techniques
- Fault-tolerant and Fault-secure Memories (UNIT
FRTT) - Fault-tolerant techniques in manufacturing
- Replication
- Coding
- Reconfiguration
-
4FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline
- Network Fault Tolerance (UNIT NFT)
- Computer networks
- Basic techniques
- Example multistage networks
- Case Studies (UNIT CS)
- ESS and 3B20
- FTMP Fault-tolerant Multiprocessor
- SIFT Software-implemented Fault Tolerance
- Communication controller
- Fault-tolerant Building Block Architecture
5COURSE ACTIVITIES
- PROJECT
- PRESENTATION
- INVITED SPEAKERS
- CONFERENCES AND WORKSHOPS
- Some Websites
- www.dependability.org
- www.paradise.caltech.edu
- www.milan.eas.asu.edu
- www.crhc.uiuc.edu
6Major References on Fault-tolerant Computing
(Books/General) 1
- Chang, H. Y., E.G. Manning and G. Metze, Fault
Diagnosis in Digital Systems, Wiley
Interscience, 1970. - Friedman, A. D. and P. R. Menon, Fault Detection
in Digital Circuits, Prentice-Hall, 1971. - Breuer, M. A. and A.D. Friedman, Diagnosis and
Reliable Design of Digital Systems, Computer
Science Press, 1976. - Kraft, G. D. and W. N. Toy, Microprogrammed
Control and Reliable Design of Small Computers,
Prentice-Hall, 1981. - Anderson, T. and P.A. Lee, Fault Tolerance
Principles and Practice, Prentice-Hall, 1982. - Siewiorek, D.P. and R. S. Swarz, The Theory and
Practice of Reliable Systems Design, Digital
Press, 1982 1995. - Lala, P.K., Fault Tolerant and Fault Testable
Hardware Design, Prentice-Hall International,
1985. - Pradhan, D. K. (ed.), Fault Tolerant Computing
Theory and Techniques, Vols. I and II,
Prentice-Hall, 1986.
7Major References on Fault-tolerant Computing
(Books/General) 2
- Avizienis, A., H. Kopetz and J. C. Laprie (eds.),
The Evolution of Fault-Tolerant Computing,
Springer-Verlag, 1987. - Johnson, B. W., Design and Analysis of Fault
Tolerant Digital Systems, Addison-Wesley, 1989. - Negrini, R., M. G. Sami and R. Stefanelli, Fault
Tolerance Through Reconfiguration in VLSI and WSI
Arrays, MIT Press, 1989. - Laprie, J. C. (ed.), Dependable computing and
Fault-Tolerant Systems, Vol. 5 Dependability
Basic Concepts and Terminology, Springer-Verlag
Wien New York, 1992. - Landwehr, C. E., B. Randell, L. Simoncini (eds.),
Dependable Computing and Fault-Tolerant Systems,
Vol. 8, Dependable Computing for Critical
Applications 3, Springer-Verlag Wien New York,
1993. - Koob, G. M. and C. G. Lau (eds.), Foundations of
Dependable Comp-uting, System Implementation,
Kluwer Academic Publishers, 1994. - Koob, G. M. and C. G. Lau (eds.), Foundations of
Dependable Comp-uting, Paradigms for Dependable
Applications, Kluwer Academic Publishers, 1994.
8Major References on Fault-tolerant Computing
(Books/General) 3
- Koob, G. M. and C. G. Lau (eds.), Foundations of
Dependable Comp-uting, Models and Frameworks for
Dependable Systems, Kluwer Academic Publishers,
1994. - Malek, M. (ed.), Responsive Computing, Kluwer
Acad. Publish., 1994. - Fussel, D. S. and M. Malek (eds.), Responsive
Computer Systems, Steps Toward Fault-Tolerant
Real-Time Systems, Kluwer Academic Publishers,
1995. - Cristian, F., G. Le Lann and T. Lunt (eds.),
Dependable computing and Fault-Tolerant Systems,
Vol. 9, Dependable Computing for Critical
Applications 4, Springer-Verlag Wien New York,
1995. - Dhiraj K. Pradhan, Fault-Tolerant Computer System
Design, Textbook Binding, 1996. - A. A. Shvartsman, Fault-Tolerant Parallel
Computation, Kluwer, 1997 - W. Schneeweiss, Die Fehlerbaum-Methode,
LiLoLe-Verlag, 1999 - S. Montenegro, Sichere und fehlertolerante
Steuerungen, Hanser Muenchen, 1999.
9Major References on Fault-tolerant Computing
(Books/Reliability Evaluation)
- Myers, G. J., Software Reliability Principles and
Practice, Wiley-Interscience, 1976. - Trivedi, K. S., Probability and Statistics with
Reliability Queuing and Computer Science
Applications, Prentice-Hall, 1982. - Asche, H. and H. Feingold, Repairable Systems
Reliability, Marcel Dekker, 1984. - Musa, J. D., A. Iannino and K. Okumoto, Software
Reliability Measurement, Prediction,
Application, McGraw-Hill, 1987. - W. Schneeweiss, Petri Nets for Reliability
Modeling, LiLoLe, 1999
10Major References on Fault-tolerant Computing
(Books/Coding)
- Sellers, E. F., M. Y. Hsiao and L. W. Bearnson,
Error Detecting Logic for Digital Computers,
McGraw-Hill, 1968. - Peterson, W. and E. Welding, Error-Correcting
Codes (2nd ed.), MIT Press, 1972. - Wakerly, J., Errors Detecting Codes,
Self-Checking Circuits and Applications, The
Computer Science Library, 1978. - Lin, S. and D. J. Castello, Error Control Coding
Fundamentals and Application, Prentice-Hall,
1983. - Nagle, H. T., J. D. Irwin and D. Hoffman, Error
Detecting and Correcting Codes for Computer
Scientist and Engineers, MacMillan Publishers,
1986. - Rao, T. R. N. and E. Fujiwara, Error-Control
Coding for Computer Systems, Prentice-Hall, 1989.
11Major References on Fault-tolerant Computing
(Books/Software)
- Myers, G. J., The Art of Software Testing,
Wiley-Interscience, 1970. - Deutsch, M. D., Software Verification and
Validation, Prent.-Hall, 1982. - Shooman, M. L., Software Engineering,
McGraw-Hill, 1983. - Beizer, B., Software Testing Techniques, Van
Nostrand Reinhold, 1983. - Bernstein, P. A., V. Hadzlacos and N. Goodman,
Concurrency Control and Recovery in Database
Systems, Addison-Wesley, 1987. - Neufelder, A. M., Earning Software Reliability,
Marcel Dekker Inc., 1993. - Lyu, M. R. (ed.), Software Fault Tolerance, John
Wiley and Sons, 1995. - Lyu, M. R. (ed.), Handbook of Software
Reliability Engineering, Computer Science Press,
1995.
12Major References on Fault-tolerant Computing
(Journals)
- Special Issue of Proc. Of IEEE, October 1978
- Special Issue of Computer, October 1979
- Special Issue of Computer, March 1980
- Special Issue of Computer, August 1984
- Special Issue of IEEE Software, May 1995
- IEEE Trans. on Reliability
- IEEE Trans. On Software Engineering
- Computer
- Design and Test
- Electronics
- Proc. Of IEEE
- Computer Design
- Journal of Electronic Testing Theory and
Applications - Journal of Parallel and Distributed Computing
- IEEE Trans. on Parallel and Distributed Computing
- Real-Time Systems Journal
13Major References on Fault-tolerant Computing
(Conference Proceedings)
- Fault-Tolerant Computing Symposium
- Reliability and Maintainability Symposium
- Reliability in Distributed Software and Database
Systems Symposium - Test Conference
- Distributed Computing Systems Conference
- Parallel Processing Conference
- Real-Time Systems Symposium
- Computer Architecture Symposium
14INTRODUCTION
- OBJECTIVES
- MOTIVATION FOR FAULT-TOLERANT SYSTEMS
- TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS
AND THEIR RELATIONS TO COMPUTER SYSTEM
DEPENDABILITY - TO PRESENT BASIC CONCEPTS AND APPROACHES
- TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY
- CONTENTS
- MOTIVATION
- SYSTEM VIEWS
- SYSTEM DEPENDABILITY CONCEPTS
- APPROACHES TO DEPENDABLE DESIGN
- DEPENDABILITY RINGS
- DEPENDABLE DESIGN METHODOLOGY
15TYPES OF SYSTEMS
- Dependable (Reliable) System
- A system which delivers a required service during
its lifetime - Fault-Tolerant Computer Systems
- A system that has the capability to continue the
correct execution of its programs and
input/output functions in the presence of faults - Real-Time-Computer Systems
- are the ones that deliver service to a user
within a specified deadline (physical time,
duration, etc.) - Responsive Computer System
- are Fault-Tolerant Real-Time Systems that deliver
satisfactory service in a timely manner
16MOTIVATION FOR RELIABLE AND FAULT-TOLERANT
COMPUTING
- ECONOMIC NECESSITY
- LIFE SAVING
- NOVICE USERS
- HARSH ENVIRONMENTS
- MORE COMPLEX SYSTEMS
17DEVICE RELIABILITY AND SYSTEM RELIABILITY
Equivalent Device Reliability
106 105 104 103 102 10 1
Mean Time between Failures (MTBF) in Years
Minimum Acceptable Reliability
System Reliability
1950 1960 1970 1980 1990
Relays Vacuum Tubes Semiconductors SSI
MSI LSI - VLSI
18DEPENDABILITY PERFORMANCE TRADE-OFF
Ultra Reliable Systems
0.99999 0.9999 0.999 0.99 0.9
Commercial Fault-Tolerant Systems
Availability
Massively Parallel/ Distributed Systems
1 10 100 1000 10000 100000
Throughput (MIPS)
19EXAMPLES
- DEFENSE SYSTEMS
- FLIGHT SYSTEMS
- AIR TRAFFIC CONTROL
- COMMUNICATION SYSTEMS
- BANKING SYSTEMS
- AIRLINE SEAT RESERVATIONS
- TELEPHONE SYSTEMS
- HOUSEHOLD APPLIANCES
- VIDEO GAMES
20VIEW 1 SYSTEM LIFE CYCLE
SYSTEM CONSTRAINTS
NEW TECHNOLOGY
OBSOLESCENCE
NEEDS
CONCEPT FORMULATION SYSTEM SPECIFICATION DESIGN PR
OTOTYPE PRODUCTION INSTALLATION OPERATIONAL
LIFE MODIFICATION AND RETIREMENT
- Notice that testing, verification or validation
should occur after every phase of life cycle - Very few tools exist, and for some steps of the
cycle only
21VIEW 2 PACKAGING LEVELS OF INTEGRATION
- APPLICATIONS
- APPLICATIONS MODULES
- SPECIAL-PURPOSE LANGUAGES
- STANDARD LANGUAGES
- OPERATING SYSTEMS
- CABINETS/FRAMES
- BOXES/CAGES
- PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs
- INTEGRATED CIRCUITS (CHIPS)
- Dependability must be considered at every level
- System decomposition (partitioning) may have a
significant impact on dependability
22VIEW 3 WORKLOAD VIEW
LIVEWARE
USEFUL WORK
PREPARATION
SEMI USEFUL WORK
HARDWARE/ SOFTWARE
IDLING
FAULT SERVICING
- ELIMINATE IDLING AND USE IT FOR TESTING TO
IMPROVE DEPENDABILITY
23VIEW 4 LEVELS OF ABSTRACTION FOR DIGITAL
COMPUTERS
- DEPENDABILITY AND TESTING MUST BE CONSIDERED AT
EVERY LEVEL
24VIEW 5 COMPUTER SYSTEM
LIVEWARE MAINTENANCE PERSONNEL OPERATORS SYSTEM
DESIGNERS SYSTEM ANALYSTS PROGRAMMERS USERS
SOFTWARE PACKAGES ASSEMBLERS COMPILERS OPERATING
SYSTEMS UTILITY PROGRAMS DEBUGGING PROGRAMS FILE
PROCESSING PROGRAMS
FIRMWARE MICROPROGRAM MICROPRO- GRAMMING
SYSTEMS
HARDWARE CPUs I/O DEVICES MEMORIES INTERCONNECTION
NETWORKS
FAULTS ARE ATTRIBUTED TO HARDWARE 20-65
SOFTWARE 20-80 PEOPLE 15-40 ATTs
20-40-40 (2/3 applications 1/3 OS)
25(WARNING!!!)VIEW 6 IF YOU DO NOT FOLLOW
DEPENDABLE DESIGN METHODOLOGY YOU MAY END UP WITH
THE FOLLOWING
- SIX PHASES OF A PROJECT
- ENTHUSIASM
- DISILLUSIONMENT
- PANIC AND HYSTERIA
- SEARCH FOR THE GUILTY
- PUNISHMENT OF THE INNOCENT
- PRAISE AND AWARDS FOR THE NON-PARTICIPANTS
- (Author unknown found in one of the computer
companies)
26SYSTEM DEPENDABILITY CONCEPTS
- RELIABILITY
- Is a conditional probability that the system will
perform its intended function without failure at
time t provided it was fully operational at time
t 0 - AVAILABILITY
- Instantaneous availability is the probability
that a system is performing correctly at time t
and is equal to reliability of non-repairable
systems - A (t) R (t)
- Steady-state availability is the probability
that a system will be operational at any random
point of time and is expressed as the fraction of
time a system is operational during its expected
lifetime - As (t)
- SURVIVABILITY is the probability that a system
will deliver the required service in the presence
of a defined a priori set of faults or any of its
subset
27APPROACHES
- FAULT INTOLERANCE
- FAULT TOLERANCE
- MAINTAINABILITY
- HARDWARE/SOFTWARE TRADE-OFFS
28HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION
HARDWARE
EXAMPLES M6800 MC68000 VAX-11/780
IBM-30XX CRAY-XMP C-205 SYSTOLIC ARRAYS,
RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS
INSTRUCTIONS INTEGER ARITHMETIC
ADD/SUB MPY/DIV FLOATING-POINT ARITHMETIC VECTOR
PROCESSING MULTIPROCESSING (e.g., submachine
set-up)
SOFTWARE
VERTICAL MIGRATION is a transfer of functions
implementation from software to firmware and/or
hardware or vice-versa. Vertical Migration
improves performance and dependability, and
reduces cost.
29DEPENDABILITY (RELIABILITY) RINGS FOR FAULT
TOLERANCE
Dependability Rings
Acceptance Test
Operating System, Languages and Application
Acceptance Test
System Hardware
Acceptance Test
Register-Transfer Level
Acceptance Test
Logic Level
Each Dependability Ring should provide measures
and mechanisms for Fault Tolerance (Detection,
Location, Testability and Recovery)
30A BOOTSTRAP TEST RINGS IN A MULTICOMPUTER SYSTEM
Network
Memories
Processor
Diagnostic and Maintenance Processor (s)
(Hardcore)
Test Rings
31DEPENDABLE DESIGN METHODOLOGY
- Identify fault classes, fault latency and fault
impact - Determine qualitative and quantitative specs for
fault tolerance and evaluate your design in
specific environment - Identify weak spots and assess potential damage
- Decompose the system
- Develop fault and error detection techniques and
algorithms - Develop fault isolation techniques and algorithms
- Develop recovery/reintegration/restart
- Evaluate degree of fault tolerance
- Refine, iterate for improvement try to eliminate
weak spots and minimize potential damage
32REAL-TIME SYSTEMS DESIGN
- Identify time/critical tasks and specify their
timing (deadlines, durations, frequency,
periodicity, if any). Characterize the system
load and environment. - Characterize timing of a system (hardware and
software). - Map timing specification onto a system timing
(find the best resource allocation and scheduling
methods), and incorporate concurrent monitoring. - Verify and validate the design for quantitative
and qualitative specifications. - Refine, iterate and fine-tune the design.
33RESPONSIVE SYSTEM DESIGN
- Determine qualitative and quantitative
specifications for fault tolerance and task
timeliness which meet user requirements. - Determine system timing (hardware and software)
assess damage, availability and responsiveness. - Develop and time fault and error detection
techniques and algorithms. - Develop and time fault isolation techniques and
algorithms. - Develop time recovery/reintegration/restart.
- Map timing specification onto system timing under
appropriate assumptions and incorporate
concurrent monitoring. - Evaluate responsiveness.
- Refine and iterate for improvement.
- RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND
ARCHITECTS OF TIME
34REFERENCES(TEXTBOOK)
- C. G. Bell, J. C. Mudge and J. E. McNamara Seven
Views of Computer Systems, Chapter 1 in the book
by the same authors titled Computer
Engineering, Digital Press, 1978. - G.J. Lipovski and M. Malek, Parallel Computing
Theory and Comparisons, Wiley-Interscience, New
York, 1987. - M. Malek, Parallel Computer Systems Testing and
Integration, in the book titled Testing and
Diagnosis of VLSI and LSI, M. G. Sami and F.
Lombardi (eds.), Kluwer, 1988. - Pankaj Jalote, Fault Tolerance in Distributed
Systems / Textbook Binding / Published 1994 - Dhiraj K. Pradhan, Fault-Tolerant Computer System
Design, Textbook Binding, 1996.