Title: 332:437 Concepts in Digital Systems Design
1332437Concepts in Digital Systems Design
- Instructor Prof. Michael L. Bushnell
- Teaching Assistants
- Ms. Srihitha Yerabaka
- Mr. Raghuveer Ausoori
- Course web site
- http//www.caip.rutgers.edu/bushnell
- ECE Department Rutgers University
2Changes in ECE Undergrad. Hardware Courses
3Course Structure
- Uses Verilog
- No more troubles with arithmetic conversions
- Less trouble with configurations
- Course is now a 2-term sequence
- 1st term Academic content
- 2nd term Design Project (exclusively)
- Use automatic logic synthesis to synthesize
project as Field-Programmable Gate Array
4332437 Lecture 1History of Fault Tolerance
- Motivation for fault tolerance
- Generations of Computers
- Fault Tolerance
- Definitions
- Applications
- Triple Modular Redundancy
- Mid-Value Select Technique Flux Summing
- Summary
Material from Design and Analysis of Digital
Fault Tolerant Systems, By Barry Johnson,
Addison-Wesley Publishers
5Photographs from NASA
- Made possible by fault-tolerant interplanetary
spacecraft (unmanned) and the Hubble space
telescope - Spacecraft Electronics
- Severe radiation fields (Gamma rays)
- Extremes of heat and cold
- Cosmic rays
- Extreme electromagnetic disturbance (particularly
around Jupiter)
6Hubble Space Telescope Rising from Space Shuttle
7Neptune Rising Above Surface of its Moon
8The Orion Nebulae
9Cartwheel Galaxy
104 Largest Moons of Jupiter -- Ganymeade,
Callisto, Io, and Europa Voyager II
11New Asteroid Found by Voyager II
12Voyager II Photograph of Neptune
13Voyager II -- Rings of Saturn
14Voyager II -- Rings of Neptune
15Device and System Reliability
16Dependability-Performance Trade-off
17Operational Times for Fault Tolerant Mobile
Computers
- Simplex Single processor, no fault-tolerant
hardware - TMR Triple Modular Redundancy (described later)
fault-tolerant hardware - Uses a lot of power
18Generations of Computers
- 1st -- All electronic, stored program 1945-55
- Manchester Mark I Kilburn
- Princeton IAS von Neumann
- Univac I Mauchley Eckert
- Fault-tolerance needed for computer to work at
all - 2nd Discrete Transistor Computers 1955-64
- IBM 7090/7094
- Univac 1100
- MIT Whirlwind Forrester, Wang, Olsen
- Invention of Magnetic Core Memory
- Invention of CRT Display with light pen
19Generations of Computers
- 3rd Invention of Integrated Circuit 1964-74
- 1959 Kilby, Texas Instruments
- 1960 Noyce, Fairchild
- IBM System/360 Blaauw Brooks
- Single architecture, whole range of
price/performance - 4th Invention of Dynamic RAM Memory 1974-88
- Invention of mprocessor (Intel 4004)
- IBM System/370
- DEC VAX 11/780
- DRAM replaces magnetic core memory
- Virtual Memory (segmentation and paging) used
20Generations of Computers
- 5th Parallel Distributed Computers
1988-present - Enabled by cheap VLSI Hardware
- Pervasive Computer Networking
- Carnegie-Mellon c.mmp 16 processor DEC pdp-11
- Distributed memory, crossbar interconnect
- IBM SP-2 1 to 64 processors
- Sun Work Station
- IBM PC
- N-Cube 10 1024 processors
21Current Ultra-Large Scale IC Technology
- AMD K8 233 million transistors
- Intel Smithfield -- 230 million transistors
- New IBM Cell Chip mprocessor 234 million
transistors - Hardware on chip doubles every 1 ½ years
- Currently
- 2.8 GHz mprocessor clock rates
- 1 Billion transistors on a chip
- Beginning of Network-on-a-Chip Era
22Major Changes in Systems Design
- Fault-Tolerant Computing now affordable and
necessary for mobile sensor-on-silicon
applications for medicine, networking - Severe problems in verification and testing of
computers - 60 of cost of hardware, Intels biggest capital
cost - Hardware cannot be tested unless specific design
procedures followed - Network reliability now a major headache
- Low-Power Design most important hardware
problem - System-on-a-Chip put several mprocessors (e.g.,
8086 DSP), DRAM, glue logic, A/D, D/A, analog
filters, chemical sensors, wireless
transmitter/receiver on 1 chip
23Definitions
- Fault-tolerance System can continue correct
performance in presence of hardware/software
faults - Fault Physical defect or flaw in
hardware/software component - Error Manifestation of a fault
- Failure situation where the error resulted in a
a system incorrectly performing its function
243 Universe Model (Avizienis)
- Physical Universe semiconductor, power supply,
printer, etc. - Informational Universe where errors occur
- Incorrect computer data words
- Incorrect digital voice/picture image
- External or Users Universe where user of
system ultimately sees effects of faults errors - Fault-latency time between occurrence of fault
and appearance of an error caused by that fault - Error-latency time between appearance of error
and appearance of resulting failure
25Fault Causes
- Specification mistakes wrong algorithms,
architectures, hardware/software specifications - Implementation mistakes poor design, poor
component selection, poor construction, software
coding errors - Component defects manufacturing main cause of
faults - Imperfections, random defects, wear-out (broken
bonds, corrosion) - External disturbance g rays, a particles,
electromagnetic interference, battle damage,
environment extremes
26Fault Description
- Nature hardware/software/analog/digital
- Duration how long fault is active
- Permanent in existence indefinitely
- Transient appears/disappears in very short time
- Intermittent appears/disappears/reappears
repeatedly - Extent localized to a given hardware or
software module or globally affects hardware
/software/both - Value
- Determinant status in unchanged throughout time
- Indeterminant status at time T may differ from
status before or after T
27Fault Tolerance Methods
- Fault Tolerance give system the ability to keep
performing its tasks after faults occur - Fault Avoidance -- Prevent fault occurrence
- Design reviews, component screening, testing
- Fault Masking prevent system faults from
introducing errors into system informational
structure
28Methods to Achieve Fault Tolerance
- Fault Masking
- Reconfiguration eliminate faulty module
restore system to operation - Fault Detection recognize that fault occurred
- Fault Location find where fault occurred
- Fault Containment isolate fault prevent from
spreading through system - Fault Recovery remain operational or regain
operation after faults
29Metrics
- Reliability R (t) conditional probability that
system works throughout t0, t, given that it
worked at t0 - Note 0.97 0.9999999
- Availability A (t) probability that system is
available at time instant t perform its function - Unreliable but available
- Safety S (t) probability that system will
perform its function correctly or will
discontinue working in a way that does not affect
operation of other systems or endanger people
30Metrics (continued)
- Performability P (L, t) probability that system
will be at or above level L of performance at
time t - Graceful degradation
- Maintainability M (t) probability that a failed
system will be restored to operation within time
period t - Testability ability to test for system
attributes controllability observability - Design for Testability Now critical to
construct any digital system if it is to be
successfully manufactured - Method Add hardware strictly for testing
purposes - Dependability reliability availability
performability testability
31Fault-Tolerant Computing Applications
- Long-life applications
- P (operation at time t 10 yr.) 0.95
- Satellites Unmanned Space Flight
- ATT Telstar Communications Satellites
- NASA Martian Pathfinder
- Critical Computation
- NASA Space Shuttle
- Foxboro Nitroglycerine Plant Controls
- Maintenance Postponement
- Lucent 5 ESS Telephone Exchange
- High Availability
- New York Stock Exchange Quotron System
32Redundancy Techniques
- Passive fault masking (hide faults)
- Active or dynamic detect fault remove broken
hardware from system with electronic switch - Hybrid Combine 1 2
33Passive Redundancy
- Triple Modular Redundancy
- Single point of failure, restoring organ
- Generalize N modular redundancy
- N must be odd if majority voting is used
- Problems
- Very expensive
- 3 results may not agree e.g., in analog control
system, A/D converters have jitter in least
significant bits may disagree - Solve Mid-Value select technique
34Triple Modular Redundancy (TMR)
35TMR with Triplicated Voters
36Software Voting
37Mid-value Select Technique
38Passive Redundancy (continued)
- Frequently, one result must be produced
- Leads to single point of failure problem
- Example Motor Controller
- Solve
- Flux summing used closed loop control system to
compensate for faults - Secondary current a S primary currents
- Works because flux summer transformer is
incredibly reliable
39Flux-Summing
40Summary
- Motivation for fault tolerance
- Generations of computers
- Fault Tolerance
- Applications
- Triple Modular Redundancy
- Mid-Value Select Technique Flux Summing
- Fault tolerance necessary for applications
- Medicine
- Transportation
- Defense
- Inter-Planetary Exploration