ftc6 - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

ftc6

Description:

... all o's or 1's errors can be detected in any word with even number of ... The system also supports error logging, auto-retry and software error handling. ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 22
Provided by: alexande87
Category:

less

Transcript and Presenter's Notes

Title: ftc6


1
HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR
INFORMATIK
Dependable Systems Vorlesung 6 FAULT-TOLERANT
AND FAULT-SECURE MEMORIES Wintersemester
2002/03 Leitung Prof. Dr. Miroslaw
Malek www.informatik.hu-berlin.de/rok/ftc
2
Fault-tolerant and Fault-secure Memories
  • Objectives
  • To study techniques of fault-tolerant and
    fault-secure memory design used in memory
    manufacturing and applications
  • Contents
  • Fault-tolerant techniques in manufacturing
  • Replication
  • Codes
  • Reconfiguration

3
Fault-tolerant Technique in Memory Manufacturing
(Overhead From 2 to 10)
  • Depending on expected failure density. A number
    of additional rows and/or columns are added and
    therefore included on the chip.
  • Polysilicon fuses in decoding circuitry are
    selectively blown to allow addressing of the
    spare rows and columns.
  • Two methods exist for blowing fuses
  • By focusing a laser on a given fuse for about one
    second
  • By applying a 10 - 50 volt signal across a highly
    resistive fuse
  • With the rapidly increasing chip densities, the
    use of redundancy is standard among memory
    manufacturers

4
Fault-tolerant Memories (Overhead From 2 to
200)
  • Identical copies of memory are used to mask
    erroneous results
  • Replication is usually implemented at the module
    level to minimize the number of voters needed to
    determine the correct output, and may consist of
    static or dynamic redundancy, or a combination of
    both.
  • Duplex
  • Half-duplex (two halves of memory are encoded
    into a third half residing in a back-up module
    such that the original data may be recovered if
    one of the three modules fails)
  • N-modular redundancy (usually triple modular
    redundancy)
  • Additional hardware includes
  • Memory units voter
  • Or disagreement detector

5
Fault-tolerant Memories (Continued)
  • Exemplary systems with replicated memories
    include
  • Star
  • (Self-testing and self-repairing computer)
  • Ftmp
  • (Fault-tolerant multiprocessor)
  • Sift
  • (Software-implemented fault tolerance computer)
  • Comtrac
  • (Computer-aided traffic control system)
  • (4,2) concept
  • (Communication controller with four processors
    and duplicated memory from philips)
  • Stratus
  • (Commercial fault-tolerant system) 
  • 3b20 from att
  • (commercial fault-tolerant system)

6
Memory Codes Parity Codes
  • Even parity
  • Odd parity
  • (Better coverage since all o's or 1's errors can
    be detected in any word with even number of
    bits) 
  • Byte-parity
  • (Parity bit is appended to every 7 or 8 bits)
  • Interlaced parity
  • Chip-wide parity
  • Two-dimensional parity

7
Chip-wide Parity Method
8
Two-dimensional Parity Method
9
Hamming Codes
  • Hamming codes provide error detection as well as
    error correction in a b-bit long word. Log2b
    check bits are generated whose values allow
    determination of the single bit if a single bit
    error occurs.
  • As an example a (7, 4) single error-detecting
    hamming code is shown. There are a total of
    seven bits, four of which are data bits.
  • Even though the code requires 15 - 70 additional
    hardware and results in degraded memory speed
    (due to encoding and decoding of the check bits),
    it often results in orders of magnitude or higher
    increase in the mean time between failures (mtbf)
    for the memory, a tradeoff which is often
    accepted.
  • Hamming codes may be extended to provide k-error
    correction and 2k-error detection, but such
    modifications require even greater hardware and
    software overheads.

10
Single-error Correction Example For A (7, 4)
HAMMING CODE (Bit D3 Is in Error)
  • Data bits are d1, d2, d3, d4
  • Check bits are c1, c2, c3
  • Equations used for syndrome generation
  • s3 d1 Ã… d2 Ã… d4 Ã… c1
  • s2 d1 Ã… d3 Ã… d4 Ã… c2
  • s1 d2 Ã… d3 Ã… d4 Ã… c3

11
Sec-ded Memory Design
  • 32-bit error detection and correction unit
  • Corrects all single-bit errors
  • Detects all double errors
  • Detects some triple errors
  • Detection in 32 nsec, correction in 64 nsec
  • 7 check bits for 32-bit word via a modified
    hamming code
  • May also work on 8-bit bytes
  • Built-in diagnostics

12
Block Diagram Of Memory System
13
Edc Unit Operation
  • Configuration 32-Bit Memory Array/Data Bus,
    7-Bit Check Array
  • Memory Read Cycle 
  • 1. Data read from memory array to buffers and
    from check array to check-bit inputs 
  • 2. EDC unit gets data from buffers 
  • 3. EDC unit computes check bits and syndrome 
  • 4. On non-zero syndrome, error(s) are indicated
    via error or multierror lines and bit correction
    occurs (1-bit error) 
  • EDC unit passes (corrected) data to buffers and
    then to data bus
  • Memory Write Cycle 
  • 1. Data from data bus via buffers to EDC unit 
  • 2. Check bits are computed 
  • 3. Data from EDC unit via buffers to memory and
    check bits from EDC unit to check-bit memory
    array 
  • In the 2M bytes memory MTFB improved from 95h to
    15,000h 
  • Up to 35 increase in cost on 16K memory cost
  • Up to 40 increase in power consumption
  • PARITY COMPLEMENT METHOD FOR ERROR CORRECTION

14
EDC UNIT OPERATION (Continued)
  • 1st Write 1 1 0 1 0 0 1 1 0 Original Data 
  • 1st Read 1 1 0 1 0 1 1 1 0 PE (Parity Error)
  • D D 0 0 1 0 1 0 0 0 1 Data Complement 
  • 2nd Write 0 0 1 0 1 0 0 0 1 Complemented Data 
  • 2nd Read 0 0 1 0 1 1 0 0 1 PE (Parity Error)
  • D D 1 1 0 1 0 0 1 1 0 Data Complement
  • (Correct Data)
  • Hard
  • Error
  • Location
  •  
  • This double complement method in combination
    with an ECC system can correct additional errors,
    e.g., National Semiconductor DP8400 chip (detects
    100 of 2-bit errors and both errors are
    correctable if no more than one of them is soft)

15
Reconfiguration
  • Reconfiguration involves the  
  • Permutation of the address 
  • and/or data lines between an array of 
  • memory chips and the cpu to prevent the 
  • building of multiple hard errors 
  •  
  • Spare memory locations technique 
  • (spare blocks method)
  •  
  •  
  •  
  • Spare switchable columns technique

16
Spare Blocks Method
  • Special purpose hardware
  •  Intel's iQX Module using Reallocation Technique

Block containing
Faulty Data
Memory
Allocation
.
RAM
.
Main Memory
(Mapping
.
Table)

Spare
Memory
Blocks
High-Order
Address
Low-Order
Address
Memory Address from Host
Hard error rate is 0.027 in 1000 hours  Soft
error rate is 0.1 in 1000 hours  in the 2Mbyte
memory system
17
Spare Switchable Columns Method
18
Fault-tolerant Memories In Commercial
Systems (1)
  • INTEL'S SERIES 90/IQX 
  • (A sec-ded code on the data, a parity check on
    the address bus, and the scrubbing of memory,
    which is the periodic dumping and rewriting of
    data to prevent the build-up of multiple soft
    errors, spare memory with pointer table) 
  • Vax-11/780 and microvaxes 
  • (a 7-bit sec-ded hamming code for 32-bit words
    and error logging)
  •  
  • Memory systems for spaceborne computers
  •  
  • (Sec-ded with periodic scrubbing or bit-per-chip
    memory organization with row/column, power
    isolation and error protocol data to assist
    reconfiguration)

19
Fault-tolerant Memories In Commercial
Systems (2)
  • IBM 30xx AND 43xx
  • Use a hamming sec-ded code and parity
    complement method
  • UNIVAC 1100/60
  • Employs sec-ded and sends an error signal to the
    requesting device if a double error is detected
  • VAX-11/780
  • Employ a hamming sec- and microvax ded code with
    error logging
  • CRAY-XMP YMP
  • Use an 8-bit sec-ded code word with each 64-bit
    memory word
  • SUN WORKSTATIONS
  • Some use sec-ded

20
Fault-tolerant Memories In Fault-tolerant
Computers
  • Self-testing and repairing (star) computer
  • 12 bits of instruction words are stored in
    2-out-of-4 code while the remaining 20 bits
    consist of 16-bits for the address field and 4
    check bits.
  • An inverse modulo-15 code is used to set the
    check bits such that the combined 20 bits
    represent a number that is divisible by 15.
  • Operands also use the inverse modulo-15 code (28
    data bits and 4 check bits in the data words)
    critical programs can be written into multiple
    memory units.

21
Examples
  • Carnegie-mellon university computers
  • C.Mmp uses two parity bits (one odd, one even) in
    its memory. 
  • Cm employs retry and error reporting mechanisms.
  • C.Vmp uses tmr.
  • Electronic switching systems (bell labs)
  • Ess 1 uses two parity bits (one covering both
    address and data, the other covering just the
    address). The system also supports error
    logging, auto-retry and software error handling.
  • Ess 3a makes extensive use of totally
    self-checking checkers and duplication of
    critical processors to recover from errors.
  • Fault-tolerant building blocks architecture
    (jpl-ucla)
  • Uses sec-ded and two spare switchable bits
  • Other examples
  • Tandem, stratus, august systems, plessey (great
    britain), philips (4.2)-concept (the
    netherlands), comtrac (japan) and copra (france)
Write a Comment
User Comments (0)
About PowerShow.com