Failure Correction Techniques for Large Disk Array PowerPoint PPT Presentation

presentation player overlay
1 / 21
About This Presentation
Transcript and Presenter's Notes

Title: Failure Correction Techniques for Large Disk Array


1
Failure Correction Techniques for Large Disk Array
Garth A. Gibson, Lisa Hellerstein et
al. University of California at Berkeley
2
What is the problem?
  • Disk arrays can increase I/O bandwidth and
    access parallelism
  • The chance of data loss increases with the
    increasing number of disk arrays

Figure 1. The mean time to data loss (MTTDL) in a
single-erasure-correcting array.
3
Types of data failure
  • Transient or noise-related errors
  • Correct by repeating the offending operation
  • or by applying per sector error-correction
    facilities
  • Media defects
  • detect and mask at the factory
  • Catastrophic failures -- Head crashes or failures
    of the read/write or controller electronics

4
The goal of this paper
  • Avoid loss of user data
  • Recover the catastrophic disk failures
  • Make disk arrays as reliable as an individual disk

5
Concept 1 -- erasure-correcting codes and
error-correcting codes
  • Erasure-correcting codes are designed to recover
    erased bits in a message word
  • An unreadable bit is called an erasure
  • The position of the erased bits are known
  • For a catastrophic disk failure, the bits on a
    failed disk can be designated as unreadable
  • Error-correcting codes are designed to correct
    messages in which some of the bits may have been
    flipped, but the positions of those bits are
    unknown.

6
Concept 2 -- Redundancy Metrics
  • Disk as stack of bits
  • -ith.bit in each disk forms the
    ith.Codeword in the redundancy encoding
  • Mean time to data loss (MTTDL) measure of
    reliability
  • Check disk overhead check disks/data disks
  • Update penalty number of check disks to be
    updated
  • Group size the information and check disk that
    must be accessed during the reconstruction of a
    failed disk form a group

7
1d - Parity
  • Single-erasure-correction scheme
  • For G data disks, one check disk with parity of
    all G disks.
  • Overhead 1/G
  • Update penalty 1
  • Group size G1

G 4
8
2d - Parity
  • Double-erasure-correction scheme
  • G2 data disks arranged in 2-dimensional array
  • For each row and each column, one check disk
  • stores parity for that row or column
  • Check disk Overhead
  • 2G/G2 2/G
  • Update penalty 2
  • Group size G1

G 4
9
N-dimensional parity (Nd-parity)
  • N-erasure-correction scheme
  • Check disk overhead NG(N-1) / GN N/G
  • Update penalty N
  • Group size G1

10
Linear Codes
  • Contain the original information unmodified
    within each codeword and compute the check bits
    of each codeword as the parity of subsets of the
    information bits



Codeword 1 1
1 1
Parity
11
Parity Check Matrix H P I
Fig. 4
How to compute the check parity bit? HX 0
First row of H 100101 100 X
111010 x1 x2 x3
P I HX 100000x100 0
x11
12
Parity Check Matrix for 1d-parity and 2d-parity
Fig. 5
13
Properties of the parity check matrix
  • Express in terms of a parameter, t,
  • whose value is between 0 and c
  • H will allow any t erasures to be corrected
  • H will allow any t errors to be detected
  • The minimum number of bits in which any two
    codewords differ, known as the distance of the
    code, is at least t1
  • Any set of t column selected from will be
    linearly independent

14
Implementing Reconstruction
0 1000 0 0110 0 0000 0 0001 0 0000 0 0100 0
1000 1 0011
Fig. 6(a). When 4 disks fail in a 16 information
disk 2d-parity array, the controllers allow us to
identify which disks need to be repaired and
reconstructed.
15
Implementing Reconstruction cont.

Fig. 6(b) Apply elementary row operations (the
essence of Gaussian elimination) to find a matrix
M, such that the product MB has the 44 identity
matrix in its first four rows.
16
Elementary operation
Example x y z 0 (1) x - 2y
2z 4 (2) x 2y - z 2
(3) (3) - (1) to replace (3) x y z 0
(1) x - 2y 2z 4 (2)
y - 2z 0 (4) (2)-(1) to replace (2) x
y z 0 (1) - 3y z 4
(5) y - 2z 0 (4) (5)(4)3 to
replace (4) x y z 0 (1) -
3y z 4 (5) - 5z 10
(6) result x4, y-2, z-2
  • If we interchange two equation, the new system is
    still equivalent to the old one.
  • If we multiply an equation with a nonzero number,
    the new system is still equivalent to the old
    one.
  • Replacing one equation with the sum of two
    equation, we obtain an equivalent system

17
Gaussian Elimination
augmented matrix 1 1 1 0
1 -2 2 4 1 2
-1 2 (3) - (1) to replace (3) 1
1 1 0 1 -2 2 4
0 1 -2 2 (2)-(1) to replace
(2) 1 1 1 0 1
-3 1 4 0 1 -2 2
(5)(4)3 to replace (4) 1 1 1
0 0 -3 1 4
0 0 -5 10
Definition Using elementary operation, in every
step the new matrix was exactly the augmented
matrix associated to the new system. Once we
obtain a triangular matrix, write the associated
linear system and then solve it.
Example x y z 0 (1) x - 2y 2z
4 (2) x 2y - z 2 (3) The linear
equation x y z 0 - 3y z 4
- 5z 10
18
Implementing Reconstruction cont.
012 34567 89 15 11 10 0 0000000
10000000 01 00 0 0100010 00000100 01 01
1 0100010 01000100 00 00 0 0000111
00010000 10 01 0 1000100 00001000
Fig. 6 (C) The first 4 rows of MA describe the
operations that must be performed to reconstruct
our 4 disks.
19
The position for codes with t-erasure-correction
  • Be implemented in software
  • Run in an I/O processor
  • Software learns of failures directly from disk
    controllers

20
Conclusion
  • Implement the redundancy codes for disk arrays
  • Minimize the number of check disks that must be
    updated whenever an information disk is updated
  • Improve the reliability of disk arrays

21
Question
  • What is codeword for redundancy disk?
  • List three redundancy metrics
  • What are 1d-parity and 2d-parity schemes?
  • What mathematical operation to be used for
    recovering failed disk?
Write a Comment
User Comments (0)
About PowerShow.com