Failure Correction Techniques for Large Disk Array presentation

About This Presentation

Transcript and Presenter's Notes

Title: Failure Correction Techniques for Large Disk Array

1
Failure Correction Techniques for Large Disk Array
Garth A. Gibson, Lisa Hellerstein et
al. University of California at Berkeley
2
What is the problem?

Disk arrays can increase I/O bandwidth and
access parallelism
The chance of data loss increases with the
increasing number of disk arrays

Figure 1. The mean time to data loss (MTTDL) in a
single-erasure-correcting array.
3
Types of data failure

Transient or noise-related errors
Correct by repeating the offending operation
or by applying per sector error-correction
facilities
Media defects
detect and mask at the factory
Catastrophic failures -- Head crashes or failures
of the read/write or controller electronics

4
The goal of this paper

Avoid loss of user data
Recover the catastrophic disk failures
Make disk arrays as reliable as an individual disk

5
Concept 1 -- erasure-correcting codes and
error-correcting codes

Erasure-correcting codes are designed to recover
erased bits in a message word
An unreadable bit is called an erasure
The position of the erased bits are known
For a catastrophic disk failure, the bits on a
failed disk can be designated as unreadable
Error-correcting codes are designed to correct
messages in which some of the bits may have been
flipped, but the positions of those bits are
unknown.

6
Concept 2 -- Redundancy Metrics

Disk as stack of bits
-ith.bit in each disk forms the
ith.Codeword in the redundancy encoding
Mean time to data loss (MTTDL) measure of
reliability
Check disk overhead check disks/data disks
Update penalty number of check disks to be
updated
Group size the information and check disk that
must be accessed during the reconstruction of a
failed disk form a group

7
1d - Parity

Single-erasure-correction scheme
For G data disks, one check disk with parity of
all G disks.

Overhead 1/G
Update penalty 1
Group size G1

G 4
8
2d - Parity

Double-erasure-correction scheme
G2 data disks arranged in 2-dimensional array
For each row and each column, one check disk
stores parity for that row or column

Check disk Overhead
2G/G2 2/G
Update penalty 2
Group size G1

G 4
9
N-dimensional parity (Nd-parity)

N-erasure-correction scheme
Check disk overhead NG(N-1) / GN N/G
Update penalty N
Group size G1

10
Linear Codes

Contain the original information unmodified
within each codeword and compute the check bits
of each codeword as the parity of subsets of the
information bits

Codeword 1 1
1 1
Parity
11
Parity Check Matrix H P I
Fig. 4
How to compute the check parity bit? HX 0
First row of H 100101 100 X
111010 x1 x2 x3
P I HX 100000x100 0
x11
12
Parity Check Matrix for 1d-parity and 2d-parity
Fig. 5
13
Properties of the parity check matrix

Express in terms of a parameter, t,
whose value is between 0 and c
H will allow any t erasures to be corrected
H will allow any t errors to be detected
The minimum number of bits in which any two
codewords differ, known as the distance of the
code, is at least t1
Any set of t column selected from will be
linearly independent

14
Implementing Reconstruction
0 1000 0 0110 0 0000 0 0001 0 0000 0 0100 0
1000 1 0011
Fig. 6(a). When 4 disks fail in a 16 information
disk 2d-parity array, the controllers allow us to
identify which disks need to be repaired and
reconstructed.
15
Implementing Reconstruction cont.

Fig. 6(b) Apply elementary row operations (the
essence of Gaussian elimination) to find a matrix
M, such that the product MB has the 44 identity
matrix in its first four rows.
16
Elementary operation
Example x y z 0 (1) x - 2y
2z 4 (2) x 2y - z 2
(3) (3) - (1) to replace (3) x y z 0
(1) x - 2y 2z 4 (2)
y - 2z 0 (4) (2)-(1) to replace (2) x
y z 0 (1) - 3y z 4
(5) y - 2z 0 (4) (5)(4)3 to
replace (4) x y z 0 (1) -
3y z 4 (5) - 5z 10
(6) result x4, y-2, z-2

If we interchange two equation, the new system is
still equivalent to the old one.
If we multiply an equation with a nonzero number,
the new system is still equivalent to the old
one.
Replacing one equation with the sum of two
equation, we obtain an equivalent system

17
Gaussian Elimination
augmented matrix 1 1 1 0
1 -2 2 4 1 2
-1 2 (3) - (1) to replace (3) 1
1 1 0 1 -2 2 4
0 1 -2 2 (2)-(1) to replace
(2) 1 1 1 0 1
-3 1 4 0 1 -2 2
(5)(4)3 to replace (4) 1 1 1
0 0 -3 1 4
0 0 -5 10
Definition Using elementary operation, in every
step the new matrix was exactly the augmented
matrix associated to the new system. Once we
obtain a triangular matrix, write the associated
linear system and then solve it.
Example x y z 0 (1) x - 2y 2z
4 (2) x 2y - z 2 (3) The linear
equation x y z 0 - 3y z 4
- 5z 10
18
Implementing Reconstruction cont.
012 34567 89 15 11 10 0 0000000
10000000 01 00 0 0100010 00000100 01 01
1 0100010 01000100 00 00 0 0000111
00010000 10 01 0 1000100 00001000
Fig. 6 (C) The first 4 rows of MA describe the
operations that must be performed to reconstruct
our 4 disks.
19
The position for codes with t-erasure-correction

Be implemented in software
Run in an I/O processor
Software learns of failures directly from disk
controllers

20
Conclusion

Implement the redundancy codes for disk arrays
Minimize the number of check disks that must be
updated whenever an information disk is updated
Improve the reliability of disk arrays

Failure Correction Techniques for Large Disk Array PowerPoint PPT Presentation