Title: Algorithm-Based Fault Tolerance Matrix Multiplication
1Algorithm-Based Fault ToleranceMatrix
Multiplication
2Problem at Hand
- Have matrices A and B
- Want to compute their product AB
- Ask a matrix-matrix-multiply (MMM) implementation
to compute product - Answer C
- Question Is C the correct answer?
How could we know for sure?
3Algorithm-Based Fault Tolerance
- Encode input matrices via error-correcting code
- Run regular MMM algorithm on encoded matrices
- Encoding invariant under MMM
- Naturally outputs encoded matrices
- Encoding guarantees
- If upto t errors in output, will detect error
- If upto cltt errors in output, can decode correct
output matrix
4Outline
Linear Error Correcting Codes
ABFT Linear Encoding of Matrices
Algorithm-Based Fault Tolerance
5Error Correcting Codes
- Map f ?k ? ?n
- k-long data words ? n-long codewords
- We use ?0, 1
- Code of length n is a sparse subset of ?n
- Very few possible words are valid codewords
- Rate of code
- Amount of information communicated by each
codeword
6Minimum Distance
- Minimum Distance d() Hamming distance
- Hamming distance number of spots where words
differ - Measures difficulty of decoding/correcting
corrupted codewords
7Detection and Correction
- Code may detect errors in ?dmin spots
- No error can morph one codeword into another
- May correct errors in ?(dmin-1)/2 spots
- Can still find closest codeword
- More details later
Each codeword defines circle around itself of
radius dmin/2
8Linear Codes
- Codewords form linear subspace inside ?n
- In rowspace of generator matrix G
-
a (n7, k3) code
9Property 1
- Linear combination of any codewords is also a
codeword - For any x,y?C, (xy)?C
- Codewordconstant is codeword
- For any z?C, kz?C
- lt0,00gt always a codeword
- Proof basic properties of linear spaces
10Property 2
- Minimum distance of linear code
- Where
- Proof
11Parity Check Matrix
- H dual matrix to G
- Contains basis of space orthogonal to Gs row
space - n-k dimentional space
- H is (n-k)xn
- Space defined as
- Note H also defines a linear code
12Property 3
- dminmin of columns of H that can sum to 0
- Proof
13Property 4
- Minimum distance of linear code ? n-k1
- Proof
- Total n dimensions (since codewords are
n-vectors) - Gs rowspace rank k
- Thus, Hs columspace rank n-k
- Thus, n-k1 columns will be linearly dependent
- Add up to 0
- By Property 3, this is ? dmin
14Outline
Linear Error Correcting Codes
ABFT Linear Encoding of Matrices
Algorithm-Based Fault Tolerance
15Encoding a Matrix
- Algorithm-Based Fault Tolerance introduced by
Huang and Abraham in 1984 - Encode each row of matrix via extra column
- Column entries sums of matrix rows
16Encoding a Matrix
- Encode each column of matrix via extra row
- Row entries sums of matrix columns
- Full Encoding
17Detecting Errors
- Suppose matrix A is corrupted to matrix Â
- entry âi,j is wrong
- Can detect errors exact position lti,jgt
18Correcting Errors
- Can correct error using row or col checksum
19Big Trick Preservation of Encoding
- Column-encoded mtx Row-encoded mtx
Fully-encoded mtx - Can check MMM computation by checking encoding of
output - If product matrix has an erroneous entry
- Can detect
- Can correct
20Applications
- Matrix Multiplication
- Given encoded A and B,
- Check whether MMM result C (?AB) has valid
encoding - Matrix Factorization
- Given a factorization AWZ
- Verify correctness by verifying encodings of
factors - Factors row- OR column-encoded
- Can only detect, not correct errors
21Weighted ABFT
- Oftentimes need to check row- or column-encoded
matrices - Ex factorization, data integrity check
- Can only detect errors in such matrices
- Can we also correct?
- Yes, by generalizing to weighted checking
rows/columns
22Weighting
- Suppose we have d n-vectors w1wd
- Can column-encode matrix A
- Lets try out
23Weighted Error Detection
24Weighted Error Correction
- Weighted encoding Detects and Corrects single
errors - Even for non full-encoding
25Outline
Linear Error Correcting Codes
ABFT Linear Encoding of Matrices
Algorithm-Based Fault Tolerance
26Surprise
- But this is all just a linear code!
- Generator matrix for above scheme
27Generating Encodings
- Given mltai,1, ai,2, , ai,kgt as message word
(or matrix row/column)
28Surprise??
- Not too surprising really
- Why else would MMM preserve encoding?
- Another possibility
- Efficient can be implemented via bit shifts
- Room open for using any linear code!
29Error Detection/Correction in General
- To show for linear codes
- Can detect ?dmin errors
- Can correct ?(dmin-1)/2 errors
- Let be original codeword
- Let be the corrupted codeword
-
- e error vector
30Error Detection in General
-
-
- s called the syndrome vector
- Independent of original codeword
- Note weight(e) ltdmin since ltdmin errors
- Thus
- Detection if , then ERROR
31Error Correction in General
- Clearly e is correction vector
- corrects error in
- Sufficient to prove
- weight(e)?(dmin-1)/2 ? H is isomorphism
correction vectors ?
syndrome vectors - i.e. for each correction vector (want to know)
? unique syndrome vector - Thus, possible to correct any error
- may not be efficient
32H is Onto
-
- weight(e) ? (dmin-1)/2 lt dmin
- rank(H) n-k ? (dmin-1)/2
- Thus, rank(H) ? weight(e) and He ? 0
- Not enough 1s in e to sum Hs columns to 0
- H maps onto its range
- Thus,
33H is 1-1
- Let e1 and e2 be correction vectors, e1 ? e2
- Suppose that
- weight(e1e2) ? (dmin-1)/2
- He1 He2 s
- He1-He2 H(e1-e2) s-s 0
- And so, (e1-e2) is a codeword
- Thus, weight(e1-e2) ? dmin
- But weight(e1e2) ? (dmin-1)/2 and so
weight(e1-e2) ?dmin-1 - Contradiction! e1 e2
34Other Encoding Schemes
- Linear codes preserved by matrix multiplication
- Presumably, fancier codes might be preserved by
fancier computations - Limit
- S. Winograd showed in 1962 that any code s.t.
f(x?y) f(x) ? f(y) has rate (k/n) or minimum
weight?0 as k?? - How general can we get?
- Do good solutions exist for small k?
- k64 bits should be good enough
35Summary
- For Matrix Multiplication can encode input via
linear codes - Solutions exist for more complex codes
- Ex Fourier Transforms
- On parallel systems must ensure
- No processor touches gt1 element per row/column
- Else, if one processor fails, encoding
overwhelmed with errors - To ensure this must modify algorithm
- Separate check placement theory