Title: Koh-i-Noor
1Koh-i-Noor
- Mark Manasse, Chandu Thekkath
- Microsoft Research
- Silicon Valley
5/29/2016
2Motivation
- Large-scale storage systems can be expensive to
build and maintain - Cost of managing the storage,
- Cost of provisioning the storage.
- Total cost of ownership is dominated by
management cost, - but provisioning 10,000 spindles worth of useful
data is expensive, especially with redundant
storage.
3Rationale / Mathematics I
- Hotmail has storage needs approaching a petabyte.
- Currently built from difficult-to-manage
components. - 1,000 10-disk RAID arrays
- Requires significant backup effort, incessant
replacement of failed drives, etc. - Reliability could be improved by mirroring.
- But 10,000 mirrored drives is a lot of work to
manage. - For either system, difficult to expand
gracefully, since the storage is in small,
isolated clumps.
- The Galois Field of order pk (for p prime) is
formed by considering polynomials in Z/Zpx
modulo a primitive polynomial of degree k. - Facts
- Any primitive polynomial will do all the
resulting fields are isomorphic. - We write GF(pk) to denote one such field.
- x is a generator of the field.
- Everything you know about algebra is still true.
4Reducing the Total Cost of Ownership
- Reduce management cost
- Use automatic management, automatic load
balancing, incremental system expansion, fast
backup. - (Focus of earlier research and the Sepal
project.) - Reduce provisioning cost
- Use redundancy schemes that tolerate multiple
failed components without the cost of
duplicating them as is done in mirroring or
triplexing.
5Rationale / Mathematics II
- There are existing product that improve storage
management. - It would be difficult to use them to build
petabyte storage systems that can support
something like Hotmail. - Need thousands of these components and even with
squadrons of perl and java programmers, it would
be difficult to configure and manage such a
system. - These products lack formal management interfaces
that can be used by external programs or each
other to reliably manage the system.
- A Vandermonde matrix is of the form
- The determinant of a Vandermonde matrix is
(a-b)(a-c)(a-z)(b-c)(b-z)(y-z).
6Large-Scale Mirroring
- Assume disk mean-time-to-failure of 50 years.
- Mirroring needs 20,000 disks for 10,000 data
disks. - Expect 8 disks to fail every week.
- Assume 1 day to repair a disk by copying
frommirror. - Mean time before data is unavailable (because of
a double failure) is 45 years. - Cost of storage is double what is actually needed.
7Rationale / Mathematics III
- Goal build a distributed block-level storage
system that is highly-available, scalable, and
easy to manage. - Design each component of the system to automate
or eliminate many management decisions that are
today made by human beings. - Automatically tolerate and recover from
exceptional conditions such as component failures
that traditionally require human intervention.
- Facts about determinants
- det(AB) det(A) det(B)
- det(A ka) k det(A a)
- Facts about GF(256)
- a b a XOR b.
- Every element other than 0 is xk, for some
0k255. - If axk and bxj, then abxkj.
- With a 512-byte table of logarithms and a
1024-byte table of anti-logarithms,
multiplication becomes trivial. - A byte of data, viewed as an element of GF(256),
multiplied and added with other bytes still
occupies a byte of storage.
8Improving Time to Data Unavailability
- Consider 10,000 disks, each with 50 years MTTF.
- 50 years with mirroring at 2x provisioning
cost. - May be too short a period at too high a price.
- ½M years with triplexing at 3x provisioning
cost. - Very long at very high price.
- An alternative use Reed-Solomon Erasure Codes.
- 40 clusters of 256 disks each.
- Can tolerate triple (or more) failures .
- 50K years before data unavailability.
- 1.03x privisioning cost.
9Rationale IV
- Suppose the instantaneous probability of a disk
being in a failed state is p (computed as
MTTR/MTTF). - The probability of k disks failed is pk.
- The probability of finding k disks failed out of
j is (j C k)pkq. - Cluster MTTFMTTR/q with N total disks, System
MTTF k! MTTFk/N (j MTTR)k-1 (if j gtgt k). - RAID sets
- N10,000, k2, j10, MTTF50 years, MTTR1 day,
System MTTF5023651/50000 9 years. - Duplexing
- N20,000, k2, j2, System MTTF5023651/20000
years 45 years. - Triplexing
- N30,000, k3, j3, System MTTF5033652/30000
years 555000 years. - Erasure codes
- N10,000, k4, j256, System MTTF 5043653
24/2563 10000 years 43500 years. Is that
enough?
10Inductive step proving the determinant of a
Vandermonde matrix is the product of
differences. Determinant here is 1. Expand
on last column, after removing common factors,
whats left if Vk.
11Reed-Solomon Erasure Codes
2. Suppose data disks 2,3 and check disk 3 fail.
1. We use an n(nk) coding matrix to store data
on n data disks and k check disks. (k3 in our
example)
3. Omitting failed rows, we get an invertible
nn matrix R.
4. Multiplying both sides by R-1, we recover all
the data.
12Rationale / Mathematics Va
- The matrices were transposed so that the data
would be in column vectors, which fits better on
the slide. - The correction matrix in the example uses my
special erasure code for k 3. - The correction rows are taken from a Vandermonde
matrix. - As you might recognize, had you been reading the
other half of the slides.
- For general k, we take a Vandermonde matrix using
elements 0, 1, , 255. - The product of element differences is non-zero,
because all the individual element differences
are non-zero. - Make it tall enough for all the data disks.
- Diagonalize the square part, and use the
remaining columns for check disks. - Row operations change the determinant in simple
ways.
13Rationale / Mathematics Vb
- Computing check digits is easy, and
well-parallelized. - To update a block on data disk j, compute the XOR
of the old block with the new block. - Multicast the log of the XOR together with j to
the check disks (and have them start reading the
block off disk). - Each check disk multiplies the XORed data block
by the jth entry in their column, XORs that
with the old check value, and writes the new
check block. - Rotate which disks are check disks, making the
load uniform. - No difference in latency compared to RAID-5 (but
double the disk bandwidth).
- For k3, we can just take the identity matrix
plus three columns of Vandermonde, using 1, x,
x2. - Computation of check digits is faster the
logarithm for the kth check disk and jth data
disk is kj. - We omit non-failed data-disks in computing
determinants. - What remains is a small minor of the whole
matrix. - 1x1 works because all entries are non-zero.
- 3x3 works because its a trasposed Vandermonde
matrix. - 2x2 works because its either a Vandermonde
matrix, or a Vandermonde matrix with columns
multiplied by powers of x.
14Parallel Reconstruction of a Failed Disk
- Given the inverse matrix (as above), a failed
disk is the sum (exclusive or) of products of the
individual bytes on data and check disks with the
elements of the matrix. - The example below shows reconstruction of disk 2,
in a system with 4 data disks and 2 check disks
disks 2 and 3 failed.
15Rationale / Mathematics VI
- The reconstruction on the previous slide
requires - Reading all of the surviving disks.
- Parallel multiplication by precomputed entries
from the inverse matrix. - XOR of pairs of disks entries working up a binary
tree. - A network of small, cheap switches provides the
necessary connectivity. - Put hot spare CPUs and drives into network skip
up tree at nodes with no useful partner.
- Given nature of matrix, the inverse matrix can be
constructed via Gaussian elimination very quickly
for small values of k. - With more complicated erasure coding (2D parity,
for example), its harder to see how to wire the
network to accommodate hot-spares without
requiring more from the interconnect.
16Related Work
- Reed-Solomon error correcting codes have been
used for single disks and CDs. - Reed-Solomon and Even-Odd erasure codes have been
proposed - For software RAID ABC,
- For wide area storage systems LANL.
- Our usage of the code is different.
- Parallel reconstruction is novel.
- Patents in progress on both.
17Potential deficiencies
- This could be a bad idea, depending on what
youre trying to do - The bandwidth in at the top of the tree isnt
enough to keep all the disk arms busy with small
reads and writes. - This could be a bad disk for building databases.
- The total disk bandwidth at least doubles during
degraded operation, which might affect many more
disks than with mirroring or standard RAID-5. - Failures may not be independent, invalidating
predicted reliability. - CPUs (which, if you attach multiple disks, should
serve disks in different pods) may fail too
often, in ways that rebooting doesnt fix. - Power supplies, cables, switches, may fail,
which we havent accounted for. - Writes are going to be relatively expensive.
18Comparisons and future work
- Other people have already built small
erasure-coded arrays. Is there enough new here?
We think so. - Real systems needs
- Backup,
- Geographical redundancy.