Koh-i-Noor - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Koh-i-Noor

Description:

Rationale / Mathematics I. Hotmail has storage needs approaching a petabyte. ... Rationale / Mathematics II. There are existing ... Rationale / Mathematics III ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 19

Provided by: edwar49

Category:

more less

Transcript and Presenter's Notes

Title: Koh-i-Noor

1
Koh-i-Noor

Mark Manasse, Chandu Thekkath
Microsoft Research
Silicon Valley

2/23/2014
2
Motivation

Large-scale storage systems can be expensive to
build and maintain
Cost of managing the storage,
Cost of provisioning the storage.
Total cost of ownership is dominated by
management cost,
but provisioning 10,000 spindles worth of useful
data is expensive, especially with redundant
storage.

3
Rationale / Mathematics I

Hotmail has storage needs approaching a petabyte.
Currently built from difficult-to-manage
components.
1,000 10-disk RAID arrays
Requires significant backup effort, incessant
replacement of failed drives, etc.
Reliability could be improved by mirroring.
But 10,000 mirrored drives is a lot of work to
manage.
For either system, difficult to expand
gracefully, since the storage is in small,
isolated clumps.

The Galois Field of order pk (for p prime) is
formed by considering polynomials in Z/Zpx
modulo a primitive polynomial of degree k.
Facts
Any primitive polynomial will do all the
resulting fields are isomorphic.
We write GF(pk) to denote one such field.
x is a generator of the field.
Everything you know about algebra is still true.

4
Reducing the Total Cost of Ownership

Reduce management cost
Use automatic management, automatic load
balancing, incremental system expansion, fast
backup.
(Focus of earlier research and the Sepal
project.)
Reduce provisioning cost
Use redundancy schemes that tolerate multiple
failed components without the cost of
duplicating them as is done in mirroring or
triplexing.

5
Rationale / Mathematics II

There are existing product that improve storage
management.
It would be difficult to use them to build
petabyte storage systems that can support
something like Hotmail.
Need thousands of these components and even with
squadrons of perl and java programmers, it would
be difficult to configure and manage such a
system.
These products lack formal management interfaces
that can be used by external programs or each
other to reliably manage the system.

A Vandermonde matrix is of the form
The determinant of a Vandermonde matrix is
(a-b)(a-c)(a-z)(b-c)(b-z)(y-z).

6
Large-Scale Mirroring

Assume disk mean-time-to-failure of 50 years.
Mirroring needs 20,000 disks for 10,000 data
disks.
Expect 8 disks to fail every week.
Assume 1 day to repair a disk by copying
frommirror.
Mean time before data is unavailable (because of
a double failure) is 45 years.
Cost of storage is double what is actually needed.

7
Rationale / Mathematics III

Goal build a distributed block-level storage
system that is highly-available, scalable, and
easy to manage.
Design each component of the system to automate
or eliminate many management decisions that are
today made by human beings.
Automatically tolerate and recover from
exceptional conditions such as component failures
that traditionally require human intervention.

Facts about determinants
det(AB) det(A) det(B)
det(A ka) k det(A a)
Facts about GF(256)
a b a XOR b.
Every element other than 0 is xk, for some
0k255.
If axk and bxj, then abxkj.
With a 512-byte table of logarithms and a
1024-byte table of anti-logarithms,
multiplication becomes trivial.
A byte of data, viewed as an element of GF(256),
multiplied and added with other bytes still
occupies a byte of storage.

8
Improving Time to Data Unavailability

Consider 10,000 disks, each with 50 years MTTF.
50 years with mirroring at 2x provisioning
cost.
May be too short a period at too high a price.
½M years with triplexing at 3x provisioning
cost.
Very long at very high price.
An alternative use Reed-Solomon Erasure Codes.
40 clusters of 256 disks each.
Can tolerate triple (or more) failures .
50K years before data unavailability.
1.03x privisioning cost.

9
Rationale IV

Suppose the instantaneous probability of a disk
being in a failed state is p (computed as
MTTR/MTTF).
The probability of k disks failed is pk.
The probability of finding k disks failed out of
j is (j C k)pkq.
Cluster MTTFMTTR/q with N total disks, System
MTTF k! MTTFk/N (j MTTR)k-1 (if j gtgt k).
RAID sets
N10,000, k2, j10, MTTF50 years, MTTR1 day,
System MTTF5023651/50000 9 years.
Duplexing
N20,000, k2, j2, System MTTF5023651/20000
years 45 years.
Triplexing
N30,000, k3, j3, System MTTF5033652/30000
years 555000 years.
Erasure codes
N10,000, k4, j256, System MTTF 5043653
24/2563 10000 years 43500 years. Is that
enough?

10
Inductive step proving the determinant of a
Vandermonde matrix is the product of
differences. Determinant here is 1. Expand
on last column, after removing common factors,
whats left if Vk.
11
Reed-Solomon Erasure Codes
2. Suppose data disks 2,3 and check disk 3 fail.
1. We use an n(nk) coding matrix to store data
on n data disks and k check disks. (k3 in our
example)
3. Omitting failed rows, we get an invertible
nn matrix R.
4. Multiplying both sides by R-1, we recover all
the data.
12
Rationale / Mathematics Va

The matrices were transposed so that the data
would be in column vectors, which fits better on
the slide.
The correction matrix in the example uses my
special erasure code for k 3.
The correction rows are taken from a Vandermonde
matrix.
As you might recognize, had you been reading the
other half of the slides.

For general k, we take a Vandermonde matrix using
elements 0, 1, , 255.
The product of element differences is non-zero,
because all the individual element differences
are non-zero.
Make it tall enough for all the data disks.
Diagonalize the square part, and use the
remaining columns for check disks.
Row operations change the determinant in simple
ways.

13
Rationale / Mathematics Vb

Computing check digits is easy, and
well-parallelized.
To update a block on data disk j, compute the XOR
of the old block with the new block.
Multicast the log of the XOR together with j to
the check disks (and have them start reading the
block off disk).
Each check disk multiplies the XORed data block
by the jth entry in their column, XORs that
with the old check value, and writes the new
check block.
Rotate which disks are check disks, making the
load uniform.
No difference in latency compared to RAID-5 (but
double the disk bandwidth).

For k3, we can just take the identity matrix
plus three columns of Vandermonde, using 1, x,
x2.
Computation of check digits is faster the
logarithm for the kth check disk and jth data
disk is kj.
We omit non-failed data-disks in computing
determinants.
What remains is a small minor of the whole
matrix.
1x1 works because all entries are non-zero.
3x3 works because its a trasposed Vandermonde
matrix.
2x2 works because its either a Vandermonde
matrix, or a Vandermonde matrix with columns
multiplied by powers of x.

14
Parallel Reconstruction of a Failed Disk

Given the inverse matrix (as above), a failed
disk is the sum (exclusive or) of products of the
individual bytes on data and check disks with the
elements of the matrix.
The example below shows reconstruction of disk 2,
in a system with 4 data disks and 2 check disks
disks 2 and 3 failed.

15
Rationale / Mathematics VI

The reconstruction on the previous slide
requires
Reading all of the surviving disks.
Parallel multiplication by precomputed entries
from the inverse matrix.
XOR of pairs of disks entries working up a binary
tree.
A network of small, cheap switches provides the
necessary connectivity.
Put hot spare CPUs and drives into network skip
up tree at nodes with no useful partner.

Given nature of matrix, the inverse matrix can be
constructed via Gaussian elimination very quickly
for small values of k.
With more complicated erasure coding (2D parity,
for example), its harder to see how to wire the
network to accommodate hot-spares without
requiring more from the interconnect.

16
Related Work

Reed-Solomon error correcting codes have been
used for single disks and CDs.
Reed-Solomon and Even-Odd erasure codes have been
proposed
For software RAID ABC,
For wide area storage systems LANL.
Our usage of the code is different.
Parallel reconstruction is novel.
Patents in progress on both.

17
Potential deficiencies

This could be a bad idea, depending on what
youre trying to do
The bandwidth in at the top of the tree isnt
enough to keep all the disk arms busy with small
reads and writes.
This could be a bad disk for building databases.
The total disk bandwidth at least doubles during
degraded operation, which might affect many more
disks than with mirroring or standard RAID-5.
Failures may not be independent, invalidating
predicted reliability.
CPUs (which, if you attach multiple disks, should
serve disks in different pods) may fail too
often, in ways that rebooting doesnt fix.
Power supplies, cables, switches, may fail,
which we havent accounted for.
Writes are going to be relatively expensive.