Koh-i-Noor - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Koh-i-Noor

Description:

Stanford Dash/Flash seminar. ... Mark Manasse, Chandu Thekkath Microsoft Research Silicon Valley * Motivation Large-scale storage systems can be expensive to build ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 19
Provided by: Edwar263
Category:

less

Transcript and Presenter's Notes

Title: Koh-i-Noor


1
Koh-i-Noor
  • Mark Manasse, Chandu Thekkath
  • Microsoft Research
  • Silicon Valley

5/29/2016
2
Motivation
  • Large-scale storage systems can be expensive to
    build and maintain
  • Cost of managing the storage,
  • Cost of provisioning the storage.
  • Total cost of ownership is dominated by
    management cost,
  • but provisioning 10,000 spindles worth of useful
    data is expensive, especially with redundant
    storage.

3
Rationale / Mathematics I
  • Hotmail has storage needs approaching a petabyte.
  • Currently built from difficult-to-manage
    components.
  • 1,000 10-disk RAID arrays
  • Requires significant backup effort, incessant
    replacement of failed drives, etc.
  • Reliability could be improved by mirroring.
  • But 10,000 mirrored drives is a lot of work to
    manage.
  • For either system, difficult to expand
    gracefully, since the storage is in small,
    isolated clumps.
  • The Galois Field of order pk (for p prime) is
    formed by considering polynomials in Z/Zpx
    modulo a primitive polynomial of degree k.
  • Facts
  • Any primitive polynomial will do all the
    resulting fields are isomorphic.
  • We write GF(pk) to denote one such field.
  • x is a generator of the field.
  • Everything you know about algebra is still true.

4
Reducing the Total Cost of Ownership
  • Reduce management cost
  • Use automatic management, automatic load
    balancing, incremental system expansion, fast
    backup.
  • (Focus of earlier research and the Sepal
    project.)
  • Reduce provisioning cost
  • Use redundancy schemes that tolerate multiple
    failed components without the cost of
    duplicating them as is done in mirroring or
    triplexing.

5
Rationale / Mathematics II
  • There are existing product that improve storage
    management.
  • It would be difficult to use them to build
    petabyte storage systems that can support
    something like Hotmail.
  • Need thousands of these components and even with
    squadrons of perl and java programmers, it would
    be difficult to configure and manage such a
    system.
  • These products lack formal management interfaces
    that can be used by external programs or each
    other to reliably manage the system.
  • A Vandermonde matrix is of the form
  • The determinant of a Vandermonde matrix is
    (a-b)(a-c)(a-z)(b-c)(b-z)(y-z).

6
Large-Scale Mirroring
  • Assume disk mean-time-to-failure of 50 years.
  • Mirroring needs 20,000 disks for 10,000 data
    disks.
  • Expect 8 disks to fail every week.
  • Assume 1 day to repair a disk by copying
    frommirror.
  • Mean time before data is unavailable (because of
    a double failure) is 45 years.
  • Cost of storage is double what is actually needed.

7
Rationale / Mathematics III
  • Goal build a distributed block-level storage
    system that is highly-available, scalable, and
    easy to manage.
  • Design each component of the system to automate
    or eliminate many management decisions that are
    today made by human beings.
  • Automatically tolerate and recover from
    exceptional conditions such as component failures
    that traditionally require human intervention.
  • Facts about determinants
  • det(AB) det(A) det(B)
  • det(A ka) k det(A a)
  • Facts about GF(256)
  • a b a XOR b.
  • Every element other than 0 is xk, for some
    0k255.
  • If axk and bxj, then abxkj.
  • With a 512-byte table of logarithms and a
    1024-byte table of anti-logarithms,
    multiplication becomes trivial.
  • A byte of data, viewed as an element of GF(256),
    multiplied and added with other bytes still
    occupies a byte of storage.

8
Improving Time to Data Unavailability
  • Consider 10,000 disks, each with 50 years MTTF.
  • 50 years with mirroring at 2x provisioning
    cost.
  • May be too short a period at too high a price.
  • ½M years with triplexing at 3x provisioning
    cost.
  • Very long at very high price.
  • An alternative use Reed-Solomon Erasure Codes.
  • 40 clusters of 256 disks each.
  • Can tolerate triple (or more) failures .
  • 50K years before data unavailability.
  • 1.03x privisioning cost.

9
Rationale IV
  • Suppose the instantaneous probability of a disk
    being in a failed state is p (computed as
    MTTR/MTTF).
  • The probability of k disks failed is pk.
  • The probability of finding k disks failed out of
    j is (j C k)pkq.
  • Cluster MTTFMTTR/q with N total disks, System
    MTTF k! MTTFk/N (j MTTR)k-1 (if j gtgt k).
  • RAID sets
  • N10,000, k2, j10, MTTF50 years, MTTR1 day,
    System MTTF5023651/50000 9 years.
  • Duplexing
  • N20,000, k2, j2, System MTTF5023651/20000
    years 45 years.
  • Triplexing
  • N30,000, k3, j3, System MTTF5033652/30000
    years 555000 years.
  • Erasure codes
  • N10,000, k4, j256, System MTTF 5043653
    24/2563 10000 years 43500 years. Is that
    enough?

10
Inductive step proving the determinant of a
Vandermonde matrix is the product of
differences. Determinant here is 1. Expand
on last column, after removing common factors,
whats left if Vk.
11
Reed-Solomon Erasure Codes
2. Suppose data disks 2,3 and check disk 3 fail.
1. We use an n(nk) coding matrix to store data
on n data disks and k check disks. (k3 in our
example)
3. Omitting failed rows, we get an invertible
nn matrix R.
4. Multiplying both sides by R-1, we recover all
the data.
12
Rationale / Mathematics Va
  • The matrices were transposed so that the data
    would be in column vectors, which fits better on
    the slide.
  • The correction matrix in the example uses my
    special erasure code for k 3.
  • The correction rows are taken from a Vandermonde
    matrix.
  • As you might recognize, had you been reading the
    other half of the slides.
  • For general k, we take a Vandermonde matrix using
    elements 0, 1, , 255.
  • The product of element differences is non-zero,
    because all the individual element differences
    are non-zero.
  • Make it tall enough for all the data disks.
  • Diagonalize the square part, and use the
    remaining columns for check disks.
  • Row operations change the determinant in simple
    ways.

13
Rationale / Mathematics Vb
  • Computing check digits is easy, and
    well-parallelized.
  • To update a block on data disk j, compute the XOR
    of the old block with the new block.
  • Multicast the log of the XOR together with j to
    the check disks (and have them start reading the
    block off disk).
  • Each check disk multiplies the XORed data block
    by the jth entry in their column, XORs that
    with the old check value, and writes the new
    check block.
  • Rotate which disks are check disks, making the
    load uniform.
  • No difference in latency compared to RAID-5 (but
    double the disk bandwidth).
  • For k3, we can just take the identity matrix
    plus three columns of Vandermonde, using 1, x,
    x2.
  • Computation of check digits is faster the
    logarithm for the kth check disk and jth data
    disk is kj.
  • We omit non-failed data-disks in computing
    determinants.
  • What remains is a small minor of the whole
    matrix.
  • 1x1 works because all entries are non-zero.
  • 3x3 works because its a trasposed Vandermonde
    matrix.
  • 2x2 works because its either a Vandermonde
    matrix, or a Vandermonde matrix with columns
    multiplied by powers of x.

14
Parallel Reconstruction of a Failed Disk
  • Given the inverse matrix (as above), a failed
    disk is the sum (exclusive or) of products of the
    individual bytes on data and check disks with the
    elements of the matrix.
  • The example below shows reconstruction of disk 2,
    in a system with 4 data disks and 2 check disks
    disks 2 and 3 failed.

15
Rationale / Mathematics VI
  • The reconstruction on the previous slide
    requires
  • Reading all of the surviving disks.
  • Parallel multiplication by precomputed entries
    from the inverse matrix.
  • XOR of pairs of disks entries working up a binary
    tree.
  • A network of small, cheap switches provides the
    necessary connectivity.
  • Put hot spare CPUs and drives into network skip
    up tree at nodes with no useful partner.
  • Given nature of matrix, the inverse matrix can be
    constructed via Gaussian elimination very quickly
    for small values of k.
  • With more complicated erasure coding (2D parity,
    for example), its harder to see how to wire the
    network to accommodate hot-spares without
    requiring more from the interconnect.

16
Related Work
  • Reed-Solomon error correcting codes have been
    used for single disks and CDs.
  • Reed-Solomon and Even-Odd erasure codes have been
    proposed
  • For software RAID ABC,
  • For wide area storage systems LANL.
  • Our usage of the code is different.
  • Parallel reconstruction is novel.
  • Patents in progress on both.

17
Potential deficiencies
  • This could be a bad idea, depending on what
    youre trying to do
  • The bandwidth in at the top of the tree isnt
    enough to keep all the disk arms busy with small
    reads and writes.
  • This could be a bad disk for building databases.
  • The total disk bandwidth at least doubles during
    degraded operation, which might affect many more
    disks than with mirroring or standard RAID-5.
  • Failures may not be independent, invalidating
    predicted reliability.
  • CPUs (which, if you attach multiple disks, should
    serve disks in different pods) may fail too
    often, in ways that rebooting doesnt fix.
  • Power supplies, cables, switches, may fail,
    which we havent accounted for.
  • Writes are going to be relatively expensive.

18
Comparisons and future work
  • Other people have already built small
    erasure-coded arrays. Is there enough new here?
    We think so.
  • Real systems needs
  • Backup,
  • Geographical redundancy.
Write a Comment
User Comments (0)
About PowerShow.com