Pastiche: Making Backup Cheap and Easy - PowerPoint PPT Presentation

About This Presentation
Title:

Pastiche: Making Backup Cheap and Easy

Description:

Pastiche: Making Backup Cheap and Easy. Introduction. Backup is ... A Pastiche node retains its archive skeleton, so performing partial restores is easy ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 36
Provided by: csF2
Learn more at: http://www.cs.fsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Pastiche: Making Backup Cheap and Easy


1
Pastiche Making Backup Cheap and Easy
2
Introduction
  • Backup is cumbersome and expensive
  • 4/GB/Month
  • Small-scale solutions dominated by administrative
    efforts
  • Large-scale solutions require centralized
    management

3
Pastiche
  • Observation 1 disk is no longer full
  • Can use excess capacity for efficient, effective,
    and administration-free backup
  • Use untrusted machines to perform backup services
  • Need replication for reliability
  • Need to balance locality and reliability

4
Pastiche
  • Observation 2 Much of the data on a given
    machine is not unique
  • Office 2000 217 MB footprint
  • Different installations are largely the same
  • Its exploitation can achieve storage savings

5
Pastiche
  • Built on three pieces of research
  • Pastry Peer-to-peer, self-administering,
    scalable routing
  • Content-based indexing easy discovering of
    redundant data
  • Convergent encryption use the same encrypted
    representation without sharing keys

6
Challenges
  • How to discover backup buddies without a
    centralized directory?
  • How can nodes reuse their own state to backup
    others?
  • How can nodes restore files/machines without
    requiring administrative intervention?
  • How can nodes detect unfaithful buddies?

7
Basic Idea
  • Summarize storage content with abstracts
  • Use abstracts to locate buddies
  • A skeleton tree is used to represent and restore
    an entire file system
  • Periodic queries of buddies for stored data

8
Enabling Technologies
  • Peer-to-peer routing
  • Content-based indexing
  • Convergent encryption

9
Peer-to-Peer Routing
  • Pastry scalable, self-organizing, routing and
    object location infrastructure
  • Each node has a nodeID
  • IDs are uniformly distributed in the ID space
  • A proximity metric to measure the distance
    between two IDs

10
More on Pastry
  • Each node maintains three sets of states
  • Leaf set
  • Closest nodes in terms of nodeIDs
  • Neighborhood set
  • Closest nodes in terms of of the proximity metric
  • Routing table
  • Prefix routing

11
Prefix Routing
  • In each step, a node forwards the message to a
    node whose nodeID shares a prefix that is at
    least one digit longer than the prefix of the
    current nodeID
  • Destination 1230
  • Current NodeID 1023
  • Next Hop 12--

12
Pastiches Use of Pastry
  • Uses two separate Pastry overlay networks during
    buddy discovery
  • Once a node is discovered, traffic is send
    directly via IP
  • Pastiche adds two mechanisms
  • Lighthouse sweep to discover distinct Pastry
    nodes
  • Distance metric based on the FS contents

13
Content-Based Indexing
  • Goal identify file regions for sharing
  • Use Rabin fingerprints
  • A fingerprint is generated for each overlapping
    k-byte substring in a file
  • If the lower-order bits of a fingerprint match a
    predetermined value, that offset is marked as an
    anchor
  • Anchors divide files into chunks each chunk is
    associated with a secure hash value

14
Sharing with Confidentiality
  • Sharing encrypted data without sharing keys
  • Need to have a single encrypted representation
  • For the ease of comparisons
  • Use convergent encryption

15
Convergent Encryption
  • Sosayhow do you share a door without sharing
    its corresponding keys?

16
Convergent Encryption
  • How about use different safes to stores those
    keys?

17
Convergent Encryption
  • And use different keys to access those keys

18
Implications of the Use of Convergent Encryption
  • If a backup node is not a participating group
  • Cannot decrypt the data
  • If not, a backup node knows the node also stores
    that data
  • Information leak vs. storage efficiency

19
Design
  • Pastiche data is stored in chunks
  • Chunk boundaries determined by content-based
    indexing
  • Encrypted with convergent encryption
  • Chunks carry owner lists

20
Design
  • When a newly written file is closed, it is
    scheduled for chunking
  • If a chunk already exists, the local host is
    added to the owner list
  • If not, encrypt the chunk and write it out
  • Chunking and writing deferred to avoid
    short-lived files

21
Design
  • Chunks are immutable
  • When a file is written, its set of chunk may
    change
  • A chunk is not deleted until the last reference
    to it is removed

22
Abstracts Finding Redundancy
  • An ideal backup buddy is one that holds a
    superset of the new machines data
  • To find it, send the full signature (hashes) of
    the new node to candidate buddies
  • However, we need to transfer 1.3MB per GB of
    stored data
  • Solution Abstractstransfer only a random
    subset of signatures

23
Compare one disk to another
Node1 signature
Node2 signature
98
73
1
46
98
73
1
46
98
73
1
46
98
73
1
46
20
8
11
55
67
8
11
55
20
8
11
55
67
8
11
55
7
45
21
53
26
17
93
13
7
53
16
24
26
13
16
24
33
18
77
19
16
24
45
21
35
15
35
15
35
15
33
18
1
67
15
13
Node1 abstract
24
Overlays Finding a Set of Buddies
  • A desirable buddy should have
  • A substantial overlap
  • Physically nearby (with at least one far away to
    survive geographically correlated failures)

25
Applied Use of Pastry
  • Pastiche uses two Pastry overlays to facilitate
    buddy discovery
  • One for network proximity
  • One for file system overlap
  • Coveragethe fraction of overlapping chunks
    stored on a site

26
Security Problems
  • A malicious node can
  • Under-report coverage to avoid being chosen as a
    buddy
  • Over-report coverage to attract clients just to
    discard their chunks

27
Backup Protocol
  • A Pastiche node has full control over the backup
    schedule
  • A snapshot consists of three things
  • Chunks to be added
  • Chunks to be removed
  • Metadata of those chunks

28
Restoration
  • A Pastiche node retains its archive skeleton, so
    performing partial restores is easy
  • To recover the whole machine, a node has to
    obtain its root node from one of the backup
    machines first

29
Detecting Failure and Malice
  • A node randomly requests data from its buddies
  • Can bound probability of having failures and
    malicious nodes undetected

30
Preventing Greed
  • Someone can store things everywhere
  • Need to institute distributed quota
  • Very difficult
  • Some proposed solutions
  • Each node monitors the overall storage costs
    imposed by its backup clients
  • Problem Sybil attacks (forge many entities that
    consumes little storage)

31
Preventing Greed
  • Force each node to solve puzzles proportional to
    storage consumption
  • Problems
  • Needless expensive
  • Storage is traded against something other than
    storage
  • Heterogeneous computing power

32
Preventing Greed
  • Electronic currency
  • Problems
  • Need to add atomic currency transactions
  • Complicated

33
Implementation
  • Chunkstore file system
  • Backup daemon

34
Performance Overhead
35
The Chance of Finding Buddies
Write a Comment
User Comments (0)
About PowerShow.com