Pond and CFS - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Pond and CFS

Description:

Dynamic (So, the system must be self-organizing and self-repairing = self-tuning) ... Tick mark: Block' ID. Square: Server. Circle: immediate successors of server ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 48
Provided by: david475
Category:
Tags: cfs | pond | tick

less

Transcript and Presenter's Notes

Title: Pond and CFS


1
Pond and CFS
  • CS599
  • Special Topics in OS and Distributed Storage
    Systems
  • Professor Banu Ozden
  • Jan 2004
  • Ho Chung

2
Table of Contents
  • Part 1 Pond
  • Overview OceanStore and Pond
  • Pond Architecture
  • Techniques Erasure Codes, Push-based update,
    Byzantine Agreement, Proactive Threshold
    Signature
  • Experimental Results
  • Part 2 CFS
  • Overview and Design Goals
  • Chord Layer, DHash Layer, FS Layer
  • Experimental Results

3
PART 1Pond The OceanStore prototype
4
OceanStore Overview
  • Design Goal Persistent Storage
  • Design criteria
  • High durability
  • Universal availability
  • Privacy
  • Data Integrity
  • Assumptions on infrastructure
  • Untrusted (e.g. Hosts rtrs can fail
    arbitrarily)
  • Dynamic (So, the system must be self-organizing
    and self-repairing gt self-tuning)
  • Support for nomadic data (how? promiscuous
    caching)

5
OceanStore as an Application
New Distributed Applications (OceanStore)
Tapestry (Routing messages location of objects)
Network (Java NBIO)
Operating System
6
Pond Overview
Inner ring
  • Serialize concurrent writes
  • Enforce access control
  • Check update

7
Pond Data Model
  • Versioning
  • Each data object in Pond has version
  • Allows time travel
  • Each version of an object contains metadata,
    actual data, and pointers to previous version
  • Entire stream of versions of a given data object
    is named by AGUID
  • GUID
  • BGUID (block) secure hash of a block of data
  • VGUID (version) BGUID of the top block
  • AGUID (active) Hash(app-specified nameOwnerPK)
  • Mapping from AGUID to the latest version of VGUID
    may change over time

8
Pond - GUID
AGUID
VGUID i1
VGUID i
root block
backpointer
M
M
copy on write
Indirect blocks
copy on write
d1
d2
d3
d4
d5
d6
d7
d6
d7
data blocks
9
Pond Architecture (1)
  • Virtualization of Resources
  • Virtualizes resources are not tied to hardware
  • DOLR (Decentralized Object Location Routing)
    Interface
  • Tapestry virtualizes resources
  • Object is addressed with a GUID not IP address
  • (e.g.) PublishObject(Object_GUID, App_ID)
  • Locality aware
  • No restriction on the placement of objects
  • Queries for nearby object with high probability

10
Pond Architecture (2)
  • Replication and Consistency
  • Each object has a single primary replica
  • Heartbeat Certificate (AGUID, VGUID, TimeStamp,
    Version)
  • Enforces access control
  • Serializes concurrent updates from multiple users
  • Inner ring
  • Use Byzantine-fault tolerant protocol to agree on
    updates to the data object, and digitally sign
    the object

11
Pond Architecture (3)
  • High Durability for Archival Storage
  • Motivation If we create 2 replicas of a data
    block, then we get fault tolerance of one failure
    for additional 100 storage. Can we improve this?
    Yes
  • Erasure Codes more durable than replication for
    same space
  • After an update in primary replica, all newly
    created blocks are erasure-coded and fragments
    are stored

A
C
E
G
f
B
D
F
H
A
E
D
H
12
Pond - Erasure Codes
  • A block is divided into m identically-sized
    fragments, which are then encoded into n
    fragments, where n gt m
  • The original object can be reconstructed from any
    m fragments
  • Rate of encoding r m/n lt 1
  • Intuitively, erasure encoding gives higher fault
    tolerance for the storage used than replication
  • Disadvantage? Expensive computation

13
Pond Caching Data Objects
  • Frequently-read objects?
  • Use Whole-block caching (instead of Erasure)
  • However, if the whole-block cache does not exist
    in its local node, the node reads fragments, and
    do decoding to reconstruct the block. Then, cache
    the block.
  • LRU
  • To read the latest version of a document?
  • Utilize Tapestry to retrieve a heartbeat for the
    object from its primary replica

14
Pond Push-based Update
  • Update?
  • Push-based update of secondary replicas of an
    object
  • Every time primary replica applies an update to
    create a new version, it sends the corresponding
    update and heartbeat down the secondary replicas

15
Pond Byzantine Agreement
4 Byzantine generals, N gt 3f 1
P1 (commander)
  • 1st Round The commander sends a value to each
    of the lieutenants
  • 2nd Round Each of the lieutenants sends the
    value it received to its peers
  • P3 is a faulty general
  • Need to send O(Nf1) messages

1v
1v
1v
P3
21v
P2
31u
41v
41v
31w
21v
P4
16
Pond Authentication
  • Authentication in Byzantine agreement Use hybrid
    cryptography
  • MAC is used in all comm. in Inner Ring
  • PKC is to communicate with all other machines
  • Secondary replicas can verify the authenticity of
    data received from other replicas without
    contacting the inner ring
  • (e.g.) Most read traffic can be satisfied by the
    secondary replicas

17
Pond Proactive Threshold Signature (1)
  • Goal
  • To support flexibility in choosing the membership
    of the inner ring
  • To replace machines in the inner ring without
    changing public keys
  • PTS pairs a single public key with l private key
    shares. Each of the l servers uses its key share
    to generate a signature share, and any k
    correctly generated signature shares may be
    combined by any party to produce a full
    signature, where (l 3f 1, k f 1, f of
    faulty hosts)

18
Pond Proactive Threshold Signature (2)
Inner Ring
Public Key Private Key Shares
PK
SK1
SK2
SK3
SK4
SS1
SS2
SS3
SS4
Signature Shares
l 4, k 2, f 1, (L 3f 1, k f 1)
Public Key Private Key Shares
PK
SK1
SK2
SK3
SK4
SS1
SS2
SS3
SS4
Signature Shares
New node
NOTE The Public key doesnt change!
19
Pond Prototype implementation
  • All major subsystems operational
  • Self-organizing Tapestry base
  • Primary replicas use Byzantine agreement
  • Secondary replicas self-organize into multicast
    tree
  • Erasure-coding archive
  • Staged Event-driven software architecture
  • Built on SEDA
  • 280K lines of Java (J2SE v1.3)
  • JNI libraries for cryptography, erasure coding

20
Pond Deployment on Planetlab
  • http//www.planet-lab.org
  • 100 hosts, 40 sites
  • Hosts are spread across North America, Europe,
    Australia, and New Zealand
  • Pond up to 1000 virtual nodes
  • Using custom Perl scripts
  • 5 minute startup
  • Gives global scale for free

21
Pond Results (Latency)
Table 1. Latency Breakdown of an Update The
majority of the time is spent computing the
threshold signature share in small updates. With
larger updates, the time to apply and archive the
update dominates signature time.
Figure 1. Latency to Read Objects from the
Archival The graph shows that the time to read an
object increases with the of blocks that must
be retrieved
22
Pond Results (Throughput)
W
R
Table 2. Throughput in the Wide Area. The
throughput for a distributed ring is limited by
the wide-area bandwidth
Table 3. Results of the Andrew Benchmark OceanStor
e outperforms NFS by a factor of 4.3 in
read-intensive phases. But the write performance
is worse by as much as a factor of 7.3.
23
Pond Conclusion
  • Likes
  • Supports higher degree of consistency (by
    Byzantine agreement protocol)
  • Idea of using Proactive Threshold Signature
  • Dont Likes
  • Not suitable for write-sharing
  • Idea of Responsible Party (choose the hosts for
    inner rings)
  • Complex! Data privacy, client updates and
    durable storage all come with increase in
    Complexity (e.g. Byzantine protocol, Plaxton
    tree, proactive threshold signature, Erasure
    encoding, etc.)

24
PART 2Wide-area cooperative storage with
CFSCFS Distributed read-only file storage
25
CFS Chord-based Distributed File Storage System
CFS Client
CFS Server
CFS Server
FS (Interprets blocks as files presents a FS
interface to apps)
DHash (Storage layer storage/retrieval,
replication/caching of data blocks)
DHash
DHash
Chord (Lookup layer Maintains routing tables
used to find blocks)
Chord
Chord
26
CFS Design Goals
  • Efficiency and Scalability
  • (See Chord algorithmic performance in the next
    slide)
  • Availability
  • Chord allows client to always retrieve data
    (assumption absence of network partition, etc)
  • Fault-tolerance - Replication Caching
  • Block-level Storage Store blocks, NOT whole
    files (cf. PAST)
  • Block-level Caching Cache along the lookup path
  • Whole-file Caching Only if, files are small
  • Load Balance
  • Virtual Servers Spread blocks evenly over the
    available virtual servers (per physical server)
  • Per-publisher Quotas
  • To avoid malicious injection of large quantities
    of data (e.g. PAST)
  • Decentralization
  • cf. CDN (e.g. Akamai) is managed by a central
    entity

27
CFS Chord Layer
  • Chord is a structured P2P
  • Chord maps key to linear key space
  • Given a key, it maps the key onto a node
  • (e.g. lookup(key) IP Address of node)
  • Key Idea Keep pointers (fingers) into
    exponential places around space
  • Algorithmic performance
  • In n node network, each node maintains O(log N)
    route entries in its routing table
  • A lookup requires O(log N) messages

28
Chord A simple Lookup protocol
Lookup(K54)
N1
N1
N8
K54
N8
K54
K10
N56
N56
N14
Using only Successor list
N14
N51
N51
N48
N48
N21
N21
K24
K38
N42
N42
K30
N32
N38
N38
N32
Fig 2.1 Chord ring consisting of 10 nodes
storing 5 keys
Fig 2.2 Node 8 performs a lookup for key 54
29
Chord A fast Lookup Protocol
Fig 2.4
Fig 2.3
Finger Table
Lookup(K54)
K54
N1
N1
N8
N8
1
2
N56
N56
4
N14
N14
32
N51
N51
8
N48
Using finger table to accelerate lookup
16
N48
N21
N21
N42
N42
i-th entry in the table at node n contains the ID
of the first node that succeeds n by at least
2(i-1) on the Chord ring
N38
N38
N32
N32
30
CFS Chord LayerServer Selection
  • Goal
  • Reduce lookup latency by preferentially
    contacting nodes likely to be nearby in the
    underlying networks
  • Cost Metric Pick the minimum C(ni)
  • Notations
  • H(ni) an estimate of the of Chord hops that
    would remain after contacting ni
  • di Latency to node ni as reported by node m (m
    previous hop)
  • d Average latency of all the RPCs that node n
    has ever issued
  • Log N an estimate of the of significant high
    bits in an ID
  • ones() the function counts how many bits are
    set in ()
  • (ni id gtgt (160 log N) ) the significant
    bits in the ID-space distance between ni and the
    target key id

31
CFS Chord LayerNode ID Authentication
  • When a new node wants to join, the existing node
    authenticates the node
  • Chord ID SHA1 (Nodes IP address Virtual
    node index)
  • Check whether the claimed IP address virtual
    index hash to the Chord ID
  • Why do this? If Chord nodes could use arbitrary
    IDs, an attacker could destroy chosen data by
    choosing a node ID just after the datas ID

32
CFS Block vs. Whole-file
  • ADV. of Block granularity
  • Well-suited to serve large popular files
  • Network BW consumed for lookup is small
  • (CFS also hides the block lookup latency by
    pre-fetching blocks)
  • Less work to achieve load balance
  • Allow flexible choice of format to client
    applications and different data structures can
    coexist
  • ADV. of Whole-file
  • Efficient to serve large unpopular files
  • Decreases the of msg required to fetch a file
  • Lower lookup costs (one lookup per file rather
    than per block)

33
CFS DHash Layer Replication and Caching
  • The block is stored at the successors of its ID
    (square).
  • The block is replicated at the successors
    immediate successors (circles)

Circle immediate successors of server
Square Server
The placement of block replicas and cached copies
Tick mark Block ID
Triangle Servers along the lookup path
  • The block is cached at the servers along the
    lookup path (triangles)

34
CFS DHash LayerLoad Balance
  • Motivation Assume all CFS server had one ID.
    Then, every server has the same storage burden.
    Is this what we want? No
  • Every server has different network and storage
    capacity
  • Thus, Uniform distribution doesnt produce
    perfect load balance (due to heterogeneity)
  • A Solution Virtual Servers
  • ADV Allows adaptive configuration of the server
    according to the servers capacity
  • DISADV Introduces a potential increase in of
    hops in a Chord lookup
  • A Quick Remedy Allow virtual servers in the same
    server to cross lookup each others tables

35
CFS DHash LayerUpdate and Delete
  • CFS allows updates for only the publisher of the
    file
  • CFS doesnt support an explicit delete
  • Publishers must periodically refresh their blocks
    to continue to store them

36
CFS FS Layer (1)
  • The File System is read-only as far as clients
    are concerned
  • The File System may be updated by its publisher
  • Key Idea
  • WE WANT integrity and authenticity guarantees on
    public data and serve many clients
  • HOW?
  • Use SFSRO ( SFS Read-Only File System)
  • Self-certifying read-only FS
  • Filenames contain public keys
  • ADVANTAGE of Read-Only FS?
  • Distribution infrastructure is independent from
    the published content
  • Avoid any cryptographic operations on servers,
    and Keep the overhead on Clients

37
CFS FS Layer (2)
A simple CFS file system
data-block
root-block
public key
H(B1)
directory-block
inode-block
H(D)
Block, B1
H(F)
DIR-info, D
File-info, F
data-block
H(B2)
signature
ltname, H (inode)gt
Block, B2
  • Public key is the root blocks identifier
  • Data block and Inode is named by hashes of their
    contents
  • Update involves updating root block to point to
    new data
  • Includes Timestamp to prevent replay attack
  • Includes finite time-interval ? Need periodic
    refresh for indefinite storage

38
Some thoughts on implementation
  • Why Rabin public key cryptosystem in SFSRO? Why
    NOT RSA or DSA?
  • ? Fast signature verification
  • How fast would be considered as cheap for digital
    signature verification?
  • ?Far smaller than a typical network RTT (e.g.
    82µsec)

39
CFS Implementation
  • CFS
  • 7K lines of C (including 3K lines for Chord)
  • Servers communicate over UDP with a C RPC
    package (provided by the SFS toolkit)
  • Why not TCP? Overhead of TCP connection setup
  • CFS runs on Linux, OpenBSD, and FreeBSD

40
CFS Results (1)
  • Lookup cost is O(log (N))
  • Pre-fetch increases the speed
  • Server selection increases speed

41
CFS Results (2)
With only 1 virtual server, some servers will not
store any blocks, and other would store more than
average
  • You can control Storage space!
  • Load Balance With multiple virtual servers per
    a real server, the sum of the fraction of ID
    space that a servers virtual servers are
    responsible for is more tightly clustered around
    the average

42
CFS Conclusions
  • Pros (or Likes)
  • Simplicity
  • Aggressive load balancing (via virtual servers)
  • Algorithms guarantees data availability with high
    probability (e.g. tighter bounds on lookup cost)
  • Cons (or Dont likes)
  • Read-only storage system
  • No anonymity
  • No (keyword) search feature

43
APPENDIX
44
Appendix P2P Comparisons (1)
  • Tapestry (UCB), Chord (MIT), Pastry (Microsoft),
    CAN(ATT) to provide functionality to route
    messages to an object
  • Disadv. of CAN, Chord route on the shortest
    overlay hops available
  • Adv. of Tapestry Pastry construct locally
    optimal routing tables from initialization and
    maintain them in order to reduce routing stretch
  • Adv. of Pastry constraints the routing distance
    per overlay hops to achieve efficiency in
    point-to-point routing between overlay nodes

45
Appendix - P2P Comparisons(2)
  • Adv. of Tapestry locality-awareness
  • Number and location of object replicas are not
    fixed
  • Difference between Pastry and Tapestry is in
    object location.
  • While Tapestry helps the user or application
    locate the nearest copy of an object,
  • Pastry actively replicates the object and places
    replicas at random locations in the network.  
  • The result is that when a client searches for a
    nearby object, Tapestry would route through a few
    hops to the object, while Pastry might require
    the client to route to a distant replica of the
    object.

46
Appendix - Bloom Filter (1)
  • Goal to support membership queries
  • Given a set Aa1, a2, an of n elements, using
    hash function the BF computes whether the message
    query is a member of the set
  • Factors reject-time, hash area size, allowable
    fraction of errors
  • Idea Examine only part of the message to
    recognize as not matching a test message reduce
    hash area size by allowing errors

47
Appendix - Bloom Filter (2)
  • Initially, m-bit Vector v, is set to 0
  • Choose k independent hash functions, h1()
    hn(), each with range 1 m (e.g. here k 4)
  • For each element in a?A, the bits at positions
    h1(a), h2(a), hn(a) in v are set to 1
  • Given a query for b, we check the bits at
    positions h1(b), h2(b), hn(b).
  • If any of them is 0, then b is not in the set A

A Bloom Filter with 4 hash functions
Write a Comment
User Comments (0)
About PowerShow.com