Pond and CFS

About This Presentation

Title:

Pond and CFS

Description:

Dynamic (So, the system must be self-organizing and self-repairing = self-tuning) ... Tick mark: Block' ID. Square: Server. Circle: immediate successors of server ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 48

Provided by: david475

Category:

more less

Transcript and Presenter's Notes

Title: Pond and CFS

1
Pond and CFS

CS599
Special Topics in OS and Distributed Storage
Systems
Professor Banu Ozden
Jan 2004
Ho Chung

2
Table of Contents

Part 1 Pond
Overview OceanStore and Pond
Pond Architecture
Techniques Erasure Codes, Push-based update,
Byzantine Agreement, Proactive Threshold
Signature
Experimental Results
Part 2 CFS
Overview and Design Goals
Chord Layer, DHash Layer, FS Layer
Experimental Results

3
PART 1Pond The OceanStore prototype
4
OceanStore Overview

Design Goal Persistent Storage
Design criteria
High durability
Universal availability
Privacy
Data Integrity
Assumptions on infrastructure
Untrusted (e.g. Hosts rtrs can fail
arbitrarily)
Dynamic (So, the system must be self-organizing
and self-repairing gt self-tuning)
Support for nomadic data (how? promiscuous
caching)

5
OceanStore as an Application
New Distributed Applications (OceanStore)
Tapestry (Routing messages location of objects)
Network (Java NBIO)
Operating System
6
Pond Overview
Inner ring

Serialize concurrent writes
Enforce access control
Check update

7
Pond Data Model

Versioning
Each data object in Pond has version
Allows time travel
Each version of an object contains metadata,
actual data, and pointers to previous version
Entire stream of versions of a given data object
is named by AGUID
GUID
BGUID (block) secure hash of a block of data
VGUID (version) BGUID of the top block
AGUID (active) Hash(app-specified nameOwnerPK)
Mapping from AGUID to the latest version of VGUID
may change over time

8
Pond - GUID
AGUID
VGUID i1
VGUID i
root block
backpointer
M
M
copy on write
Indirect blocks
copy on write
d1
d2
d3
d4
d5
d6
d7
d6
d7
data blocks
9
Pond Architecture (1)

Virtualization of Resources
Virtualizes resources are not tied to hardware
DOLR (Decentralized Object Location Routing)
Interface
Tapestry virtualizes resources
Object is addressed with a GUID not IP address
(e.g.) PublishObject(Object_GUID, App_ID)
Locality aware
No restriction on the placement of objects
Queries for nearby object with high probability

10
Pond Architecture (2)

Replication and Consistency
Each object has a single primary replica
Heartbeat Certificate (AGUID, VGUID, TimeStamp,
Version)
Enforces access control
Serializes concurrent updates from multiple users
Inner ring
Use Byzantine-fault tolerant protocol to agree on
updates to the data object, and digitally sign
the object

11
Pond Architecture (3)

High Durability for Archival Storage
Motivation If we create 2 replicas of a data
block, then we get fault tolerance of one failure
for additional 100 storage. Can we improve this?
Yes
Erasure Codes more durable than replication for
same space
After an update in primary replica, all newly
created blocks are erasure-coded and fragments
are stored

A
C
E
G
f
B
D
F
H
A
E
D
H
12
Pond - Erasure Codes

A block is divided into m identically-sized
fragments, which are then encoded into n
fragments, where n gt m
The original object can be reconstructed from any
m fragments
Rate of encoding r m/n lt 1
Intuitively, erasure encoding gives higher fault
tolerance for the storage used than replication
Disadvantage? Expensive computation

13
Pond Caching Data Objects

Frequently-read objects?
Use Whole-block caching (instead of Erasure)
However, if the whole-block cache does not exist
in its local node, the node reads fragments, and
do decoding to reconstruct the block. Then, cache
the block.
LRU
To read the latest version of a document?
Utilize Tapestry to retrieve a heartbeat for the
object from its primary replica

14
Pond Push-based Update

Update?
Push-based update of secondary replicas of an
object
Every time primary replica applies an update to
create a new version, it sends the corresponding
update and heartbeat down the secondary replicas

15
Pond Byzantine Agreement
4 Byzantine generals, N gt 3f 1
P1 (commander)

1st Round The commander sends a value to each
of the lieutenants
2nd Round Each of the lieutenants sends the
value it received to its peers
P3 is a faulty general
Need to send O(Nf1) messages

1v
1v
1v
P3
21v
P2
31u
41v
41v
31w
21v
P4
16
Pond Authentication

Authentication in Byzantine agreement Use hybrid
cryptography
MAC is used in all comm. in Inner Ring
PKC is to communicate with all other machines
Secondary replicas can verify the authenticity of
data received from other replicas without
contacting the inner ring
(e.g.) Most read traffic can be satisfied by the
secondary replicas

17
Pond Proactive Threshold Signature (1)

Goal
To support flexibility in choosing the membership
of the inner ring
To replace machines in the inner ring without
changing public keys
PTS pairs a single public key with l private key
shares. Each of the l servers uses its key share
to generate a signature share, and any k
correctly generated signature shares may be
combined by any party to produce a full
signature, where (l 3f 1, k f 1, f of
faulty hosts)

18
Pond Proactive Threshold Signature (2)
Inner Ring
Public Key Private Key Shares
PK
SK1
SK2
SK3
SK4
SS1
SS2
SS3
SS4
Signature Shares
l 4, k 2, f 1, (L 3f 1, k f 1)
Public Key Private Key Shares
PK
SK1
SK2
SK3
SK4
SS1
SS2
SS3
SS4
Signature Shares
New node
NOTE The Public key doesnt change!
19
Pond Prototype implementation

All major subsystems operational
Self-organizing Tapestry base
Primary replicas use Byzantine agreement
Secondary replicas self-organize into multicast
tree
Erasure-coding archive
Staged Event-driven software architecture
Built on SEDA
280K lines of Java (J2SE v1.3)
JNI libraries for cryptography, erasure coding

20
Pond Deployment on Planetlab

http//www.planet-lab.org
100 hosts, 40 sites
Hosts are spread across North America, Europe,
Australia, and New Zealand
Pond up to 1000 virtual nodes
Using custom Perl scripts
5 minute startup
Gives global scale for free

21
Pond Results (Latency)
Table 1. Latency Breakdown of an Update The
majority of the time is spent computing the
threshold signature share in small updates. With
larger updates, the time to apply and archive the
update dominates signature time.
Figure 1. Latency to Read Objects from the
Archival The graph shows that the time to read an
object increases with the of blocks that must
be retrieved
22
Pond Results (Throughput)
W
R
Table 2. Throughput in the Wide Area. The
throughput for a distributed ring is limited by
the wide-area bandwidth
Table 3. Results of the Andrew Benchmark OceanStor
e outperforms NFS by a factor of 4.3 in
read-intensive phases. But the write performance
is worse by as much as a factor of 7.3.
23
Pond Conclusion

Likes
Supports higher degree of consistency (by
Byzantine agreement protocol)
Idea of using Proactive Threshold Signature
Dont Likes
Not suitable for write-sharing
Idea of Responsible Party (choose the hosts for
inner rings)
Complex! Data privacy, client updates and
durable storage all come with increase in
Complexity (e.g. Byzantine protocol, Plaxton
tree, proactive threshold signature, Erasure
encoding, etc.)

24
PART 2Wide-area cooperative storage with
CFSCFS Distributed read-only file storage
25
CFS Chord-based Distributed File Storage System
CFS Client
CFS Server
CFS Server
FS (Interprets blocks as files presents a FS
interface to apps)
DHash (Storage layer storage/retrieval,
replication/caching of data blocks)
DHash
DHash
Chord (Lookup layer Maintains routing tables
used to find blocks)
Chord
Chord
26
CFS Design Goals

Efficiency and Scalability
(See Chord algorithmic performance in the next
slide)
Availability
Chord allows client to always retrieve data
(assumption absence of network partition, etc)
Fault-tolerance - Replication Caching
Block-level Storage Store blocks, NOT whole
files (cf. PAST)
Block-level Caching Cache along the lookup path
Whole-file Caching Only if, files are small
Load Balance
Virtual Servers Spread blocks evenly over the
available virtual servers (per physical server)
Per-publisher Quotas
To avoid malicious injection of large quantities
of data (e.g. PAST)
Decentralization
cf. CDN (e.g. Akamai) is managed by a central
entity

27
CFS Chord Layer

Chord is a structured P2P
Chord maps key to linear key space
Given a key, it maps the key onto a node
(e.g. lookup(key) IP Address of node)
Key Idea Keep pointers (fingers) into
exponential places around space
Algorithmic performance
In n node network, each node maintains O(log N)
route entries in its routing table
A lookup requires O(log N) messages

28
Chord A simple Lookup protocol
Lookup(K54)
N1
N1
N8
K54
N8
K54
K10
N56
N56
N14
Using only Successor list
N14
N51
N51
N48
N48
N21
N21
K24
K38
N42
N42
K30
N32
N38
N38
N32
Fig 2.1 Chord ring consisting of 10 nodes
storing 5 keys
Fig 2.2 Node 8 performs a lookup for key 54
29
Chord A fast Lookup Protocol
Fig 2.4
Fig 2.3
Finger Table
Lookup(K54)
K54
N1
N1
N8
N8
1
2
N56
N56
4
N14
N14
32
N51
N51
8
N48
Using finger table to accelerate lookup
16
N48
N21
N21
N42
N42
i-th entry in the table at node n contains the ID
of the first node that succeeds n by at least
2(i-1) on the Chord ring
N38
N38
N32
N32
30
CFS Chord LayerServer Selection

Goal
Reduce lookup latency by preferentially
contacting nodes likely to be nearby in the
underlying networks
Cost Metric Pick the minimum C(ni)
Notations
H(ni) an estimate of the of Chord hops that
would remain after contacting ni
di Latency to node ni as reported by node m (m
previous hop)
d Average latency of all the RPCs that node n
has ever issued
Log N an estimate of the of significant high
bits in an ID
ones() the function counts how many bits are
set in ()
(ni id gtgt (160 log N) ) the significant
bits in the ID-space distance between ni and the
target key id

31
CFS Chord LayerNode ID Authentication

When a new node wants to join, the existing node
authenticates the node
Chord ID SHA1 (Nodes IP address Virtual
node index)
Check whether the claimed IP address virtual
index hash to the Chord ID
Why do this? If Chord nodes could use arbitrary
IDs, an attacker could destroy chosen data by
choosing a node ID just after the datas ID

32
CFS Block vs. Whole-file

ADV. of Block granularity
Well-suited to serve large popular files
Network BW consumed for lookup is small
(CFS also hides the block lookup latency by
pre-fetching blocks)
Less work to achieve load balance
Allow flexible choice of format to client
applications and different data structures can
coexist
ADV. of Whole-file
Efficient to serve large unpopular files
Decreases the of msg required to fetch a file
Lower lookup costs (one lookup per file rather
than per block)

33
CFS DHash Layer Replication and Caching

The block is stored at the successors of its ID
(square).
The block is replicated at the successors
immediate successors (circles)

Circle immediate successors of server
Square Server
The placement of block replicas and cached copies
Tick mark Block ID
Triangle Servers along the lookup path

The block is cached at the servers along the
lookup path (triangles)

34
CFS DHash LayerLoad Balance

Motivation Assume all CFS server had one ID.
Then, every server has the same storage burden.
Is this what we want? No
Every server has different network and storage
capacity
Thus, Uniform distribution doesnt produce
perfect load balance (due to heterogeneity)
A Solution Virtual Servers
ADV Allows adaptive configuration of the server
according to the servers capacity
DISADV Introduces a potential increase in of
hops in a Chord lookup
A Quick Remedy Allow virtual servers in the same
server to cross lookup each others tables

35
CFS DHash LayerUpdate and Delete

CFS allows updates for only the publisher of the
file
CFS doesnt support an explicit delete
Publishers must periodically refresh their blocks
to continue to store them

36
CFS FS Layer (1)

The File System is read-only as far as clients
are concerned
The File System may be updated by its publisher
Key Idea
WE WANT integrity and authenticity guarantees on
public data and serve many clients
HOW?
Use SFSRO ( SFS Read-Only File System)
Self-certifying read-only FS
Filenames contain public keys
ADVANTAGE of Read-Only FS?
Distribution infrastructure is independent from
the published content
Avoid any cryptographic operations on servers,
and Keep the overhead on Clients

37
CFS FS Layer (2)
A simple CFS file system
data-block
root-block
public key
H(B1)
directory-block
inode-block
H(D)
Block, B1
H(F)
DIR-info, D
File-info, F
data-block
H(B2)
signature
ltname, H (inode)gt
Block, B2

Public key is the root blocks identifier
Data block and Inode is named by hashes of their
contents
Update involves updating root block to point to
new data
Includes Timestamp to prevent replay attack
Includes finite time-interval ? Need periodic
refresh for indefinite storage

38
Some thoughts on implementation

Why Rabin public key cryptosystem in SFSRO? Why
NOT RSA or DSA?
? Fast signature verification
How fast would be considered as cheap for digital
signature verification?
?Far smaller than a typical network RTT (e.g.
82µsec)

39
CFS Implementation

CFS
7K lines of C (including 3K lines for Chord)
Servers communicate over UDP with a C RPC
package (provided by the SFS toolkit)
Why not TCP? Overhead of TCP connection setup
CFS runs on Linux, OpenBSD, and FreeBSD

40
CFS Results (1)

Lookup cost is O(log (N))

Pre-fetch increases the speed
Server selection increases speed

41
CFS Results (2)
With only 1 virtual server, some servers will not
store any blocks, and other would store more than
average

You can control Storage space!

Load Balance With multiple virtual servers per
a real server, the sum of the fraction of ID
space that a servers virtual servers are
responsible for is more tightly clustered around
the average

42
CFS Conclusions

Pros (or Likes)
Simplicity
Aggressive load balancing (via virtual servers)
Algorithms guarantees data availability with high
probability (e.g. tighter bounds on lookup cost)
Cons (or Dont likes)
Read-only storage system
No anonymity
No (keyword) search feature

43
APPENDIX
44
Appendix P2P Comparisons (1)

Tapestry (UCB), Chord (MIT), Pastry (Microsoft),
CAN(ATT) to provide functionality to route
messages to an object
Disadv. of CAN, Chord route on the shortest
overlay hops available
Adv. of Tapestry Pastry construct locally
optimal routing tables from initialization and
maintain them in order to reduce routing stretch
Adv. of Pastry constraints the routing distance
per overlay hops to achieve efficiency in
point-to-point routing between overlay nodes

45
Appendix - P2P Comparisons(2)

Adv. of Tapestry locality-awareness
Number and location of object replicas are not
fixed
Difference between Pastry and Tapestry is in
object location.
While Tapestry helps the user or application
locate the nearest copy of an object,
Pastry actively replicates the object and places
replicas at random locations in the network.
The result is that when a client searches for a
nearby object, Tapestry would route through a few
hops to the object, while Pastry might require
the client to route to a distant replica of the
object.

46
Appendix - Bloom Filter (1)

Goal to support membership queries
Given a set Aa1, a2, an of n elements, using
hash function the BF computes whether the message
query is a member of the set
Factors reject-time, hash area size, allowable
fraction of errors
Idea Examine only part of the message to
recognize as not matching a test message reduce
hash area size by allowing errors

47
Appendix - Bloom Filter (2)

Initially, m-bit Vector v, is set to 0
Choose k independent hash functions, h1()
hn(), each with range 1 m (e.g. here k 4)
For each element in a?A, the bits at positions
h1(a), h2(a), hn(a) in v are set to 1
Given a query for b, we check the bits at
positions h1(b), h2(b), hn(b).
If any of them is 0, then b is not in the set A

A Bloom Filter with 4 hash functions

Write a Comment

User Comments (0)