Title: Pond and CFS
1Pond and CFS
- CS599
- Special Topics in OS and Distributed Storage
Systems - Professor Banu Ozden
- Jan 2004
- Ho Chung
2Table of Contents
- Part 1 Pond
- Overview OceanStore and Pond
- Pond Architecture
- Techniques Erasure Codes, Push-based update,
Byzantine Agreement, Proactive Threshold
Signature - Experimental Results
- Part 2 CFS
- Overview and Design Goals
- Chord Layer, DHash Layer, FS Layer
- Experimental Results
3PART 1Pond The OceanStore prototype
4OceanStore Overview
- Design Goal Persistent Storage
- Design criteria
- High durability
- Universal availability
- Privacy
- Data Integrity
- Assumptions on infrastructure
- Untrusted (e.g. Hosts rtrs can fail
arbitrarily) - Dynamic (So, the system must be self-organizing
and self-repairing gt self-tuning) - Support for nomadic data (how? promiscuous
caching)
5OceanStore as an Application
New Distributed Applications (OceanStore)
Tapestry (Routing messages location of objects)
Network (Java NBIO)
Operating System
6Pond Overview
Inner ring
- Serialize concurrent writes
- Enforce access control
- Check update
7Pond Data Model
- Versioning
- Each data object in Pond has version
- Allows time travel
- Each version of an object contains metadata,
actual data, and pointers to previous version - Entire stream of versions of a given data object
is named by AGUID - GUID
- BGUID (block) secure hash of a block of data
- VGUID (version) BGUID of the top block
- AGUID (active) Hash(app-specified nameOwnerPK)
- Mapping from AGUID to the latest version of VGUID
may change over time
8Pond - GUID
AGUID
VGUID i1
VGUID i
root block
backpointer
M
M
copy on write
Indirect blocks
copy on write
d1
d2
d3
d4
d5
d6
d7
d6
d7
data blocks
9Pond Architecture (1)
- Virtualization of Resources
- Virtualizes resources are not tied to hardware
- DOLR (Decentralized Object Location Routing)
Interface - Tapestry virtualizes resources
- Object is addressed with a GUID not IP address
- (e.g.) PublishObject(Object_GUID, App_ID)
- Locality aware
- No restriction on the placement of objects
- Queries for nearby object with high probability
10Pond Architecture (2)
- Replication and Consistency
- Each object has a single primary replica
- Heartbeat Certificate (AGUID, VGUID, TimeStamp,
Version) - Enforces access control
- Serializes concurrent updates from multiple users
- Inner ring
- Use Byzantine-fault tolerant protocol to agree on
updates to the data object, and digitally sign
the object
11Pond Architecture (3)
- High Durability for Archival Storage
- Motivation If we create 2 replicas of a data
block, then we get fault tolerance of one failure
for additional 100 storage. Can we improve this?
Yes - Erasure Codes more durable than replication for
same space - After an update in primary replica, all newly
created blocks are erasure-coded and fragments
are stored
A
C
E
G
f
B
D
F
H
A
E
D
H
12Pond - Erasure Codes
- A block is divided into m identically-sized
fragments, which are then encoded into n
fragments, where n gt m - The original object can be reconstructed from any
m fragments - Rate of encoding r m/n lt 1
- Intuitively, erasure encoding gives higher fault
tolerance for the storage used than replication - Disadvantage? Expensive computation
13Pond Caching Data Objects
- Frequently-read objects?
- Use Whole-block caching (instead of Erasure)
- However, if the whole-block cache does not exist
in its local node, the node reads fragments, and
do decoding to reconstruct the block. Then, cache
the block. - LRU
- To read the latest version of a document?
- Utilize Tapestry to retrieve a heartbeat for the
object from its primary replica
14Pond Push-based Update
- Update?
- Push-based update of secondary replicas of an
object - Every time primary replica applies an update to
create a new version, it sends the corresponding
update and heartbeat down the secondary replicas
15Pond Byzantine Agreement
4 Byzantine generals, N gt 3f 1
P1 (commander)
- 1st Round The commander sends a value to each
of the lieutenants - 2nd Round Each of the lieutenants sends the
value it received to its peers - P3 is a faulty general
-
- Need to send O(Nf1) messages
1v
1v
1v
P3
21v
P2
31u
41v
41v
31w
21v
P4
16Pond Authentication
- Authentication in Byzantine agreement Use hybrid
cryptography - MAC is used in all comm. in Inner Ring
- PKC is to communicate with all other machines
- Secondary replicas can verify the authenticity of
data received from other replicas without
contacting the inner ring - (e.g.) Most read traffic can be satisfied by the
secondary replicas
17Pond Proactive Threshold Signature (1)
- Goal
- To support flexibility in choosing the membership
of the inner ring - To replace machines in the inner ring without
changing public keys - PTS pairs a single public key with l private key
shares. Each of the l servers uses its key share
to generate a signature share, and any k
correctly generated signature shares may be
combined by any party to produce a full
signature, where (l 3f 1, k f 1, f of
faulty hosts)
18Pond Proactive Threshold Signature (2)
Inner Ring
Public Key Private Key Shares
PK
SK1
SK2
SK3
SK4
SS1
SS2
SS3
SS4
Signature Shares
l 4, k 2, f 1, (L 3f 1, k f 1)
Public Key Private Key Shares
PK
SK1
SK2
SK3
SK4
SS1
SS2
SS3
SS4
Signature Shares
New node
NOTE The Public key doesnt change!
19Pond Prototype implementation
- All major subsystems operational
- Self-organizing Tapestry base
- Primary replicas use Byzantine agreement
- Secondary replicas self-organize into multicast
tree - Erasure-coding archive
- Staged Event-driven software architecture
- Built on SEDA
- 280K lines of Java (J2SE v1.3)
- JNI libraries for cryptography, erasure coding
20Pond Deployment on Planetlab
- http//www.planet-lab.org
- 100 hosts, 40 sites
- Hosts are spread across North America, Europe,
Australia, and New Zealand - Pond up to 1000 virtual nodes
- Using custom Perl scripts
- 5 minute startup
- Gives global scale for free
21Pond Results (Latency)
Table 1. Latency Breakdown of an Update The
majority of the time is spent computing the
threshold signature share in small updates. With
larger updates, the time to apply and archive the
update dominates signature time.
Figure 1. Latency to Read Objects from the
Archival The graph shows that the time to read an
object increases with the of blocks that must
be retrieved
22Pond Results (Throughput)
W
R
Table 2. Throughput in the Wide Area. The
throughput for a distributed ring is limited by
the wide-area bandwidth
Table 3. Results of the Andrew Benchmark OceanStor
e outperforms NFS by a factor of 4.3 in
read-intensive phases. But the write performance
is worse by as much as a factor of 7.3.
23Pond Conclusion
- Likes
- Supports higher degree of consistency (by
Byzantine agreement protocol) - Idea of using Proactive Threshold Signature
- Dont Likes
- Not suitable for write-sharing
- Idea of Responsible Party (choose the hosts for
inner rings) - Complex! Data privacy, client updates and
durable storage all come with increase in
Complexity (e.g. Byzantine protocol, Plaxton
tree, proactive threshold signature, Erasure
encoding, etc.)
24PART 2Wide-area cooperative storage with
CFSCFS Distributed read-only file storage
25CFS Chord-based Distributed File Storage System
CFS Client
CFS Server
CFS Server
FS (Interprets blocks as files presents a FS
interface to apps)
DHash (Storage layer storage/retrieval,
replication/caching of data blocks)
DHash
DHash
Chord (Lookup layer Maintains routing tables
used to find blocks)
Chord
Chord
26CFS Design Goals
- Efficiency and Scalability
- (See Chord algorithmic performance in the next
slide) - Availability
- Chord allows client to always retrieve data
(assumption absence of network partition, etc) - Fault-tolerance - Replication Caching
- Block-level Storage Store blocks, NOT whole
files (cf. PAST) - Block-level Caching Cache along the lookup path
- Whole-file Caching Only if, files are small
- Load Balance
- Virtual Servers Spread blocks evenly over the
available virtual servers (per physical server) - Per-publisher Quotas
- To avoid malicious injection of large quantities
of data (e.g. PAST) - Decentralization
- cf. CDN (e.g. Akamai) is managed by a central
entity
27CFS Chord Layer
- Chord is a structured P2P
- Chord maps key to linear key space
- Given a key, it maps the key onto a node
- (e.g. lookup(key) IP Address of node)
- Key Idea Keep pointers (fingers) into
exponential places around space - Algorithmic performance
- In n node network, each node maintains O(log N)
route entries in its routing table - A lookup requires O(log N) messages
28Chord A simple Lookup protocol
Lookup(K54)
N1
N1
N8
K54
N8
K54
K10
N56
N56
N14
Using only Successor list
N14
N51
N51
N48
N48
N21
N21
K24
K38
N42
N42
K30
N32
N38
N38
N32
Fig 2.1 Chord ring consisting of 10 nodes
storing 5 keys
Fig 2.2 Node 8 performs a lookup for key 54
29Chord A fast Lookup Protocol
Fig 2.4
Fig 2.3
Finger Table
Lookup(K54)
K54
N1
N1
N8
N8
1
2
N56
N56
4
N14
N14
32
N51
N51
8
N48
Using finger table to accelerate lookup
16
N48
N21
N21
N42
N42
i-th entry in the table at node n contains the ID
of the first node that succeeds n by at least
2(i-1) on the Chord ring
N38
N38
N32
N32
30CFS Chord LayerServer Selection
- Goal
- Reduce lookup latency by preferentially
contacting nodes likely to be nearby in the
underlying networks - Cost Metric Pick the minimum C(ni)
- Notations
- H(ni) an estimate of the of Chord hops that
would remain after contacting ni - di Latency to node ni as reported by node m (m
previous hop) - d Average latency of all the RPCs that node n
has ever issued - Log N an estimate of the of significant high
bits in an ID - ones() the function counts how many bits are
set in () - (ni id gtgt (160 log N) ) the significant
bits in the ID-space distance between ni and the
target key id
31CFS Chord LayerNode ID Authentication
- When a new node wants to join, the existing node
authenticates the node - Chord ID SHA1 (Nodes IP address Virtual
node index) - Check whether the claimed IP address virtual
index hash to the Chord ID - Why do this? If Chord nodes could use arbitrary
IDs, an attacker could destroy chosen data by
choosing a node ID just after the datas ID
32CFS Block vs. Whole-file
- ADV. of Block granularity
- Well-suited to serve large popular files
- Network BW consumed for lookup is small
- (CFS also hides the block lookup latency by
pre-fetching blocks) - Less work to achieve load balance
- Allow flexible choice of format to client
applications and different data structures can
coexist - ADV. of Whole-file
- Efficient to serve large unpopular files
- Decreases the of msg required to fetch a file
- Lower lookup costs (one lookup per file rather
than per block)
33CFS DHash Layer Replication and Caching
- The block is stored at the successors of its ID
(square). - The block is replicated at the successors
immediate successors (circles)
Circle immediate successors of server
Square Server
The placement of block replicas and cached copies
Tick mark Block ID
Triangle Servers along the lookup path
- The block is cached at the servers along the
lookup path (triangles)
34CFS DHash LayerLoad Balance
- Motivation Assume all CFS server had one ID.
Then, every server has the same storage burden.
Is this what we want? No - Every server has different network and storage
capacity - Thus, Uniform distribution doesnt produce
perfect load balance (due to heterogeneity) - A Solution Virtual Servers
- ADV Allows adaptive configuration of the server
according to the servers capacity - DISADV Introduces a potential increase in of
hops in a Chord lookup - A Quick Remedy Allow virtual servers in the same
server to cross lookup each others tables
35CFS DHash LayerUpdate and Delete
- CFS allows updates for only the publisher of the
file - CFS doesnt support an explicit delete
- Publishers must periodically refresh their blocks
to continue to store them
36CFS FS Layer (1)
- The File System is read-only as far as clients
are concerned - The File System may be updated by its publisher
- Key Idea
- WE WANT integrity and authenticity guarantees on
public data and serve many clients - HOW?
- Use SFSRO ( SFS Read-Only File System)
- Self-certifying read-only FS
- Filenames contain public keys
- ADVANTAGE of Read-Only FS?
- Distribution infrastructure is independent from
the published content - Avoid any cryptographic operations on servers,
and Keep the overhead on Clients
37CFS FS Layer (2)
A simple CFS file system
data-block
root-block
public key
H(B1)
directory-block
inode-block
H(D)
Block, B1
H(F)
DIR-info, D
File-info, F
data-block
H(B2)
signature
ltname, H (inode)gt
Block, B2
- Public key is the root blocks identifier
- Data block and Inode is named by hashes of their
contents - Update involves updating root block to point to
new data - Includes Timestamp to prevent replay attack
- Includes finite time-interval ? Need periodic
refresh for indefinite storage
38Some thoughts on implementation
- Why Rabin public key cryptosystem in SFSRO? Why
NOT RSA or DSA? - ? Fast signature verification
- How fast would be considered as cheap for digital
signature verification? - ?Far smaller than a typical network RTT (e.g.
82µsec)
39CFS Implementation
- CFS
- 7K lines of C (including 3K lines for Chord)
- Servers communicate over UDP with a C RPC
package (provided by the SFS toolkit) - Why not TCP? Overhead of TCP connection setup
- CFS runs on Linux, OpenBSD, and FreeBSD
40CFS Results (1)
- Lookup cost is O(log (N))
- Pre-fetch increases the speed
- Server selection increases speed
41CFS Results (2)
With only 1 virtual server, some servers will not
store any blocks, and other would store more than
average
- You can control Storage space!
- Load Balance With multiple virtual servers per
a real server, the sum of the fraction of ID
space that a servers virtual servers are
responsible for is more tightly clustered around
the average
42CFS Conclusions
- Pros (or Likes)
- Simplicity
- Aggressive load balancing (via virtual servers)
- Algorithms guarantees data availability with high
probability (e.g. tighter bounds on lookup cost) - Cons (or Dont likes)
- Read-only storage system
- No anonymity
- No (keyword) search feature
43APPENDIX
44Appendix P2P Comparisons (1)
- Tapestry (UCB), Chord (MIT), Pastry (Microsoft),
CAN(ATT) to provide functionality to route
messages to an object - Disadv. of CAN, Chord route on the shortest
overlay hops available - Adv. of Tapestry Pastry construct locally
optimal routing tables from initialization and
maintain them in order to reduce routing stretch - Adv. of Pastry constraints the routing distance
per overlay hops to achieve efficiency in
point-to-point routing between overlay nodes
45Appendix - P2P Comparisons(2)
- Adv. of Tapestry locality-awareness
- Number and location of object replicas are not
fixed - Difference between Pastry and Tapestry is in
object location. - While Tapestry helps the user or application
locate the nearest copy of an object, - Pastry actively replicates the object and places
replicas at random locations in the network. Â - The result is that when a client searches for a
nearby object, Tapestry would route through a few
hops to the object, while Pastry might require
the client to route to a distant replica of the
object.
46Appendix - Bloom Filter (1)
- Goal to support membership queries
- Given a set Aa1, a2, an of n elements, using
hash function the BF computes whether the message
query is a member of the set - Factors reject-time, hash area size, allowable
fraction of errors - Idea Examine only part of the message to
recognize as not matching a test message reduce
hash area size by allowing errors
47Appendix - Bloom Filter (2)
- Initially, m-bit Vector v, is set to 0
- Choose k independent hash functions, h1()
hn(), each with range 1 m (e.g. here k 4) - For each element in a?A, the bits at positions
h1(a), h2(a), hn(a) in v are set to 1 - Given a query for b, we check the bits at
positions h1(b), h2(b), hn(b). - If any of them is 0, then b is not in the set A
A Bloom Filter with 4 hash functions