Title: peertopeer file systems
1 - peer-to-peer file systems
- Presented by Serge Kreiker
2P2P in the Internet
- Napster A peer-to-peer file sharing application
- allow Internet users to exchange files directly
- simple idea hugely successful
- fastest growing Web application
- 50 Million users in January 2001
- shut down in February 2001
- similar systems/startups followed in rapid
succession - Napster,Gnutella, Freenet
3Napster
128.1.2.3
(xyz.mp3, 128.1.2.3)
Central Napster server
4Napster
128.1.2.3
xyz.mp3 ?
128.1.2.3
Central Napster server
5Napster
128.1.2.3
xyz.mp3 ?
Central Napster server
6Gnutella
7Gnutella
xyz.mp3 ?
8Gnutella
9Gnutella
xyz.mp3
10So Far
- Centralized Napster
- - Table size O(n)
- - Number of hops O(1)
- Flooded queries Gnutella
- - Table size O(1)
- - Number of hops O(n)
11Storage Management Systems challenges
- Distributed
- Nodes have identical capabilities and
responsibilities - anonymity
- Storage management spread storage burden evenly
- Tolerate unreliable participants
- Robustness surviving massive failures
- Resilience to DoS attacks, censorship, other node
failures. - Cache management cache additional copies of
popular files
12Routing challanges
- Efficiency O(log(N)) messages per lookup
- N is the total number of servers
- Scalability O(log(N)) state per node
- Robustness surviving massive failures
13We are going to look at
- PAST (Rice and Microsoft Research, routing
substrate - Pastry) - CFS (MIT, routing substrate - Chord)
14What is PAST ?
- Archival storage and content distribution utility
- Not a general purpose file system
- Stores multiple replicas of files
- Caches additional copies of popular files in the
local file system
15How it works
- Built over a self-organizing, Internet-based
overlay network - Based on Pastry routing scheme
- Offers persistent storage services for replicated
read-only files - Owners can insert/reclaim files
- Clients just lookup
16PAST Nodes
- The collection of PAST nodes form an overlay
network - Minimally, a PAST node is an access point
- Optionally, it contributes to storage and
participate in the routing
17PAST operations
- fileId Insert(name, owner-credentials, k,
file) - file Lookup(fileId)
- Reclaim(fileId, owner-credentials)
18Insertion
- fileId computed as the secure hash of name,
owners public key, salt - Stores the file on the k nodes whose nodeIds are
numerically closest to the 128 msb of fileId - How to map Key IDs to Node IDs?
- Use Pastry
19Insert contd
- The required storage is debited against the
owners storage quota - A file certificate is returned
- Signed with owners private key
- Contains fileId, hash of content, replication
factor others - The file certificate are routed via Pastry
- Each node of the k replica storing nodes attach
a store receipt - Ack sent back after all k-nodes have accepted the
file
20Insert file with fileId117, k4
21Lookup Reclaim
- Lookup Pastry locates a near node that has a
copy and retrieves it - Reclaim weak consistency
- After it, a lookup is no longer guaranteed to
retrieve the file - But, it does not guarantee that the file is no
longer available
22Pastry Peer-to-peer routing
- Provide generic, scalable indexing, data location
and routing - Inspiration from Plaxtons algorithm (used in web
content distribution eg. Akamai) and Landmark
hierarchy routing - Goals
- Efficiency
- Scalability
- Fault Resilience
- Self-organization (completely decentralized)
23PastryHow it works?
- Each node has Unique nodeId.
- Each Message has a key.
- Both are uniformly distributed and lie in the
same namespace - Pastry node routes the message to the node with
the closest nodeId to the key. - Number of routing steps is O(log N).
- Pastry takes into account network locality.
- PAST uses fileID as key, and stores the file in
k closest nodes.
24Pastry Node ID space
- Each node is assigned a 128-bit node identifier -
nodeId. - nodeId is assigned randomly when joining the
system. (e.g. using SHA-1 hash of its IP or nodes
public Key) - Nodes with adjacent nodeIds are diverse in
geography, ownership, network attachment, etc. - nodeId and keys are in base 2b. b is
configuration param with typical value 4.
25PastryNode ID space
128 bits ( max. 2128 nodes)
Node id
0
1
L1
b bits
L levels b 128/L bits per level NodeId
sequence of L, base 2b (b-bit) digits
21280
2128 - 1
1
1
Circular Namespace
26Pastry Node State (1)
- Each node maintains routing table-R,
neighborhood set-M, leaf set-L. - Routing table is organized into ?log2bN? rows
with 2b-1 entry each. - Each entry n contains the IP address of a close
node which ID matches in the first n digits,
differs in digit (n1) - Choice of b - tradeoff between size of routing
table and length of route.
27Pastry Node State (2)
- Neighborhood set - nodeIds , IP addresses of ?M?
nearby nodes based on proximity in nodeId space - Leaf set set of ?L? nodes withclosest nodeId
to current node. - L - divided into 2 ?L? /2 closest larger, ?L?
/2 closest smaller. - values for ?L? and ?M? are 2b
28Example NodeId10233102, b2, nodeId is 16 bit.
All numbers in base 4.
29Pastry Routing Requests
- Route (my-id, key-id, message)
- if (key-id in range of my leaf-set)
- forward to the numerically closest node in
leaf set - else
- forward to a node node-id in the routing table
s. th. node-id shares a longer prefix with
key-id than my-id - else
- forward to a node node-id that shares the same
length prefix with key-id as my-id but is
numerically closer
Routing takes O(log N) messages
30B2, l4,key 1230
31PastryNode Addition
A 10
- X joining node
- A node nearby X (network proximity)
- Z node numerically closest to X2
- Routing Table of X
- leaf-set(X) leaf-set(Z)
- neighborhood-set(X)
- neighborhood-set(A)
- routing table X, row i routing
table Ni, row i, where Ni is the ith node
encountered along the route from A to Z - X notifies all-nodes in leaf-set(X) which update
their state.
N1
N36
Lookup(216)
N2
240
Z 210
32X joins the system , first stage
Route message Key X
33Pastry Node Failures, Recovery
- Rely on a soft-state protocol to deal with node
failures - Neighboring nodes in the nodeId space
periodically - exchange keepalive msgs
- unresponsive nodes for a period T removed from
- leaf-sets
- recovering nodes contacts last known leaf set,
updates its own leaf set, notifies members of its
presence. - Randomized routing to deal with malicious nodes
that can cause repeated query failures
34Security
- Each PAST node and each user of the system hold a
smartcard - Private/public key pair is associated with each
card - Smartcards generate and verify certificates and
maintain storage quotas
35More on Security
- Smartcards ensures integrity of nodeId and fileId
assignments - Store receipts prevent malicious nodes to create
fewer than k copies - File certificates allow storage nodes and clients
to verify integrity and authenticity of stored
content, or to enforce the storage quota
36Storage Management
- Based on local coordination among nodes nearby
with nearby nodeIds - Responsibilities
- Balance the free storage among nodes
- Maintain the invariant that replicas for each
file are are stored on k nodes closest to its
fileId
37Causes for storage imbalance solutions
- The number of files assigned to each node may
vary - The size of the inserted files may vary
- The storage capacity of PAST nodes differs
- Solutions
- Replica diversion
- File diversion
38Replica diversion
- Recall each node maintains a leaf set
- l nodes with nodeIds numerically closest to given
node - If a node A cannot accommodate a copy locally, it
considers replica diversion - A chooses B in its leaf set and asks it to store
the replica - Then, enters a pointer to Bs copy in its table
and issues a store receipt
39Policies for accepting a replica
- If (file size/remaining free storage) t
- Reject
- t is a fixed threshold
- T has different values for primary replica (
nodes among k numerically closest ) and diverted
replica ( nodes in the same leaf set, but not k
closest ) - t(primary) t(diverted)
40File diversion
- When one of the k nodes declines to store a
replica ? try replica diversion - If the chosen node for diverted replica also
declines ? the entire file is diverted - Negative ack is sent, the client will generate
another fileId, and start again - After 3 rejections the user is announced
41Maintaining replicas
- Pastry uses keep-alive messages and it adjusts
the leaf set after failures - The same adjustment takes place at join
- What happens with the copies stored by a failed
node ? - How about the copies stored by a node that leaves
or enters a new leaf set ?
42Maintaining replicas contd
- To maintain the invariant ( k copies ) ?the
replicas have to be re-created in the previous
cases - Big overhead
- Proposed solution for join lazy re-creation
- First insert a pointer to the node that holds
them, then migrate them gradually
43Caching
- The k replicas are maintained in PAST for
availability - The fetch distance is measured in terms of
overlay network hops ( which doesnt mean
anything for the real case ) - Caching is used to improve performance
44Caching contd
- PAST uses the unused portion of their
advertised disk space to cache files - When store a new primary or a diverted replica, a
node evicts one or more cached copies - How it works a file that is routed through a
node by Pastry ( insert or lookup ) is inserted
into the local cache f its size - c is a fraction of the current cache size
45Evaluation
- PAST implemented in JAVA
- Network Emulation using JavaVM
- 2 workloads (based on NLANR traces) for
- file sizes
- 4 normal distributions of node storage sizes
46Key Results
- STORAGE
- Replica and file diversion improved global
storage utilization from 60.8 to 98 compared to
without - insertion failures drop to
- Caveat Storage capacities used in experiment,
1000x times below what might be expected in
practice. - CACHING
- Routing Hops with caching lower than without
caching even with 99 storage utilization - Caveat median file sizes very low, likely
caching performance will degrade if this is
higher.
47CFSIntroduction
- Peer-to-peer read only storage system
- Decentralized architecture focusing mainly on
- efficiency of data access
- robustness
- load balance
- scalability
- Provides a distributed hash table for block
storage - Uses Chord to map keys to nodes.
- Does not provide
- anonymity
- strong protection against malicious participants
- Focus is on providing an efficient and robust
lookup and storage layer with simple algorithms.
48CFS Software Structure
RPC API
Local API
FS
DHASH
DHASH
DHASH
CHORD
CHORD
CHORD
CFS Client
CFS Server
CFS Server
49CFS Layer functionalities
- The client file system uses the DHash layer to
retrieve blocks - The Server Dhash and the client DHash layer uses
the client Chord layer to locate the servers that
hold desired blocks - The server DHash layer is responsible for storing
keyed blocks, maintaining proper levels of
replication as servers come and go, and caching
popular blocks - Chord layers interact in order to integrate
looking up a block identifier with checking for
cached copies of the block
50- Client identifies the root block using a public
key generated by - the publisher.
- Uses the public key as the root block identifier
to fetch the root block and - checks for the validity of the block using
the signature - File inode key is obtained by usual search
through directory - blocks . These contain the keys of the file
inode blocks which are - used to fetch the inode blocks.
- The inode block contains the block numbers and
their corr. keys - which are used to fetch the data blocks.
51CFS Properties
- decentralized control no administrative
relationship between servers and publishers. - scalability lookup uses space and messages at
most logarithmic in the number of servers. - availability client can retrieve data as long
as at least one replica is reachable using the
underlying network. - load balance for large files, it is done
through spreading blocks over a number of
servers. For small files, blocks are cached at
servers involved in the lookup. - persistence once data is inserted, it is
available for the agreed upon interval. - quotas are implemented by limiting the amount
of data inserted by any particular IP address - efficiency - delay of file fetches is
comparable with FTP due to efficient lookup,
pre-fetching, caching and server selection.
52Chord
- Consistent hashing
- maps node IP address Virtual host number into
a m-bit node identifier. - maps block keys into the same m bit identifier
space. - Node responsible for a key is the successor of
the keys id with wrap-around in the m bit
identifier space. - Consistent hashing balances the keys so that all
nodes share equal load with high probability.
Minimal movement of keys as nodes enter and leave
the network. - For scalability, Chord uses a distributed version
of consistent hashing in which nodes maintain
only O(log N) state and use O(log N) messages for
lookup with a high probability.
53Chord details
- two data structures used for performing lookups
- Successor list This maintains the next r
successors of the node. The successor list can be
used to traverse the nodes and find the node
which is responsible for the data in O(N) time. - Finger table ith entry in the finger table
contains the identity of the first node that
succeeds n by at least 2i 1 on the ID circle. - lookup pseudo code
- find ids predecessor, its successor is the node
responsible for the key - to find the predecessor, check if the key lies
between the node-id and its successor. Else,
using the finger table and successor list, find
the node which is the closest predecessor of id
and repeat this step. - since finger table entries point to nodes at
power-of-two intervals around the ID ring, each
iteration of above step reduces the distance
between the predecessor and the current node by
half. -
54Finger i points to successor of n2i
N120
112
½
¼
1/8
1/16
1/32
1/64
1/128
N80
55Chord Node join/failure
- Chord tries to preserve two invariants
- Each nodes successor is correctly maintained.
- For every key k, node successor(k) is responsible
for k. - To preserve these invariants, when a node joins a
network - Initialize the predecessors, successors and
finger table of node n - Update the existing finger tables of other nodes
to reflect the addition of n - Notify higher layer software so that state can be
transferred. - For concurrent operations and failures, each
Chord node runs a stabilization algorithm
periodically to update the finger tables and
successor lists to reflect addition/failure of
nodes. - If lookups fail during the stabilization process,
the higher layer can lookup again. Chord provides
guarantees that the stabilization algorithm will
result in a consistent ring.
56Chord Server selection
- added to Chord as part of CFS implementation.
- Basic idea reduce lookup latency by
preferentially contacting nodes likely to be
nearby in the underlying network - Latencies are measured during finger table
creation, so no extra measurements necessary. - This works only well for latencies such that low
latencies from a to b and from b to c that the
latency is low between a and c - Measurements suggest this is true. A case study
of server selection, Masters thesis -
-
57CFS Node Id Authentication
- Attacker can destroy chosen data by selecting a
node ID which is the successor of the data key
and then deny the existence of the data. - To prevent this, when a new node joins the
system, existing nodes check - If the hash (node ip virtual number) is same as
the professed node id - send a random nonce to the claimed IP to check
for IP spoofing - To succeed, the attacker would have to control a
large number of machines so that he can target
blocks of the same file (which are randomly
distributed over multiple servers)
58CFS Dhash Layer
- Provides a distributed hash table for block
storage - reflects a key CFS design decision split each
file into blocks and randomly distribute the
blocks over many servers. - This provides good load distribution for large
files . - disadvantage is that lookup cost increases since
lookup is executed for each block. The lookup
cost is small though compared to the much higher
cost of block fetches. - Also supports pre-fetching of blocks to reduce
user perceived latencies. - Supports replication, caching, quotas , updates
of blocks.
59CFS Replication
- Replicates the blocks on k servers to increase
availability. - Places the replicas at the k servers which are
the immediate successors of the node which is
responsible for the key - Can easily find the servers from the successor
list (r k) - Provides fault tolerance since when the successor
fails, the next server can serve the block. - Since in general successor nodes are not likely
to be physically close to each other , since the
node id is a hash of the IP virtual number,
this provides robustness against failure of
multiple servers located on the same network. - The client can fetch the block from any of the
k servers. Latency can be used as a deciding
factor. This also has the side-effect of
spreading the load across multiple servers. This
works under the assumption that the proximity in
the underlying network is transitive.
60CFS Caching
- Dhash implements caching to avoid overloading
servers for popular data. - Caching is based on the observation that as the
lookup proceeds more and more towards the desired
key, the distance traveled across the key space
with each hop decreases. This implies that with a
high probability, the nodes just before the key
are involved in a large number of lookups for the
same block. So when the client fetches the block
from the successor node, it also caches it at the
servers which were involved in the lookup . - Cache replacement policy is LRU. Blocks which are
cached on servers at large distances are evicted
faster from the cache since not many lookups
touch these servers. On the other hand, blocks
cached on closer servers remain alive in the
cache as long as they are referenced.
61CFS Implementation
- Implemented in 7000 lines of C code including
3000 lines of Chord - User level programs communicate over UDP with
RPC primitives provided by the SFS toolkit. - Chord library maintains the successor lists and
the finger tables. For multiple virtual servers
on the same physical server, the routing tables
are shared for efficiency. - Each Dhash instance is associated with a chord
virtual server. Has its own implementation of the
chord lookup protocol to increase efficiency. - Client FS implementation exports an ordinary Unix
like file system. The client runs on the same
machine as the server, uses Unix domain sockets
to communicate with the local server and uses the
server as a proxy to send queries to non-local
CFS servers.
62CFS Experimental results
- Two sets of tests
- To test real-world client-perceived performance ,
the first test explores performance on a subset
of 12 machines of the RON testbed. - 1 megabyte file split into 8K size blocks
- All machines download the file one at a time .
- Measure the download speed with and without
server selection - The second test is a controlled test in which a
number of servers are run on the same physical
machine and use the local loopback interface for
communication. In this test, robustness,
scalability, load balancing etc. of CFS are
studied.
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73Future Research
- Support keyword search
- By adopting an existing centralized search engine
(like Napster) - use a distributed set of index files stored on
CFS - Improve security against malicious participants.
- Can form a consistent internal ring and can route
all lookups to nodes internal to the ring and
then deny the existence of the data - Content hashes help guard against block
substitution. - Future versions will add periodic routing table
consistency check by randomly selected nodes to
see try to detect malicious participants. - Lazy replica copying to reduce the overhead for
hosts which join the network for a short period
of time.
74Conclusions
- PAST(Pastry) and CFS(Chord)represent peer-to-peer
routing and location schemes for storage - The ideas are almost the same in all of them
- CFS load management is less complex
- Questions raised at SOSP about them
- Is there any real application for them ?
- Who will trust these infrastructures to store
his/her files ?