peertopeer file systems

About This Presentation

Title:

peertopeer file systems

Description:

peer-to-peer file systems. Presented by: Serge Kreiker 'P2P' in the Internet ... xyz.mp3. So Far. Centralized : Napster - Table size O(n) - Number of hops O(1) ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 75

Provided by: disc7

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: peertopeer file systems

1

peer-to-peer file systems
Presented by Serge Kreiker

2
P2P in the Internet

Napster A peer-to-peer file sharing application
allow Internet users to exchange files directly
simple idea hugely successful
fastest growing Web application
50 Million users in January 2001
shut down in February 2001
similar systems/startups followed in rapid
succession
Napster,Gnutella, Freenet

3
Napster
128.1.2.3
(xyz.mp3, 128.1.2.3)
Central Napster server
4
Napster
128.1.2.3
xyz.mp3 ?
128.1.2.3
Central Napster server
5
Napster
128.1.2.3
xyz.mp3 ?
Central Napster server
6
Gnutella
7
Gnutella
xyz.mp3 ?
8
Gnutella
9
Gnutella
xyz.mp3
10
So Far

Centralized Napster
- Table size O(n)
- Number of hops O(1)
Flooded queries Gnutella
- Table size O(1)
- Number of hops O(n)

11
Storage Management Systems challenges

Distributed
Nodes have identical capabilities and
responsibilities
anonymity
Storage management spread storage burden evenly
Tolerate unreliable participants
Robustness surviving massive failures
Resilience to DoS attacks, censorship, other node
failures.
Cache management cache additional copies of
popular files

12
Routing challanges

Efficiency O(log(N)) messages per lookup
N is the total number of servers
Scalability O(log(N)) state per node
Robustness surviving massive failures

13
We are going to look at

PAST (Rice and Microsoft Research, routing
substrate - Pastry)
CFS (MIT, routing substrate - Chord)

14
What is PAST ?

Archival storage and content distribution utility
Not a general purpose file system
Stores multiple replicas of files
Caches additional copies of popular files in the
local file system

15
How it works

Built over a self-organizing, Internet-based
overlay network
Based on Pastry routing scheme
Offers persistent storage services for replicated
read-only files
Owners can insert/reclaim files
Clients just lookup

16
PAST Nodes

The collection of PAST nodes form an overlay
network
Minimally, a PAST node is an access point
Optionally, it contributes to storage and
participate in the routing

17
PAST operations

fileId Insert(name, owner-credentials, k,
file)
file Lookup(fileId)
Reclaim(fileId, owner-credentials)

18
Insertion

fileId computed as the secure hash of name,
owners public key, salt
Stores the file on the k nodes whose nodeIds are
numerically closest to the 128 msb of fileId
How to map Key IDs to Node IDs?
Use Pastry

19
Insert contd

The required storage is debited against the
owners storage quota
A file certificate is returned
Signed with owners private key
Contains fileId, hash of content, replication
factor others
The file certificate are routed via Pastry
Each node of the k replica storing nodes attach
a store receipt
Ack sent back after all k-nodes have accepted the
file

20
Insert file with fileId117, k4
21
Lookup Reclaim

Lookup Pastry locates a near node that has a
copy and retrieves it
Reclaim weak consistency
After it, a lookup is no longer guaranteed to
retrieve the file
But, it does not guarantee that the file is no
longer available

22
Pastry Peer-to-peer routing

Provide generic, scalable indexing, data location
and routing
Inspiration from Plaxtons algorithm (used in web
content distribution eg. Akamai) and Landmark
hierarchy routing
Goals
Efficiency
Scalability
Fault Resilience
Self-organization (completely decentralized)

23
PastryHow it works?

Each node has Unique nodeId.
Each Message has a key.
Both are uniformly distributed and lie in the
same namespace
Pastry node routes the message to the node with
the closest nodeId to the key.
Number of routing steps is O(log N).
Pastry takes into account network locality.
PAST uses fileID as key, and stores the file in
k closest nodes.

24
Pastry Node ID space

Each node is assigned a 128-bit node identifier -
nodeId.
nodeId is assigned randomly when joining the
system. (e.g. using SHA-1 hash of its IP or nodes
public Key)
Nodes with adjacent nodeIds are diverse in
geography, ownership, network attachment, etc.
nodeId and keys are in base 2b. b is
configuration param with typical value 4.

25
PastryNode ID space
128 bits ( max. 2128 nodes)
Node id
0
1
L1

b bits
L levels b 128/L bits per level NodeId
sequence of L, base 2b (b-bit) digits
21280
2128 - 1
1
1
Circular Namespace
26
Pastry Node State (1)

Each node maintains routing table-R,
neighborhood set-M, leaf set-L.
Routing table is organized into ?log2bN? rows
with 2b-1 entry each.
Each entry n contains the IP address of a close
node which ID matches in the first n digits,
differs in digit (n1)
Choice of b - tradeoff between size of routing
table and length of route.

27
Pastry Node State (2)

Neighborhood set - nodeIds , IP addresses of ?M?
nearby nodes based on proximity in nodeId space
Leaf set set of ?L? nodes withclosest nodeId
to current node.
L - divided into 2 ?L? /2 closest larger, ?L?
/2 closest smaller.
values for ?L? and ?M? are 2b

28
Example NodeId10233102, b2, nodeId is 16 bit.
All numbers in base 4.
29
Pastry Routing Requests

Route (my-id, key-id, message)
if (key-id in range of my leaf-set)
forward to the numerically closest node in
leaf set
else
forward to a node node-id in the routing table
s. th. node-id shares a longer prefix with
key-id than my-id
else
forward to a node node-id that shares the same
length prefix with key-id as my-id but is
numerically closer

Routing takes O(log N) messages
30
B2, l4,key 1230
31
PastryNode Addition
A 10

X joining node
A node nearby X (network proximity)
Z node numerically closest to X2
Routing Table of X
leaf-set(X) leaf-set(Z)
neighborhood-set(X)
neighborhood-set(A)
routing table X, row i routing
table Ni, row i, where Ni is the ith node
encountered along the route from A to Z
X notifies all-nodes in leaf-set(X) which update
their state.

N1
N36
Lookup(216)
N2
240
Z 210
32
X joins the system , first stage
Route message Key X
33
Pastry Node Failures, Recovery

Rely on a soft-state protocol to deal with node
failures
Neighboring nodes in the nodeId space
periodically
exchange keepalive msgs
unresponsive nodes for a period T removed from
leaf-sets
recovering nodes contacts last known leaf set,
updates its own leaf set, notifies members of its
presence.
Randomized routing to deal with malicious nodes
that can cause repeated query failures

34
Security

Each PAST node and each user of the system hold a
smartcard
Private/public key pair is associated with each
card
Smartcards generate and verify certificates and
maintain storage quotas

35
More on Security

Smartcards ensures integrity of nodeId and fileId
assignments
Store receipts prevent malicious nodes to create
fewer than k copies
File certificates allow storage nodes and clients
to verify integrity and authenticity of stored
content, or to enforce the storage quota

36
Storage Management

Based on local coordination among nodes nearby
with nearby nodeIds
Responsibilities
Balance the free storage among nodes
Maintain the invariant that replicas for each
file are are stored on k nodes closest to its
fileId

37
Causes for storage imbalance solutions

The number of files assigned to each node may
vary
The size of the inserted files may vary
The storage capacity of PAST nodes differs
Solutions
Replica diversion
File diversion

38
Replica diversion

Recall each node maintains a leaf set
l nodes with nodeIds numerically closest to given
node
If a node A cannot accommodate a copy locally, it
considers replica diversion
A chooses B in its leaf set and asks it to store
the replica
Then, enters a pointer to Bs copy in its table
and issues a store receipt

39
Policies for accepting a replica

If (file size/remaining free storage) t
Reject
t is a fixed threshold
T has different values for primary replica (
nodes among k numerically closest ) and diverted
replica ( nodes in the same leaf set, but not k
closest )
t(primary) t(diverted)

40
File diversion

When one of the k nodes declines to store a
replica ? try replica diversion
If the chosen node for diverted replica also
declines ? the entire file is diverted
Negative ack is sent, the client will generate
another fileId, and start again
After 3 rejections the user is announced

41
Maintaining replicas

Pastry uses keep-alive messages and it adjusts
the leaf set after failures
The same adjustment takes place at join
What happens with the copies stored by a failed
node ?
How about the copies stored by a node that leaves
or enters a new leaf set ?

42
Maintaining replicas contd

To maintain the invariant ( k copies ) ?the
replicas have to be re-created in the previous
cases
Big overhead
Proposed solution for join lazy re-creation
First insert a pointer to the node that holds
them, then migrate them gradually

43
Caching

The k replicas are maintained in PAST for
availability
The fetch distance is measured in terms of
overlay network hops ( which doesnt mean
anything for the real case )
Caching is used to improve performance

44
Caching contd

PAST uses the unused portion of their
advertised disk space to cache files
When store a new primary or a diverted replica, a
node evicts one or more cached copies
How it works a file that is routed through a
node by Pastry ( insert or lookup ) is inserted
into the local cache f its size
c is a fraction of the current cache size

45
Evaluation

PAST implemented in JAVA
Network Emulation using JavaVM
2 workloads (based on NLANR traces) for
file sizes
4 normal distributions of node storage sizes

46
Key Results

STORAGE
Replica and file diversion improved global
storage utilization from 60.8 to 98 compared to
without
insertion failures drop to
Caveat Storage capacities used in experiment,
1000x times below what might be expected in
practice.
CACHING
Routing Hops with caching lower than without
caching even with 99 storage utilization
Caveat median file sizes very low, likely
caching performance will degrade if this is
higher.

47
CFSIntroduction

Peer-to-peer read only storage system
Decentralized architecture focusing mainly on
efficiency of data access
robustness
load balance
scalability
Provides a distributed hash table for block
storage
Uses Chord to map keys to nodes.
Does not provide
anonymity
strong protection against malicious participants
Focus is on providing an efficient and robust
lookup and storage layer with simple algorithms.

48
CFS Software Structure
RPC API
Local API
FS
DHASH
DHASH
DHASH
CHORD
CHORD
CHORD
CFS Client
CFS Server
CFS Server
49
CFS Layer functionalities

The client file system uses the DHash layer to
retrieve blocks
The Server Dhash and the client DHash layer uses
the client Chord layer to locate the servers that
hold desired blocks
The server DHash layer is responsible for storing
keyed blocks, maintaining proper levels of
replication as servers come and go, and caching
popular blocks
Chord layers interact in order to integrate
looking up a block identifier with checking for
cached copies of the block

Client identifies the root block using a public
key generated by
the publisher.
Uses the public key as the root block identifier
to fetch the root block and
checks for the validity of the block using
the signature
File inode key is obtained by usual search
through directory
blocks . These contain the keys of the file
inode blocks which are
used to fetch the inode blocks.
The inode block contains the block numbers and
their corr. keys
which are used to fetch the data blocks.

51
CFS Properties

decentralized control no administrative
relationship between servers and publishers.
scalability lookup uses space and messages at
most logarithmic in the number of servers.
availability client can retrieve data as long
as at least one replica is reachable using the
underlying network.
load balance for large files, it is done
through spreading blocks over a number of
servers. For small files, blocks are cached at
servers involved in the lookup.
persistence once data is inserted, it is
available for the agreed upon interval.
quotas are implemented by limiting the amount
of data inserted by any particular IP address
efficiency - delay of file fetches is
comparable with FTP due to efficient lookup,
pre-fetching, caching and server selection.

52
Chord

Consistent hashing
maps node IP address Virtual host number into
a m-bit node identifier.
maps block keys into the same m bit identifier
space.
Node responsible for a key is the successor of
the keys id with wrap-around in the m bit
identifier space.
Consistent hashing balances the keys so that all
nodes share equal load with high probability.
Minimal movement of keys as nodes enter and leave
the network.
For scalability, Chord uses a distributed version
of consistent hashing in which nodes maintain
only O(log N) state and use O(log N) messages for
lookup with a high probability.

53
Chord details

two data structures used for performing lookups
Successor list This maintains the next r
successors of the node. The successor list can be
used to traverse the nodes and find the node
which is responsible for the data in O(N) time.
Finger table ith entry in the finger table
contains the identity of the first node that
succeeds n by at least 2i 1 on the ID circle.
lookup pseudo code
find ids predecessor, its successor is the node
responsible for the key
to find the predecessor, check if the key lies
between the node-id and its successor. Else,
using the finger table and successor list, find
the node which is the closest predecessor of id
and repeat this step.
since finger table entries point to nodes at
power-of-two intervals around the ID ring, each
iteration of above step reduces the distance
between the predecessor and the current node by
half.

54
Finger i points to successor of n2i
N120
112
½
¼
1/8
1/16
1/32
1/64
1/128
N80
55
Chord Node join/failure

Chord tries to preserve two invariants
Each nodes successor is correctly maintained.
For every key k, node successor(k) is responsible
for k.
To preserve these invariants, when a node joins a
network
Initialize the predecessors, successors and
finger table of node n
Update the existing finger tables of other nodes
to reflect the addition of n
Notify higher layer software so that state can be
transferred.
For concurrent operations and failures, each
Chord node runs a stabilization algorithm
periodically to update the finger tables and
successor lists to reflect addition/failure of
nodes.
If lookups fail during the stabilization process,
the higher layer can lookup again. Chord provides
guarantees that the stabilization algorithm will
result in a consistent ring.

56
Chord Server selection

added to Chord as part of CFS implementation.
Basic idea reduce lookup latency by
preferentially contacting nodes likely to be
nearby in the underlying network
Latencies are measured during finger table
creation, so no extra measurements necessary.
This works only well for latencies such that low
latencies from a to b and from b to c that the
latency is low between a and c
Measurements suggest this is true. A case study
of server selection, Masters thesis

57
CFS Node Id Authentication

Attacker can destroy chosen data by selecting a
node ID which is the successor of the data key
and then deny the existence of the data.
To prevent this, when a new node joins the
system, existing nodes check
If the hash (node ip virtual number) is same as
the professed node id
send a random nonce to the claimed IP to check
for IP spoofing
To succeed, the attacker would have to control a
large number of machines so that he can target
blocks of the same file (which are randomly
distributed over multiple servers)

58
CFS Dhash Layer

Provides a distributed hash table for block
storage
reflects a key CFS design decision split each
file into blocks and randomly distribute the
blocks over many servers.
This provides good load distribution for large
files .
disadvantage is that lookup cost increases since
lookup is executed for each block. The lookup
cost is small though compared to the much higher
cost of block fetches.
Also supports pre-fetching of blocks to reduce
user perceived latencies.
Supports replication, caching, quotas , updates
of blocks.

59
CFS Replication

Replicates the blocks on k servers to increase
availability.
Places the replicas at the k servers which are
the immediate successors of the node which is
responsible for the key
Can easily find the servers from the successor
list (r k)
Provides fault tolerance since when the successor
fails, the next server can serve the block.
Since in general successor nodes are not likely
to be physically close to each other , since the
node id is a hash of the IP virtual number,
this provides robustness against failure of
multiple servers located on the same network.
The client can fetch the block from any of the
k servers. Latency can be used as a deciding
factor. This also has the side-effect of
spreading the load across multiple servers. This
works under the assumption that the proximity in
the underlying network is transitive.

60
CFS Caching

Dhash implements caching to avoid overloading
servers for popular data.
Caching is based on the observation that as the
lookup proceeds more and more towards the desired
key, the distance traveled across the key space
with each hop decreases. This implies that with a
high probability, the nodes just before the key
are involved in a large number of lookups for the
same block. So when the client fetches the block
from the successor node, it also caches it at the
servers which were involved in the lookup .
Cache replacement policy is LRU. Blocks which are
cached on servers at large distances are evicted
faster from the cache since not many lookups
touch these servers. On the other hand, blocks
cached on closer servers remain alive in the
cache as long as they are referenced.

61
CFS Implementation

Implemented in 7000 lines of C code including
3000 lines of Chord
User level programs communicate over UDP with
RPC primitives provided by the SFS toolkit.
Chord library maintains the successor lists and
the finger tables. For multiple virtual servers
on the same physical server, the routing tables
are shared for efficiency.
Each Dhash instance is associated with a chord
virtual server. Has its own implementation of the
chord lookup protocol to increase efficiency.
Client FS implementation exports an ordinary Unix
like file system. The client runs on the same
machine as the server, uses Unix domain sockets
to communicate with the local server and uses the
server as a proxy to send queries to non-local
CFS servers.

62
CFS Experimental results

Two sets of tests
To test real-world client-perceived performance ,
the first test explores performance on a subset
of 12 machines of the RON testbed.
1 megabyte file split into 8K size blocks
All machines download the file one at a time .
Measure the download speed with and without
server selection
The second test is a controlled test in which a
number of servers are run on the same physical
machine and use the local loopback interface for
communication. In this test, robustness,
scalability, load balancing etc. of CFS are
studied.

63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
Future Research

Support keyword search
By adopting an existing centralized search engine
(like Napster)
use a distributed set of index files stored on
CFS
Improve security against malicious participants.
Can form a consistent internal ring and can route
all lookups to nodes internal to the ring and
then deny the existence of the data
Content hashes help guard against block
substitution.
Future versions will add periodic routing table
consistency check by randomly selected nodes to
see try to detect malicious participants.
Lazy replica copying to reduce the overhead for
hosts which join the network for a short period
of time.

74
Conclusions

PAST(Pastry) and CFS(Chord)represent peer-to-peer
routing and location schemes for storage
The ideas are almost the same in all of them
CFS load management is less complex
Questions raised at SOSP about them
Is there any real application for them ?
Who will trust these infrastructures to store
his/her files ?

Write a Comment

User Comments (0)