peertopeer file systems - PowerPoint PPT Presentation

About This Presentation
Title:

peertopeer file systems

Description:

peer-to-peer file systems. Presented by: Serge Kreiker 'P2P' in the Internet ... xyz.mp3. So Far. Centralized : Napster - Table size O(n) - Number of hops O(1) ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 75
Provided by: disc7
Category:

less

Transcript and Presenter's Notes

Title: peertopeer file systems


1
  • peer-to-peer file systems
  • Presented by Serge Kreiker

2
P2P in the Internet
  • Napster A peer-to-peer file sharing application
  • allow Internet users to exchange files directly
  • simple idea hugely successful
  • fastest growing Web application
  • 50 Million users in January 2001
  • shut down in February 2001
  • similar systems/startups followed in rapid
    succession
  • Napster,Gnutella, Freenet

3
Napster
128.1.2.3
(xyz.mp3, 128.1.2.3)
Central Napster server
4
Napster
128.1.2.3
xyz.mp3 ?
128.1.2.3
Central Napster server
5
Napster
128.1.2.3
xyz.mp3 ?
Central Napster server
6
Gnutella
7
Gnutella
xyz.mp3 ?
8
Gnutella
9
Gnutella
xyz.mp3
10
So Far
  • Centralized Napster
  • - Table size O(n)
  • - Number of hops O(1)
  • Flooded queries Gnutella
  • - Table size O(1)
  • - Number of hops O(n)

11
Storage Management Systems challenges
  • Distributed
  • Nodes have identical capabilities and
    responsibilities
  • anonymity
  • Storage management spread storage burden evenly
  • Tolerate unreliable participants
  • Robustness surviving massive failures
  • Resilience to DoS attacks, censorship, other node
    failures.
  • Cache management cache additional copies of
    popular files

12
Routing challanges
  • Efficiency O(log(N)) messages per lookup
  • N is the total number of servers
  • Scalability O(log(N)) state per node
  • Robustness surviving massive failures

13
We are going to look at
  • PAST (Rice and Microsoft Research, routing
    substrate - Pastry)
  • CFS (MIT, routing substrate - Chord)

14
What is PAST ?
  • Archival storage and content distribution utility
  • Not a general purpose file system
  • Stores multiple replicas of files
  • Caches additional copies of popular files in the
    local file system

15
How it works
  • Built over a self-organizing, Internet-based
    overlay network
  • Based on Pastry routing scheme
  • Offers persistent storage services for replicated
    read-only files
  • Owners can insert/reclaim files
  • Clients just lookup

16
PAST Nodes
  • The collection of PAST nodes form an overlay
    network
  • Minimally, a PAST node is an access point
  • Optionally, it contributes to storage and
    participate in the routing

17
PAST operations
  • fileId Insert(name, owner-credentials, k,
    file)
  • file Lookup(fileId)
  • Reclaim(fileId, owner-credentials)

18
Insertion
  • fileId computed as the secure hash of name,
    owners public key, salt
  • Stores the file on the k nodes whose nodeIds are
    numerically closest to the 128 msb of fileId
  • How to map Key IDs to Node IDs?
  • Use Pastry

19
Insert contd
  • The required storage is debited against the
    owners storage quota
  • A file certificate is returned
  • Signed with owners private key
  • Contains fileId, hash of content, replication
    factor others
  • The file certificate are routed via Pastry
  • Each node of the k replica storing nodes attach
    a store receipt
  • Ack sent back after all k-nodes have accepted the
    file

20
Insert file with fileId117, k4
21
Lookup Reclaim
  • Lookup Pastry locates a near node that has a
    copy and retrieves it
  • Reclaim weak consistency
  • After it, a lookup is no longer guaranteed to
    retrieve the file
  • But, it does not guarantee that the file is no
    longer available

22
Pastry Peer-to-peer routing
  • Provide generic, scalable indexing, data location
    and routing
  • Inspiration from Plaxtons algorithm (used in web
    content distribution eg. Akamai) and Landmark
    hierarchy routing
  • Goals
  • Efficiency
  • Scalability
  • Fault Resilience
  • Self-organization (completely decentralized)

23
PastryHow it works?
  • Each node has Unique nodeId.
  • Each Message has a key.
  • Both are uniformly distributed and lie in the
    same namespace
  • Pastry node routes the message to the node with
    the closest nodeId to the key.
  • Number of routing steps is O(log N).
  • Pastry takes into account network locality.
  • PAST uses fileID as key, and stores the file in
    k closest nodes.

24
Pastry Node ID space
  • Each node is assigned a 128-bit node identifier -
    nodeId.
  • nodeId is assigned randomly when joining the
    system. (e.g. using SHA-1 hash of its IP or nodes
    public Key)
  • Nodes with adjacent nodeIds are diverse in
    geography, ownership, network attachment, etc.
  • nodeId and keys are in base 2b. b is
    configuration param with typical value 4.

25
PastryNode ID space
128 bits ( max. 2128 nodes)
Node id
0
1
L1


b bits
L levels b 128/L bits per level NodeId
sequence of L, base 2b (b-bit) digits
21280
2128 - 1
1
1
Circular Namespace
26
Pastry Node State (1)
  • Each node maintains routing table-R,
    neighborhood set-M, leaf set-L.
  • Routing table is organized into ?log2bN? rows
    with 2b-1 entry each.
  • Each entry n contains the IP address of a close
    node which ID matches in the first n digits,
    differs in digit (n1)
  • Choice of b - tradeoff between size of routing
    table and length of route.

27
Pastry Node State (2)
  • Neighborhood set - nodeIds , IP addresses of ?M?
    nearby nodes based on proximity in nodeId space
  • Leaf set set of ?L? nodes withclosest nodeId
    to current node.
  • L - divided into 2 ?L? /2 closest larger, ?L?
    /2 closest smaller.
  • values for ?L? and ?M? are 2b

28
Example NodeId10233102, b2, nodeId is 16 bit.
All numbers in base 4.
29
Pastry Routing Requests
  • Route (my-id, key-id, message)
  • if (key-id in range of my leaf-set)
  • forward to the numerically closest node in
    leaf set
  • else
  • forward to a node node-id in the routing table
    s. th. node-id shares a longer prefix with
    key-id than my-id
  • else
  • forward to a node node-id that shares the same
    length prefix with key-id as my-id but is
    numerically closer

Routing takes O(log N) messages
30
B2, l4,key 1230
31
PastryNode Addition
A 10
  • X joining node
  • A node nearby X (network proximity)
  • Z node numerically closest to X2
  • Routing Table of X
  • leaf-set(X) leaf-set(Z)
  • neighborhood-set(X)
  • neighborhood-set(A)
  • routing table X, row i routing
    table Ni, row i, where Ni is the ith node
    encountered along the route from A to Z
  • X notifies all-nodes in leaf-set(X) which update
    their state.

N1
N36
Lookup(216)
N2
240
Z 210
32
X joins the system , first stage
Route message Key X
33
Pastry Node Failures, Recovery
  • Rely on a soft-state protocol to deal with node
    failures
  • Neighboring nodes in the nodeId space
    periodically
  • exchange keepalive msgs
  • unresponsive nodes for a period T removed from
  • leaf-sets
  • recovering nodes contacts last known leaf set,
    updates its own leaf set, notifies members of its
    presence.
  • Randomized routing to deal with malicious nodes
    that can cause repeated query failures

34
Security
  • Each PAST node and each user of the system hold a
    smartcard
  • Private/public key pair is associated with each
    card
  • Smartcards generate and verify certificates and
    maintain storage quotas

35
More on Security
  • Smartcards ensures integrity of nodeId and fileId
    assignments
  • Store receipts prevent malicious nodes to create
    fewer than k copies
  • File certificates allow storage nodes and clients
    to verify integrity and authenticity of stored
    content, or to enforce the storage quota

36
Storage Management
  • Based on local coordination among nodes nearby
    with nearby nodeIds
  • Responsibilities
  • Balance the free storage among nodes
  • Maintain the invariant that replicas for each
    file are are stored on k nodes closest to its
    fileId

37
Causes for storage imbalance solutions
  • The number of files assigned to each node may
    vary
  • The size of the inserted files may vary
  • The storage capacity of PAST nodes differs
  • Solutions
  • Replica diversion
  • File diversion

38
Replica diversion
  • Recall each node maintains a leaf set
  • l nodes with nodeIds numerically closest to given
    node
  • If a node A cannot accommodate a copy locally, it
    considers replica diversion
  • A chooses B in its leaf set and asks it to store
    the replica
  • Then, enters a pointer to Bs copy in its table
    and issues a store receipt

39
Policies for accepting a replica
  • If (file size/remaining free storage) t
  • Reject
  • t is a fixed threshold
  • T has different values for primary replica (
    nodes among k numerically closest ) and diverted
    replica ( nodes in the same leaf set, but not k
    closest )
  • t(primary) t(diverted)

40
File diversion
  • When one of the k nodes declines to store a
    replica ? try replica diversion
  • If the chosen node for diverted replica also
    declines ? the entire file is diverted
  • Negative ack is sent, the client will generate
    another fileId, and start again
  • After 3 rejections the user is announced

41
Maintaining replicas
  • Pastry uses keep-alive messages and it adjusts
    the leaf set after failures
  • The same adjustment takes place at join
  • What happens with the copies stored by a failed
    node ?
  • How about the copies stored by a node that leaves
    or enters a new leaf set ?

42
Maintaining replicas contd
  • To maintain the invariant ( k copies ) ?the
    replicas have to be re-created in the previous
    cases
  • Big overhead
  • Proposed solution for join lazy re-creation
  • First insert a pointer to the node that holds
    them, then migrate them gradually

43
Caching
  • The k replicas are maintained in PAST for
    availability
  • The fetch distance is measured in terms of
    overlay network hops ( which doesnt mean
    anything for the real case )
  • Caching is used to improve performance

44
Caching contd
  • PAST uses the unused portion of their
    advertised disk space to cache files
  • When store a new primary or a diverted replica, a
    node evicts one or more cached copies
  • How it works a file that is routed through a
    node by Pastry ( insert or lookup ) is inserted
    into the local cache f its size
  • c is a fraction of the current cache size

45
Evaluation
  • PAST implemented in JAVA
  • Network Emulation using JavaVM
  • 2 workloads (based on NLANR traces) for
  • file sizes
  • 4 normal distributions of node storage sizes

46
Key Results
  • STORAGE
  • Replica and file diversion improved global
    storage utilization from 60.8 to 98 compared to
    without
  • insertion failures drop to
  • Caveat Storage capacities used in experiment,
    1000x times below what might be expected in
    practice.
  • CACHING
  • Routing Hops with caching lower than without
    caching even with 99 storage utilization
  • Caveat median file sizes very low, likely
    caching performance will degrade if this is
    higher.

47
CFSIntroduction
  • Peer-to-peer read only storage system
  • Decentralized architecture focusing mainly on
  • efficiency of data access
  • robustness
  • load balance
  • scalability
  • Provides a distributed hash table for block
    storage
  • Uses Chord to map keys to nodes.
  • Does not provide
  • anonymity
  • strong protection against malicious participants
  • Focus is on providing an efficient and robust
    lookup and storage layer with simple algorithms.

48
CFS Software Structure
RPC API
Local API
FS
DHASH
DHASH
DHASH
CHORD
CHORD
CHORD
CFS Client
CFS Server
CFS Server
49
CFS Layer functionalities
  • The client file system uses the DHash layer to
    retrieve blocks
  • The Server Dhash and the client DHash layer uses
    the client Chord layer to locate the servers that
    hold desired blocks
  • The server DHash layer is responsible for storing
    keyed blocks, maintaining proper levels of
    replication as servers come and go, and caching
    popular blocks
  • Chord layers interact in order to integrate
    looking up a block identifier with checking for
    cached copies of the block

50
  • Client identifies the root block using a public
    key generated by
  • the publisher.
  • Uses the public key as the root block identifier
    to fetch the root block and
  • checks for the validity of the block using
    the signature
  • File inode key is obtained by usual search
    through directory
  • blocks . These contain the keys of the file
    inode blocks which are
  • used to fetch the inode blocks.
  • The inode block contains the block numbers and
    their corr. keys
  • which are used to fetch the data blocks.

51
CFS Properties
  • decentralized control no administrative
    relationship between servers and publishers.
  • scalability lookup uses space and messages at
    most logarithmic in the number of servers.
  • availability client can retrieve data as long
    as at least one replica is reachable using the
    underlying network.
  • load balance for large files, it is done
    through spreading blocks over a number of
    servers. For small files, blocks are cached at
    servers involved in the lookup.
  • persistence once data is inserted, it is
    available for the agreed upon interval.
  • quotas are implemented by limiting the amount
    of data inserted by any particular IP address
  • efficiency - delay of file fetches is
    comparable with FTP due to efficient lookup,
    pre-fetching, caching and server selection.

52
Chord
  • Consistent hashing
  • maps node IP address Virtual host number into
    a m-bit node identifier.
  • maps block keys into the same m bit identifier
    space.
  • Node responsible for a key is the successor of
    the keys id with wrap-around in the m bit
    identifier space.
  • Consistent hashing balances the keys so that all
    nodes share equal load with high probability.
    Minimal movement of keys as nodes enter and leave
    the network.
  • For scalability, Chord uses a distributed version
    of consistent hashing in which nodes maintain
    only O(log N) state and use O(log N) messages for
    lookup with a high probability.

53
Chord details
  • two data structures used for performing lookups
  • Successor list This maintains the next r
    successors of the node. The successor list can be
    used to traverse the nodes and find the node
    which is responsible for the data in O(N) time.
  • Finger table ith entry in the finger table
    contains the identity of the first node that
    succeeds n by at least 2i 1 on the ID circle.
  • lookup pseudo code
  • find ids predecessor, its successor is the node
    responsible for the key
  • to find the predecessor, check if the key lies
    between the node-id and its successor. Else,
    using the finger table and successor list, find
    the node which is the closest predecessor of id
    and repeat this step.
  • since finger table entries point to nodes at
    power-of-two intervals around the ID ring, each
    iteration of above step reduces the distance
    between the predecessor and the current node by
    half.

54
Finger i points to successor of n2i
N120
112
½
¼
1/8
1/16
1/32
1/64
1/128
N80
55
Chord Node join/failure
  • Chord tries to preserve two invariants
  • Each nodes successor is correctly maintained.
  • For every key k, node successor(k) is responsible
    for k.
  • To preserve these invariants, when a node joins a
    network
  • Initialize the predecessors, successors and
    finger table of node n
  • Update the existing finger tables of other nodes
    to reflect the addition of n
  • Notify higher layer software so that state can be
    transferred.
  • For concurrent operations and failures, each
    Chord node runs a stabilization algorithm
    periodically to update the finger tables and
    successor lists to reflect addition/failure of
    nodes.
  • If lookups fail during the stabilization process,
    the higher layer can lookup again. Chord provides
    guarantees that the stabilization algorithm will
    result in a consistent ring.

56
Chord Server selection
  • added to Chord as part of CFS implementation.
  • Basic idea reduce lookup latency by
    preferentially contacting nodes likely to be
    nearby in the underlying network
  • Latencies are measured during finger table
    creation, so no extra measurements necessary.
  • This works only well for latencies such that low
    latencies from a to b and from b to c that the
    latency is low between a and c
  • Measurements suggest this is true. A case study
    of server selection, Masters thesis

57
CFS Node Id Authentication
  • Attacker can destroy chosen data by selecting a
    node ID which is the successor of the data key
    and then deny the existence of the data.
  • To prevent this, when a new node joins the
    system, existing nodes check
  • If the hash (node ip virtual number) is same as
    the professed node id
  • send a random nonce to the claimed IP to check
    for IP spoofing
  • To succeed, the attacker would have to control a
    large number of machines so that he can target
    blocks of the same file (which are randomly
    distributed over multiple servers)

58
CFS Dhash Layer
  • Provides a distributed hash table for block
    storage
  • reflects a key CFS design decision split each
    file into blocks and randomly distribute the
    blocks over many servers.
  • This provides good load distribution for large
    files .
  • disadvantage is that lookup cost increases since
    lookup is executed for each block. The lookup
    cost is small though compared to the much higher
    cost of block fetches.
  • Also supports pre-fetching of blocks to reduce
    user perceived latencies.
  • Supports replication, caching, quotas , updates
    of blocks.

59
CFS Replication
  • Replicates the blocks on k servers to increase
    availability.
  • Places the replicas at the k servers which are
    the immediate successors of the node which is
    responsible for the key
  • Can easily find the servers from the successor
    list (r k)
  • Provides fault tolerance since when the successor
    fails, the next server can serve the block.
  • Since in general successor nodes are not likely
    to be physically close to each other , since the
    node id is a hash of the IP virtual number,
    this provides robustness against failure of
    multiple servers located on the same network.
  • The client can fetch the block from any of the
    k servers. Latency can be used as a deciding
    factor. This also has the side-effect of
    spreading the load across multiple servers. This
    works under the assumption that the proximity in
    the underlying network is transitive.

60
CFS Caching
  • Dhash implements caching to avoid overloading
    servers for popular data.
  • Caching is based on the observation that as the
    lookup proceeds more and more towards the desired
    key, the distance traveled across the key space
    with each hop decreases. This implies that with a
    high probability, the nodes just before the key
    are involved in a large number of lookups for the
    same block. So when the client fetches the block
    from the successor node, it also caches it at the
    servers which were involved in the lookup .
  • Cache replacement policy is LRU. Blocks which are
    cached on servers at large distances are evicted
    faster from the cache since not many lookups
    touch these servers. On the other hand, blocks
    cached on closer servers remain alive in the
    cache as long as they are referenced.

61
CFS Implementation
  • Implemented in 7000 lines of C code including
    3000 lines of Chord
  • User level programs communicate over UDP with
    RPC primitives provided by the SFS toolkit.
  • Chord library maintains the successor lists and
    the finger tables. For multiple virtual servers
    on the same physical server, the routing tables
    are shared for efficiency.
  • Each Dhash instance is associated with a chord
    virtual server. Has its own implementation of the
    chord lookup protocol to increase efficiency.
  • Client FS implementation exports an ordinary Unix
    like file system. The client runs on the same
    machine as the server, uses Unix domain sockets
    to communicate with the local server and uses the
    server as a proxy to send queries to non-local
    CFS servers.

62
CFS Experimental results
  • Two sets of tests
  • To test real-world client-perceived performance ,
    the first test explores performance on a subset
    of 12 machines of the RON testbed.
  • 1 megabyte file split into 8K size blocks
  • All machines download the file one at a time .
  • Measure the download speed with and without
    server selection
  • The second test is a controlled test in which a
    number of servers are run on the same physical
    machine and use the local loopback interface for
    communication. In this test, robustness,
    scalability, load balancing etc. of CFS are
    studied.

63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
Future Research
  • Support keyword search
  • By adopting an existing centralized search engine
    (like Napster)
  • use a distributed set of index files stored on
    CFS
  • Improve security against malicious participants.
  • Can form a consistent internal ring and can route
    all lookups to nodes internal to the ring and
    then deny the existence of the data
  • Content hashes help guard against block
    substitution.
  • Future versions will add periodic routing table
    consistency check by randomly selected nodes to
    see try to detect malicious participants.
  • Lazy replica copying to reduce the overhead for
    hosts which join the network for a short period
    of time.

74
Conclusions
  • PAST(Pastry) and CFS(Chord)represent peer-to-peer
    routing and location schemes for storage
  • The ideas are almost the same in all of them
  • CFS load management is less complex
  • Questions raised at SOSP about them
  • Is there any real application for them ?
  • Who will trust these infrastructures to store
    his/her files ?
Write a Comment
User Comments (0)
About PowerShow.com