Title: PAST: A largescale persistent peertopeer storage utility
1PAST A large-scale persistent peer-to-peer
storage utility
- LECS Reading Group
- 10/23/2001
2P2P in the Internet
- Napster A peer-to-peer file sharing application
- allow Internet users to exchange files directly
- simple idea hugely successful
- fastest growing Web application
- 50 Million users in January 2001
- shut down in February 2001
- similar systems/startups followed in rapid
succession - Gnutella, Scour, Freenet, Groove, Flycode, vTrails
3Peer-to-peer computing
- Peer-to-peer systems
- distributed, nodes have identical capabilities
and responsibilities, communication is
asymmetric - Technical Potential
- can harness huge amounts of resources.
- user PCs disk space, upstream bandwidth, CPU
cycles - without requiring expensive hardware,
bandwidth, rack space - completely distributed
- robust, less vulnerable to DoS attacks, harder to
censor
Technical Challenges are decentralized
control, self-organization, adaptation and
scalability!
4Napster
128.1.2.3
(xyz.mp3, 128.1.2.3)
Central Napster server
5Napster
128.1.2.3
xyz.mp3 ?
128.1.2.3
Central Napster server
6Napster
128.1.2.3
xyz.mp3 ?
Central Napster server
7Gnutella
8Gnutella
xyz.mp3 ?
9Gnutella
10Gnutella
xyz.mp3
11Peer-to-peer File Sharing
- Napster
- decentralized storage of actual content
- transfer content directly from one peer (client)
to another - centralized index and search
- simple, but O(N) state and single point of
failure - Gnutella
- like a decentralized Napster
- distributed index and search
- Robust, but worst case O(N) messages per lookup
Next generation systems build on distributed
indexing, lookup services
12Large-scale Storage Management Systems
- Distributed storage infrastructure
- PAST (Rice and Microsoft Research, routing
substrate - Pastry) - OceanStore (U.C.Berkeley, routing substrate -
Tapestry) - Publius (ATT)
- Farsite (Microsoft Research)
- CFS (MIT, routing substrate - Chord)
- GRCD(UC Berkeley, builds on CAN)
- Goals
- Continuous access to persistent information
- Utility infrastructure that manages customer
content - Resilience to DoS attacks, censorship, other node
failures.
13PAST
- Internet-based, peer-to-peer global storage
utility - Goals strong persistence, high availability,
scalability and security - Overview
- PAST API for Clients
- Pastry
- Peer-to-peer routing substrate
- Storage management
- store multiple replicas of files
- Cache management
- cache additional copies of popular files
14PAST API for Clients
- fileId Insert(name, owner-credentials, k, file)
- stores file at k distinct nodes in the PAST
network - fileId SHA-1(name, owner-credentials, random
number) - file Lookup(fileId)
- reliably retrieve a copy of the file, normally
from a nearby node - Reclaim(fileId, owner-credentials)
- reclaims the storage occupied by the k copies of
the file identified by fileId.
Archival storage and content distribution not a
general purpose FS No searching, directory
lookup, or key distribution operations
15PAST IDs
- File Identifier - 160 bits 128 msb forms the
KeyID - Node Identifier -128 bits
- Both are uniformly distributed
- Both lie in the same namespace
- How to map Key IDs to Node IDs?
- Use Pastry
16Pastry Peer-to-peer routing substrate
- Provide generic, scalable indexing, data location
and routing for peer-to-peer applications - Inspiration from Plaxtons algorithm (used in web
content distribution eg. Akamai) and Landmark
hierarchy routing - Goals
- Efficiency
- Scalability
- Fault Resilience
- Self-organization (completely decentralized)
17Pastry Basic Idea
18Pastry Basic Idea
insert(K1,V1)
19Pastry Basic Idea
insert(K1,V1)
20Pastry Basic Idea
(K1,V1)
21Pastry Basic Idea
retrieve (K1)
22PAST/Pastry Node ID space
128 bits ( max. 2128 nodes)
Node id
0
1
L1
b bits
L levels b 128/L bits per level NodeId
sequence of L, base 2b (b-bit) digits
21280
2128 - 1
1
1
Circular Namespace
23State of a Pastry Node
Node 1 2 3 has routing table
- Entries consist of nodeId,
- IP address of node
- Routing Table
- ceil(log2b N ) levels each level corresponds to
a row - 2b 1 entries per level i.e., columns per row
- each entry per level n corresponds to a node
whose nodeId - matches in the first n digits, differs in digit
(n1)
Xi Yi
1 Yi
1 2
is every number in 0,.., 2b-1 1
is every number in 0,.., 2b-1 2
is every number in 0,.., 2b-1 3
Xi, Yi are any numbers in 0,.., 2b-1
24State of a Pastry Node
- Leaf Set
- l nearby nodes based on proximity in nodeId
space - Neighborhood Set
- l nearby nodes based on network proximity metric
- not used for routing
- used during node addition/recovery
16-bit nodeId space l 8, b 2
Leaf Set Entries
10233102
25Routing Requests in Pastry
- Route (my-id, key-id, message)
- if (key-id in range of my leaf-set)
- forward to the numerically closest node in
leaf set - else
- forward to a node node-id in the routing table
s. th. node-id shares a longer prefix with
key-id than my-id - else
- forward to a node node-id that shares the same
length prefix with key-id as my-id but is
numerically closer
Routing takes O(log N) messages
26Node Addition
A 10
- X joining node
- A node nearby X (network proximity)
- Z node numerically closest to X2
- Routing Table of X
- leaf-set(X) leaf-set(Z)
- neighborhood-set(X)
- neighborhood-set(A)
- routing table X, row i routing
table Ni, row i, where Ni is the ith node
encountered along the route from A to Z - X notifies all-nodes in leaf-set(X) which update
their state.
N1
N36
Lookup(216)
N2
240
Z 210
27Node Failures, Recovery
- Rely on a soft-state protocol to deal with node
failures - Neighboring nodes in the nodeId space
periodically - exchange keepalive msgs
- unresponsive nodes for a period T removed from
- leaf-sets
- recovering nodes contacts last known leaf set,
updates its own leaf set, notifies members of its
presence. - Randomized routing to deal with malicious nodes
that can cause repeated query failures
PASTRY details buried in Middleware 2001 paper
28PAST Storage Management
- Goals
- High global storage utilization
- Graceful degradation near maximal utilization
- Design Goals
- Local coordination among nodes.
- Fully integrate storage management w/ file
insertion. - Modest performance overheads.
- Challenges
- Balancing unused storage among nodes vs.
requirement to maintain copies of each file at k
nodes with nodeIds closest to fileId
29Storage Load Imbalance
- Causes
- storage capacity differences among individual
PAST nodes - high variance in file size distribution
- statistical variation in fileID and nodeID
assignments - Impact
- not all of the k-closest nodes can accommodate a
file replica - 3 solutions to deal with imbalances
301 Per-node storage control
- No more than 2 orders of magnitude difference in
storage capacity of individual nodes assumed - Advertised capacity controls admission of new
nodes (compared to average capacity) - too large split into multiple nodeIds
- too small reject
312 Replica Diversion
- Necessary when a node A among the k closest (to
the fileId) cannot accommodate the file copy
locally - GOAL balance the unused storage space among the
nodes in a leaf set - Node A diverts copy to node B in its leaf set if
- B is not among k-closest
- B does not already have a diverted replica
- Replica diversion controlled by 3 policies to
avoid performance penalty of unnecessary replica
diversion.
323 File Diversion
- Necessary when file insert fails even with
replica diversion - GOAL Balance the unused storage space among
different portions of the nodeId space in PAST - client generates new fileId for the file and
retries up to 3 times - application notified after 4 successive file
insert failures - can retry with smaller file size or k (
replicas) value.
33PAST Cache Management at Nodes
- Why cache file copies?
- k replicas may not be enough for very popular
files - beneficial if there exists spatial locality among
clients of a particular file - Goals
- minimize client access latencies
- fetch distance in terms of Pastry Routing hops
- maximize query throughput
- balance query load in the system
34Caching Policies
- Insertion Insert a file routed through a node
as part of an Insert or Lookup operation if - file size size
- Replacement GreedyDual-Size (GD-S) Policy
- assign weight Hd to each file d, weight
inversely proportional to file size d - evict file v with minimum weight Hv
- subtract Hv from the weights of all remaining
cached files (enforces aging)
35Evaluation
- PAST implemented in JAVA
- Network Emulation using JavaVM
- 2 workloads (based on NLANR traces) for
- file sizes
- 4 normal distributions of node storage sizes
36Key Results
- STORAGE
- Replica and file diversion improved global
storage utilization from 60.8 to 98 compared to
without - insertion failures drop to
- Caveat Storage capacities used in experiment,
1000x times below what might be expected in
practice. - CACHING
- Routing Hops with caching lower than without
caching even with 99 storage utilization - Caveat median file sizes very low, likely
caching performance will degrade if this is
higher.
37Questions
- Is PASTRY really self-organizing?
- IP multicast based expanding ring search etc.
not viable. - Get Nearest network node externally for Node
Joins/Additions how will you do this in
practice? - Is strong persistence an overkill?
- Makes the system needlessly complicated
(especially w.r.t replica maintenance and
diversion policies) - k the number of replicas anyway.
- How do caches purge copies of Reclaimed files?
- How to deal with arbitrary large files?
- Isnt CFS block based storage scheme much better
in this case?