Title: Pastry and Past
1Pastry and Past
Alex Shraer
Based on slides by Peter Druschel and Gabi Kliot
(CS Department, Technion)
2Sources
- Storage management and caching in PAST, a
large-scale persistent peer-to-peer storage
utility - Antony Rowstron (Microsoft Research)
- Peter Druschel (Rice University)
- Pastry scalable, decentralized object location
and routing for large-scale peer-to-peer systems - Antony Rowstron (Microsoft Research)
- Peter Druschel (Rice University)
3PASTRY
scalable, decentralized object location and
routing for large-scale peer-to-peer systems
4Pastry
- Generic p2p location and routing substrate (DHT)
- Self-organizing overlay network (join,
departures, locality repair) - Consistent hashing
- Lookup/insert object in lt log2b N routing steps
(expected) - O(log N) per-node state
- Network locality heuristics
- Scalable, fault resilient,
- self-organizing, locality aware
5Pastry Object distribution
2128 - 1
O
- Consistent hashing
- 128 bit circular id space
- nodeIds (uniform random)
- objIds/keys (uniform random)
- Invariant node with numerically closest nodeId
maintains object
objId/key
nodeIds
6Pastry Object insertion/lookup
2128 - 1
O
Msg with key X is routed to live node with nodeId
closest to X Problem complete routing table
not feasible
X
Route(X)
7Pastry Routing table ( 10233102)
L nodes in leaf set log N
Rows (actually log 2128 128/b) 2b
columns L neighbors
row
0
2b
1
2
2b
7
8Pastry Leaf sets
- Each node maintains IP addresses of the nodes
with the L numerically closest larger and smaller
nodeIds, respectively. - routing efficiency/robustness
- fault detection (keep-alive)
- application-specific local coordination
9Pastry Routing procedure
If (destination is within range of our leaf set)
forward to numerically closest member else let
l length of shared prefix let d value of
l-th digit in Ds address if (Rld exists)
forward to Rld else forward to a known
node (from ) that (a) shares at
least as long a prefix (b) is numerically
closer than this node
Unless L/2 adjacent nodes in the leafset failed
simultaneously, at least one such node must be
alive
10Pastry Routing
d471f1
d467c4
d462ba
d46a1c
d4213f
- Properties
- log N steps
- O(log N) state
Route(d46a1c)
d13da3
2b
65a1fc
2b
11Pastry Routing
- Integrity of overlay
- guaranteed unless L/2 simultaneous failures of
nodes with adjacent nodeIds - Number of routing hops
- No failures lt log N expected, 128/b 1 max
- During failure recovery
- O(N) worst case, average case much better
2b
12Pastry Locality properties
- Assumption scalar proximity metric
- e.g. ping/RTT delay, IP hops, geographical
distance - a node can probe distance to any other node
- Proximity invariant
- Each routing table entry refers to a node close
- to the local node (in the proximity space), among
- all nodes with the appropriate nodeId prefix.
13Pastry Geometric Routing in proximity space
NodeId space
- The proximity distance traveled by message in
each routing step is exponentially increasing
(entry in row l is chosen from a set of nodes of
size N/2bl) - The distance traveled by message from its source
increases monotonically at each step (message
takes exponentially larger strides each step)
14Pastry Locality properties
- Simulations show
- Expected distance traveled by a message in the
proximity space is within a small constant of the
minimum - Among k nodes with nodeIds closest to the key,
message likely to reach the node closest to the
source node first - The nearest copy in 76 of lookups
- One of the two nearest in 92 of lookups
15Pastry Self-organization
- Initializing and maintaining routing tables and
leaf sets - Node addition
- Node departure (failure)
- The goal is to maintain all routing table entries
- to refer to a near node, among all live nodes
- with appropriate prefix
16Pastry Node addition
- New node X contacts nearby node A
- A routes join message to X, which arrives to Z,
closest to X - X obtains leaf set from Z, ith row for routing
table from ith node from A to Z - X informs any nodes that need to be aware of its
arrival - X also improves its table locality by requesting
neighborhood sets from all nodes X knows - In practice optimistic approach
17Pastry Node addition
d471f1
Zd467c4
d462ba
Xd46a1c
d4213f
New node Xd46a1c
Route(d46a1c)
d13da3
A 65a1fc
18Pastry Node addition
X is close to A, B is close to B1. Why X is close
to B1? The expected distance from B to its row
one entries (B1) is much larger than the expected
distance from A to B (chosen from exponentially
decreasing set size)
19Node departure (failure)
- Leaf set repair (eager all the time)
- Leaf set members exchange keep-alive messages
- In case a node in the leaf set fails, request set
from furthest live node in set. Update the
leafset and notify the nodes that were added to
the leafset - Routing table repair (lazy upon failure)
- get table from peers in the same row, if not
found from higher rows - Neighborhood set repair (eager)
- Periodically contact neighbors. If a neighbor
failed take neighbor lists from other neighbors,
check distances, and update your list with the
closest nodes found
20Randomized Routing
- So far, the routing is deterministic. If a node
in the routing path has failed or refuses to pass
the message, re-transmitting will not help. - Each step, the message must be forwarded to a
node whose ID shares at least as long a prefix,
but is numerically closer than current node - If there are several possible such nodes - choose
one randomly, heavily biased towards the closest - If routing fail, the client needs to retransmit
21Pastry Distance traveled
30-40 longer than the optimum. Not bad,
considering that Pastry only stores 75 entries in
the routing table, instead of 99,999 of the
complete routing table
L16, 100k random queries Proximity in emulated
network. Nodes paced randomly
22Pastry Summary
- Generic p2p overlay network
- Scalable, fault resilient, self-organizing,
secure - O(log N) routing steps (expected)
- O(log N) routing table size
- Network locality properties
2b
2b
23Storage management and caching in PAST, a
large-scale, persistent peer-to-peer storage
utility
24INTRODUCTION
- PAST system
- Internet-based, peer-to-peer global storage
utility - Characteristics
- strong persistence, high availability (by using k
replicas) - scalability (due to efficient Pastry routing)
- short insert and query paths
- query load balancing and latency reduction (due
to wide dispersion, Pastry locality and caching) - security
- Composed of nodes connected to internet, each
node has 128-bit nodeId - Use Pastry for efficient routing scheme
- No support for mutable files, searching,
directory lookup
25INTRODUCTION
- Function of nodes
- store replicas of files
- initiate and route client requests to insert or
retrieve files in PAST - File-related property
- Inserted files have quasi-unique fileId,
- File is replicated across multiple nodes
- To retrieve file, client must know fileId and
decryption key (if necessary) - fileId 160-bit computed as SHA-1 of file name,
owners public key, random salt number
26PAST Operation
- Insert fileId Insert(name, owner-credentials,
k, file) - fileId computed (hash code of file name, public
key, etc.) - Request Message reaches one of k nodes closest to
fileId - Node accepts a replica of the file, forwards
message to k-1 nodes existing in leaf set - Once k nodes accept, ack message with store
receipts is passed to client. Clients can be
charged for the storage - Lookup file Lookup(fileId)
- Retrieves a copy of the file, if it was inserted
earlier and if one of the k nodes that store it
are connected to the network. The closest node
will usually provide the copy - Reclaim Reclaim(fileId, owner-credentials)
- After this, retrieval of the file is no longer
guaranteed. - Unlike a delete operation, a Reclaim does not
guarantee that the file will not be accessible.
These weaker semantics simplify the algorithm
27STORAGE MANAGEMENT - why?
- Ensure availability of files
- Balance the storage load
- Provide graceful degradation in performance as
the system (globally) runs out of storage
28STORAGE MANAGEMENT
- Responsibility
- Replicas of files are maintained by k nodes with
nodeId closest to fileId. - Why is this a good thing?
- Creates a conflict what if these k nodes have
insufficient storage space, while other nodes
have enough space - Challenge Balance free storage space among nodes
- Causes for such load imbalance
- Number of files assigned to each node is
different - Size of each inserted file is different
- Storage capacity of each node is different
- Solution Replica diversion File diversion
29STORAGE MANAGEMENTReplica Diversion
- Purpose balance the remaining free storage
space among the nodes in a leafset - Diversion steps of node A (that received
insertion request but has insufficient space) - choose node B among nodes in leaf set
- B does not already holds diverted replica
- B is not one of the k closest (where the file
will be stored anyway) - ask B to store a copy
- enter a file entry in table with pointer to B
- send store receipt as usual
-
30Replica Diversion continued
- If B fails, the replica should be stored
elsewhere - If A fails, the replica in B should remain
available - otherwise the probability that all k replicas are
inaccessible is doubled with each replica
diversion - will be described later
- Ask the k1th closest node C to keep a pointer
to B - If A fails, k closest nodes will still hold
replicas - If C fails, A asks the new k1th node to keep a
pointer to B - Cost
- A and C both store an additional entry in their
file tables (the pointer
to B) - A few additional RPCs during insert and during
lookup
31Replica Diversion continued
- Node rejects file if file_size/remaining_storage
gt t - Meaning the file would consume more than a
fraction t of the remaining storage in the node - Primary replica stores (among the k closest) use
t tpri - Diverted replica stores (not among k closest) use
tdiv - tpri gt tdiv
- Some properties
- Avoids unnecessary diversion when node still has
space - Prefers diverting large files minimize number
of diversions - Prefers accepting primary replicas than diverted
replicas - Primary store A that rejects the file diverts it
to B - B is a node in the leafset of A
- B is not already a primary or diverted store for
this replica - Has the most free space among all such possible
nodes
32Replica Diversion continued
- If the chosen node B also rejects the replica
- Nodes that already stored a replica discard it
- a negative ack message is returned to the client,
- causing a File Diversion
33STORAGE MANAGEMENTFile Diversion
- Purpose balancing the remaining free storage
space among different portions of the nodeID
space in the network - Client node generates new fileId using different
salt value and reissues file insert - This is repeated at most 4 times.
- If fourth attempt fails
- make smaller file size by fragmenting
- make k smaller (number of replicas)
34STORAGE MANAGEMENTnode strategy to maintain k
replicas
- In Pastry, neighboring nodes in nodeId space
exchange keep-alive message. On a timeout - remove the failed node from leaf set
- include a live node with next closest noidId
- A change in leaf set affects the replicas
- if failed node stores a file (primary or diverted
replica holder), the primary store(s) assign
another node to keep the replica - There might not be space in the leafset for
another replica. In this case the number of
replicas might temporarily drop below k - To cope with failure of primary that diverted
replicate diversion pointers - Optimization a joining node may, instead of
requesting and copying a replica, install a
pointer to the previous replica holder (a node
that is no longer in the leaf set) in file table
(like replica diversion). Then, gradual migration
35STORAGE MANAGEMENTFragmenting and File encoding
- Instead of replication, it is possible to use
erasure coding - For example Reed Solomon
- Suppose the file has n blocks
- To tolerate m failures, we can replicate m times
mn blocks - Instead, we can add m checksum blocks (for
example), such that any n blocks out of the mn
can restore the file. - This approach fragments the file
- Although it seems like erasure coding is better
than replication, it has its disadvantages
36Erasure coding vs. Replication
- Some pros and Cons of Erasure Coding
- improves balancing of disk utilization in the
system - Same availability for much less storage (or
much more availability for the same storage) - Should probably be preferred when there are a lot
of failures - With replication, the data object can be
downloaded from the replica that is closest to
the client, whereas with coding the download
latency is bounded by the distance to the n-th
closest replica. - The need of coding and decoding adds complexity
to the system design - The whole object needs to be downloaded and
reconstructed - (with replication only one block can be
downloaded) - Higher network load (need to contact several
nodes to retrieve a file)
37CACHING
- GOAL minimizing client access latency,
maximizing query throughput, balancing query load - k replicas are saved mainly for availability of
the file, although they help with balancing
access load and proximity-aware routing minimizes
access latency. But, sometimes, its not enough. - Examples
- popular object require much more than k replicas
to sustain the load and at the same time keep
access time and network traffic low. - Suppose that a file is popular among a cluster of
clients. Its better if we keep a copy near that
cluster.
38CACHING cont.
- Caching create and maintain additional copies of
highly popular file in unused disk space of
nodes - Evict cached files when storage is needed
- cache performance decreases as system utilization
increases - During successful insertion and lookup, insert to
cache on all routed nodes (unless larger than
some fraction c of the free storage) - GreedyDual-Size (GD-S) policy for replacement
- A weight wf(cost(f)/size(f)) is assigned to each
cached file - file with lowest wf is replaced
- This wf is subtracted from the weight of all
remaining cached files
39EXPERIMENTAL RESULTS Effects of Storage
Management
- No diversion
- (tpri 1, tdiv 0)
- max utilization 60.8
- 51.1 inserts failed
- Replica/file diversion
- (tpri 0.1, tdiv .05)
- max utilization gt 98
- lt 1 inserts failed
- leaf set size effect of local load balancing
- Policy-
- Accept a file if file_size /
free_space lt t
40EXPERIMENTAL RESULTS Determine Threshold Values
- Insertion Statistics and Utilization as tpri
varied, tdiv 0.05
- Insertion Statistics and Utilization as tdiv
varied, tpri 0.1
- The lower tpri, the less likely that large file
can be stored, therefore many small files can be
stored instead -gt number of stored file
increases, but Utilization drops, since large
files are rejected at low utilization levels
- Similarly, as tdiv increases, storage
utilization improves, but fewer files are
successfully inserted, for the same reasons
- Policy-
- Accept a file if file_size /
free_space lt t
41EXPERIMENTAL RESULTS Impact of file and replica
diversion
- Number of replica diversions is small even at
high utilization - at 80 utilization less than 10 replicas are
diverted - As long as utilization is below 95, each file is
rarely redirected more than once and file
diversions are very rare - Less than 16 of all replicas are diverted when
utilization is 95
42EXPERIMENTAL RESULTSCaching
No caching constant number of hops, until
redirection begins and then one more hop is
required
- Global cache hit ratio and average number of
message hops
log 16 2250 3
Number of nodes
- As storage Util. and file number increases,
cached files are replaced by replicas -gt cache
hit ratio decreases - hit ratio ? -gt routing hops ?
(however, no-caching
is still worse even at 99 utilization)
43CONCLUSION
- Design and evaluation of PAST
- Storage Management, Caching
- Nodes and files are assigned uniformly
distributed ID - Replicas of file stored at k nodes closest to
fildId - Experimental results
- Achieve storage utilization of 98
- Below 5 file insertion failures at 95
utilization - Mostly large files are rejected
- Caching achieves load balancing, reduces
fetch-distance and network traffic