Pastry and Past - PowerPoint PPT Presentation

About This Presentation
Title:

Pastry and Past

Description:

No support for mutable files, searching, directory lookup. 27. 27. 27. INTRODUCTION ... fileId computed (hash code of file name, public key, etc. ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 44
Provided by: webeeTec
Category:

less

Transcript and Presenter's Notes

Title: Pastry and Past


1
Pastry and Past
Alex Shraer
Based on slides by Peter Druschel and Gabi Kliot
(CS Department, Technion)
2
Sources
  • Storage management and caching in PAST, a
    large-scale persistent peer-to-peer storage
    utility
  • Antony Rowstron (Microsoft Research)
  • Peter Druschel (Rice University)
  • Pastry scalable, decentralized object location
    and routing for large-scale peer-to-peer systems
  • Antony Rowstron (Microsoft Research)
  • Peter Druschel (Rice University)

3
PASTRY
scalable, decentralized object location and
routing for large-scale peer-to-peer systems
4
Pastry
  • Generic p2p location and routing substrate (DHT)
  • Self-organizing overlay network (join,
    departures, locality repair)
  • Consistent hashing
  • Lookup/insert object in lt log2b N routing steps
    (expected)
  • O(log N) per-node state
  • Network locality heuristics
  • Scalable, fault resilient,
  • self-organizing, locality aware

5
Pastry Object distribution
2128 - 1
O
  • Consistent hashing
  • 128 bit circular id space
  • nodeIds (uniform random)
  • objIds/keys (uniform random)
  • Invariant node with numerically closest nodeId
    maintains object

objId/key
nodeIds
6
Pastry Object insertion/lookup
2128 - 1
O
Msg with key X is routed to live node with nodeId
closest to X Problem complete routing table
not feasible
X
Route(X)
7
Pastry Routing table ( 10233102)
L nodes in leaf set log N
Rows (actually log 2128 128/b) 2b
columns L neighbors
row
0
2b
1
2

2b
7
8
Pastry Leaf sets
  • Each node maintains IP addresses of the nodes
    with the L numerically closest larger and smaller
    nodeIds, respectively.
  • routing efficiency/robustness
  • fault detection (keep-alive)
  • application-specific local coordination

9
Pastry Routing procedure
If (destination is within range of our leaf set)
forward to numerically closest member else let
l length of shared prefix let d value of
l-th digit in Ds address if (Rld exists)
forward to Rld else forward to a known
node (from ) that (a) shares at
least as long a prefix (b) is numerically
closer than this node
Unless L/2 adjacent nodes in the leafset failed
simultaneously, at least one such node must be
alive
10
Pastry Routing
d471f1
d467c4
d462ba
d46a1c
d4213f
  • Properties
  • log N steps
  • O(log N) state

Route(d46a1c)
d13da3
2b
65a1fc
2b
11
Pastry Routing
  • Integrity of overlay
  • guaranteed unless L/2 simultaneous failures of
    nodes with adjacent nodeIds
  • Number of routing hops
  • No failures lt log N expected, 128/b 1 max
  • During failure recovery
  • O(N) worst case, average case much better

2b
12
Pastry Locality properties
  • Assumption scalar proximity metric
  • e.g. ping/RTT delay, IP hops, geographical
    distance
  • a node can probe distance to any other node
  • Proximity invariant
  • Each routing table entry refers to a node close
  • to the local node (in the proximity space), among
  • all nodes with the appropriate nodeId prefix.

13
Pastry Geometric Routing in proximity space
NodeId space
  • The proximity distance traveled by message in
    each routing step is exponentially increasing
    (entry in row l is chosen from a set of nodes of
    size N/2bl)
  • The distance traveled by message from its source
    increases monotonically at each step (message
    takes exponentially larger strides each step)

14
Pastry Locality properties
  • Simulations show
  • Expected distance traveled by a message in the
    proximity space is within a small constant of the
    minimum
  • Among k nodes with nodeIds closest to the key,
    message likely to reach the node closest to the
    source node first
  • The nearest copy in 76 of lookups
  • One of the two nearest in 92 of lookups

15
Pastry Self-organization
  • Initializing and maintaining routing tables and
    leaf sets
  • Node addition
  • Node departure (failure)
  • The goal is to maintain all routing table entries
  • to refer to a near node, among all live nodes
  • with appropriate prefix

16
Pastry Node addition
  • New node X contacts nearby node A
  • A routes join message to X, which arrives to Z,
    closest to X
  • X obtains leaf set from Z, ith row for routing
    table from ith node from A to Z
  • X informs any nodes that need to be aware of its
    arrival
  • X also improves its table locality by requesting
    neighborhood sets from all nodes X knows
  • In practice optimistic approach

17
Pastry Node addition
d471f1
Zd467c4
d462ba
Xd46a1c
d4213f
New node Xd46a1c
Route(d46a1c)
d13da3
A 65a1fc
18
Pastry Node addition
X is close to A, B is close to B1. Why X is close
to B1? The expected distance from B to its row
one entries (B1) is much larger than the expected
distance from A to B (chosen from exponentially
decreasing set size)
19
Node departure (failure)
  • Leaf set repair (eager all the time)
  • Leaf set members exchange keep-alive messages
  • In case a node in the leaf set fails, request set
    from furthest live node in set. Update the
    leafset and notify the nodes that were added to
    the leafset
  • Routing table repair (lazy upon failure)
  • get table from peers in the same row, if not
    found from higher rows
  • Neighborhood set repair (eager)
  • Periodically contact neighbors. If a neighbor
    failed take neighbor lists from other neighbors,
    check distances, and update your list with the
    closest nodes found

20
Randomized Routing
  • So far, the routing is deterministic. If a node
    in the routing path has failed or refuses to pass
    the message, re-transmitting will not help.
  • Each step, the message must be forwarded to a
    node whose ID shares at least as long a prefix,
    but is numerically closer than current node
  • If there are several possible such nodes - choose
    one randomly, heavily biased towards the closest
  • If routing fail, the client needs to retransmit

21
Pastry Distance traveled
30-40 longer than the optimum. Not bad,
considering that Pastry only stores 75 entries in
the routing table, instead of 99,999 of the
complete routing table
L16, 100k random queries Proximity in emulated
network. Nodes paced randomly
22
Pastry Summary
  • Generic p2p overlay network
  • Scalable, fault resilient, self-organizing,
    secure
  • O(log N) routing steps (expected)
  • O(log N) routing table size
  • Network locality properties

2b
2b
23
Storage management and caching in PAST, a
large-scale, persistent peer-to-peer storage
utility
24
INTRODUCTION
  • PAST system
  • Internet-based, peer-to-peer global storage
    utility
  • Characteristics
  • strong persistence, high availability (by using k
    replicas)
  • scalability (due to efficient Pastry routing)
  • short insert and query paths
  • query load balancing and latency reduction (due
    to wide dispersion, Pastry locality and caching)
  • security
  • Composed of nodes connected to internet, each
    node has 128-bit nodeId
  • Use Pastry for efficient routing scheme
  • No support for mutable files, searching,
    directory lookup

25
INTRODUCTION
  • Function of nodes
  • store replicas of files
  • initiate and route client requests to insert or
    retrieve files in PAST
  • File-related property
  • Inserted files have quasi-unique fileId,
  • File is replicated across multiple nodes
  • To retrieve file, client must know fileId and
    decryption key (if necessary)
  • fileId 160-bit computed as SHA-1 of file name,
    owners public key, random salt number

26
PAST Operation
  • Insert fileId Insert(name, owner-credentials,
    k, file)
  • fileId computed (hash code of file name, public
    key, etc.)
  • Request Message reaches one of k nodes closest to
    fileId
  • Node accepts a replica of the file, forwards
    message to k-1 nodes existing in leaf set
  • Once k nodes accept, ack message with store
    receipts is passed to client. Clients can be
    charged for the storage
  • Lookup file Lookup(fileId)
  • Retrieves a copy of the file, if it was inserted
    earlier and if one of the k nodes that store it
    are connected to the network. The closest node
    will usually provide the copy
  • Reclaim Reclaim(fileId, owner-credentials)
  • After this, retrieval of the file is no longer
    guaranteed.
  • Unlike a delete operation, a Reclaim does not
    guarantee that the file will not be accessible.
    These weaker semantics simplify the algorithm

27
STORAGE MANAGEMENT - why?
  • Ensure availability of files
  • Balance the storage load
  • Provide graceful degradation in performance as
    the system (globally) runs out of storage

28
STORAGE MANAGEMENT
  • Responsibility
  • Replicas of files are maintained by k nodes with
    nodeId closest to fileId.
  • Why is this a good thing?
  • Creates a conflict what if these k nodes have
    insufficient storage space, while other nodes
    have enough space
  • Challenge Balance free storage space among nodes
  • Causes for such load imbalance
  • Number of files assigned to each node is
    different
  • Size of each inserted file is different
  • Storage capacity of each node is different
  • Solution Replica diversion File diversion

29
STORAGE MANAGEMENTReplica Diversion
  • Purpose balance the remaining free storage
    space among the nodes in a leafset
  • Diversion steps of node A (that received
    insertion request but has insufficient space)
  • choose node B among nodes in leaf set
  • B does not already holds diverted replica
  • B is not one of the k closest (where the file
    will be stored anyway)
  • ask B to store a copy
  • enter a file entry in table with pointer to B
  • send store receipt as usual

30
Replica Diversion continued
  • If B fails, the replica should be stored
    elsewhere
  • If A fails, the replica in B should remain
    available
  • otherwise the probability that all k replicas are
    inaccessible is doubled with each replica
    diversion
  • will be described later
  • Ask the k1th closest node C to keep a pointer
    to B
  • If A fails, k closest nodes will still hold
    replicas
  • If C fails, A asks the new k1th node to keep a
    pointer to B
  • Cost
  • A and C both store an additional entry in their
    file tables (the pointer
    to B)
  • A few additional RPCs during insert and during
    lookup

31
Replica Diversion continued
  • Node rejects file if file_size/remaining_storage
    gt t
  • Meaning the file would consume more than a
    fraction t of the remaining storage in the node
  • Primary replica stores (among the k closest) use
    t tpri
  • Diverted replica stores (not among k closest) use
    tdiv
  • tpri gt tdiv
  • Some properties
  • Avoids unnecessary diversion when node still has
    space
  • Prefers diverting large files minimize number
    of diversions
  • Prefers accepting primary replicas than diverted
    replicas
  • Primary store A that rejects the file diverts it
    to B
  • B is a node in the leafset of A
  • B is not already a primary or diverted store for
    this replica
  • Has the most free space among all such possible
    nodes

32
Replica Diversion continued
  • If the chosen node B also rejects the replica
  • Nodes that already stored a replica discard it
  • a negative ack message is returned to the client,
  • causing a File Diversion

33
STORAGE MANAGEMENTFile Diversion
  • Purpose balancing the remaining free storage
    space among different portions of the nodeID
    space in the network
  • Client node generates new fileId using different
    salt value and reissues file insert
  • This is repeated at most 4 times.
  • If fourth attempt fails
  • make smaller file size by fragmenting
  • make k smaller (number of replicas)

34
STORAGE MANAGEMENTnode strategy to maintain k
replicas
  • In Pastry, neighboring nodes in nodeId space
    exchange keep-alive message. On a timeout
  • remove the failed node from leaf set
  • include a live node with next closest noidId
  • A change in leaf set affects the replicas
  • if failed node stores a file (primary or diverted
    replica holder), the primary store(s) assign
    another node to keep the replica
  • There might not be space in the leafset for
    another replica. In this case the number of
    replicas might temporarily drop below k
  • To cope with failure of primary that diverted
    replicate diversion pointers
  • Optimization a joining node may, instead of
    requesting and copying a replica, install a
    pointer to the previous replica holder (a node
    that is no longer in the leaf set) in file table
    (like replica diversion). Then, gradual migration

35
STORAGE MANAGEMENTFragmenting and File encoding
  • Instead of replication, it is possible to use
    erasure coding
  • For example Reed Solomon
  • Suppose the file has n blocks
  • To tolerate m failures, we can replicate m times
    mn blocks
  • Instead, we can add m checksum blocks (for
    example), such that any n blocks out of the mn
    can restore the file.
  • This approach fragments the file
  • Although it seems like erasure coding is better
    than replication, it has its disadvantages

36
Erasure coding vs. Replication
  • Some pros and Cons of Erasure Coding
  • improves balancing of disk utilization in the
    system
  • Same availability for much less storage (or
    much more availability for the same storage)
  • Should probably be preferred when there are a lot
    of failures
  • With replication, the data object can be
    downloaded from the replica that is closest to
    the client, whereas with coding the download
    latency is bounded by the distance to the n-th
    closest replica.
  • The need of coding and decoding adds complexity
    to the system design
  • The whole object needs to be downloaded and
    reconstructed
  • (with replication only one block can be
    downloaded)
  • Higher network load (need to contact several
    nodes to retrieve a file)

37
CACHING
  • GOAL minimizing client access latency,
    maximizing query throughput, balancing query load
  • k replicas are saved mainly for availability of
    the file, although they help with balancing
    access load and proximity-aware routing minimizes
    access latency. But, sometimes, its not enough.
  • Examples
  • popular object require much more than k replicas
    to sustain the load and at the same time keep
    access time and network traffic low.
  • Suppose that a file is popular among a cluster of
    clients. Its better if we keep a copy near that
    cluster.

38
CACHING cont.
  • Caching create and maintain additional copies of
    highly popular file in unused disk space of
    nodes
  • Evict cached files when storage is needed
  • cache performance decreases as system utilization
    increases
  • During successful insertion and lookup, insert to
    cache on all routed nodes (unless larger than
    some fraction c of the free storage)
  • GreedyDual-Size (GD-S) policy for replacement
  • A weight wf(cost(f)/size(f)) is assigned to each
    cached file
  • file with lowest wf is replaced
  • This wf is subtracted from the weight of all
    remaining cached files

39
EXPERIMENTAL RESULTS Effects of Storage
Management
  • No diversion
  • (tpri 1, tdiv 0)
  • max utilization 60.8
  • 51.1 inserts failed
  • Replica/file diversion
  • (tpri 0.1, tdiv .05)
  • max utilization gt 98
  • lt 1 inserts failed

- leaf set size effect of local load balancing
  • Policy-
  • Accept a file if file_size /
    free_space lt t

40
EXPERIMENTAL RESULTS Determine Threshold Values
  • Insertion Statistics and Utilization as tpri
    varied, tdiv 0.05
  • Insertion Statistics and Utilization as tdiv
    varied, tpri 0.1
  • The lower tpri, the less likely that large file
    can be stored, therefore many small files can be
    stored instead -gt number of stored file
    increases, but Utilization drops, since large
    files are rejected at low utilization levels
  • Similarly, as tdiv increases, storage
    utilization improves, but fewer files are
    successfully inserted, for the same reasons
  • Policy-
  • Accept a file if file_size /
    free_space lt t

41
EXPERIMENTAL RESULTS Impact of file and replica
diversion
  • Number of replica diversions is small even at
    high utilization
  • at 80 utilization less than 10 replicas are
    diverted
  • As long as utilization is below 95, each file is
    rarely redirected more than once and file
    diversions are very rare
  • Less than 16 of all replicas are diverted when
    utilization is 95

42
EXPERIMENTAL RESULTSCaching
No caching constant number of hops, until
redirection begins and then one more hop is
required
  • Global cache hit ratio and average number of
    message hops

log 16 2250 3
Number of nodes
  • As storage Util. and file number increases,
    cached files are replaced by replicas -gt cache
    hit ratio decreases
  • hit ratio ? -gt routing hops ?
    (however, no-caching
    is still worse even at 99 utilization)

43
CONCLUSION
  • Design and evaluation of PAST
  • Storage Management, Caching
  • Nodes and files are assigned uniformly
    distributed ID
  • Replicas of file stored at k nodes closest to
    fildId
  • Experimental results
  • Achieve storage utilization of 98
  • Below 5 file insertion failures at 95
    utilization
  • Mostly large files are rejected
  • Caching achieves load balancing, reduces
    fetch-distance and network traffic
Write a Comment
User Comments (0)
About PowerShow.com