Storage Management and Caching in PAST - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Storage Management and Caching in PAST

Description:

Storage Management and Caching in PAST A Large-scale persistent peer-to-peer storage utility Presented by Albert Tannous CSE 598D: Storage Systems Dr. Bhuvan ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 22

Provided by: AlbertT9

Learn more at: https://www.cse.psu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Storage Management and Caching in PAST

1
Storage Management and Caching in PAST

A Large-scale persistent
peer-to-peer storage utility

Presented by Albert Tannous
CSE 598D Storage Systems Dr. Bhuvan Urgaonkar
2
Introduction to PAST (1)

P2P systems can be characterized as distributed
systems in which all nodes have identical
capabilities and responsibilities and all
communication is symmetric
PAST An Internet-based, P2P global storage
utility, which aims to provide strong
persistence, high availability, scalability and
security
PAST is based on a self-organizing, Internet
based overlay network of storage nodes that
cooperatively route file queries, store multiple
replicas of files, and cache additional copies of
popular files
Storage nodes and files are each assigned
uniformly distributed identifiers

3
Introduction to PAST (2)

PAST is composed of nodes connected to the
Internet, where each node is capable of
initiating and routing client requests to insert
or retrieve files
PAST nodes form a self-organizing overlay network
Inserted files are replicated across multiple
nodes for availability
Advantages of PAST
Exploits the multitude and diversity of nodes in
the Internet to achieve strong persistence and
high availability
No need for physical transport of storage media
to protect backup and archival data
No need for explicit mirroring to ensure high
availability and throughput for shared data

4
Introduction to PAST (3)

Files stored are associated with a quasi-unique
fileId that is generated at the time of the
files insertion into PAST
Files stored are immutable
Files can be shared by the owner by distributing
fileId (and decryption key if necessary)
Pastry is the routing scheme used to route
clients requests to proper nodes

5
Introduction to PAST (4)

A client needs the fileId (and maybe the
decryption key) to retrieve a file
PAST does not provide facilities for searching,
directory lookup, or key distribution.
PAST is intended as an archival storage and
content distribution utility, not a
general-purpose FS

6
PAST Overview (1)

Any host connected to the Internet can be a PAST
node, just needs the software
Set of operations exported to PAST clients
fileId Insert(name, owner-credentials, k, file)
file Lookup(fileId)
Reclaim(fileId, owner-credentials)
Each PAST node is assigned a 128-bit node
identifier, called a nodeId
A set of nodeIds is an excellent candidate for
storing the replicas of a file, since the nodes
are likely to be diverse in all aspects

7
PAST Overview (2)

During an insert operation, PAST stores the file
on the k PAST nodes whose nodeIds are numerically
closest to the 128 msb of the files fileId
k is chosen to meet the availability needs of a
file, relative to the expected failure rates of
individual nodes
PAST is layered on top of Pastry

8
Pastry

Pastry A P2P request routing and content
location scheme
Pastry is efficient, scalable, fault resilient
and self-organizing
Pastry routes an associated message towards the
node whose nodeId is numerically closest to the
128 msb of the fileId, among all live nodes
A file can be located unless the k nodes whose
nodeIds are numerically closest to the 128 msbs
of the fileId have failed within a recovery period

9
PAST Operations (1)

Insert Request a fileId is computed as the SHA-1
hashcode of the files textual name, the clients
public key, and a random salt. The required
storage (file size times k) is debited against
the clients storage quota, and a file
certificate is issued and signed with the owners
private key
The file and the certificate are routed to the
first node closest to the fileId. The node
accepts the replica and forwards it to the k-1
other nodes
When the k nodes have accepted the replica, the
client recieves an acknowledgement

10
PAST Operations (2)

Lookup request Client node sends a request
message, using the requested fileId as the
destination. When a node that stores the file
receives the request, node responds with the
content and the stored file certificate
Reclaim request Like an insert request, the
clients node issues a reclaim certificate, which
allows the replica storing nodes to verify that
the files legitimate owner is requesting the
operation. The storing nodes each issue and
return a reclaim receipt, which the client node
verifies for a credit against the users storage
quota

11
Security

Each PAST node and each user of the system hold a
smartcard (read-only clients dont need a card).
A private/public key pair is associated with each
card
Each smartcards public key is signed with the
smartcard issuers private key for certification
purposes
The smartcards ensure the integrity of nodeId and
fileId assignments, thus preventing an attacker
from controlling adjacent nodes in the nodeId
space, or directing file insertions to a specific
portion of the fileId space.

12
Storage Management (1)

The goal is to allow high global storage
utilization and graceful degradation as the
system approaches its maximal utilization
Responsibilities of the storage management
Balance the remaining free storage space among
nodes in the PAST network as the system-wide
storage utilization is approaching 100
Maintain the invariant that copies of each file
are maintained by the k nodes with nodeIds
closest to the fileId

13
Storage Management (2)

2 ways to resolve the conflicting
responsibilities
Replica diversion Allows a node that is not one
of the k numerically closest nodes to the fileId
to alternatively store the file, if it is in the
leaf set of one of those k nodes, to accommodate
differences in the storage capacity and
utilization of nodes within a leaf set
File diversion A file is diverted to a different
part of the nodeId space by choosing a different
salt in the generation of its fileId when a
nodes entire leaf set is reaching capacity. Its
purpose is to achieve more global load balancing
across large portions of the nodeId space

14
Causes of Storage Load Imbalance

If not all the k closest nodes can accommodate a
replica (insufficient storage), but k nodes exist
within the leaf sets of the k nodes that can
accommodate the file
That imbalance in the available storage among the
l k nodes in the intersection of the k leaf
sets can arise for several reasons
The number of files assigned to each node may
differ because of statistical variation in the
assignment of nodeIds and fileIds
The size distribution of inserted files may have
high variance and may be heavy tailed.
The storage capacity of individual PAST nodes
differs

15
Per-Node Storage

Assumption The storage capacities of individual
PAST nodes differ by no more than two orders of
magnitude at a given time
PAST controls the distribution of per-node
storage capacities by comparing the advertised
storage capacity of a newly joining node with the
average storage capacity of nodes in its leaf
set
If the node is too large, it is asked to split
and join under multiple nodeId
If a node is too small, it is rejected
A node is free to advertise only a fraction of
its actual disk space for use by PAST. The
advertised capacity is used as the basis for the
admission decision

16
Replica Diversion

Goal Balance the remaining free storage space
among the nodes in a leaf set
3 policies are used
Acceptance of replicas into a nodes local store
Selecting a node to store a diverted replica
Deciding when to divert a file to a different
part of the nodeId space

17
File Diversion

Goal Balance the remaining free storage space
among different portions of the nodeId space in
PAST
When a file insert operation fails, a negative
acknowledgment is returned to the client node
The client node in turn generates a new fileId
using a different salt value and retries the
insert operation
A client node repeats this process for up to 3
times. If, after 4 attempts the insert operation
still fails, the operation is aborted and an
insert failure is reported to the application

18
Maintaining Replicas

PAST maintains k copies of each inserted file on
different nodes within a leaf set
Neighboring nodes in the nodeId space
periodically exchange keep-alive messages
When a new node joins the system or a previously
failed node gets back on-line, its included and
another node is dropped from each of the previous
leaf sets
A node that becomes one of the k closest nodes
for certain files has to acquire a replica of
each file, re-creating replicas that were
previously held by the failed node
A node that ceases to be one of the k nodes for
certain files can discard the copies

19
File Encoding

Reed-Solomon encoding adding m additional
checksum blocks to n original data blocks (all of
equal size) allows recovery from up to m losses
of data or checksum blocks
This reduces the storage overhead required to
tolerate m failures from m to (m n)/n times the
file size
By fragmenting a file into a large number of data
blocks, the storage overhead for availability can
be made very small
Storing fragments of a file at separate nodes
(and thereby striping the file over several
disks) can also improve bandwidth

20
Caching (1)

Goal Minimize client access latencies (fetch
distance), maximize the query throughput and
balance the query load in the system
Maintaining replicas lead to availability, query
load balancing and latency reduction
A highly popular file may demand many more than k
replicas in order to sustain its lookup load
while minimizing client latency and network
traffic
If a file is popular among one or more local
clusters of clients, its advantageous to store a
copy near each cluster

21
Caching (2)

PAST nodes use the unused portion of their
advertised disk space to cache files
Cached copies can be evicted and discarded at any
time when a node stores a new primary or
redirected replica of a file, it evicts one or
more cached files to make room for the replica
Cache insertion policy
A file routed through a node as part of a lookup
or insert operation is inserted into the local
disk cache if its size is less than a fraction c
of the nodes current cache size (the portion of
the nodes storage not currently used to store
primary or diverted replicas)
Cache replacement policy Based on the
GreedyDual-Size (GD-S)