Title: OceanStore An Architecture for Global-scale Persistent Storage
1OceanStoreAn Architecture for Global-scale
Persistent Storage
- By John Kubiatowicz, David Bindel, Yan Chen,
Steven Czerwinski, Patrick Eaton, Dennis Geels,
Ramakrishna Gummadi, Sean Rhea, Hakim
Weatherspoon, Westley Weimer, Chris Wells, and
Ben Zhao - http//oceanstore.cs.berkeley.edu
Presented by Yongbo Wang, Hailing Yu
2Ubiquitous Computing
3OceanStore Overview
- A global-scale utility infrastructure
- Internet-based, distributed storage system for
information appliances such as computers, PDAs,
cellular phones, - It is designed to support 1010 users, each having
104 data files (Support over 1014 files)
4OceanStore Overview (cont)
- Automatically recovers from server and network
failures - Utilizes redundancy and client-side cryptographic
techniques to protect data - Allows replicas of a data object to exist
anywhere, at any time - Incorporates new resources
- Adjusts to usage patterns
5(No Transcript)
6OceanStore
- Two Unique design goals
- Ability to be constructed from an untrusted
infrastructure - Servers may crash
- Information can be stolen
- Support of nomadic data
- Data can be cached anywhere, anytime (promiscuous
caching) - Data is separated from its physical location
7Underlying Technologies
- Naming
- Access control
- Data Location and Routing
- Data Update
- Deep Archival Storage
- Introspection
8Naming
- Objects are identified by a globally unique
identifier (GUID) - Different objects in OceanStore use different
mechanism to generate their GUID
9Underlying Technologies
- Naming
- Access control
- Data Location and Routing
- Data Update
- Deep Archival Storage
- Introspection
10Access Control
- Reader Restriction
- Encrypt the data that is not public
- Distribute the encryption key to users having
read permission - Writer Restriction
- The owner of an object can decide an access
control list (ACL) for the object - All writes are verified by well-behaved servers
and clients based on the ACL.
11Underlying Technologies
- Naming
- Access control
- Data Location and Routing
- Data Update
- Deep Archival Storage
- Introspection
12Data Location and Routing
- Provides necessary service to route messages to
their destinations and to locate objects in the
system - Works on top of IP
13Data Location and Routing
- Each object in the system is identified by a
globally unique identifier ,GUID (a pseudo-random
fixed length bit string) - An object GUID is a secure hash function over the
objects contents - OceanStore uses 160-bit SHA-1 hash for which the
probability that two out of 1014 objects hash to
the same value is approximately 1 in 1020.
14Data Location and Routing
- In OceanStore system, entities that are accessed
frequently are likely to reside close to where
they are being used - Two-tiered approach
- First use a fast probabilistic algorithm
- If necessary, use a slower but reliable
hierarchical algorithm
15Probabilistic algorithm
- Each server has a set of neighbors, chosen from
servers closest to it in network latency - A server associates with each neighbor a
probability of finding each object in the system
through that neighbor - This association is maintained in constant space
using an attenuated Bloom filter
16Bloom Filters
- An efficient, lossy way of describing sets
- A Bloom filter is a bit-vector of length w with a
family of hash functions - Each hash function maps the elements of the
represented set to an integer in 0,w) - To form a representation of a set, each set
element is hashed and the bits in the vector
corresponding to has functions results are set
17Bloom Filters
- To check if an element is in the set
- Element is hashed
- Corresponding bits in the filter are checked
- - If any of the bits are not set, it is not in
the set - - If all bits are set, it may be in the set
- The element may not be in the set even if all of
the hashed bits are set (false positive) - False positive rate of a Bloom filter is a linear
function of its width, number of hash functions
and cardinality of the represented set
18A Bloom Filter To check an objects name against
a Bloom filter summary, the name is hashed with n
different hash functions (here, n3) and bits
corresponding to the result are checked
19Attenuated Bloom Filters
- An attenuated Bloom filter of depth d is an array
of d normal bloom filters - For each neighbor link, an attenuated Bloom
filter is kept - The k th bloom filter in the array is the merger
of all Bloom filters for all of the nodes k hops
away through any path starting with that neighbor
link
20Attenuated Bloom Filter for the outgoing link
A?B In FAB,the document Uncle Johns Band would
map to potential value 1/41/83/8.
21The Query Algorithm
- The query node examines the 1st level of each of
its neighbors filters - If matches are found, the query is forwarded to
closest neighbor - If no filter matches, the querying node examines
the next level of each filter at each step and
forwards the query if a matching node founds
22The probabilistic query process n1 is looking
for object X, which is hashed to bits 0,1, and 3.
23Probabilistic location and routing
- A filter of depth d stores information about
servers d hops from the server - If a query reaches a server d hops away from its
source due to a false positive, it is not
forwarded further - In this case, the probabilistic algorithm gives
up and forwards the query to deterministic
algorithm
24Deterministic location and routing
- Tapestry OceanStores self-organizing routing
and object location subsystem - IP overlay network with a distributed, fault
tolerant architecture - A query is routed from node to node until the
location of a replica is discovered
25Tapestry
- A hierarchical distributed data structure
- Every server is assigned a random and unique
node-ID - The node-ID s are then used to construct a mesh
of neighbor links
26Tapestry
- Every node is connected to other nodes via
neighbor links of various levels - Level-1 edges connect to a set of nodes closest
in network latency with different values in the
lowest digit of their node-IDs - Level-2 edges connect to the closest nodes that
match in the lowest digit and different second
digits, etc.
27Tapestry
- Each node has a neighbor map with multiple levels
- for example, the 9th entry of the 4th level for
node 325AE is the node closest to 325AE which
ends in 95AE - Messages are routed to the destination ID digit
by digit - 8gt98gt598gt4598
28Neighbor Map for Tapestry node 0642
29Tapestry routing example A potential path for a
message originating at node 0325 destined for
node 4598
30Tapestry
- Each object is associated with a location root
through a deterministic mapping function - To advertise an object o, the server s storing
the object sends a publish message toward the
objects root, leaving location pointers at each
hop
31Tapestry routing example To publish an object,
the server storing the object sends a publish
message toward the objects root (e.g. node
4598), leaving location pointers at each node
32Locating an object
- To locate an object, a client sends a message
toward the objects root. When the message
encounters a pointer, it routes directly to the
object - It is proved that Tapestry can route the request
to the asymptotically optimal node (in terms of
the shortest path network distance) containing a
replica
33Tapestry routing example To locate an object,
node 0325 sends a message toward the objects
root (e.g. node 4598)
34Data Location and Routing
- Fault tolerance
- Tapestry uses redundant neighbor pointers when it
detects a primary route failure - Uses periodic UDP probes to check link conditions
- Tapestry deterministically chooses multiple root
nodes for each object
35Data Location and Routing
- Automatic repair
- Node insertions
- A new node needs the address of at least one
existing node - It then starts advertising its services and the
roles it can assume to the system through the
existing node - Exiting nodes
- If possible, the exiting node runs a shutdown
script to inform the system - In any case, neighbors will detect its absence
and update routing tables accordingly
36Underlying Technologies
- Naming
- Access control
- Data Location and Routing
- Data Update
- Deep Archival Storage
- Introspection
37Updates
- Updates are made by clients and all updates are
logged - OceanStore allows concurrent updates
- Serializing updates
- Since the infrastructure is untrusted, using a
master replica will not work - Instead, a group of servers called inner ring is
responsible for choosing final commit order
38Update commitment
- Inner ring is a group of servers working on
behalf of an object. - It consists of a small number of highly-connected
servers. - Each object has an inner ring which can be
located through Tapestry
39Inner ring
- An objects inner ring,
- Generates new versions of an object from client
updates - Generates encoded, archival fragments and
distributes them - Provides mapping from active GUID to the GUID of
most recent version of the object - Verifies a data objects legitimate writers
- Maintains an update history providing an undo
mechanism
40Update commitment
- Each inner ring makes its decisions through a
Byzantine agreement protocol - Byzantine agreement lets a group of 3n1 servers
reach a agreement whenever no more than n of them
are faulty
41Update commitment
- Other nodes containing the data of that object
are called secondary nodes - They do not participate in serialization protocol
- They are organized into one or more multicast
trees (dissemination trees)
42- Path of an update
- After generating an update, a client sends it
directly to the objects inner ring - While inner ring performs a Byzantine agreement
to commit the update, secondary nodes propagate
the update among themselves - The result of update is multicast down the
dissemination tree to all secondary nodes
43Cost of an update in bytes sent across the
network, normalized to minimum cost needed to
send the update to each of the replicas
44Update commitment
- Fault tolerance
- Guarantees fault tolerance if less than one third
of the servers in the inner ring is malicious - Secondary nodes do not participate in the
Byzantine protocol, but receive consistency
information
45Update commitment
- Automatic repair
- Servers of the inner ring can be changed without
affecting the rest of the system - Servers participating in the inner ring are
altered continuously to maintain the Byzantine
assumption
46Underlying Technologies
- Naming
- Access control
- Data Location and Routing
- Data Update
- Deep Archival Storage
- Introspection
47Deep Archival Storage
- Each object is treated as a series of m fragments
and then transformed into n fragments, where ngtm - That uses Reed-Solomon encoding.
- Any m of the n coded fragments are sufficient to
construct the original data - Rate of encoding rm/n
- Storage overhead1/rn/m
48Underlying Technologies
- Naming
- Access control
- Data Location and Routing
- Data Update
- Deep Archival Storage
- Introspection
49Introspection
- It is impossible to manually administer millions
of servers and objects - OceanStore contains introspection tools
- Event monitoring
- Event analysis
- Self-adaptation
50Introspection
- Introspective modules on servers observe network
traffic and measure local traffic. - They automatically create, replace, and remove
replicas in response to objects usage patterns
51Introspection
- If a replica becomes unavailable
- Clients will receive service from a more distant
replica - This produces extra load on distant replicas
- Introspective mechanism detects this and new
replicas are created - Above actions provide fault tolerance and
automatic repair
52Event handlers summarizes local events. These
summaries are stored in a database. The
information in the database is periodically
analyzed and necessary actions are taken. A
summary is sent to other nodes.
53Conclusion
- OceanStore provides a global-scale, distributed
storage platform through adaptation, fault
tolerance and repair - It is self-maintaining
- A prototype implemented in Java is under
construction at UC Berkeley. Although it is not
operational yet, many components are already
functioning in isolation
54The end