OceanStore An Architecture for Global-scale Persistent Storage - PowerPoint PPT Presentation

About This Presentation
Title:

OceanStore An Architecture for Global-scale Persistent Storage

Description:

A Bloom filter is a bit-vector of length w with a family of hash functions ... Attenuated Bloom Filters ... Attenuated Bloom Filter for the outgoing link A ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 55
Provided by: csU45
Category:

less

Transcript and Presenter's Notes

Title: OceanStore An Architecture for Global-scale Persistent Storage


1
OceanStoreAn Architecture for Global-scale
Persistent Storage
  • By John Kubiatowicz, David Bindel, Yan Chen,
    Steven Czerwinski, Patrick Eaton, Dennis Geels,
    Ramakrishna Gummadi, Sean Rhea, Hakim
    Weatherspoon, Westley Weimer, Chris Wells, and
    Ben Zhao
  • http//oceanstore.cs.berkeley.edu

Presented by Yongbo Wang, Hailing Yu
2
Ubiquitous Computing
3
OceanStore Overview
  • A global-scale utility infrastructure
  • Internet-based, distributed storage system for
    information appliances such as computers, PDAs,
    cellular phones,
  • It is designed to support 1010 users, each having
    104 data files (Support over 1014 files)

4
OceanStore Overview (cont)
  • Automatically recovers from server and network
    failures
  • Utilizes redundancy and client-side cryptographic
    techniques to protect data
  • Allows replicas of a data object to exist
    anywhere, at any time
  • Incorporates new resources
  • Adjusts to usage patterns

5
(No Transcript)
6
OceanStore
  • Two Unique design goals
  • Ability to be constructed from an untrusted
    infrastructure
  • Servers may crash
  • Information can be stolen
  • Support of nomadic data
  • Data can be cached anywhere, anytime (promiscuous
    caching)
  • Data is separated from its physical location

7
Underlying Technologies
  • Naming
  • Access control
  • Data Location and Routing
  • Data Update
  • Deep Archival Storage
  • Introspection

8
Naming
  • Objects are identified by a globally unique
    identifier (GUID)
  • Different objects in OceanStore use different
    mechanism to generate their GUID

9
Underlying Technologies
  • Naming
  • Access control
  • Data Location and Routing
  • Data Update
  • Deep Archival Storage
  • Introspection

10
Access Control
  • Reader Restriction
  • Encrypt the data that is not public
  • Distribute the encryption key to users having
    read permission
  • Writer Restriction
  • The owner of an object can decide an access
    control list (ACL) for the object
  • All writes are verified by well-behaved servers
    and clients based on the ACL.

11
Underlying Technologies
  • Naming
  • Access control
  • Data Location and Routing
  • Data Update
  • Deep Archival Storage
  • Introspection

12
Data Location and Routing
  • Provides necessary service to route messages to
    their destinations and to locate objects in the
    system
  • Works on top of IP

13
Data Location and Routing
  • Each object in the system is identified by a
    globally unique identifier ,GUID (a pseudo-random
    fixed length bit string)
  • An object GUID is a secure hash function over the
    objects contents
  • OceanStore uses 160-bit SHA-1 hash for which the
    probability that two out of 1014 objects hash to
    the same value is approximately 1 in 1020.

14
Data Location and Routing
  • In OceanStore system, entities that are accessed
    frequently are likely to reside close to where
    they are being used
  • Two-tiered approach
  • First use a fast probabilistic algorithm
  • If necessary, use a slower but reliable
    hierarchical algorithm

15
Probabilistic algorithm
  • Each server has a set of neighbors, chosen from
    servers closest to it in network latency
  • A server associates with each neighbor a
    probability of finding each object in the system
    through that neighbor
  • This association is maintained in constant space
    using an attenuated Bloom filter

16
Bloom Filters
  • An efficient, lossy way of describing sets
  • A Bloom filter is a bit-vector of length w with a
    family of hash functions
  • Each hash function maps the elements of the
    represented set to an integer in 0,w)
  • To form a representation of a set, each set
    element is hashed and the bits in the vector
    corresponding to has functions results are set

17
Bloom Filters
  • To check if an element is in the set
  • Element is hashed
  • Corresponding bits in the filter are checked
  • - If any of the bits are not set, it is not in
    the set
  • - If all bits are set, it may be in the set
  • The element may not be in the set even if all of
    the hashed bits are set (false positive)
  • False positive rate of a Bloom filter is a linear
    function of its width, number of hash functions
    and cardinality of the represented set

18
A Bloom Filter To check an objects name against
a Bloom filter summary, the name is hashed with n
different hash functions (here, n3) and bits
corresponding to the result are checked
19
Attenuated Bloom Filters
  • An attenuated Bloom filter of depth d is an array
    of d normal bloom filters
  • For each neighbor link, an attenuated Bloom
    filter is kept
  • The k th bloom filter in the array is the merger
    of all Bloom filters for all of the nodes k hops
    away through any path starting with that neighbor
    link

20
Attenuated Bloom Filter for the outgoing link
A?B In FAB,the document Uncle Johns Band would
map to potential value 1/41/83/8.
21
The Query Algorithm
  • The query node examines the 1st level of each of
    its neighbors filters
  • If matches are found, the query is forwarded to
    closest neighbor
  • If no filter matches, the querying node examines
    the next level of each filter at each step and
    forwards the query if a matching node founds

22
The probabilistic query process n1 is looking
for object X, which is hashed to bits 0,1, and 3.
23
Probabilistic location and routing
  • A filter of depth d stores information about
    servers d hops from the server
  • If a query reaches a server d hops away from its
    source due to a false positive, it is not
    forwarded further
  • In this case, the probabilistic algorithm gives
    up and forwards the query to deterministic
    algorithm

24
Deterministic location and routing
  • Tapestry OceanStores self-organizing routing
    and object location subsystem
  • IP overlay network with a distributed, fault
    tolerant architecture
  • A query is routed from node to node until the
    location of a replica is discovered

25
Tapestry
  • A hierarchical distributed data structure
  • Every server is assigned a random and unique
    node-ID
  • The node-ID s are then used to construct a mesh
    of neighbor links

26
Tapestry
  • Every node is connected to other nodes via
    neighbor links of various levels
  • Level-1 edges connect to a set of nodes closest
    in network latency with different values in the
    lowest digit of their node-IDs
  • Level-2 edges connect to the closest nodes that
    match in the lowest digit and different second
    digits, etc.

27
Tapestry
  • Each node has a neighbor map with multiple levels
  • for example, the 9th entry of the 4th level for
    node 325AE is the node closest to 325AE which
    ends in 95AE
  • Messages are routed to the destination ID digit
    by digit
  • 8gt98gt598gt4598

28
Neighbor Map for Tapestry node 0642
29
Tapestry routing example A potential path for a
message originating at node 0325 destined for
node 4598
30
Tapestry
  • Each object is associated with a location root
    through a deterministic mapping function
  • To advertise an object o, the server s storing
    the object sends a publish message toward the
    objects root, leaving location pointers at each
    hop

31
Tapestry routing example To publish an object,
the server storing the object sends a publish
message toward the objects root (e.g. node
4598), leaving location pointers at each node
32
Locating an object
  • To locate an object, a client sends a message
    toward the objects root. When the message
    encounters a pointer, it routes directly to the
    object
  • It is proved that Tapestry can route the request
    to the asymptotically optimal node (in terms of
    the shortest path network distance) containing a
    replica

33
Tapestry routing example To locate an object,
node 0325 sends a message toward the objects
root (e.g. node 4598)
34
Data Location and Routing
  • Fault tolerance
  • Tapestry uses redundant neighbor pointers when it
    detects a primary route failure
  • Uses periodic UDP probes to check link conditions
  • Tapestry deterministically chooses multiple root
    nodes for each object

35
Data Location and Routing
  • Automatic repair
  • Node insertions
  • A new node needs the address of at least one
    existing node
  • It then starts advertising its services and the
    roles it can assume to the system through the
    existing node
  • Exiting nodes
  • If possible, the exiting node runs a shutdown
    script to inform the system
  • In any case, neighbors will detect its absence
    and update routing tables accordingly

36
Underlying Technologies
  • Naming
  • Access control
  • Data Location and Routing
  • Data Update
  • Deep Archival Storage
  • Introspection

37
Updates
  • Updates are made by clients and all updates are
    logged
  • OceanStore allows concurrent updates
  • Serializing updates
  • Since the infrastructure is untrusted, using a
    master replica will not work
  • Instead, a group of servers called inner ring is
    responsible for choosing final commit order

38
Update commitment
  • Inner ring is a group of servers working on
    behalf of an object.
  • It consists of a small number of highly-connected
    servers.
  • Each object has an inner ring which can be
    located through Tapestry

39
Inner ring
  • An objects inner ring,
  • Generates new versions of an object from client
    updates
  • Generates encoded, archival fragments and
    distributes them
  • Provides mapping from active GUID to the GUID of
    most recent version of the object
  • Verifies a data objects legitimate writers
  • Maintains an update history providing an undo
    mechanism

40
Update commitment
  • Each inner ring makes its decisions through a
    Byzantine agreement protocol
  • Byzantine agreement lets a group of 3n1 servers
    reach a agreement whenever no more than n of them
    are faulty

41
Update commitment
  • Other nodes containing the data of that object
    are called secondary nodes
  • They do not participate in serialization protocol
  • They are organized into one or more multicast
    trees (dissemination trees)

42
  • Path of an update
  • After generating an update, a client sends it
    directly to the objects inner ring
  • While inner ring performs a Byzantine agreement
    to commit the update, secondary nodes propagate
    the update among themselves
  • The result of update is multicast down the
    dissemination tree to all secondary nodes

43
Cost of an update in bytes sent across the
network, normalized to minimum cost needed to
send the update to each of the replicas
44
Update commitment
  • Fault tolerance
  • Guarantees fault tolerance if less than one third
    of the servers in the inner ring is malicious
  • Secondary nodes do not participate in the
    Byzantine protocol, but receive consistency
    information

45
Update commitment
  • Automatic repair
  • Servers of the inner ring can be changed without
    affecting the rest of the system
  • Servers participating in the inner ring are
    altered continuously to maintain the Byzantine
    assumption

46
Underlying Technologies
  • Naming
  • Access control
  • Data Location and Routing
  • Data Update
  • Deep Archival Storage
  • Introspection

47
Deep Archival Storage
  • Each object is treated as a series of m fragments
    and then transformed into n fragments, where ngtm
  • That uses Reed-Solomon encoding.
  • Any m of the n coded fragments are sufficient to
    construct the original data
  • Rate of encoding rm/n
  • Storage overhead1/rn/m

48
Underlying Technologies
  • Naming
  • Access control
  • Data Location and Routing
  • Data Update
  • Deep Archival Storage
  • Introspection

49
Introspection
  • It is impossible to manually administer millions
    of servers and objects
  • OceanStore contains introspection tools
  • Event monitoring
  • Event analysis
  • Self-adaptation

50
Introspection
  • Introspective modules on servers observe network
    traffic and measure local traffic.
  • They automatically create, replace, and remove
    replicas in response to objects usage patterns

51
Introspection
  • If a replica becomes unavailable
  • Clients will receive service from a more distant
    replica
  • This produces extra load on distant replicas
  • Introspective mechanism detects this and new
    replicas are created
  • Above actions provide fault tolerance and
    automatic repair

52
Event handlers summarizes local events. These
summaries are stored in a database. The
information in the database is periodically
analyzed and necessary actions are taken. A
summary is sent to other nodes.
53
Conclusion
  • OceanStore provides a global-scale, distributed
    storage platform through adaptation, fault
    tolerance and repair
  • It is self-maintaining
  • A prototype implemented in Java is under
    construction at UC Berkeley. Although it is not
    operational yet, many components are already
    functioning in isolation

54
The end
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com