Title: OceanStore: An Infrastructure for Global-Scale Persistent Storage
1OceanStore An Infrastructure for Global-Scale
Persistent Storage
- John Kubiatowicz, David Bindel, Yan Chen, Steven
Czerwinski, Patrick Eaton, Dennis Geels,
Ramakrishna Gummadi, Sean Rhea, Hakim
Weatherspoon, Westley Weimer, Chris Wells, Ben
Zhao
A few slides have been borrowed from the authors
presentations
2Vision
- What is Oceanstore?
- a utility infrastructure to span the globe and
provide continuous access to persistent
information
Source Berkeley OceanStore Website
3Vision
- What is Oceanstore?
- a utility infrastructure to span the globe and
provide continuous access to persistent
information - data
- all kinds of information
- desktop, laptop, palmtop
- cars, cellular phones, other devices
- futuristic embedded in environment
4Vision
- What is Oceanstore?
- a utility infrastructure to span the globe and
provide continuous access to persistent
information - persistence
- devices can be rebooted, lost, replaced
- reliable, durable data (deep archival will last
forever) - Automatic maintenance
5Vision
- What is Oceanstore?
- a utility infrastructure to span the globe and
provide continuous access to persistent
information - connectivity
- even to tiniest devices, possibly intermittent
- variable bandwidth, latency
- availability
- uniform access, comparable to LAN-based networked
storage - fault-tolerant, DoS-tolerant
6Vision
- what is oceanstore?
- a utility infrastructure to span the globe and
provide continuous access to persistent
information - scale
- geographically distributed
- 1010 users
- 1014 files / objects
7Questions about information
- Where is persistent information stored?
- 20th-century tie between location and content
outdated - In world-scale system, locality is key
- How is it protected?
- Can disgruntled employee of ISP sell your
secrets? - Cant trust anyone (how paranoid are you?)
- Can we make it indestructible?
- Want our data to survive the big one!
- Highly resistant to hackers (denial of service)
- Wide-scale disaster recovery
- Is it hard to manage?
- Worst failures are human-related
- Want automatic (introspective) diagnosis and
repair
8First ObservationWant Utility Infrastructure
- Mark Weiser from Xerox Transparent computing is
the ultimate goal. Computers should disappear
into the background - In the context of storage
- Dont want to worry about backup
- Dont want to worry about obsolescence
- Need lots of resources to make data secure and
highly available, BUT dont want to own them - Outsourcing of storage already becoming popular
- Pay monthly fee and your data is out there
-
9Utility-based Infrastructure
Canadian OceanStore
Sprint
ATT
IBM
Pac Bell
IBM
- Service provided by confederation of companies
- Monthly fee paid to one service provider
- Companies buy and sell capacity from each other
10Target applications
- Email
- Group calendar, contacts
- Distributed design tools
- Computer Supported Cooperative Work
- Digital libraries
- Distributed/shared repositories
11Assumptions
- Untrusted infrastructure
- A small number of servers may crash or leak
information - most of the servers functioning correctly
- financially responsible party of servers ensure
integrity - but only clients trusted with cleartext
- Nomadic data
- data divorced from location
- flows freely within the storage infrastructure
- promiscuous caching anywhere, anytime
- location important for performance
- dynamic system tuning through introspection
12System overview
- persistent object
- GUID 160-bit SHA-1 hash
- secure identification globally unique and
unforgeable - 280 unique objects before collisions (birthday
paradox) - floating object replicas independent of location
- encrypted data
- read
- try fast probabilistic replica search (Bloom
filter) - fallback to slower deterministic search
(Tapestry) - write
- update with predicates as in Bayou what is
Bayou? - creates new version
13What is Bayou
- The Bayou System (Xerox PARC) is a platform of
replicated, highly-available, variable-consistency
, databases on which collaborative applications
can be built. It caters to portable devices
having intermittent connections.
14System overview
- application interface
- sessions sequence of read/writes
- session guarantees Bayou
- loose consistency levels, ACID
- active and archival forms
- active latest version, with update handle
- archive erasure coded read-only version
- dynamic optimization
- object location
- degree of replication
15Tentative UpdatesEpidemic Dissemination
16Committed UpdatesMulticast Dissemination
17naming
- self-certifying path names (Mazières)
- object GUID hash of owner key and readable name
- create hierarchies using directory objects
- read restriction
- through client encryption of data
- write restriction, access control
- associate ACL lists with object, respected by
servers
18addressing
- address an object by its GUID
- message GUID, random number, small predicate
- route to closest GUID replica matching predicate
- combines data location and routing
- no central name service to attack
- save one round-trip for location discovery
- routing
- fast, probabilistic search algorithm
- slow, deterministic search algorithm
19routing
- fast, probabilistic search algorithm
- Bloom filter
- probabilistic set membership test using bit
vector - n-bit vector generated from n hashes of each set
element - filter is union (OR) of all bit vectors
- attenuated Bloom filter
- array of d Bloom filters
- i th Bloom filter is union of all lti -hop nodes
- slow, deterministic algorithm
- Tapestry
20addressing and routing
deterministic
probabilistic
21Attenuated Bloom Filter
22updates
- Updates based on versioning and conflict
resolution - i.e. no locking
- update actions with predicates
- commit apply action of first true predicate
- abort no true predicates
- conflict resolution on encrypted data
- possible predicates
- compare-version, compare-size, compare-block,
search - possible actions
- replace-block, insert-block, delete-block, append
23archival
- produced when objects idle
- use erasure codes (redundant fragmentation)
- simplest example parity bit
- need any (n-1) out of n fragments
- interleaved Reed-Solomon codes, Tornado codes
- fragmentation improves reliability
- deep archival storage
- sweeper processes ensure replication sustained
over time - fragmentation improves performance
24Erasure Codes
Simple parity bits, or generalized Reed-Solomon
codes can be used to implement it.
25Floating Replica and Deep Archival Coding
26dynamic optimization (introspection)
- observation modules
- collect and summarize information
- incrementally update system database
- optimization modules
- periodically process the observation database
- cluster recognition group related objects
- replica management maintain replica number and
location - periodic migration work-home-work-home
- maintenance routing, dissemination,
availability, durability