Title: DStore: Recovery-friendly, self-managing clustered hash table
1DStore Recovery-friendly, self-managing
clustered hash table
- Andy Huang and Armando FoxStanford University
2Outline
- Proposal
- Why? The goal
- What? The class of state we focus on
- How? Technique for achieving the goal
- Quorum algorithm and recovery results
- Repartitioning algorithm and availability results
- Conclusion
3 4Why? Simplify state-management
Frontends
DB/FS
App Servers
SIMPLE COMPLEX
Configuration plug-and-play repartition
Recovery simple non-intrusive unavailability minutes
5What? Non-transactional data
- User preferences
- Explicit name, address, etc.
- Implicit usage statistics (Amazons items
viewed) - Collaborative workflow data
- Examples insurance claims, human resources files
read-mostly(catalogs) non-transactional r/w(user prefs, workflow data) transactional(billing)
6How? Decouple using hash table and quorums
- Hypothesis A state store designed for
non-transactional data can be decoupled so that
it can be managed like a stateless system - Technique 1 Expose a hash table API
- Repartitioning scheme is simple (no complex data
dependencies) - Technique 2 Use quorums (read/write majority)
- Recovery is simple (no special case recovery
mechanism) - Recovery is non-intrusive (data available
throughout)
7Architecture overview
- Brick stores data
- Dlib exposes hash table API to app server and
executes quorum-based reads/writes on bricks - Replica groups bricks storing the same portion
of the key space are in the same replica group
Bricks
App Servers
8 9Algorithm Wavering reads
- No two-phase commit (complicates recovery and
introduces coupling) - C1 attempts to write, but fails before completion
- Quorum property violated reading a majority
doesnt guarantee latest value is returned - Result wavering reads
10Algorithm Read writeback
- Idea commit partial write when it is first read
- Commit point
- Before x0
- After x1
- Proven linearizable under fail-stop model
11Algorithm Crash recovery
- Fail-stop not an accurate model implies client
that generated the request fails permanently - With writeback, commit point occurs sometime in
the future - A writer expects request to succeed or fail, not
be in-progress
12Algorithm Write in-progress
- Requirement write must be committed/aborted on
the next read - Record write in-progress on client
- On submit write start cookie
- On return write end cookie
- On read if start cookie has no matching end,
read all
13Algorithm The common case
- Write all, wait for a majority
- Normally, all replicas perform the write
- Read majority
- Normally, replicas return non-conflicting values
- Writeback performed when a brick fails or when it
is temporarily overloaded and missed some writes - Read all performed when an app server fails
14 15Results Simple, non-intrusive recovery
- Normal operation majority must complete write
- Failure if fewer than a majority fail, writes
can succeed - Recovery equivalent to missing a few writes
under normal operation - Simple no special code
- Non-intrusive availability throughout
16Benchmark Simple, non-intrusive recovery
- Benchmark
- t60 secone brick killed
- t120 secbrick restarted
- Summary
- Data available during failure and recovery
- Recovering brick restores throughput in seconds
17Benchmark Availability under performance faults
- Fault causes
- cache warming
- garbage collection
- Benchmark
- Degrade one brick by wasting CPU cycles
- Comparison
- DStore Throughput remains steady
- ROWA Throughput throttled by slowest brick
18- Repartitioning algorithmAvailability results
19Algorithm Online repartitioning
- Split replica group ID (rgid), but announce both
- Take a brick offline (looks just like a failure)
- Copy data to new brick
- Change rgid and bring both bricks online
20Benchmark Online repartitioning
- Benchmark
- t120 secgroup 0 repartitioned
- t240 secgroup 1 repartitioned
- Non-intrusive
- Data available during entire process
- Appears as if brick just failed and recovered
(but there are now more bricks)
21Conclusion
- Goal Simplify management for non-transactional
data - Techniques Expose hash table API and use quorums
- Results
- Recovery is simple and non-intrusive
- Repartitioning can be done fully online
- Next steps
- True plug-and-play automatically repartition
when bricks are added/removed (simplified by hash
table partitioning scheme) - Questions andy.huang_at_stanford.edu