Title: Efficient Replica Maintenance for Distributed Storage Systems
1Efficient Replica Maintenance for Distributed
Storage Systems
- B-G Chun, F. Dabek, A. Haeberlen, E. Sit, H.
Weatherspoon, M. Kaashoek, J. Kubiatowicz, and R.
Morris, In Proc. of NSDI, May 2006. - Presenter Fabián E. Bustamante
2Replication in Wide-Area Storage
- Applications put get objects in/from the
wide-area storage system - Objects are replicated for
- Availability
- Get on an object will return promptly
- Durability
- Object put by the app are not lost due to disk
failures - An object may be durably stored but not
immediately available
3Goal durability at low bandwidth cost
- Durability is a more practical useful goal
- Threat to durability
- Loose the last copy of an object
- So, create copies faster than they are destroyed
- Challenges
- Replication can eat your bandwidth
- Hard to distinguish bet/ transient permanent
failure - After recover, some replicas may be in nodes the
lookup algorithm does not check - Paper presents Carbonite efficient wide-area
replication technique for durability
4System Environment
- Use PlanetLab (PL) as representative
- gt600 nodes distributed world-wide
- History traces collected by CoMon project (every
5) - Disk failures from event logs of PlanetLab
Central - Synthetic traces
- 632 nodes as PL
- Failure inter-arrival times from exponential
dist. (mean session time and downtime as in PL) - Two years instead of one and avg node lifetime of
1 year - Simulation
- Trace-driven event-based simulator
- Assumptions
- Network paths are independent
- All nodes reachable from all other nodes
- Each node with same link capacity
Dates 3/1/05-2/28/06
Hosts 632
Transient failures 21355
Disk failures 219
Transient host downtime (s) (median,avg,90th) 1208, 104647, 14242
Any failure interarrival (s) 305, 1467, 3306
Disk failure interarrival (s) 544411, 143476, 490047
5Understanding durability
- To handle some avg. rate of failure create new
replicas faster than they are destroyed - Function of per-node access link, number of
nodes, amount of data stored per node - Infeasible system unable to keep pace w/ avg.
failure rate will eventually adapt by
discarding objects (which ones?) - If creation rate is just above failure rate
failure burst may be a problem - Target replicas to maintain rL
- Durability does not increased continuously with
rL
6Improving repair time
- Scope set of other nodes that can hold copies
of the objects a node is responsible for - Small scope
- Easier to keep track of copies
- Effort of creating copies fall on a small set of
nodes - Addition of nodes may result on needless copying
of objects (when combined w/ consistent hashing) - Large scope
- Spread work among more nodes
- Network traffic source/destination are spread
- Temp failures will be noticed by more nodes
7Reducing transient costs
- Impossible to distinguish transient/permanent
failures - To minimize net traffic due to transient
failures reintegrate replicas - Carbonite
- Selecet a suitable value for rL
- Respond to detected failure by creating new
replica - Reintegrate replicas
Bytes sent by different maintenance algorithms
8Reducing transient costs
Bytes sent w/ and w/o reintegration
Impact of timeouts on bandwidth and durability
9Assumptions
- The PlanetLab testbed can be seen as
representative of something - Immutable data
- Relatively stable system membership data loss
driven by disk failures - Disk failures are uncorrelated
- Simulation
- Network paths are independent
- All nodes reachable from all other nodes
- Each node with same link capacity