Title: Jan 910, 2006
1Tempest An Architecture for Scalable
Time-Critical Services
Mahesh Balakrishnan Ken Birman
Tudor Marian Amar PhanishayeeCornell
University
Scalability in of Groups Ricochet vs SRM
Tempest
Application and data structures
Replication for scalability, fault-tolerance
time-criticality
- Goals
- Provide programmers replicated data storage
primitives with very fast average performance and
good worst-case timing guarantees. - Easy Deployment, Monitoring Management of
time-critical scalable services in a clustered
environment
SRMs discovery delay is the lower bound on
recovery SRMs recovery delay scales poorly with
of Groups (delay in seconds!) Ricochet scales
in of Groups (14ms in 1 group to 24 ms in 1024
groups)
- Approach
- clone services for scalability, fault tolerance
- automate replica placement (service colocation)
- fine-grained data caching
- response time monitoring to detect service
slowdown - redundant querying for faster response
- UI to drag and drop services onto a cluster
- Accomplishments
- Ricochet Low-Latency Multicast for Scalable
Time-Critical Services - Submitted to NSDI (Oct
2005) - Scalable Services Architecture - Submitted to
ICDCS (Nov 2005)
Recovery Distribution Ricochet vs SRM in 64
groups
Ricochet - Low-Latency Scalable Multicast
SRM Recovery centered around 9 seconds, Ricochet
around 15 milliseconds. 2 orders of magnitude
improvement! Improvement increases with number of
groups
- Receiver based FEC
- Random selection of nodes to send repair packet
- Tunable per-group overhead
- Recovery time dependent on data incoming at a
node across all groups
Histogram of Inconsistency Window in Ricochet
(64 groups) Updates are reflected at all
replicas within 65 within 1.25 ms 90 within
18 ms 99 within 77 ms 100 within 125 ms
Jan 9-10, 2006