Title: Epidemic Protocols
1Epidemic Protocols
- CS614
- March 7th 2002
- Ashish Motivala
2Papers
- Epidemic algorithms for replicated database
maintenance Alan Demers, Dan Greene, Carl
Hauser, Wes Irish and John Larson Proceedings of
the Sixth Annual ACM Symposium on Principles of
distributed computing , 1987 - Bimodal multicast Kenneth P. Birman, Mark
Hayden, Oznur Ozkasap, Zhen Xiao, Mihai Budiu and
Yaron Minsky ACM Trans. Comput. Syst. 17, 2
(May. 1999) - Managing update conflicts in Bayou, a weakly
connected replicated storage system D. B. Terry,
M. M. Theimer, Karin Petersen, A. J. Demers, M.
J. Spreitzer and C. H. Hauser SOSP1995. - Flexible update propagation for weakly consistent
replication Karin Petersen, Mike J. Spreitzer,
Douglas B. Terry, Marvin M. Theimer and Alan J.
Demers SOSP, 1997, - Fighting fire with fire using randomized gossip
to combat stochastic scalability limits Indranil
Gupta, Kenneth P. Birman, Robbert van Renesse To
appear, March, 2002 - Dangers of Replication and a Solution Jim Gray,
Pat Helland, Patrick ONeil, Dennis Sasha, SIGMOD
1996 (ltlt Read in CS632)
3Simple Epidemic
- Assume a fixed population of size n
- For simplicity, assume homogeneous spreading
- Simple epidemic any one can infect any one with
equal probability - Assume that k members are already infected
- infection occurs in rounds
4Probability of Infection
- Probability Pinfect(k,n) that a particular
uninfected member is infected in a round if k are
already in a round if k are already infected? - Pinfect(k,n) 1 P(nobody infects member)
- 1 (1 1/n)k
- E(newly infected members) (n-k)x Pinfect(k,n)
- Basically its a Binomial Distribution
52 Phases
- Intuition 2 Phases
- Infection
- Initial Growth Factor
- is very high about 2
- Exponential growth
- Uninfection
- Slow death of uninfection
- to start
- Exponential decline
- Number of rounds necessary to infect the entire
population is O(log n) - First Half 1 -gt n/2 Phase 1
- Second Half n/2 -gt n Phase 2
- For large n, Pinfect(n/2,n) 1 (1/e)0.5 0.4
6Applications for Epidemic Protocols
- Reliable Multicast virtual synchrony, randomized
rumour spreading. - Systems (Database Replication) Clearinghouse,
Grapevine, Bayou - Membership and Failure Detection SWIM, SCAMP
- Data Aggregation
- Other distributed protocols leader election
Lightweight Prob. broadcast delta reliability
Li Li's work Kempe and Kleinberg's work - Our focus today
7Grapevine and Clearinghouse
- Weakly consistent replication was used at Xerox
PARC - Grapevine and Clearinghouse name services
- Updates are propagated by unreliable multicast
(direct mail). - Periodic anti-entropy exchanges among replicas
ensure that they eventually converge, even if
updates are lost. - Arbitrary pairs of replicas periodically
establish contact and resolve all differences
between their databases. - Various mechanisms (e.g., MD5 digests and update
logs) reduce the volume of data exchanged in the
common case. - Deletions handled as a special case via death
certificates recording the delete operation as
an update.
8Epidemic Algorithm Rumour Mongering
- Each replica periodically touches a selected
susceptible peer site and infects it with
updates. - Transfer every update known to the carrier but
not the victim in pull and vice versa in push.
Rumours are dropped using counter or coins
schemes. - Partner selection is randomized using a variety
of heuristics. Distance vs. Convergence Tradeoff.
- ie. If only neighbours are updated then link
traffic is O(1) but convergence traffic is O(n). - Sites connect to others at distance d with
probability d-a - Theory shows that the epidemic will eventually
the entire population (assuming it is connected). - Heuristics (push vs. pull) affect traffic load
and the expected time-to-convergence. Pull
converges faster than push. - Pull pi1 (pi) 2
- Push pi1 pi/e where pi prob. of a site
being susceptible after i rounds (cycles)
9Recap.
- Two Reliable Multicast Models
- SRM
- Local repair of problems but no end-to-end
guarantees - Virtual synchrony model (Isis, Horus, Ensemble)
- All or nothing message delivery with ordering
- Membership managed on behalf of group
- State transfer to joining member
- Great performance for small systems. In large
group sizes, under perturbations (heavy load,
applications acting little flakey) performance is
very hard to maintain.
10Multicast scaling issue (SRM)
11Multicast scaling issue (Ensemble)
12Bimodal Multicast
- 2 Sub-protocols
- Unreliable data distribution (IP multicast)
- Upon arrival, a message enters the receivers
message buffer. - Messages are delivered to the application layer
in FIFO order, and are garbage collected out of
the message buffer after some period of time. - The second sub-protocol is used to repair gaps in
the message delivery record - processes maintain a list of a random subset of
the full system membership. In practice, we
weight this list to contain primarily processes
from close by processes accessible over
low-latency links.
13Start by using unreliable multicast to rapidly
distribute the message. But some messages may not
get through, and some processes may be faulty.
So initial state involves partial distribution of
multicast(s)
14Periodically (e.g. every 100ms) each process
sends a digest describing its state to some
randomly selected group member. The digest
identifies messages. It doesnt include them.
15Recipient checks the gossip digest against its
own history and solicits a copy of any missing
message from the process that sent the gossip
16Processes respond to solicitations received
during a round of gossip by retransmitting the
requested message. The round lasts much longer
than a typical RPC time.
17Optimizations
- Request retransmissions most recent multicast
first - Idea is to catch up quickly leaving at most one
gap in the retrieved sequence - Participants bound the amount of data they will
retransmit during any given round of gossip. If
too much is solicited they ignore the excess
requests
18Optimizations
- Label each gossip message with senders gossip
round number - Ignore solicitations that have expired round
number, reasoning that they arrived very late
hence are probably no longer correct - Dont retransmit same message twice in a row to
any given destination (the copy may still be in
transit hence request may be redundant)
19Optimizations
- Use IP multicast when retransmitting a message if
several processes lack a copy - For example, if solicited twice
- Also, if a retransmission is received from far
away - Tradeoff excess messages versus low latency
- Use regional TTL to restrict multicast scope
20Bimodal Multicast and SRM with system wide
constant noise, tree topology
Repair requests (per sec)
21(No Transcript)
22Two predicates
- Predicate I A faulty outcome is one where more
than 10 but less than 90 of the processes get
the multicast. - Predicate II A faulty outcome is one where
roughly half get the multicast and failures might
conceal true outcome
23Bimodal Multicast is amenable to formal analysis
24Unlimited scalability!
- Probabilistic gossip routes around congestion
- And probabilistic reliability model lets the
system move on if a computer lags behind - Results in
- Constant communication costs
- Constant loads on links
- Steady behavior even under stress
25Good things?
- Overcome Internet limitations using randomized
P2P gossip - However, Internet routing can defeat our clever
solutions unless we know network topology - Both have great scalability and can survive under
stress - And both are backed by formal models as well as
real code and experimental data
26Further Work
27Research Locations
- Cornell Spinglass
- http//www.cs.cornell.edu/Info/Projects/Spinglass
/index.html - SWIM http//www.cs.cornell.edu/gupta/swim
- MSR Cambridge (Kermarrec) http//research.microso
ft.com/camdis/gossip.htm - EPFL (Guerraoui) http//lpdwww.epfl.ch/publicatio
ns
28Bayou Basics
- The motivation for Bayou comes from observations
of mobile computing. - Connections are expensive, frequent, and often
intermittent. - Collaborating agents are likely to be guaranteed
simultaneous connections. - Bayou accommodates these applications by helping
them manage weakly consistent data. Bayou does
not attempt to be transparent.
29Bayou Basics (cont.)
- Applications should use specific knowledge of
their data, along with the knowledge that data
may be stale, to detect and resolve conflicts. - Applications detect and resolve conflicts
differently - Bayou allows for arbitrary dependencies,
constraints, and detection of write/write and
read/write conflicts. - Programs resolve conflicts with each write.
Resolution may involve cascading back-outs. - Procedures must be deterministic so that they may
be replayed on multiple machines. - A write is considered tentative until committed
at the primary server. - A global ordering is used by the primary server
to dictate which of several conflicting writes
wins. - A modification is stable once it reaches the
primary server. - Primary servers have authority, a tradeoff that
allows data to become stable w/o hearing
responses from all clients and servers.
30Implementation
- Two applications are studied, a bibliographic
database and a meeting room scheduler. - Anti-entropy A client may connect to any server
for reading and writing data. - Servers replicate all data, and synchronize using
pair-wise communication. - Anti-entropy insures eventual consistency of the
database (they "gossip"). A primary server is the
authoritative source of consistency. - Implementation each server logs committed and
tentative data. Anti-entropy sessions update
these logs accordingly. - Access control and security security is achieved
with public key cryptography, access control by
allowing users to grant and revoke privileges.
Primary servers are responsible for managing
revocation lists.