Title: Virtual Synchrony
1Virtual Synchrony
- Justin W. Hart
- CS 614
- 11/17/2005
2Papers
- The Process Group Approach to Reliable
Distributed Computing. Birman. CACM, Dec 1993,
36(12)37-53. - Understanding the Limitations of Causally and
Totally Ordered Communication. Cheriton and
Skeen. 14th SOSP, 1993.
3Background
- Chandy-Lamport Logical Clocks
- Consistent Cuts
- Distributed Snapshots
- Publish/Subscribe
- Fail-Stop
4Fail Stop
- Group Membership Service
- Processes appear to fail by halting
- How does this affect the FLP result?
5Motivation
- Information Backplane
- Customization
- Hierarchical Structure
- Fault-Tolerance
- Reliability
6Process Groups
- Types of groups
- Anonymous groups
- Explicit groups
- Implementation Requirements
- Group communication
- Group membership as input
- Synchronization
7Anonymous Groups
- Group addressing
- Messages sent exactly once to all or no
recipients - Ordering
- Logging
8Explicit Groups
- Group members cooperate directly
- May execute algorithms based on membership
knowledge - Communication is sensitive to membership changes
9Building groups over conventional technology
- Conventional message passing technologies
- Group addressing
- Logical time causal dependency
- Message delivery ordering
- State transfer
- Fault tolerance
10Close Synchrony
- Close Synchrony
- 100 lock-step execution model
11A synchronous execution
p
q
r
s
t
u
- With true synchrony executions run in genuine
lock-step.
12So whats wrong with that?
- Under close synchrony, execution is limited by
the slowest process in the group!
13Virtual Synchrony
- Relax synchronization requirements where possible
- Benefit by allowing for asynchronous interactions
- Do this where the result is identical to close
synchrony
14A few protocols
- fbcast
- cbcast
- abcast
- gbcast
15Four protocols!?!?
- but Justin. The paper only discussed 2
protocols youre getting off-topic!
16A few protocols
- fbcast
- Simple protocol upon which well build the
others. - Delivery is FIFO ordered, with respect to the
original sender - Accomplished easily with a logical timestamp
- cbcast
- abcast
- gbcast
17Single updater
- If p is the only update source, the need is a bit
like the TCP fifo ordering - fbcast is a good choice for this case
1
2
3
4
p
r
s
t
18A few protocols
- fbcast
- cbcast
- Receipt is causally ordered
- Protocol in paper uses token passing
- Another simple protocol uses vector timestamps
- abcast
- gbcast
19Causally ordered updates
- Simple protocol based on token passing
20Causally ordered updates
- Example messages from p and s arrive out of
order at t
VT(b)1,0,0,1
c is early VT(c) 1,0,1,1 but
VT(t)0,0,0,1 clearly we are missing one
message from s
p
VT(c) 1,0,1,1
When b arrives, we can deliver both it and
message c, in order
r
s
t
VT(a) 0,0,0,1
21Causally ordered updates
- Each thread corresponds to a different lock
- In effect red events never conflict with green
ones!
2
5
p
1
r
3
s
t
2
1
4
22Hey that sped things up!
- Now I get it! Processes only have to wait for
processes that they depend on. Not the slowest
in the group!
23A few protocols
- fbcast
- cbcast
- abcast
- Atomic delivery ordering
- With respect to other abcasts
- More costly than cbcast, but with a stronger
ordering property - ISIS builds abcast over cbcast
- gbcast
24A few protocols
- fbcast
- cbcast
- abcast
- gbcast
- Atomic delivery ordering
- With respect to everything
25Three Round Multicast
26As a time-line picture
Phase 1
Phase 2
Vote?
Commit!
2PC initiator
p
q
r
s
t
All vote commit
27Just one more
28Flush protocol
- We say that a message is unstable if some
receiver has it but (perhaps) others dont - For example, qs message is unstable at process r
- If q fails we want to flush unstable messages
out of the system
29Styles of groups
- Peer Groups
- Processes cooperate closely
- Client-Server Groups
- Group acts as a server
- Client multicasts repeatedly to the group
- Diffusion Groups
- Group serves information
- Clients connect to receive data from group
- Hierarchical Groups
- Offer scalability through a hierarchy of
connected groups
30Historical Aside
- Two major classes of real systems
- Virtual synchrony
- Weaker properties not quite FLP consensus
- Much higher performance (orders of magnitude)
- Requires that majority of system remain
connected. Partitioning failures force protocols
to wait for repair - Quorum-based state machine protocols are
- Closer to FLP definition of consensus
- Slower (by orders of magnitude)
- Sometimes can make progress in partitioning
situations where virtual synchrony cant
31Names of some famous systems
- Isis was first practical virtual synchrony system
- Later followed by Transis, Totem, Horus
- Today Best options are Jgroups, Spread, Ensemble
- Technology is now used in IBM Websphere and
Microsoft Windows Clusters products! - Paxos was first major state machine system
- BASE and other Byzantine Quorum systems now
getting attention from the security community - (End of Historical aside)
32Sounds good whats wrong with it?
- Tries to solve state problems at communication
level - This violates the end-to-end argument!
- Consistency requirements are typically stated
with respect to application state
33Stable vs Durable
- Stable messages are buffered until received by
all group members - Durable message will be delivered, even if the
sender dies
34Ordering semantics
- Incidental Ordering
- Semantic Ordering
- Prescriptive Ordering
35The problem with CATOCS
- It cant say for sure
- It cant say the whole story
- It cant say together
- It cant say it efficiently
36It cant say for sure
- Processes communicating over a hidden channel
- Common database
- Shared memory
- Two threads reacting to external event
37It cant say together
- Standard solution locking
- Transaction models allow for abort and rollback
- Higher level conditions what happens if a
message arrives, but is not successfully processed
38Stock trading example
39Cant say the whole story
- Not everything can be expressed through the
happens-before relationship - Semantic ordering constraints
- Causal memory, the weakest of these, cannot be
expressed in causal multicast - Total ordering helps some of these, but is far
too expensive - Inexpensive, state-level protocols with logical
clocks can solve these
40It cant say it efficiently
- False causality
- Potential causality ! Actual causality
- Memory requirements for buffering unstable
messages - Ordering information during transmission and
reception
41And what of the end to end argument?
- All of this considers our communication channels
isnt the application-level check far more
important?
42Classes of distributed applications
- Data dissemination
- Netnews
- Trading application example
- Global predicate evaluation
- Transactional applications
- Replicated data
- Replication in the large
- Distributed real-time applications
43Implementing only part of the messaging?
- Can you cut down on overhead by implementing only
part of the messaging using CATOCS?
44Semantics
- Are the semantics of state-based approaches
superior to those of virtual synchrony?
45Scalability
- N Processes
- Time T to propagate a message across the system
- Grows roughly proportional with the square root
of the number of processes - Arcs in the active causal graph grow
quadratically - Quadratic causal graph
46Buffering grows
- Quadratic arcs
- Linear communication of causal dependencies
- Linear growth in required buffering
- Changing topologies doesnt help
- CATOCS would require separate process groups for
read and write to accomplish optimization of
updates vs queries
47Group membership protocols
- Must enforce atomic delivery semantics
- Run our most expensive protocol gbcast
- Failures increase with the size of the system,
increasing load on the GMS
48Who uses ISIS?
- Brokerage
- Database replication and triggers
49ISIS-based utilities
- NEWS
- A pub/sub application with that will replay
histories - NMGR
- Manages batch-style jobs and performs load
sharing - Parallel make
50ISIS-based utilities
- DECEIT
- NFS compatible file system
- META/LOMITA
- Sensors actuators
- Abstract sensors
- Specify control actions in high-level terms
- SPOOLER/LONG-HAUL FACILITY
51Now somewhat supported
- ISIS/Horus/Ensemble/QuickSilver
- JGroups
- Spread
- Totem
- Transis
- WebSphere Windows Cluster (internally)
52and people actually use it.
- NYSE
- French ATC System
- AEGIS
53An ongoing debate
- The effort continues here at Cornell with the
QuickSilver effort - Youve been presented the options what are your
conclusions?
54References
- Some slides borrowed from Ken Birmans CS 614
slide sets on Virtual Synchrony
http//www.cs.cornell.edu/courses/cs514/2005sp/Sli
de20Sets.htm - Images have been borrowed from The Process Group
Approach to Reliable Distributed Computing.
Birman. CACM, Dec 1993, 36(12)37-53. - Images have been borrowed from Understanding the
Limitations of Causally and Totally Ordered
Communication. Cheriton and Skeen. 14th SOSP,
1993. - Statements and ideas have been borrowed verbatim
from both papers, including section headings, and
statements in notes. This has been mostly for
coherence between the slides and papers - Also sourced data from http//www.cs.cornell.edu/k
en/