Reliable Distributed Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Reliable Distributed Systems

Description:

Processes may be unreachable (while failed or partitioned away) but later recover ... model: changing set of processes launched while system runs, some fail/terminate ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 21

Provided by: KennethP6

Learn more at: https://cse.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Reliable Distributed Systems

1
Reliable Distributed Systems

Membership

2
Group Membership

Foundational concept for high speed data
replication protocols.
Essential for large scale grid-based virtual
organizations and resource discovery and
scheduling
Solution Group membership service (GMS)
Manage GMS services membership and then manage
other services general membership 2-tier
architecture
GMP Group Membership Protocol is used among GMS
to manage membership
GMS then woks on its group.
Another problem is static vs dynamic membership

3
Agreement on Membership

Detecting failure is a lost cause.
Too many things can mimic failure
To be accurate would end up waiting for a process
to recover
Substitute agreement on membership
Now we can drop a process because it isnt fast
enough
This can seem arbitrary, e.g. A kills B
GMS implements this service for everyone else

4
Architecture
Applications use replicated data for high
availability
2PC-like protocols use membership changes instead
of failure notification
Membership Agreement, join/leave and P seems
to be unresponsive
5
Architecture
Application processes
membership views
A
A A,B,D A,D A,D,C D,C
GMS processes
join
B
leave
GMS
join
C
X
Y
Z
D
A seems to have failed
6
GMS API

Guess?

7
GMS API

P.278
Three operations
Join(process-id, callback)
Leave(process-id)
Monitor(process-id,callback)
GMS needs to be highly available
Here is problem Adapt it to grid services and VO

8
Example

Distributed system using the GMS is a airtraffic
control system it would require itself to be
reconfigured with existing processes after
failure of a process.
In some cases such as in grid VO it may be fact
of life membership may be changing dynamically.

9
Contrast dynamic with static model

Static model fixed set of processes tied to
resources
Processes may be unreachable (while failed or
partitioned away) but later recover
Think cluster of PCs
Dynamic model changing set of processes launched
while system runs, some fail/terminate
Failed processes never recover (partitioned
process may reconnect, but uses a new pid)
And can still own a physical resource, allowing
us to emulate a static model

10
Commit protocol
ok to commit?
vote unknown!
ok
decision unknown!
ok
11
Suppose this is a partitioning failure (or
merging)
ok to commit?
vote unknown!
ok
decision unknown!
ok
Do these processes actually need to be consistent
with the others?
12
Primary partition concept

Idea is to identify notion of the system with a
unique component of the partitioned system
Call this distinguished component the primary
partition of the system as a whole.
Primary partition can speak with authority for
the system as a whole
Non-primary partitions have weaker consistency
guarantees and limited ability to initiate new
actions

13
Ricciardi Group Membership Protocol

For use in a group membership service (usually
just a few processes that run on behalf of whole
system)
Tracks own membership own members use this to
maintain membership list for the whole system
All users of the service see subsequences of a
single system-wide group membership history
GMS also tracks the primary partition

14
GMP protocol itself

Used only to track membership of the core GMS
Designates one GMS member as the coordinator
Switches between 2PC and 3PC
2PC if the coordinator didnt fail and other
members failed or are joining
3PC if the coordinator failed and some other
member is taking over as new coordinator

15
GMS majority requirement

To move from system view i to view i1, GMS
requires explicit acknowledgement by a majority
of the processes in view i
Cant get a majority causes GMS to lose its
primaryness information
GMP to can be extended to support partitioning
and remerging

16
GMS in Action
p0 p1 ... p5
p0 is the initial coordinator. p1 and p2 join,
then p3...p5 join. But p0 fails during join
protocol, and later so does p3. Majority
consent is used to avoid partitioning!
17
GMS in Action
p0 p1 ... p5
2-phase commit 3-phase 2phase
P0 is coordinator P1 takes over
P1 is new coordinator
18
What if system has thousands of processes?

Idea is to build a GMS subsystem that runs on
just a few nodes
GMS members track themselves
Other processes ask to be admitted to system or
for faulty processes to be excluded
GMS treats overall system membership as a form of
replicated data that it manages, reports to its
listeners

19
Uses of membership?

If we rewire TCP and RPC to use membership
changes as trigger for breaking connections, can
eliminate many problems!
But nobody really does this
Problem is that networks lack standard GMS
subsystems now!
But we can try using it in Grid/Web services
environment?!!?

20
Summary