Reliable Distributed Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Reliable Distributed Systems

Description:

Processes may be unreachable (while failed or partitioned away) but later recover ... model: changing set of processes launched while system runs, some fail/terminate ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 21
Provided by: KennethP6
Learn more at: https://cse.buffalo.edu
Category:

less

Transcript and Presenter's Notes

Title: Reliable Distributed Systems


1
Reliable Distributed Systems
  • Membership

2
Group Membership
  • Foundational concept for high speed data
    replication protocols.
  • Essential for large scale grid-based virtual
    organizations and resource discovery and
    scheduling
  • Solution Group membership service (GMS)
  • Manage GMS services membership and then manage
    other services general membership 2-tier
    architecture
  • GMP Group Membership Protocol is used among GMS
    to manage membership
  • GMS then woks on its group.
  • Another problem is static vs dynamic membership

3
Agreement on Membership
  • Detecting failure is a lost cause.
  • Too many things can mimic failure
  • To be accurate would end up waiting for a process
    to recover
  • Substitute agreement on membership
  • Now we can drop a process because it isnt fast
    enough
  • This can seem arbitrary, e.g. A kills B
  • GMS implements this service for everyone else

4
Architecture
Applications use replicated data for high
availability
2PC-like protocols use membership changes instead
of failure notification
Membership Agreement, join/leave and P seems
to be unresponsive
5
Architecture
Application processes
membership views
A
A A,B,D A,D A,D,C D,C
GMS processes
join
B
leave
GMS
join
C
X
Y
Z
D
A seems to have failed
6
GMS API
  • Guess?

7
GMS API
  • P.278
  • Three operations
  • Join(process-id, callback)
  • Leave(process-id)
  • Monitor(process-id,callback)
  • GMS needs to be highly available
  • Here is problem Adapt it to grid services and VO

8
Example
  • Distributed system using the GMS is a airtraffic
    control system it would require itself to be
    reconfigured with existing processes after
    failure of a process.
  • In some cases such as in grid VO it may be fact
    of life membership may be changing dynamically.

9
Contrast dynamic with static model
  • Static model fixed set of processes tied to
    resources
  • Processes may be unreachable (while failed or
    partitioned away) but later recover
  • Think cluster of PCs
  • Dynamic model changing set of processes launched
    while system runs, some fail/terminate
  • Failed processes never recover (partitioned
    process may reconnect, but uses a new pid)
  • And can still own a physical resource, allowing
    us to emulate a static model

10
Commit protocol
ok to commit?
vote unknown!
ok
decision unknown!
ok
11
Suppose this is a partitioning failure (or
merging)
ok to commit?
vote unknown!
ok
decision unknown!
ok
Do these processes actually need to be consistent
with the others?
12
Primary partition concept
  • Idea is to identify notion of the system with a
    unique component of the partitioned system
  • Call this distinguished component the primary
    partition of the system as a whole.
  • Primary partition can speak with authority for
    the system as a whole
  • Non-primary partitions have weaker consistency
    guarantees and limited ability to initiate new
    actions

13
Ricciardi Group Membership Protocol
  • For use in a group membership service (usually
    just a few processes that run on behalf of whole
    system)
  • Tracks own membership own members use this to
    maintain membership list for the whole system
  • All users of the service see subsequences of a
    single system-wide group membership history
  • GMS also tracks the primary partition

14
GMP protocol itself
  • Used only to track membership of the core GMS
  • Designates one GMS member as the coordinator
  • Switches between 2PC and 3PC
  • 2PC if the coordinator didnt fail and other
    members failed or are joining
  • 3PC if the coordinator failed and some other
    member is taking over as new coordinator

15
GMS majority requirement
  • To move from system view i to view i1, GMS
    requires explicit acknowledgement by a majority
    of the processes in view i
  • Cant get a majority causes GMS to lose its
    primaryness information
  • GMP to can be extended to support partitioning
    and remerging

16
GMS in Action
p0 p1 ... p5
p0 is the initial coordinator. p1 and p2 join,
then p3...p5 join. But p0 fails during join
protocol, and later so does p3. Majority
consent is used to avoid partitioning!
17
GMS in Action
p0 p1 ... p5
2-phase commit 3-phase 2phase
P0 is coordinator P1 takes over
P1 is new coordinator
18
What if system has thousands of processes?
  • Idea is to build a GMS subsystem that runs on
    just a few nodes
  • GMS members track themselves
  • Other processes ask to be admitted to system or
    for faulty processes to be excluded
  • GMS treats overall system membership as a form of
    replicated data that it manages, reports to its
    listeners

19
Uses of membership?
  • If we rewire TCP and RPC to use membership
    changes as trigger for breaking connections, can
    eliminate many problems!
  • But nobody really does this
  • Problem is that networks lack standard GMS
    subsystems now!
  • But we can try using it in Grid/Web services
    environment?!!?

20
Summary
  • We know how to build a GMS that tracks its own
    membership
  • Examine how this can be applied to grid services?
  • M.S. or a Ph.D. Problem.
Write a Comment
User Comments (0)
About PowerShow.com