Distributed%20Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed%20Systems

Description:

Distributed Election algorithms later... -33. DDME: Fully Distributed Approach ... If participant received ABORT decision then discard changes resulting from update ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 56
Provided by: csCor
Category:

less

Transcript and Presenter's Notes

Title: Distributed%20Systems


1
Distributed Systems
2
A Distributed System
3
Loosely Coupled Distributed Systems
  • Users are aware of multiplicity of machines.
    Access to resources of various machines is done
    explicitly by
  • Remote logging into the appropriate remote
    machine.
  • Transferring data from remote machines to local
    machines, via the File Transfer Protocol (FTP)
    mechanism.

4
Tightly Coupled Distributed-Systems
  • Users not aware of multiplicity of machines.
    Access to remote resources similar to access to
    local resources
  • Examples
  • Data Migration transfer data by transferring
    entire file, or transferring only those portions
    of the file necessary for the immediate task.
  • Computation Migration transfer the computation,
    rather than the data, across the system.

5
Distributed-Operating Systems (Cont.)
  • Process Migration execute an entire process, or
    parts of it, at different sites.
  • Load balancing distribute processes across
    network to even the workload.
  • Computation speedup subprocesses can run
    concurrently on different sites.
  • Hardware preference process execution may
    require specialized processor.
  • Software preference required software may be
    available at only a particular site.
  • Data access run process remotely, rather than
    transfer all data locally.

6
Why Distributed Systems?
  • Communication
  • Dealt with this when we talked about networks
  • Resource sharing
  • Computational speedup
  • Reliability

7
Resource Sharing
  • Distributed Systems offer access to specialized
    resources of many systems
  • Example
  • Some nodes may have special databases
  • Some nodes may have access to special hardware
    devices (e.g. tape drives, printers, etc.)
  • DS offers benefits of locating processing near
    data or sharing special devices

8
OS Support for resource sharing
  • Resource Management?
  • Distributed OS can manage diverse resources of
    nodes in system
  • Make resources visible on all nodes
  • Like VM, can provide functional illusion but
    rarely hide the performance cost
  • Scheduling?
  • Distributed OS could schedule processes to run
    near the needed resources
  • If need to access data in a large database may be
    easier to ship code there and results back than
    to request data be shipped to code

9
Design Issues
  • Transparency the distributed system should
    appear as a conventional, centralized system to
    the user.
  • Fault tolerance the distributed system should
    continue to function in the face of failure.
  • Scalability as demands increase, the system
    should easily accept the addition of new
    resources to accommodate the increased demand.
  • Clusters vs Client/Server
  • Clusters a collection of semi-autonomous
    machines that acts as a single system.

10
Computation Speedup
  • Some tasks too large for even the fastest single
    computer
  • Real time weather/climate modeling, human genome
    project, fluid turbulence modeling, ocean
    circulation modeling, etc.
  • http//www.nersc.gov/research/GC/gcnersc.html
  • What to do?
  • Leave the problem unsolved?
  • Engineer a bigger/faster computer?
  • Harness resources of many smaller (commodity?)
    machines in a distributed system?

11
Breaking up the problems
  • To harness computational speedup must first break
    up the big problem into many smaller problems
  • More art than science?
  • Sometimes break up by function
  • Pipeline?
  • Job queue?
  • Sometimes break up by data
  • Each node responsible for portion of data set?

12
Decomposition Examples
  • Decrypting a message
  • Easily parallelizable, give each node a set of
    keys to try
  • Job queue when tried all your keys go back for
    more?
  • Modeling ocean circulation
  • Give each node a portion of the ocean to model (N
    square ft region?)
  • Model flows within region locally
  • Communicate with nodes managing neighboring
    regions to model flows into other regions

13
Decomposition Examples (cont)
  • Barnes Hut calculating effect of bodies in
    space on each other
  • Could divide space into NxN regions?
  • Some regions have many more bodies
  • Instead divide up so have roughly same number of
    bodies
  • Within a region, bodies have lots of effect on
    each other (close together)
  • Abstract other regions as a single body to
    minimize communication

14
Linear Speedup
  • Linear speedup is often the goal.
  • Allocate N nodes to the job goes N times as fast
  • Once youve broken up the problem into N pieces,
    can you expect it to go N times as fast?
  • Are the pieces equal?
  • Is there a piece of the work that cannot be
    broken up (inherently sequential?)
  • Synchronization and communication overhead
    between pieces?

15
Super-linear Speedup
  • Sometimes can actually do better than linear
    speedup!
  • Especially if divide up a big data set so that
    the piece needed at each node fits into main
    memory on that machine
  • Savings from avoiding disk I/O can outweigh the
    communication/ synchronization costs
  • When split up a problem, tension between
    duplicating processing at all nodes for
    reliability and simplicity and allowing nodes to
    specialize

16
OS Support for Parallel Jobs
  • Process Management?
  • OS could manage all pieces of a parallel job as
    one unit
  • Allow all pieces to be created, managed,
    destroyed at a single command line
  • Fork (process,machine)?
  • Scheduling?
  • Programmer could specify where pieces should run
    and or OS could decide
  • Process Migration? Load Balancing?
  • Try to schedule piece together so can communicate
    effectively

17
OS Support for Parallel Jobs (cont)
  • Group Communication?
  • OS could provide facilities for pieces of a
    single job to communicate easily
  • Location independent addressing?
  • Shared memory?
  • Distributed file system?
  • Synchronization?
  • Support for mutually exclusive access to data
    across multiple machines
  • Cant rely on HW atomic operations any more
  • Deadlock management?
  • Well talk about clock synchronization and
    two-phase commit later

18
Reliability
  • Distributed system offers potential for increased
    reliability
  • If one part of system fails, rest could take over
  • Redundancy, fail-over
  • !BUT! Often reality is that distributed systems
    offer less reliability
  • A distributed system is one in which some
    machine Ive never heard of fails and I cant do
    work!
  • Hard to get rid of all hidden dependencies
  • No clean failure model
  • Nodes dont just fail they can continue in a
    broken state
  • Partition network many many nodes fail at once!
    (Determine who you can still talk to Are you cut
    off or are they?)
  • Network goes down and up and down again!

19
Robustness
  • Detect and recover from site failure, function
    transfer, reintegrate failed site
  • Failure detection
  • Reconfiguration

20
Failure Detection
  • Detecting hardware failure is difficult.
  • To detect a link failure, a handshaking protocol
    can be used.
  • Assume Site A and Site B have established a link.
    At fixed intervals, each site will exchange an
    I-am-up message indicating that they are up and
    running.
  • If Site A does not receive a message within the
    fixed interval, it assumes either (a) the other
    site is not up or (b) the message was lost.
  • Site A can now send an Are-you-up? message to
    Site B.
  • If Site A does not receive a reply, it can repeat
    the message or try an alternate route to Site B.

21
Failure Detection (cont)
  • If Site A does not ultimately receive a reply
    from Site B, it concludes some type of failure
    has occurred.
  • Types of failures- Site B is down
  • - The direct link between A and B is down- The
    alternate link from A to B is down
  • - The message has been lost
  • However, Site A cannot determine exactly why the
    failure has occurred.
  • B may be assuming A is down at the same time
  • Can either assume it can make decisions alone?

22
Reconfiguration
  • When Site A determines a failure has occurred, it
    must reconfigure the system
  • 1. If the link from A to B has failed, this must
    be broadcast to every site in the system.
  • 2. If a site has failed, every other site must
    also be notified indicating that the services
    offered by the failed site are no longer
    available.
  • When the link or the site becomes available
    again, this information must again be broadcast
    to all other sites.

23
Event Ordering
  • Problem distributed systems do not share a clock
  • Many coordination problems would be simplified if
    they did (first one wins)
  • Distributed systems do have some sense of time
  • Events in a single process happen in order
  • Messages between processes must be sent before
    they can be received
  • How helpful is this?

24
Happens-before
  • Define a Happens-before relation (denoted by ?).
  • 1) If A and B are events in the same process, and
    A was executed before B, then A ? B.
  • 2) If A is the event of sending a message by one
    process and B is the event of receiving that
    message by another process, then A ? B.
  • 3) If A ? B and B ? C then A ? C.

25
Total ordering?
  • Happens-before gives a partial ordering of events
  • We still do not have a total ordering of events

26
Partial Ordering
Pi -gtPi1 Qi -gt Qi1 Ri -gt Ri1
R0-gtQ4 Q3-gtR4 Q1-gtP4 P1-gtQ2
27
Total Ordering?
P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
28
Timestamps
  • Assume each process has a local logical clock
    that ticks once per event and that the processes
    are numbered
  • Clocks tick once per event (including message
    send)
  • When send a message, send your clock value
  • When receive a message, set your clock to MAX(
    your clock, timestamp of message 1)
  • Thus sending comes before receiving
  • Only visibility into actions at other nodes
    happens during communication, communicate
    synchronizes the clocks
  • If the timestamps of two events A and B are the
    same, then use the process identity numbers to
    break ties.
  • This gives a total ordering!

29
Distributed Mutual Exclusion (DME)
  • Problem We can no longer rely on just an atomic
    test and set operation on a single machine to
    build mutual exclusion primitives
  • Requirement
  • If Pi is executing in its critical section, then
    no other process Pj is executing in its critical
    section.

30
Solution
  • We present three algorithms to ensure the mutual
    exclusion execution of processes in their
    critical sections.
  • Centralized Distributed Mutual Exclusion (CDME)
  • Fully Distributed Mutual Exclusion (DDME)
  • Token passing

31
CDME Centralized Approach
  • One of the processes in the system is chosen to
    coordinate the entry to the critical section.
  • A process that wants to enter its critical
    section sends a request message to the
    coordinator.
  • The coordinator decides which process can enter
    the critical section next, and its sends that
    process a reply message.
  • When the process receives a reply message from
    the coordinator, it enters its critical section.
  • After exiting its critical section, the process
    sends a release message to the coordinator and
    proceeds with its execution.
  • 3 messages per critical section entry

32
Problems of CDME
  • Electing the master process? Hardcoded?
  • Single point of failure? Electing a new master
    process?
  • Distributed Election algorithms later

33
DDME Fully Distributed Approach
  • When process Pi wants to enter its critical
    section, it generates a new timestamp, TS, and
    sends the message request (Pi, TS) to all other
    processes in the system.
  • When process Pj receives a request message, it
    may reply immediately or it may defer sending a
    reply back.
  • When process Pi receives a reply message from all
    other processes in the system, it can enter its
    critical section.
  • After exiting its critical section, the process
    sends reply messages to all its deferred requests.

34
DDME Fully Distributed Approach (Cont.)
  • The decision whether process Pj replies
    immediately to a request(Pi, TS) message or
    defers its reply is based on three factors
  • If Pj is in its critical section, then it defers
    its reply to Pi.
  • If Pj does not want to enter its critical
    section, then it sends a reply immediately to Pi.
  • If Pj wants to enter its critical section but has
    not yet entered it, then it compares its own
    request timestamp with the timestamp TS.
  • If its own request timestamp is greater than TS,
    then it sends a reply immediately to Pi (Pi asked
    first).
  • Otherwise, the reply is deferred.

35
Problems of DDME
  • Requires complete trust that other processes will
    play fair
  • Easy to cheat just by delaying the reply!
  • The processes needs to know the identity of all
    other processes in the system
  • Makes the dynamic addition and removal of
    processes more complex.
  • If one of the processes fails, then the entire
    scheme collapses.
  • Dealt with by continuously monitoring the state
    of all the processes in the system.
  • Constantly bothering people who dont care
  • Can I enter my critical section? Can I?

36
Token Passing
  • Circulate a token among processes in the system
  • Possession of the token entitles the holder to
    enter the critical section
  • Organize processes in system into a logical ring
  • Pass token around the ring
  • When you get it, enter critical section if need
    to then pass it on when you are done (or just
    pass it on if dont need it)

37
Problems of Token Passing
  • If machines with token fails, how to regenerate a
    new token?
  • A lot like electing a new coordinator
  • If process fails, need to repair the break in the
    logical ring

38
Compare Number of Messages?
  • CDME 3 messages per critical section entry
  • DDME The number of messages per critical-section
    entry is 2 x (n 1)
  • Request/reply for everyone but myself
  • Token passing Between 0 and n messages
  • Might luck out and ask for token while I have it
    or when the person right before me has it
  • Might need to wait for token to visit everyone
    else first

39
Compare Starvation
  • CDME Freedom from starvation is ensured if
    coordinator uses FIFO
  • DDME Freedom from starvation is ensured, since
    entry to the critical section is scheduled
    according to the timestamp ordering. The
    timestamp ordering ensures that processes are
    served in a first-come, first served order.
  • Token Passing Freedom from starvation if ring is
    unidirectional
  • Caveats
  • network reliable (I.e. machines not starved by
    inability to communicate)
  • If machines fail they are restarted or taken out
    of consideration (I.e. machines not starved by
    nonresponse of coordinator or another
    participant)
  • Processes play by the rules

40
Why DDME?
  • Harder
  • More messages
  • Bothers more people
  • Coordinator just as bothered

41
Atomicity
  • Recall Atomicity either all the operations
    associated with a program unit are executed to
    completion, or none are performed.
  • In a distributed system may have multiple copies
    of the data , replicas are good for
    reliability/availability
  • PROBLEM How do we atomically update all of the
    copies?

42
Replica Consistency Problem
  • Imagine we have multiple bank servers and a
    client desiring to update their back account
  • How can we do this?
  • Allow a client to update any server then have
    server propagate update to other servers
  • Simple and wrong!
  • Simultaneous and conflicting updates can occur at
    different servers?
  • Have client send update to all servers
  • Same problem - race condition which of the
    conflicting update will reach each server first

43
Two-phase commit
  • Algorithm for providing atomic updates in a
    distributed system
  • Give the servers (or replicas) a chance to say no
    and if any server says no, client aborts the
    operation

44
Framework
  • Goal Update all replicas atomically
  • Either everyone commits or everyone aborts
  • No inconsistencies even if face of failures
  • Caveat Assume no byzantine failures (servers
    stop when they fail do not continue and
    generate bad data)
  • Definitions
  • Coordinator Software entity that shepherds the
    process (in our example could be one of the
    servers)
  • Ready to commit side effects of update safely
    stored on non-volatile storage
  • Even if crash, once say I am ready to commit then
    when recover will find evidence and continue with
    commit protocol

45
Two Phase Commit Phase 1
  • Coordinator send a PREPARE message to each
    replica
  • Coordinator waits for all replicas to reply with
    a vote
  • Each participant send vote
  • Votes PREPARED if ready to commit and locks data
    items being updated
  • Votes NO if unable to get a lock or unable to
    ensure ready to commit

46
Two Phase Commit Phase 2
  • If coordinator receives PREPARED vote from all
    replicas then it may decide to commit or abort
  • Coordinator send its decision to all participants
  • If participant receives COMMIT decision then
    commit changes resulting from update
  • If participant received ABORT decision then
    discard changes resulting from update
  • Participant replies DONE
  • When Coordinator received DONE from all
    participants then can delete record of outcome

47
Performance
  • In absence of failure, 2PC makes a total of 2
    (1.5?) round trips of messages before decision is
    made
  • Prepare
  • Vote NO or PREPARE
  • Commit/abort
  • Done (but done just for bookkeeping, does not
    affect response time)

48
Failure Handling in 2PC Replica Failure
  • The log contains a ltcommit Tgt record. In this
    case, the site executes redo(T).
  • The log contains an ltabort Tgt record. In this
    case, the site executes undo(T).
  • The contains a ltready Tgt record consult Ci. If
    Ci is down, site sends query-status T message to
    the other sites.
  • The log contains no control records concerning T.
    In this case, the site executes undo(T).

49
Failure Handling in 2PC Coordinator Ci Failure
  • If an active site contains a ltcommit Tgt record in
    its log, the T must be committed.
  • If an active site contains an ltabort Tgt record in
    its log, then T must be aborted.
  • If some active site does not contain the record
    ltready Tgt in its log then the failed coordinator
    Ci cannot have decided to commit T. Rather than
    wait for Ci to recover, it is preferable to abort
    T.
  • All active sites have a ltready Tgt record in their
    logs, but no additional control records. In this
    case we must wait for the coordinator to recover.
  • Blocking problem T is blocked pending the
    recovery of site Si.

50
Failure Handling
  • Failure detected with timeouts
  • If participant times out before getting a PREPARE
    can abort
  • If coordinator times out waiting for a vote can
    abort
  • If a participant times out waiting for a decision
    it is blocked!
  • Wait for Coordinator to recover?
  • Punt to some other resolution protocol
  • If a coordinator times out waiting for done, keep
    record of outcome
  • other sites may have a replica.

51
Failures in distributed systems
  • We may want to avoid relying on a single
    server/coordinator/boss to make progress
  • Thus want the decision making to be distributed
    among the participants (all nodes created
    equal) gt the consensus problem in distributed
    systems.
  • However depending on what we can assume about the
    network, it may be impossible to reach a decision
    in some cases!

52
Impossibility of Consensus
  • Network characteristics
  • Synchronous - some upper bound on
    network/processing delay.
  • Asynchronous - no upper bound on
    network/processing delay.
  • Fischer Lynch and Paterson showed
  • With even just one failure possible, you cannot
    guarantee consensus.
  • Essence of proof Just before a decision is
    reached, we can delay a node slightly too long to
    reach a decision.
  • But we still want to do it.. Right?

53
Paxos, etc
  • Simply dont mention the impossibility
  • A number of rounds.
  • Each round has a leader
  • Each leader tries to get a majority to agree to
    what its proposing
  • If little progress, move on to next leader.
  • (impossiblity arises in the last sentence there..)

54
Randomized consensus
  • The first approach to circumventing the
    impossibility.
  • A number of rounds.
  • In each round there are two phases.
  • In phase one, send your proposal.
  • In phase two, if get a majority for a proposal,
    decide. Else flip a coin to choose next proposal
    (all nodes do)
  • Circumvents impossibility by showing that,
    eventually, with P 1, all nodes will flip the
    coin and end up with the same choice for the next
    proposal gt decision in next round.

55
In the real world
  • Consensus is everywhere - a number of interesting
    problems in distributed computing can be reduced
    to consensus (learn to recognize them!)
  • Asynchronous solutions to consensus are typically
    faster, simpler and will solve your problem with
    P 1. Which will do for me.
Write a Comment
User Comments (0)
About PowerShow.com