Distributed%20Systems

About This Presentation

Title:

Distributed%20Systems

Description:

Distributed Election algorithms later... -33. DDME: Fully Distributed Approach ... If participant received ABORT decision then discard changes resulting from update ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 56

Provided by: csCor

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed%20Systems

1
Distributed Systems
2
A Distributed System
3
Loosely Coupled Distributed Systems

Users are aware of multiplicity of machines.
Access to resources of various machines is done
explicitly by
Remote logging into the appropriate remote
machine.
Transferring data from remote machines to local
machines, via the File Transfer Protocol (FTP)
mechanism.

4
Tightly Coupled Distributed-Systems

Users not aware of multiplicity of machines.
Access to remote resources similar to access to
local resources
Examples
Data Migration transfer data by transferring
entire file, or transferring only those portions
of the file necessary for the immediate task.
Computation Migration transfer the computation,
rather than the data, across the system.

5
Distributed-Operating Systems (Cont.)

Process Migration execute an entire process, or
parts of it, at different sites.
Load balancing distribute processes across
network to even the workload.
Computation speedup subprocesses can run
concurrently on different sites.
Hardware preference process execution may
require specialized processor.
Software preference required software may be
available at only a particular site.
Data access run process remotely, rather than
transfer all data locally.

6
Why Distributed Systems?

Communication
Dealt with this when we talked about networks
Resource sharing
Computational speedup
Reliability

7
Resource Sharing

Distributed Systems offer access to specialized
resources of many systems
Example
Some nodes may have special databases
Some nodes may have access to special hardware
devices (e.g. tape drives, printers, etc.)
DS offers benefits of locating processing near
data or sharing special devices

8
OS Support for resource sharing

Resource Management?
Distributed OS can manage diverse resources of
nodes in system
Make resources visible on all nodes
Like VM, can provide functional illusion but
rarely hide the performance cost
Scheduling?
Distributed OS could schedule processes to run
near the needed resources
If need to access data in a large database may be
easier to ship code there and results back than
to request data be shipped to code

9
Design Issues

Transparency the distributed system should
appear as a conventional, centralized system to
the user.
Fault tolerance the distributed system should
continue to function in the face of failure.
Scalability as demands increase, the system
should easily accept the addition of new
resources to accommodate the increased demand.
Clusters vs Client/Server
Clusters a collection of semi-autonomous
machines that acts as a single system.

10
Computation Speedup

Some tasks too large for even the fastest single
computer
Real time weather/climate modeling, human genome
project, fluid turbulence modeling, ocean
circulation modeling, etc.
http//www.nersc.gov/research/GC/gcnersc.html
What to do?
Leave the problem unsolved?
Engineer a bigger/faster computer?
Harness resources of many smaller (commodity?)
machines in a distributed system?

11
Breaking up the problems

To harness computational speedup must first break
up the big problem into many smaller problems
More art than science?
Sometimes break up by function
Pipeline?
Job queue?
Sometimes break up by data
Each node responsible for portion of data set?

12
Decomposition Examples

Decrypting a message
Easily parallelizable, give each node a set of
keys to try
Job queue when tried all your keys go back for
more?
Modeling ocean circulation
Give each node a portion of the ocean to model (N
square ft region?)
Model flows within region locally
Communicate with nodes managing neighboring
regions to model flows into other regions

13
Decomposition Examples (cont)

Barnes Hut calculating effect of bodies in
space on each other
Could divide space into NxN regions?
Some regions have many more bodies
Instead divide up so have roughly same number of
bodies
Within a region, bodies have lots of effect on
each other (close together)
Abstract other regions as a single body to
minimize communication

14
Linear Speedup

Linear speedup is often the goal.
Allocate N nodes to the job goes N times as fast
Once youve broken up the problem into N pieces,
can you expect it to go N times as fast?
Are the pieces equal?
Is there a piece of the work that cannot be
broken up (inherently sequential?)
Synchronization and communication overhead
between pieces?

15
Super-linear Speedup

Sometimes can actually do better than linear
speedup!
Especially if divide up a big data set so that
the piece needed at each node fits into main
memory on that machine
Savings from avoiding disk I/O can outweigh the
communication/ synchronization costs
When split up a problem, tension between
duplicating processing at all nodes for
reliability and simplicity and allowing nodes to
specialize

16
OS Support for Parallel Jobs

Process Management?
OS could manage all pieces of a parallel job as
one unit
Allow all pieces to be created, managed,
destroyed at a single command line
Fork (process,machine)?
Scheduling?
Programmer could specify where pieces should run
and or OS could decide
Process Migration? Load Balancing?
Try to schedule piece together so can communicate
effectively

17
OS Support for Parallel Jobs (cont)

Group Communication?
OS could provide facilities for pieces of a
single job to communicate easily
Location independent addressing?
Shared memory?
Distributed file system?
Synchronization?
Support for mutually exclusive access to data
across multiple machines
Cant rely on HW atomic operations any more
Deadlock management?
Well talk about clock synchronization and
two-phase commit later

18
Reliability

Distributed system offers potential for increased
reliability
If one part of system fails, rest could take over
Redundancy, fail-over
!BUT! Often reality is that distributed systems
offer less reliability
A distributed system is one in which some
machine Ive never heard of fails and I cant do
work!
Hard to get rid of all hidden dependencies
No clean failure model
Nodes dont just fail they can continue in a
broken state
Partition network many many nodes fail at once!
(Determine who you can still talk to Are you cut
off or are they?)
Network goes down and up and down again!

19
Robustness

Detect and recover from site failure, function
transfer, reintegrate failed site
Failure detection
Reconfiguration

20
Failure Detection

Detecting hardware failure is difficult.
To detect a link failure, a handshaking protocol
can be used.
Assume Site A and Site B have established a link.
At fixed intervals, each site will exchange an
I-am-up message indicating that they are up and
running.
If Site A does not receive a message within the
fixed interval, it assumes either (a) the other
site is not up or (b) the message was lost.
Site A can now send an Are-you-up? message to
Site B.
If Site A does not receive a reply, it can repeat
the message or try an alternate route to Site B.

21
Failure Detection (cont)

If Site A does not ultimately receive a reply
from Site B, it concludes some type of failure
has occurred.
Types of failures- Site B is down
- The direct link between A and B is down- The
alternate link from A to B is down
- The message has been lost
However, Site A cannot determine exactly why the
failure has occurred.
B may be assuming A is down at the same time
Can either assume it can make decisions alone?

22
Reconfiguration

When Site A determines a failure has occurred, it
must reconfigure the system
1. If the link from A to B has failed, this must
be broadcast to every site in the system.
2. If a site has failed, every other site must
also be notified indicating that the services
offered by the failed site are no longer
available.
When the link or the site becomes available
again, this information must again be broadcast
to all other sites.

23
Event Ordering

Problem distributed systems do not share a clock
Many coordination problems would be simplified if
they did (first one wins)
Distributed systems do have some sense of time
Events in a single process happen in order
Messages between processes must be sent before
they can be received
How helpful is this?

24
Happens-before

Define a Happens-before relation (denoted by ?).
1) If A and B are events in the same process, and
A was executed before B, then A ? B.
2) If A is the event of sending a message by one
process and B is the event of receiving that
message by another process, then A ? B.
3) If A ? B and B ? C then A ? C.

25
Total ordering?

Happens-before gives a partial ordering of events
We still do not have a total ordering of events

26
Partial Ordering
Pi -gtPi1 Qi -gt Qi1 Ri -gt Ri1
R0-gtQ4 Q3-gtR4 Q1-gtP4 P1-gtQ2
27
Total Ordering?
P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
28
Timestamps

Assume each process has a local logical clock
that ticks once per event and that the processes
are numbered
Clocks tick once per event (including message
send)
When send a message, send your clock value
When receive a message, set your clock to MAX(
your clock, timestamp of message 1)
Thus sending comes before receiving
Only visibility into actions at other nodes
happens during communication, communicate
synchronizes the clocks
If the timestamps of two events A and B are the
same, then use the process identity numbers to
break ties.
This gives a total ordering!

29
Distributed Mutual Exclusion (DME)

Problem We can no longer rely on just an atomic
test and set operation on a single machine to
build mutual exclusion primitives
Requirement
If Pi is executing in its critical section, then
no other process Pj is executing in its critical
section.

30
Solution

We present three algorithms to ensure the mutual
exclusion execution of processes in their
critical sections.
Centralized Distributed Mutual Exclusion (CDME)
Fully Distributed Mutual Exclusion (DDME)
Token passing

31
CDME Centralized Approach

One of the processes in the system is chosen to
coordinate the entry to the critical section.
A process that wants to enter its critical
section sends a request message to the
coordinator.
The coordinator decides which process can enter
the critical section next, and its sends that
process a reply message.
When the process receives a reply message from
the coordinator, it enters its critical section.
After exiting its critical section, the process
sends a release message to the coordinator and
proceeds with its execution.
3 messages per critical section entry

32
Problems of CDME

Electing the master process? Hardcoded?
Single point of failure? Electing a new master
process?
Distributed Election algorithms later

33
DDME Fully Distributed Approach

When process Pi wants to enter its critical
section, it generates a new timestamp, TS, and
sends the message request (Pi, TS) to all other
processes in the system.
When process Pj receives a request message, it
may reply immediately or it may defer sending a
reply back.
When process Pi receives a reply message from all
other processes in the system, it can enter its
critical section.
After exiting its critical section, the process
sends reply messages to all its deferred requests.

34
DDME Fully Distributed Approach (Cont.)

The decision whether process Pj replies
immediately to a request(Pi, TS) message or
defers its reply is based on three factors
If Pj is in its critical section, then it defers
its reply to Pi.
If Pj does not want to enter its critical
section, then it sends a reply immediately to Pi.
If Pj wants to enter its critical section but has
not yet entered it, then it compares its own
request timestamp with the timestamp TS.
If its own request timestamp is greater than TS,
then it sends a reply immediately to Pi (Pi asked
first).
Otherwise, the reply is deferred.

35
Problems of DDME

Requires complete trust that other processes will
play fair
Easy to cheat just by delaying the reply!
The processes needs to know the identity of all
other processes in the system
Makes the dynamic addition and removal of
processes more complex.
If one of the processes fails, then the entire
scheme collapses.
Dealt with by continuously monitoring the state
of all the processes in the system.
Constantly bothering people who dont care
Can I enter my critical section? Can I?

36
Token Passing

Circulate a token among processes in the system
Possession of the token entitles the holder to
enter the critical section
Organize processes in system into a logical ring
Pass token around the ring
When you get it, enter critical section if need
to then pass it on when you are done (or just
pass it on if dont need it)

37
Problems of Token Passing

If machines with token fails, how to regenerate a
new token?
A lot like electing a new coordinator
If process fails, need to repair the break in the
logical ring

38
Compare Number of Messages?

CDME 3 messages per critical section entry
DDME The number of messages per critical-section
entry is 2 x (n 1)
Request/reply for everyone but myself
Token passing Between 0 and n messages
Might luck out and ask for token while I have it
or when the person right before me has it
Might need to wait for token to visit everyone
else first

39
Compare Starvation

CDME Freedom from starvation is ensured if
coordinator uses FIFO
DDME Freedom from starvation is ensured, since
entry to the critical section is scheduled
according to the timestamp ordering. The
timestamp ordering ensures that processes are
served in a first-come, first served order.
Token Passing Freedom from starvation if ring is
unidirectional
Caveats
network reliable (I.e. machines not starved by
inability to communicate)
If machines fail they are restarted or taken out
of consideration (I.e. machines not starved by
nonresponse of coordinator or another
participant)
Processes play by the rules

40
Why DDME?

Harder
More messages
Bothers more people
Coordinator just as bothered

41
Atomicity

Recall Atomicity either all the operations
associated with a program unit are executed to
completion, or none are performed.
In a distributed system may have multiple copies
of the data , replicas are good for
reliability/availability
PROBLEM How do we atomically update all of the
copies?

42
Replica Consistency Problem

Imagine we have multiple bank servers and a
client desiring to update their back account
How can we do this?
Allow a client to update any server then have
server propagate update to other servers
Simple and wrong!
Simultaneous and conflicting updates can occur at
different servers?
Have client send update to all servers
Same problem - race condition which of the
conflicting update will reach each server first

43
Two-phase commit

Algorithm for providing atomic updates in a
distributed system
Give the servers (or replicas) a chance to say no
and if any server says no, client aborts the
operation

44
Framework

Goal Update all replicas atomically
Either everyone commits or everyone aborts
No inconsistencies even if face of failures
Caveat Assume no byzantine failures (servers
stop when they fail do not continue and
generate bad data)
Definitions
Coordinator Software entity that shepherds the
process (in our example could be one of the
servers)
Ready to commit side effects of update safely
stored on non-volatile storage
Even if crash, once say I am ready to commit then
when recover will find evidence and continue with
commit protocol

45
Two Phase Commit Phase 1

Coordinator send a PREPARE message to each
replica
Coordinator waits for all replicas to reply with
a vote
Each participant send vote
Votes PREPARED if ready to commit and locks data
items being updated
Votes NO if unable to get a lock or unable to
ensure ready to commit

46
Two Phase Commit Phase 2

If coordinator receives PREPARED vote from all
replicas then it may decide to commit or abort
Coordinator send its decision to all participants
If participant receives COMMIT decision then
commit changes resulting from update
If participant received ABORT decision then
discard changes resulting from update
Participant replies DONE
When Coordinator received DONE from all
participants then can delete record of outcome

47
Performance

In absence of failure, 2PC makes a total of 2
(1.5?) round trips of messages before decision is
made
Prepare
Vote NO or PREPARE
Commit/abort
Done (but done just for bookkeeping, does not
affect response time)

48
Failure Handling in 2PC Replica Failure

The log contains a ltcommit Tgt record. In this
case, the site executes redo(T).
The log contains an ltabort Tgt record. In this
case, the site executes undo(T).
The contains a ltready Tgt record consult Ci. If
Ci is down, site sends query-status T message to
the other sites.
The log contains no control records concerning T.
In this case, the site executes undo(T).

49
Failure Handling in 2PC Coordinator Ci Failure

If an active site contains a ltcommit Tgt record in
its log, the T must be committed.
If an active site contains an ltabort Tgt record in
its log, then T must be aborted.
If some active site does not contain the record
ltready Tgt in its log then the failed coordinator
Ci cannot have decided to commit T. Rather than
wait for Ci to recover, it is preferable to abort
T.
All active sites have a ltready Tgt record in their
logs, but no additional control records. In this
case we must wait for the coordinator to recover.
Blocking problem T is blocked pending the
recovery of site Si.

50
Failure Handling

Failure detected with timeouts
If participant times out before getting a PREPARE
can abort
If coordinator times out waiting for a vote can
abort
If a participant times out waiting for a decision
it is blocked!
Wait for Coordinator to recover?
Punt to some other resolution protocol
If a coordinator times out waiting for done, keep
record of outcome
other sites may have a replica.

51
Failures in distributed systems

We may want to avoid relying on a single
server/coordinator/boss to make progress
Thus want the decision making to be distributed
among the participants (all nodes created
equal) gt the consensus problem in distributed
systems.
However depending on what we can assume about the
network, it may be impossible to reach a decision
in some cases!

52
Impossibility of Consensus

Network characteristics
Synchronous - some upper bound on
network/processing delay.
Asynchronous - no upper bound on
network/processing delay.
Fischer Lynch and Paterson showed
With even just one failure possible, you cannot
guarantee consensus.
Essence of proof Just before a decision is
reached, we can delay a node slightly too long to
reach a decision.
But we still want to do it.. Right?

53
Paxos, etc

Simply dont mention the impossibility
A number of rounds.
Each round has a leader
Each leader tries to get a majority to agree to
what its proposing
If little progress, move on to next leader.
(impossiblity arises in the last sentence there..)

54
Randomized consensus

The first approach to circumventing the
impossibility.
A number of rounds.
In each round there are two phases.
In phase one, send your proposal.
In phase two, if get a majority for a proposal,
decide. Else flip a coin to choose next proposal
(all nodes do)
Circumvents impossibility by showing that,
eventually, with P 1, all nodes will flip the
coin and end up with the same choice for the next
proposal gt decision in next round.

55
In the real world

Consensus is everywhere - a number of interesting
problems in distributed computing can be reduced
to consensus (learn to recognize them!)
Asynchronous solutions to consensus are typically
faster, simpler and will solve your problem with
P 1. Which will do for me.

Write a Comment

User Comments (0)

About PowerShow.com

Distributed%20Systems - PowerPoint PPT Presentation

Distributed%20Systems

Distributed Election algorithms later... -33. DDME: Fully Distributed Approach ... If participant received ABORT decision then discard changes resulting from update ... – PowerPoint PPT presentation