Title: Reliable Distributed Systems
1Reliable Distributed Systems
2Topics for this lecture
- Adding clocks to distributed systems
- Also, to non-distributed ones
- Well just touch on real-time operating systems
and scheduling an active area until recently - Recent machines are so fast that importance of
topic is now reduced - Using time in broadcast protocols
- Comparison of CASD with gossip
3Many systems need real-time
- Air traffic control talks about when planes
will be at various locations - Doctors talk about how patients responded after
drug was given, or change therapy after some
amount of time - Process control software runs factor floors by
coordinating what machine tools do, and when
4Many systems need real-time
- Video and multi-media systems need isochronous
communication protocols that coordinate video,
voice, and other data sources - Telecommunications systems must guarantee
real-time response despite failures, for example
when switching telephone calls
5Real time in real systems
- These real systems combine the need for logical
consistency and coordination (for control) with
the need for physical coordination and timely
behavior - Issue becomes one of integrating real-time tools
and support with logical tools and support such
as we have already considered - Not real-time or logical time, but want both for
different aspects of a single application!
6Clock Synchronization and Synchronous Systems
- Up to now, we restricted attention to logical
notions of time a happens before b - But recall that we also touched on real clocks
- Two views of real-time
- Supporting clocks that programs can consult as
needed - Making direct use of real-time inside protocols
7Clock Synchronization
- Topic was extremely active during early 1980s
- Best known algorithms include the one in OSF/1
UNIX (based on one by Marzullo at Xerox), the
optimally accurate algorithm of Srikanth and
Toueg, and the probabilistic algorithm of
Cristian - Introduction of Global Positioning System is
eliminating the need for this form of
synchronization
8Clock synchronization protocols
- Would like to think of network as having a single
clock that all processes can consult - But need to implement this using many local
clocks that can - Initially be set incorrectly
- Drift over the course of time
- Clock synchronization tries to keep clocks close
to true real-time and minimize their tendency to
drift
9Precision and Accuracy
- Accuracy measures local clocks relative to an
external source of accurate real-time. Accurate
clocks are close to real-time - Precision measures local clocks relative to each
other. Precise clocks are close to each other - Skew is the numerical limit on the maximum
dis-tance that correct clocks can drift apart.
E.g. could say the maximum skew is 1sec for
some system
10How clock synchronization used to work
- Periodically, all processors would run a clock
sync protocol, for example by broadcasting the
reading from their clocks - Each receives a set of values from the others
(sets may differ due to faults!) - Algorithm would pick a synchronized value from
the set analysis used to prove properties of
clocks
11Global Positioning System
- Satellite system launched by military in early
1990s, became public and inexpensive - Can think of satellites as broadcasting the time
- Small radio receiver picks up signals from three
satellites and triangulates to determine position - Same computation also yields extremely accurate
clock (accurate to a few milliseconds)
12Clock synchronization with GPS
- Put two GPS receivers (or more) on a network
- Periodically, receivers broadcast the true time
- Other machines only need to adjust their clocks
to overcome propagation delays for clock sync
messages through the network! - Well matched to the a-posteriori clock
synchronization approach
13Basic idea
- GPS receiver broadcasts the time is now 1000
on a broadcast network (ethernet) - Receivers note the time when they receive the
message 1001, 958, .... and reply with
values - GPS receiver subtracts the median value
- Differences 1, -2, ... now give the drift of
the clock of the destination relative to the
median clock
14A-posteriori method, adjustment stage
- Now we distribute these drift values back to the
processors, which compensate for the rate of
drift over the time period to the next
synchronization - Can show that this yields clocks that are optimal
both in accuracy and precision - A processor with a broken clock has a good chance
of discovering it during synchronization
15Using real-time
- One option is to use a real-time operating
system, clock synchronization algorithm, and to
design protocols that exploit time - Example MARS system uses pairs of redundant
processors to perform actions fault-tolerantly
and meet deadlines. Has been applied in process
control systems. (Another example Delta-4)
16Features of real-time operating systems
- The O/S itself tends to be rather simple
- Big black boxes behave unpredictably
- They are structured in terms of tasks
- A task is more or less a thread
- But typically come with expected runtime,
deadlines, priorities, interruptability, etc - User decomposes application into task-like
component parts and then expresses goals in a
form that RTOS can handle - Widely used on things like medical devices
17RTOS can be beneficial
- Lockheed Martin ATL timed CORBA method
invocations - Variation in response time was huge with a normal
Linux OS - When using a Timesys RTOS the variability is
eliminated!
18Real-time broadcast protocols
- Can also implement broadcast protocols that make
direct use of temporal information - Examples
- Broadcast that is delivered at same time by all
correct processes (plus or minus the clock skew) - Distributed shared memory that is updated within
a known maximum delay - Group of processes that can perform periodic
actions
19A real-time broadcast
ta
tb
t
Message is sent at time t by p0. Later both p0
and p1 fail. But message is still delivered
atomically, after a bounded delay, and within a
bounded interval of time (at non-faulty processes)
20A real-time distributed shared memory
ta
tb
t
set x3
x3
At time t p0 updates a variable in a distributed
shared memory. All correct processes observe the
new value after a bounded delay, and within a
bounded interval of time.
21Periodic process group Marzullo
Periodically, all members of a group take some
action. Idea is to accomplish this with minimal
communication
22The CASD protocols
- Also known as the ? -T protocols
- Developed by Cristian and others at IBM, was
intended for use in the (ultimately, failed) FAA
project - Goal is to implement a timed atomic broadcast
tolerant of Byzantine failures
23Basic idea of the CASD protocols
- Assumes use of clock synchronization
- Sender timestamps message
- Recipients forward the message using a flooding
technique (each echos the message to others) - Wait until all correct processors have a copy,
then deliver in unison (up to limits of the clock
skew)
24CASD picture
ta
tb
t
p0, p1 fail. Messages are lost when echoed by
p2, p3
25Idea of CASD
- Assume known limits on number of processes that
fail during protocol, number of messages lost - Using these and the temporal assumptions, deduce
worst-case scenario - Now now that if we wait long enough, all (or no)
correct process will have the message - Then schedule delivery using original time plus a
delay computed from the worst-case assumptions
26The problems with CASD
- In the usual case, nothing goes wrong, hence the
delay can be very conservative - Even if things do go wrong, is it right to assume
that if a message needs between 0 and ?ms to make
one hope, it needs 0,n ? to make n hops? - How realistic is it to bound the number of
failures expected during a run?
27CASD in a more typical run
ta
tb
t
28... leading developers to employ more aggressive
parameter settings
ta
tb
t
29CASD with over-aggressive paramter settings
starts to malfunction
ta
tb
t
all processes look incorrect (red) from time to
time
30CASD mile high
- When run slowly protocol is like a real-time
version of abcast - When run quickly protocol starts to give
probabilistic behavior - If I am correct (and there is no way to know!)
then I am guaranteed the properties of the
protocol, but if not, I may deliver the wrong
messages
31How to repair CASD in this case?
- Gopal and Toueg developed an extension, but it
slows the basic CASD protocol down, so it
wouldnt be useful in the case where we want
speed and also real-time guarantees - Can argue that the best we can hope to do is to
superimpose a process group mechanism over CASD
(Verissimo and Almeida are looking at this).
32Why worry?
- CASD can be used to implement a distributed
shared memory (delta-common storage) - But when this is done, the memory consistency
properties will be those of the CASD protocol
itself - If CASD protocol delivers different sets of
messages to different processes, memory will
become inconsistent
33Why worry?
- In fact, we have seen that CASD can do just this,
if the parameters are set aggressively - Moreover, the problem is not detectable either by
technically faulty processes or correct ones - Thus, DSM can become inconsistent and we lack any
obvious way to get it back into a consistent state
34Using CASD in real environments
- Would probably need to set the parameters close
to the range where CASD can malfunction, but
rarely - Hence would need to add a self-stabilization
algorithm to restore consistent state of memory
after it becomes inconsistent - Problem has not been treated in papers on CASD
- pbcast protocol does this
35Using CASD in real environments
- Once we build the CASD mechanism how would we use
it? - Could implement a shared memory
- Or could use it to implement a real-time state
machine replication scheme for processes - US air traffic project adopted latter approach
- But stumbled on many complexities
36Using CASD in real environments
- Pipelined computation
- Transformed computation
37Issues?
- Could be quite slow if we use conservative
parameter settings - But with aggressive settings, either process
could be deemed faulty by the protocol - If so, it might become inconsistent
- Protocol guarantees dont apply
- No obvious mechanism to reconcile states within
the pair - Method was used by IBM in a failed effort to
build a new US Air Traffic Control system
38Similar to MARS
- Research system done in Austria by Hermann Kopetz
- Basic idea is that everything happens twice
- Receiver can suppress duplicates but is
guaranteed of at least one copy of each message - Used to overcome faults without loss of real-time
guarantees - MARS is used in the BMW but gets close to a
hardware f.tol. scheme
39Many more issues.
- What if a process starts to lag?
- What if applications arent strictly
deterministic? - How should such a system be managed?
- How can a process be restarted?
- If not, the system eventually shuts down!
- How to measure the timing behavior of components,
including the network
40FAA experience?
- It became too hard to work all of this out
- Then they tried a transactional approach, also
had limited success - Finally, they gave up!
- 6B was lost
- A major fiasco, ATC is still a mess
41Totem approach
- Start with extended virtual synchrony model
- Analysis used to prove real-time delivery
properties - Enables them to guarantee delivery within about
100-200ms on a standard broadcast LAN - Contrast with our 85us latency for Horus!
42Tradeoffs between consistency, time
- Notice that as we push CASD to run faster we lose
consistency - Contrast with our virtual synchrony protocols
they run as fast as they can (often, much faster
than CASD when it is not malfunctioning) but
dont guarantee real-time delivery
43A puzzle
- Suppose that experiments show that 99.99 of
Horus or Ensemble messages are delivered in 85us
/- 10us for some known maximum load - Also have a theory that shows that 100 of Totem
messages are delivered in about 150ms for
reasonable assumptions - And have the CASD protocols which work well with
? around 250ms for similar LANs
44A puzzle
- Question is there really a difference between
these forms of guarantees? - We saw that CASD is ultimately probabilistic.
Since Totem makes assumptions, it is also,
ultimately, probabilistic - But the experimentally observed behavior of Horus
is also probabilistic - ... so why isnt Horus a real-time system?
45What does real-time mean?
- To the real-time community?
- A system that provably achieves its deadlines
under stated assumptions - Often achieved using delays!
- To the pragmatic community?
- The system is fast enough to accomplish our goals
- Experimentally, it never seems to lag behind or
screw up
46Some real-time issues
- Scheduling
- Given goals, how should tasks be scheduled?
- Periodic, a-periodic and completely ad-hoc tasks
- What should we do if a system misses its goals?
- How can we make components highly predictable in
terms of their real-time performance profile?
47Real-time today
- Slow transition
- Older, special purpose operating systems and
components, carefully hand-crafted for
predictability - Newer systems are simply so fast (and can be
dedicated to task) that what used to be hard is
now easy - In effect, we no longer need to worry about
real-time, in many cases, because our goals are
so easily satisfied!