Clusters Part 4 Systems presentation

About This Presentation

Transcript and Presenter's Notes

Title: Clusters Part 4 Systems

1
Clusters Part 4 - Systems

Lars Lundberg
The slides in this presentation cover Part 4
(Chapters 12-15) in Pfisters book. We will,
however, only present slides for chapter 12.
This part is the most important one in Pfisters
book!

2
High Availability

What we today call high availability was
previously called fault tolerance.
Traditionally there has been hardware fault
tolerant systems. This means that faults are
entirely handled by the hardware, and the
software does not have to care.
Cluster systems offer fault tolerance in
software, i.e. they use standard hardware.

3
Classes of Availability
4
Measuring Availability

The availability is usually measured as the
percentage of the time that a system is
available. Assuming that a system can be either
fully available or not available at all.
Potential problems when measuring availability
What if the system is partly available
Should we include periods when the system is not
used
Should we include planned outages for maintenance
etc. Planned outages can be a real problem in
non-stop operation environments.

5
High Availability vs. Continuous operation

If we separate the planned outages (maintenance,
upgrades etc.) from the unplanned ones (crashes,
faults etc.), we can make the distinction
between
High availability (few and short unplanned
outages)
Continuous operation (few and short planned and
unplanned outages)
High availability and continuous operation are
not always equally important.

6
Reasons for unplanned outages

Loss of power
Application software
Operating system software
Subsystem software (e.g. databases)
Hardware with moving parts (e.g. disks, fans,
printers)
I/O-adapters
Memory
Processors, caches etc.

7
Outage Duration

Hardware does not break as often as software,
but when it does it takes longer to repair.
Traditional hardware fault tolerance can recover
from a fault faster than software fault tolerant
cluster systems.
Very few clusters can recover from a fault in
less than 30 seconds. It often takes much longer.

8
Definition of High Availability

A system is highly available if
No replaceable piece is a single point of
failure.
The system is sufficiently reliable that you are
likely to be able to repair or replace any broken
parts before anything else breaks.
Single point of failure is a single element of
hardware or software which, if it fails, brings
down the entire system.

9
Summary of High Availability

For 24x365 operation (24 hour 365 days per year),
you must consider things like cooling, power
supply, and also provide careful system
management.
24x365 operation also implies dealing with
planned outages and disasters, not just breakage
and errors.
Disregarding power failure, software causes the
largest number of outages
The longest unplanned outages are caused as much
by hardware as software (again disregarding power
failure)

10
Summary of High Availability cont.

Avoid single point of failure
Clusters can help with planned outages and some
unplanned errors in hardware and software.
Hardware based fault tolerance fails over
instantaneous, but does not help with software
errors and planned outages.
There is no industrial consensus on what high
availability and fault tolerance means.

11
Failover
One computer (Alice) is watching another computer
(Bozo) if Bozo dies, Alice takes over Bozos work

12
Failover problems

If Alice tries to take of the control at the
same time as Bozo comes back up again, we will
have two computers struggling for the control at
the same time. This can cause a lot of problems.

13
Avoiding planned outages

If we want to upgrade Bozo we can do the
following
Do a controlled (forced) fail over to Alice
Upgrade Bozo while Alice is taking care of
business
Do a failback to Bozo
Alice can now also be upgraded
Consequently, one of the advantages with
clusters is that we do not have to take the
system down during upgrades and maintenance.
Problems may, however, occur when
the upgrade includes change of data format on
disk, or when
when the software runs in parallel across the
cluster nodes

14
Moving resources when failing over

When an application is moved from one node to
another the resources that it needs must also be
moved, e.g. files and IP-addresses.
Early high-availability cluster system left this
problem to the user, i.e. the user had to write a
number of shell scripts that were executed during
a failover.
One way to help the user is to define the
dependencies between different applications and
resources. The user then only has to define where
a certain application should go,and the cluster
software will move the necessary resources along
with the application.

15
Potential problems when moving resources

Resources may depend on individual cluster nodes,
e.g. a certain disk may only be accessible on a
certain node.
The procedure for bringing resources on-line may
depend on the node, e.g. a printer queue may
already be defined on some nodes, and redefining
it may cause problems.
The information about the resource dependencies
must be available and consistent throughout the
cluster nodes, even when the node responsible for
updating this information crashes.

16
Moving data - replication vs. switchover

Moving data from Bozo to Alice an be done in two
ways
Replication (separate disks/shared nothing, see
Figure 108)
Bozo and Alice have their own separate disks, and
the changes made on Bozo are continuously sent to
Alice.
As an alternative, the changes in Bozo could be
sent in batches at certain time intervals.
Switchover (shared disk, see Figure 109)
A disk (or other storage device) is connected to
both Bozo and Alice, and when Bozo crashes, Alice
takes control over the disk.
Switchover is often preferred in high
availability systems

17
Replication vs. switchover

Replication advantages
It is easier to add a new node when using
replication.
It can be difficult to synchronize the disks in
switchover configurations, e.g. the two systems
must agree on disk partitions, volume names etc.
In switchover the disks are in one place. This
limits the distance between the nodes and also be
a problem with flooding of the room with the disk
or other disasters.
Replication can use simpler storage units
because
The disks do not need to support dual access
The disks themselves are not a single point of
failure

18
Replication vs. switchover

Switchover advantages
Easier to backup the disk
Less disk space is required
Less overhead, i.e. when using replication the
Bozo must send copies of the change to Alice, and
Alice must write these updates on the local
disks. This uses CPU and I/O capacity.
If Bozo waits for Alice to signal that each
update has been recorded correctly, the
performance will be degraded. If Bozo does not
wait, data may be lost when a failure occurs.
Failback is easier.

19
Avoiding corrupt data - transactions

When Bozo crashes, it might corrupt data or leave
it in an inconsistent state.
Transactions are used for avoiding this problem
Transactions are usually implemented by having a
log file on stable storage (e.g. mirrored disk)
No matter what happens (assuming the stable
storage stays stable) a consistent state of the
data can be recreated from the log file.
In replicated systems, transactions are
implemented by a technique called two-phase
commit.

20
Failing over communication

When Alice takes over the job from Bozo, the
communication from the client is redirected using
IP takeover
IP takeover is obtained by resetting one (or
more) of the communication adapters on Alice to
respond to the IP address(es) that Bozo was
using.
Since most communication protocols have routines
for retransmission after a time out limit, the
client computes never know the difference.
However, the people at the client computers
probably have to log in again, i.e. their
sessions are usually aborted at failover.
An alternative way of failing over communication
is that each client have multiple IP addresses
the primary server, the secondary server and so
on. If the primary server does not respond the
client tries to contact the secondary server and
so on.

21
Time for doing a failover

The time for reaching a fully operational state
after a failover can be substantial. In best case
scenarios the time can be as low as tens of
seconds.
The failover times can be reduced by having pairs
of processes
There is one process on Alice for each process on
Bozo.
Every time the process on Bozo changes its state
that change is reflected on the process on Alice.
Tandem has claimed that by using this technique,
sub-second failover is achievable.

22
Failover to where?

This question becomes interesting when there are
more than two nodes in the cluster
Simple add-on high-availability systems often use
static schemes, e.g. if Bozo dies, put jobs A and
B on Alice n the rest on Clara.
Sophisticated cluster systems provide mechanisms
for automatic load balancing (possibly also
considering some user defined priorities).
Dynamic load balancing is easier is shared-disk
clusters than in shared nothing clusters. In
hared nothing clusters replication is used and
this makes the backup order more static.

23
Global locks

In a shared-disk system, one must handle the
problem of system wide locks when a node crashes
The processes on the node that crashed were
probably holding resources that processes on
other nodes will have to use. If the locks are
not released the entire system will lockup.
There are two ways of handling this problem
Letting the applications keep track of the locks
that it was holding
Letting a global lock manager keep track of the
locks that the applications on the crashed node
were holding.

24
Heartbeats

Heartbeat messages are used for detecting when a
node is dead.
Each node sends short messages to the other
nodes, telling them that the node is alive
If a heartbeat message does not arrive within a
time-out period, the node is declared dead.
One problem with this approach is that the
message could be delayed for various reasons, and
in that case a node which is declared dead may be
OK. This can cause a lot of problems.
Another problem with this approach is that the
node may be OK, but the communication link for
the heartbeat is not OK. This could also lead to
the dangerous conclusion that an Ok node is dead.
In order to improve the reliability of the
heartbeat method the cluster might send heartbeat
signals on a number of different channels, e.g.
normal LAN, RS232 serial links, I/O links etc.

25
Actions when Bozo is declared dead

Establish a new heartbeat chain that excludes
Bozo
Inform parallel subsystems that were running on
Bozo such as databases, of what has occurred and
is about to happen
Fence Bozo off from its resources (e.g. disks)
Form a cluster-wide, consistent plan defining how
Bozos resources should be redistributed
Execute the plan, i.e. move the resources etc.
Inform the subsystem that the resource
reallocation has been completed
Resume normal operation

26
Alternatives to heartbeats

Instead of heartbeats, one can use the opposite
approach a liveness check.
This means that Alice will at certain points ask
Bozo if he is OK.
A liveness check suffers from the same kind of
problems as heartbeats, i.e. it is hard to
guarantee a response within certain limits.
If a cluster node has reasons to believe that the
rest of the system thinks that the node is dead,
the node had better commit suicide. This could
happen when a node detects that its heartbeat
signals have been delayed beyond the time-out
limit.

27
IBM RS/6000 Cluster Technology (Phoenix)

The purpose of Phoenix is to help the developer
to build cluster-parallel applications that are
highly available, i.e. Phoenix is a development
tool and does not do anything by itself.
The product is highly scalable designed for 512
nodes it has been run on clusters with more than
400 nodes.
There are tree core services in Phoenix (see
Figure 111)
Topology ServicesThis service has no direct
interface to the application. It manages
heartbeats and maintains a dynamic map of the
state of the other cluster nodes.
Group ServicesThe key interface that helps the
application to deal with high availability issues
when some event happens.
Event ManagerThis service provides a way to
inform a program running anywhere in the cluster
when some thing interesting happens

28
Microsofts Clustering Services (MSCS)

MSCS is currently supporting only two-node
clusters, later versions will however support a
larger number of nodes.
MSCS is, unlike Phoenix, a self-contained
high-availability cluster product
A key component is MSCS is the quorum resource,
which is usually a disk. The purpose of the
quorum resource is to make sure that only one of
the two nodes thinks that it is in charge of the
cluster.
Each node has access to a dynamic, but
cluster-wide consistent, configuration database.

29
Scaling

The more there are in a cluster the less you pay
for high availability, e.g.
The additional cost for handling a node failure
in a one-node system is 100, i.e. we need two
instead of one computers.
The additional cost of handling a node failure in
a four-node system is 25, i.e. we need five
instead of four computers.
One implication of this that it is desirable to
use computers that cannot individually fulfill
the job requirements.

30
Disaster Recovery

Disasters differ from ordinary failures in that
they are distributed over an area, e.g. flooding
of a room, earthquakes etc.
Shared disk switchover solutions will not work
for disasters.
Some crude and simple solutions are often used
Sending away a backup tape to a remote location
at certain intervals
Sending away a backup electronically to a remote
location at certain intervals
The key difference between disaster recovery and
normal clustering is the distance between the
nodes. This causes delays which can strongly
affect performance.

31
SMP and CC-NUMA Availability

If one processor node in an SMP or a CC-NUMA
multiprocessor crashes, the entire system will
crash.
There are a number of reasons for this, e.g.
The caches on the processor nodes may contain the
only valid copy of a certain variable.
The data structures in the operating system is
shared between the processors, and if a processor
crashes it may corrupt the shared data.

Write a Comment

User Comments (0)

About PowerShow.com

Clusters Part 4 Systems PowerPoint PPT Presentation