Dependability Considerations in Distributed Control Systems - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Dependability Considerations in Distributed Control Systems

Description:

Difficult to distinguish from a host crash. Consequences. Affected services are lost ... Use a managed platform (Java, .NET) Use auto-pointers (C ) ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 17
Provided by: miri78
Category:

less

Transcript and Presenter's Notes

Title: Dependability Considerations in Distributed Control Systems


1
Dependability Considerations in Distributed
Control Systems
  • Klemen Žagar, Cosylab

2
Dependability
  • A dependable system is one which the users may
    trust.
  • Examples of dependable distributed systems
  • The Internet
  • Power distribution grid
  • Water supply
  • Dependability is very general term. Among others,
    it covers
  • Availability it is there when needed.
  • Reliability it can work autonomously for a long
    period of time.
  • Maintainability easily fixed when broken.
  • Safety will not harm other equipment or
    personnel.
  • Security unauthorized, possibly malicious, users
    can not gain control

3
Motivation
  • Nodes of a distributed system are like dominos
  • The domino effect one falls, all may go down
  • May happen often, and takes a long time to
    rebuild
  • Thus, fault tolerance is important
  • Improved mean-time-to-failure of the system as a
    whole
  • Lower mean-time-to-repair
  • Improved availability
  • Reduced maintenance effort
  • Fault tolerance in distributed control systems?

4
Research Objectives
  • Dependable Distributed Systems (DeDiSys) research
    project with the European Union.
  • What are the most frequent causes of faults in
    distributed control systems?
  • What mitigation mechanisms are available?
  • How to improve availability by trading it against
    constraint consistency?
  • What is constraint consistency in control systems?

5
Reliability
  • Reliability, , is the probability that a
    system will perform as specified for a given
    period of time.
  • Typically exponential
  • Alternative measure is the mean time to failure
    (MTTF/MTBF)

6
Reliability of Composed Systems
  • Weakest link reliability of a coupled composed
    system is less than the reliability of its least
    reliable constituent
  • Redundancy reliability of a redundant subsystem
    is greater than the reliability of its most
    reliable constituent

7
Maintainability and Availability
  • Maintainability how long it takes to repair a
    system after a failure.
  • The measure is mean time to repair (MTTR)
  • Availability percentage of time the system is
    actually available during periods when it should
    be available.
  • Directly experienced by users!
  • Expressed in percent. In marketing, also with
    number of nines(e.g., 99.999 availability ?
    unavailable 7 min/year).
  • Example a gas station (working hours 6AM to 10PM
    16 hours)
  • Ran out of gas at 10AM (2h)
  • Pump malfunction at 2PM (2h)
  • Availability 12h/16h 75

8
Research Methodology
  • Research in the context of the DeDiSys project
  • Collection of requirements from
  • DeDiSys projects interest group members
  • Cosylabs customers (e.g., ANKA, SLS, ...)
  • Identification of scenarios
  • ALMA Common Software (ACS)
  • EPICS
  • Geographical Information Systems
  • Definition of the architecture for a
    fault-tolerance naming service (FTNS)

9
Faults in Distributed Systems
  • Node failures
  • A host crashes or a process dies
  • Volatile state is lost
  • Link failures
  • A network link is broken
  • Results in two or more partitions
  • Difficult to distinguish from a host crash
  • Consequences
  • Affected services are lost
  • Dependent systems malfunction
  • User interface doesnt show actual status

10
Improving Hardware MTTF
  • Reduce the number of mechanical parts
  • Solid-state storage instead of hard disks
  • Passive cooling of power supplies and CPUs (no
    fans)
  • High-quality or redundant power supplies
  • Replication
  • network links
  • CPU boards
  • Remote reset (e.g., via power cycling)

11
Improving Software MTTF
  • Ensure that overflows of variables that
    constantly increase (handle IDs, timers,
    counters, ...) are properly handled.
  • Ensure all resources are properly released when
    no longer needed (memory leaks, )
  • Use a managed platform (Java, .NET)
  • Use auto-pointers (C)
  • Avoid using heap storage on a per-transaction
    basis (may result in memory fragmentation) e.g.,
    use free-lists
  • Restart a process in a controllable fashion
    (rejuvenation)
  • Isolate processes through inter-process
    communication
  • Recovery
  • Recover state after a crash
  • Effective for host and process crashes
  • Automated repair

12
Decreasing MTTR
  • Foresee failures during design
  • The major difference between a thing that might
    go wrong and a thing that cannot possibly go
    wrong is that when a thing that cannot possibly
    go wrong goes wrong it usually turns out to be
    impossible to get at or repair.
  • Douglas Adams Mostly Harmless
  • Provide good diagnostics
  • Alarms
  • Detailed description of where and when an error
    occurred
  • Logs
  • State-dump at failures
  • ADC buffers after a beam dump
  • Status of synchronization primitives
  • Memory dump
  • Automated fail-over
  • In combination with redundancy
  • Passive replica must have up-to-date state of the
    primary copy
  • Fault detection (network ping, analog signal, )

13
Consistency/Availability Trade-Off
Consistency
Availability
  • Finance
  • Banking
  • Access control
  • Corporate databases

Control systems Air-traffic control Fly-by-wire Dr
ive-by-wire
14
Constraint Consistency in Control Systems
  • Constraints rules that one or more objects must
    satisfy, for example
  • If and only if serverChannel.monitors.contains(cli
    ent)then client.isSubscribedTo(serverChannel)
  • serverChannel.value clientChannel.value
  • server.getFromDatabase(x) database.get(x)
  • If client.referencesComponent(component)then
    component.isReferencedBy(client)
  • Can some constraints be temporarily relaxed in
    presence of faults?
  • If so, how to reconcile the system in a
    consistent state when faults are removed?

15
Future Work
  • DeDiSys
  • Design and implementation (due January 2007)
  • Validation (due June 2007)
  • Possible inclusion of research findings in
    control system infrastructures
  • ACS (e.g., replication of the manager and
    components)
  • EPICS (e.g., V4 fault-tolerance efforts of the
    EPICS community)
  • Inclusion in products
  • The microIOC platform
  • Servers for Geographical Information Systems
  • Other high-availability products
    (telecommunications, automotive)
  • Know-how for consulting and development services

16
Conclusion
  • Distributed systems are inherently fragile
  • Fault tolerance is difficult to program
  • Should be addressed by infrastructure/middle-ware,
    but frequently isnt
  • Comments/questions/contributions
    klemen.zagar_at_cosylab.com
Write a Comment
User Comments (0)
About PowerShow.com