Title: Ardea: A Reconfigurable Architecture for Fault Tolerant Distributed Embedded Systems
1Ardea A Reconfigurable Architecture for Fault
Tolerant Distributed Embedded Systems
- Osamah. A. Rawashdeh
- Ph.D. Defense
- Department of Electrical and Computer Engineering
- University of Kentucky
- Lexington, KY
- November 29, 2005
2Outline
- Motivation and Background
- Objective and Contributions
- Ardea Framework Overview
- Ardea Hardware Architecture
- Software Module Dependency Graphs
- Ardea Fault Tolerance
- Runtime Behavior
- Related Work
- Example Implementation
- Future Work
- Conclusion
3Motivation
- Majority of Processors are in Embedded Systems
- Systems are becoming more complex and perform
more critical operations - Correct operation must be insure for safety of
public and environment - Fault will always occur gt systems must be
designed to be dependable
4Dependable Systems
- Dependability trustworthiness of a system
allowing reliance to be justifiably placed on
its services - Failures Deviation of service provided from
compliance - with specifications
- Faults the cause of failures
- Failure semantics omission, timing, response,
and crash - Hardware versus software faults
- Fault Tolerance ability to continue operation
despite failures
Figure 1 - Page 6
5Traditional Fault Tolerance
- Fault tolerance entails fault detection and
subsequent handling - Fault tolerance requires redundancy
- Static redundancy (spatial redundancy)
- Modular redundancy
- Design Diversity
- Dynamic redundancy (temporal redundancy)
- Recovery blocks
- Failover programming
6BIG BLUE Fault Tolerance
May 2003
- BIG BLUE I
- Single processor design
- Static sensor redundancies
- Ad hoc fault tolerance
- BIG BLUE II
- Distributed 3 processor design
- I2C communication bus
- Shared data memory for communication
- Static redundancies
- BIG BLUE III
- Single processor design
- Real-time multi-tasking OS used
- Task interdependencies limited by using a
mailman
May 2004
May 2005
7Reconfiguration Based FT
- Run-time reconfiguration FT feasible in
distribute embedded systems - Cost, size, power constraints
- Availability of non-critical resources
- Graceful degradation a loss of or reduction in
the quality of services a system provides in
response to faults - Graceful degradation for distributed embedded
systems is a new research area
8The Challenge
- How to specify a dynamically reconfiguring system
that included static and dynamic redundancies - How to manage the redundancies
- What infrastructure is needed to run these
dynamic applications
9Objective and Contributions
- Ardea is a systematic framework for designing an
implementing real-time dynamically reconfiguring
fault tolerant distributed embedded systems - Contributions
- A graphical software specification technique for
real-time applications that supports static
redundancies as well as reconfiguration fault
tolerance - An infrastructure supporting run-time application
reconfiguration
10The Ardea Framework
- Ardea Automatically Reconfigurable Distributed
Embedded Architectures - Ardea herodias The Great Blue Heron, a wading
bird of the heron family Ardeidae, common all
over North and Central America. This is the
largest North American heron.
11Ardea Overview
- Software is developed in a modular fashion
- Mobile software modules can have several
implementations with different resource
requirements and output qualities - Dependencies among modules are graphically
captured in software module dependency graphs
(DGs) specifying application operating modes and
execution parameters - A set of networked processors for running
application software
12Ardea Overview cont.
- A global system manager tracks status of hardware
and software resources - System manager computes new system configurations
(a mapping of software modules onto processing
elements) - Local management tasks are responsible for OS
scheduling and data routing - Target applications real-time distributed
embedded control/periodic applications
13Outline
- Motivation and Background
- Objective and Contributions
- Framework Overview
- Ardea Hardware Architecture
- Software Module Dependency Graphs
- Ardea Fault Tolerance
- Runtime Behavior
- Related Work
- Example Implementation
- Future Work
- Conclusion
14HW Architecture Overview
- Processing Elements (PEs)
- - Homogeneous set of processors
- - Real-time OS.
- - Local management tasks (scheduler,
network interface, loader) - I/O Devices
- - Sensors and actuators
- - Hosted by PEs
- Communication Network
- - Broadcast Network
- - Bandwidth and Latency
- System Manager
- - Fault tolerant by other means
- - Tracks status of resources
- - Finds and deploys configurations
Figure 14 - Page 34
15Application Software Specification
- Dependency graphs show the periodic flow of
information from sensors to actuators (i.e., data
pipelines) - Graph nodes software modules, data variables,
I/O devices, and dependency gates - Software modules
- Executable code schedulable
- on a processing element
- Suspended while input(s) unavailable
- Produce and consume data variables
- Attributes worst case execution time
- and rate factor
Figure 3 - Page 18
16Data Exchange
- Data variables
- Application data between
- software modules
- State data variables are local to a software
module - Management data variables contain data consumed
by system manager. - Attributes
- Size
- Quality value or function
Figure 5 - Page 19
17Specifying Dependencies
- Dependency gates
- k-out-of-n OR gates n gt 0,
- 0 k n
- AND all input required
- XOR only one input required
- DEMUX for fanning out
- OR gates can be specified to distribute inputs
Figure 5 - Page 20
18DG with Node Attributes
ID yaw_cntrl1 Exec_T 900 cycl. Rate_factor
15
ID out1, out2 Criticality critical Priority
1, Rate 10 Hz State Enabled
ID rud_Angle1 Size 2 bytes Quality 1
ID mag_drv1 , mag_drv2 Exec_T 300
cycl. Rate_factor n/a
ID yaw_history Size 8 bytes Quality n/a
ID servo1_drv, servo2_drv Exec_T 200
cycles Rate_factor 11
ID yaw1, yaw2 Size 2 bytes Quality 1, 2
ID rud_Angle2 Size 2 bytes Quality 2
ID yaw_cntrl2 Exec_T 400 cycl. Rate_factor
12
19Ardea Fault Detection Handling
- Failure detection of sensors, actuators and
software modules is the responsibility of
application software - Ardea built-in fault detection
- PE crash failures by heartbeat messages
- Network link failures detected and handled as PE
failures - Software module crashes detected locally by a
module execution monitors - Critical output modules detect missed deadlines
- Fault Handling masking, reconfiguration, or
fail-stop
20Sensor Fault Detection
Figure 22 - Page 52
Figure 21 - Page 51
21Actuator Fault Detection
Figure 23 - Page 53
22Software Fault Detection
Figure 25 - Page 55
23Triple Modular Redundancy
Figure 20 - Page 50
24Ardea Runtime Behavior
- Supporting mobile software modules (moving object
code, scheduling/unscheduling, and data
re-routing) - Tracking resource availability
- Finding Configurations
- Deploying Resources
- Manage state data variables
25The System Manager
Figure 18 - Page 45
26Processing Elements (PEs)
- Memory Loader copies code into program memory
- Scheduler starts and stops execution of modules
- Network Interface handles public data variables
(data routing)
Figure 17 - Page 42
27Mobility and Data Routing
- Module I/O data passed through mailboxes
- Data routing transparent to modules
- Starting, stopping of modules
Figure 26 - Page 61
28Reconfiguration Policies
- Two configuration finding algorithms
- High-fidelity is (NP-hard) to find high-utility
configurations - Low-fidelity (fast) to insure running of critical
services - Response based on criticality of
detected/reported fault - Deploying configurations starting from sensor
side of a DG
See Figure 31 - Page 75
29Outline
- Motivation and Background
- Objective and Contributions
- Framework Overview
- Software Module Dependency Graphs
- Ardea Hardware Architecture
- Ardea Fault Tolerance
- Runtime Behavior
- Related Work
- Example Implementation
- Future Work
- Conclusion
30Related Work Analysis Tools
- Goal Based Success Trees
- Failure Mode Analysis (fault trees)
- AADL Architecture Analysis and Design Language
- By SAE
- A textual modeling language for specification of
real-time embedded systems - System is defined as a set of components with
resource and timing properties - No support for components with degraded modes of
operation - Graphical tools currently under development
31Related Work Graceful Degradation
- RoSES at CMU and Chameleon at the Technical
University of Keiserslautern - Both are abstract, not considering implementation
- Both are considering non-safety critical and
non-real-time applications - RoSES focuses on complexity configuration of
search algorithms and on product families - Chameleon focuses on modeling and analysis of
gracefully degrading systems
32Related Work Distributed Object Computing
- Examples Jini, CORBA, RT-CORBA, and ARMORs
- Principle service based computing, where
services are brokered at runtime - Designed with large information systems in mind
- Depend on TCP/UDP (not reliable)
- Not suitable for embedded systems
33Outline
- Motivation and Background
- Objective and Contributions
- Framework Overview
- Software Module Dependency Graphs
- Ardea Hardware Architecture
- Ardea Fault Tolerance
- Runtime Behavior
- Related Work
- Example Implementation
- Future Work
- Conclusion
34Example Ardea Implementation
- Specified an implementation of a control system
and scientific data collection system for a light
UAV - Application includes redundant sensors,
actuators, yaw controllers with different
fidelities - Application includes non-critical functions in
form of a scientific data collection system - Limitations
- Object code is preloaded on PEs
- Uses pre-computed configurations
- Considers only processor time as constraint
35Pseudocode of Modules
1 Check connection to magnetometer i 2 IF no
connection, write failure message into mag i
fail mailbox and suspend 3 ELSE 4 Sample
magnetometer n times 5 Average the n samples
6 Place average into yaw i mailbox
7 Suspend for sampling period amount of
time 8 GOTO 1
Figure 33 Magnetometer Drivers Pseudocode, p. 83
1 Check connection to servo 2 IF no connection,
write failure message into servo i fail mailbox
and suspend 3 ELSE 4 Read rudder angle
from input mailbox 5 IF rudder-angle is
fail-stop code, move servo to 180 degrees 5
ELSE 6 Set servo to
rudder angle 7 Delete data in input
mailbox 8 Reset deadline timer 9 Suspend until
input mailbox full or deadline timer
overflow 10 IF deadline timer overflow 11 trans
mit fail-stop to system manager 12 ELSE GOTO 1
Figure 35 Magnetometer Drivers Pseudocode, p. 84
36COTS Components
- Network CAN2.0B Bus
- Controller Area Network
- Robust differential signaling, CRC, fail-silent
nodes - CSMA/CD with non-destructive bitwise arbitration
- PEs high-performance Silicon Labs 8051 core
microcontrollers - OS microC/OS-II
- Preemptive, multitasking, priority based, ROMable
- DO-178B Level-A certified
Figure 39 - Page 87
37microC/OS-II API
- OS Functions called by the scheduler task for
managing task execution and mailboxes - Other calls read/write mailboxes, suspend while
mailbox empty, and suspend for time t
Table 8 Operating System API, p. 93
38CAN API and Messages
Table 10 The CAN Bus API, p. 95
Table 11 Overview of Messages, p. 96
39Resource Tracking and Finding New Configurations
Table 12 Highest Utility Configuration Array, p.
99
Table 13 Example Resource Status Array, p. 101
40Future Work Ardea System Monitor
Figure 46 - Page 104
41Future Work A Wireless Bus Extension
Figure 47 - Page 105
42Future Work The Ardea CAD Tool
Figures 12,13 - Pages 31,32
43Future Work Honeywells EAFTC
- Honeywells Environmentally Adaptive Fault
Tolerant Computing System (EAFTC) - One of four technology validation payloads on the
New Millennium Programs Space Technology 8 (ST8)
Mission scheduled for 2008. - Purpose fault tolerant high-rate onboard
parallel processing for science data - We are currently investigating the
use/modification of Ardea for EAFTC - Supported by the Kentucky Space Grant Consortium
Testing Tomorrows Technology Today!
44Ardea Benefits
- More flexible fault tolerance at reduced cost
- Ability to analyze reconfigurable architectures
using DGs - Simplified debugging and maintenance
- Runtime system testing
- Graceful upgrade and repair
- Reduction of design errors
- Software reusability
45Conclusion
- Graceful degradation in distributed embedded
system is a new research area currently focusing
on either abstract modeling or on
non-real-time/non-critical systems - Ardea provides a structured framework for the
design and implementation of real-time systems - Dependency graphs were presented to capture fault
tolerant, dynamically reconfiguring, software
architectures - An infrastructure supporting reconfigurable
distributed reconfigurable applications was
presented
46Misc. Publications Patents
- Patents
- Vallance, R.R., S. Chikkamaranahalli, O.A.
Rawashdeh, J.E. Lumpp, B. Walcott, and E.
Wolsing. System and Device for Characterizing
Shape Memory Alloy Wires, U.S. Patent 6,916,115,
July 12, 2005. - Wermeling, D., R. Vallance, B. Walcott, J. Main,
J. Lumpp, O.A. Rawashdeh, Programmable
Multi-Dose Intranasal Drug Delivery Device, U.S.
patent pending, application for utility patent
filed December 2002. - Balasubramanian, A., R.R. Vallance, B.L. Walcott,
J.E. Lumpp, O.A. Rawashdeh, Linear Actuator
Using Shape Memory Wire with Controller, U.S.
patent pending, provisional application filed
September, 2002. - Publications
- D. Jackson, A. Groves, O. Rawashdeh, G. Chandler,
W. Smith, and J. Lumpp, Evolution of an Avionics
System for a High-Attitude UAV, proc. AIAA
Infotech_at_Aerospace Conference, paper
AIAA-2005-7152, September 2005. - S. Chikkamaranahalli, R.R. Vallance, and A.Khan,
E.R. Marsh, O.A. Rawashdeh, J. E. Lumpp, and B.L.
Walcott, Precision Instrument for Characterizing
Shape Memory Alloy Wires in Bias Spring
Actuation, Review of Scientific Instruments
Journal, v. 76, June 2005. - G. Chandler, D. Jackson, A. Groves, O.A.
Rawashdeh, N.A. Rawashdeh, W. Smith, J. Jacob,
and J.E. Lumpp, Jr., A Low-Cost Control System
for a High-Altitude UAV, IEEE Aerospace
Conference, IEEEAC paper 1438, March 2005. - A. Simpson, O.A. Rawashdeh, S. Smith, J. Jacob,
W. Smith, and J.E. Lumpp, JR., BIG BLUE A
High-Altitude UAV Demonstrator of Mars Airplane
Technology, IEEE Aerospace Conference, IEEEAC
paper 1436, March 2005. - A. Simpson, J. Jacob, S. Smith, O. A. Rawashdeh,
J. E. Lumpp, and W. Smith, BIG BLUE II Mars
Aircraft Prototype with Inflatable-Rigidizable
Wings, 43rd AIAA Aerospace Sciences Meeting and
Exhibit, January 2005. - K.N. Roberts, K.M. Miller, J.E. Lumpp, M. Wells,
C.P. Harr, O.A. Rawashdeh, and S.W. Scheff,
Computer Controlled Cortical Contusion Device
for the Mouse, Journal of Neurotrauma, vol. 21,
no. 1296, November 2004. - O.A. Rawashdeh, Design of a Computer Controller
for a Nasal Drug Delivery Device using SMA
Actuators, Thesis, Masters of Science in
Electrical Engineering, Dept. of Electrical and
Computer Engr., University of Kentucky, May 2003. - S. Chikkamaranahalli, R.R. Vallance, O.A.
Rawashdeh, J.E. Lumpp, and B. Walcott, Setup to
Characterize Nitinol Wires, Int.Conf. on Shape
Memory and Superelastic Technologies (SMST-2003),
Pacific Grove, CA, May 4-8, 2003. - S. Chikkamaranahalli, R.R. Vallance, O.A.
Rawashdeh, J.E. Lumpp, and B. Walcott,
Characterization of SMA Wire in Bias Spring
Actuation, Proceedings of the 2003 Proceedings
of the the International Conference on Shape
Memory and Superelastic Technologies (SMST-2003).
Pacific Grove, CA, May 4-8, 2003. - J.E. Lumpp, K.N. Roberts, M. Wells, J.A. Main,
C.P. Harr, O.A. Rawashdeh, and S.W. Scheff,
Characterization of a Computer Controlled
Non-penetrating Cortical Contusion Device,
Journal of Neurotrauma, vol. 20, no. 1087, May
2003. - S. Chikkamaranahalli, R.R. Vallance, O.A.
Rawashdeh, J.E. Lumpp, and B. Walcott, Precision
Instrument for Characterizing Contraction and
Extension of Nitinol Wire, Proceedings of the
17th Annual Meeting of the American Society for
Precision Engineering (ASPE), October 20-25, 2002.
47Ardea Related Publications
- O.A. Rawashdeh and J.E. Lumpp, Jr. Run-Time
Behavior of Ardea A Dynamically Reconfiguring
Distributed Embedded Control Architecture, to
appear, IEEE Aerospace Conference, IEEEAC paper
1516, March 2006. - O.A. Rawashdeh and J.E. Lumpp, Jr. Ardea A
Dynamic Reconfiguration Framework for
Fault-Tolerant Distributed Embedded Systems,
under review, Journal of Systems and Software,
Special Issue Architecting Dependable Systems,
submitted October 2005. - G. Chandler, C. Harr, O. Rawashdeh, D. Feinauer,
D. Jackson, A. Groves, and J. Lumpp, Wireless
Extension of an Avionics Bus for Prototyping and
Testing Reconfigurable UAVs, proc. AIAA
Infotech_at_Aerospace Conference, paper
AIAA-2005-7151, September 2005. - O. Rawashdeh, D. Feinauer, C. Harr, G. Chandler,
D. Jackson, A. Groves, and J. Lumpp, A
Dynamically Reconfiguring Avionics Architecture
for UAVs, proc. AIAA Infotech_at_Aerospace
Conference, paper AIAA-2005-7050, September
2005. - O.A. Rawashdeh, G.D. Chandler, and J. E. Lumpp,
Jr., A UAV Test and Development Environment
Based on Dynamic System Reconfiguration,
International Conference on Software Engineering
(ICSE) proc. of the 2005 Workshop on
Architecting Dependable Systems (WADS05), pp. 1
7, May 2005. - O.A. Rawashdeh and J.E. Lumpp, Jr., A Technique
for Specifying Dynamically Reconfigurable
Embedded Systems, IEEE Aerospace Conference,
IEEEAC paper 1435, March 2005.
48(No Transcript)
49Reconfiguration Policies
Figure 31 - Page 75
50Related Work RoSES
- Robust Self-Configuring Embedded Systems
- Long term abstract graceful degradation research
at CMU - Composes system into feature
- subsets, each having a utility value depending
on the operation of its software components - Offline exponential reconfiguration search
algorithms find optimal configurations - Not deterministic and not testable
- Focus
- Reducing complexity of search
- algorithms
- Software fault tolerance for non-critical
functionality - Use as product family specification
51Related Work Chameleon
- Focus is on modeling and analysis of gracefully
degrading distributed embedded systems - System modeled as a set of services, each with
input requirements - Each service has a configuration tree (or success
trees) - Abstract modeling work, no implementation or
runtime considerations
52I/O Devices
- I/O devices
- Interfaces to the environment
- Output device attributes
- Criticality
- Priority
- Real-time deadline
- Status
- Attributes are modifiable
- I/O software modules
- Input modules are time triggered
- Output modules monitor deadlines
Figure 8 - Page 24
53Ardea Fault Handling
- Static redundancies do not cause reconfiguration
- Report of failures to system manager trigger
reconfiguration to employ redundancies
dynamically - Fail-stop mode initiated when critical deadlines
are missed (due to undetected failures or due to
reconfiguration delays)
54Example Dependency Graph
1/3
1/3
Figure 9 - Page 24
55Scheduling and Unscheduling
- Starting, stopping, and restarting modules
- Restarting requires
- State Preservation
- Unprocessed data preservation
Figures 27,30 - Pages 65,69