Title: RTES Approach to Software Fault Tolerance
1RTES Approach to Software Fault Tolerance
- Jim KowalkowskiFermilab-CD
2Purpose
To give you a brief, high-level overview of some
of the concepts, techniques, and tools that the
RTES collaboration developed to address software
reliability concerns in the BTeV trigger system.
(Although looking at these slides, one may think
it is not so brief.)
3Very brief history
- BTeV was developing a complex trigger system,
which included a large collection of FPGAs and
commodity hardware in two large clusters with - High availability requirement
- High throughput requirement
- Real time processing constraints
- The RTES collaboration was formed to help assure
that the availability requirement was met - Four universities with expertise in software and
hardware fault tolerance, reliability
engineering, and real time processing - (I was the liaison between BTeV and RTES)
4Problem and RTES Goal
- Problem Software reliability depends on
- Physics detector-machine performance
- Program testing procedures, implementation, and
design quality - Behavior of the electronics (front-end and within
the trigger) - Goal Create fault handling infrastructure
capable of - Accurately identifying problems (where, what, and
why) - Compensating for problems (shift the load,
changing thresholds) - Performing automated recovery procedures (restart
/ reconfiguration) - Accurate accounting
- Being extended (capturing new detection/recovery
procedures) - Policy driven monitoring and control
- ( also wanted to simplify operations)
5What aspects are interesting?
- Hierarchical decomposition of problem, which
addresses - Real time constraints (react quickly when
necessary) - Scalability
- Resource usage constraints
- Protocols between levels
- Separation of concerns
- the various contributors write code specific to
their need - Very low coupling
- linkage through message subscriptions
- Separation of monitoring, problem detection, and
actions - The system can change dynamically (as it is
running) - Interprocess messaging infrastructure
- Based on Elvin
- A Publish/subscribe system
- Supplied gateways and routers at various levels
- High-level abstractions
- System behavior and configuration and can be
expressed using domain specific concepts and
terms - Tools actually executing the configuration can
evolve independently
6Development approach
- Generated a series of use cases that describe
typical system behavior. - Generated a set of prototypical problems that may
occur on each of the systems. - Generated a system architecture that looked
similar to the vision of the real system - Created demonstration systems that match this
architecture and emulate operation of various
parts of the system using RTES developed products - Purpose is to detect and react to a given set of
problems - Made Level 1 trigger project using DSP event
processors, Linux regional managers, and a
high-level control system - Made Level 2/3 trigger project using a Linux farm
7Hierarchical Detection/Mitigation
8Configuration through Modeling
- Multi-aspect tool, separate views of
- Hardware components and physical connectivity
- Executables configuration and logical
connectivity - Fault handling behavior using hierarchical state
machines - Model interpreters can generate the system image
- At the code fragment level (for fault handling)
- Download scripts and configuration
- Validation and testing
- Modeling languages are application specific
- Shapes, properties, associations, constraints
- Appropriate to application/context
- System architecture/configuration
- Component states
- Message structures
- Fault mitigations
9Modeling Environment
10ARMORs
- Are multithreaded processes composed of
replaceable or pluggable building blocks called
Elements - Elements provide error detection and recovery
services to the trigger and other applications - Restarts, reconfiguration
- Removal from service
- ARMOR framework routes messages and schedules
Elements based on their message subscriptions - A Hierarchy of ARMOR processes form a
reconfigurable runtime environment - System management, error detection, and error
recovery services are distributed across ARMOR
processes - ARMOR runtime environment can handle self failure
- ARMOR support for the application
- Completely transparent and external support
- Instrumentation with ARMOR API
11Very Lightweight Agents (VLA)
- Message scheduling and priority assignments
- Fast, simple reactive decisions
- Reads, summarizes, and reports sensors data
- Are pluggable components
- Lives alongside application
- Some predictive capabilities
12Shortcomings
- VLAs
- Standarized API and management framework never
established - Definition much to vague to know precisely if
your code is one of these or not - ARMORs
- Difficult to write pluggable components
(complex execution model) - Only support for C
- Management and configuration not adequate
(Please note Some of the developments were cut
short due to the cancellation of BTeV)
13Current Activities
- LQCD cluster, need to automate
- Routine administration tasks
- Recovery procedures (for jobs and nodes)
- Collecting of performance information
- CMS online
- Automate system validation and testing
(These are the ones I am aware of.)