Design Environment for Fault-Adaptive Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Design Environment for Fault-Adaptive Systems

Description:

Title: No Slide Title Author: Preferred Customer Created Date: 1/30/2001 4:55:52 PM Document presentation format: On-screen Show Company: ISIS Other titles – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 25
Provided by: Prefe67
Category:

less

Transcript and Presenter's Notes

Title: Design Environment for Fault-Adaptive Systems


1
Design Environment for Fault-Adaptive Systems
  • Ted Bapty
  • Sandeep Neema
  • Sweta Shetty, Steve Nordstrom, Divya Vashishtha,
    Jason Scott, Jason Overdorf
  • Vanderbilt Univ.

2
BTeV RTES TeamNSF/ITR
  • Fermilab
  • Building BTeV Trigger Hardware
  • Domain Experts, Define Goals, Constraints, etc.
  • Vanderbilt
  • RTES Lead (Physics)
  • Design Environment, System Synthesis, System
    Integration, Prototype Hardware
  • UIUC
  • ARMOR, Fault Tolerant Middleware
  • Syracuse Pitt
  • Very Lightweight Agents, Diagnostics, Load
    Balancing

3
High Energy Physics
BTeV Experiment
FermiLab Accelerator
4
Particle Measurement
Detector Grids
  • Problem
  • Massive amounts of data (Terabytes/Sec)
  • Determine the set of particle trajectories
  • Decide if it is interesting, keep or toss
  • Hardware gt 2500 DSPs 2500 PCs
  • Never Fail (ok to degrade)

5
Trigger System(20,000 ft. view)
Store
Memory Queue, ms
2nd Level (PC)
Pre- Process (FPGA)
1st Level (DSP)
2000 Nodes
2000 Nodes
6
System Constraints
  • Triple-Mode Redundancy Too Expensive
  • Some Over-capacity designed in
  • Parallel System, Real-Time
  • Heterogeneous Processors
  • RT Constraints Queue Length.
  • No Generic Response to Faults
  • Based on application requirements
  • Based on system state
  • Based on available resources

7
Fault Mitigation
  • System has excess capacity
  • But not much (10)
  • Cannot pre-plan use of redundancy
  • Excess capacity may be used for disposable
    tasks
  • Fault Occurs
  • React quickly to regain minimal function
  • Rearrange Resources to make Best Use of Remaining
    Resources
  • User-defined recovery behavior

8
Reflex Healing
  • Reflex Action
  • Simple,
  • Rapid,
  • Real-Time, Guaranteed Response Time,
  • Sub-Optimal
  • Handle a Single Failure
  • Healing
  • Re-Evaluate Resources Tasks
  • Re-Balance/Re-Allocate Resources
  • Recover Failed Resources (After Testing)
  • Generate New Reflex Actions

9
ReflexMitigation Example
User-Defined Mitigation Actions
1. Normal Operation
2. Processor Failure
3. Subdivide Primary Task
4. Migrate to Adjacent Processors
5. Replace Secondary Task
Primary Task
5 . Reset/Test Failed Processor
Secondary Task
10
HealingMitigation Example
Mitigation Actions
1. Normal Operation
2. Processor Failure Reflex Action
Re-Eval Re-Plan
3. Update Models
4. Re-Evaluate Resources
5. Re-Plan System
Primary Task
6. Rearrange tasks
Secondary Task
11
Design Issues
  • Complex System
  • Thousands of Processors
  • High Data Rates
  • Real-Time Constraints
  • User-Defined Behaviors
  • Domain-Specific Design Tool
  • System-Specific Implementation
  • Run-Time Implementation
  • Heterogeneous Architecture
  • Real-Time - Execution Mitigation
  • Fault-Tolerant

12
Analysis
Model Integrated Computing
Reconfig Behavior
Resource
System Models
Performance Simulation
Diagnosability Analysis
Synthesis
Design and Analysis
Reliability Analysis
Feedback
Algorithm
Fault Behavior
Synthesis
Runtime
Region Operations Mgr
Experiment Interface
Global Operations Manager
Region Fault Mgr
Local Oper. Manager
L1/ DSP
Local Oper Manager
L2,3/ RISC
Logical Control Network
Logical Control Network
Local Fault Mgr
Logical Control Network
Local Fault Mgr
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
ARMOR/RTOS
ARMOR/Linux
Global Fault Manager
Logical Data Net
Logical Data Net
Local Oper. Manager
Local Oper Manager
DSP
RISC
Local Fault Mgr
Local Fault Mgr
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
ARMOR/RTOS
ARMOR/Linux
Soft Real-Time
Hard
13
Modeling Language
Processing Data Flow
Hardware Resources
Hierarchical Fault Management
Full
Recov. Mode 1
Recov Mode 3
Recov. Mode 2
Concepts Processes, streams, data channels,
Functions, data types, communication
Concepts Processors, Memory, Topology,
Reliability, Failure Modes,
Concepts Recovery Strategies, Modes of
Operation, goals/importance
14
Resource Models
  • Capture Hardware Resources
  • Nodes
  • Networks
  • Attributes
  • Hierarchy

15
Algorithm Models
  • Processes
  • Info Flow
  • Interfaces
  • Hierarchy

16
Fault Mitigation Models
Local Manager
State
Transition
Regional Manager
Conditions
Mitigation Actions
  • Finite State Machine
  • Parallel, Hierarchical
  • Events Transitions
  • Mitigation Actions
  • Time Specs

17
System Generation
Algorithms
SW Loads
Comm Maps
Schedules
Resources
Boot Maps
Task Assign
OS Cfg
18
Generation of Reflex Networks
State A
Reflex Struct.
Action AB1 Action AB2 Action AB3
Action AC1 Action AC2 Action AC3
Primary Struct.
ON (L76 Fail) DO Del P1 Conn P1,S22 Map S22,
C3 Kill T22 Migrate T33,
State C
State B
System Fault State
Reflex Scripts
1 Set for Each Processor And failure type
19
Model-Based Healing
MIC Healing Controller
Nominal
Re- Balance
New Reflex
Update Model
Faults
System Hardware
Interface
Reflex
20
Runtime Environment
Model Interface
Global Manager
Experiment Interface
Reflex Actions
Mitigation Engine
Actions
Feed Back
Regional Manager
Mitigation Engine
Reflex Actions
Actions
Feed Back
Local Manager
Mitigation Engine
Reflex Actions
DSP Kernel
DSP Hardware
21
Fault Mitigation Interface
  • Fault Mitigation Interface
  • The FMA interfaces with the local diagnostics
    facility (receive local status, clear errors,
    trigger rediagnosis, set diagnosis mode, etc.
  • Commands
  • RETRY_LINK(link_id)
  • Function Reset/resync a comm link,
  • Returns failure or success
  • REROUTE_LINK(link_id)
  • Function Reroute communications through a
    separate link
  • ADD_TASK(task_id, link_id)
  • Function Adds a task to the task list, operate
    on data from link_id
  • TEST_MEMORY(memory_bank)
  • Function Intensive test on memory bank
  • RELOCATE_DATA(from_bank, to_bank)
  • Function Moves data, marks source memory bank as
    unused/unavail
  • GET_LOCAL_STATUS
  • Function Reports status of a resource on a
    local node
  • SEND_MESSAGE
  • RECEIVE_MESSAGE
  • . . .

22
Synthesis Analysis/Offline
  • Simulation
  • Functional (e.g. Matlab)
  • Performance (Timing, Discrete Event)
  • Interfacing/generating to Swarms/Jackal/TAEMS
  • Diagnosability
  • Failure Modes Sensors
  • Predict ability to Detect/Isolate Failures
  • Reliability Analysis
  • Predict MTBF, Maximum Failures
  • Robustness
  • Stability Analysis
  • Reconfiguration Strategies/Control System

23
System Simulation
System Model
Task Model
Communication Model
24
Summary
  • Developing Model-Based Approach
  • Capture Algorithm, Resource, and Mitigation
    Aspects
  • Generation of Software
  • Normal application Code
  • Fault Mitigation Code
  • Two Fault Mitigation Approaches
  • Reflex Fast, Limited Response
  • Healing Slower, system re-design
  • Analysis Simulation
  • Runtime Infrastructure
Write a Comment
User Comments (0)
About PowerShow.com