Title: A SelfHealing Approach for Developing Complex Software Systems
1The Shadows Project
- A Self-Healing Approach for Developing Complex
Software Systems
IBM Haifa Research Lab, Reliable
Systems Presented by Onn Shehory, Shadows
project coordinator
IBM Academy Conference, April 2006
2Outline
- Introduction
- Technical overview
- Organization
- Shadows background technologies
- ConTest Concurrency Testing
- ATS Automated Threshold Setting
- BCT Behavior Capture and Test
- Contribution to standards
- Summary
3System Complexity
Actual Application Architecture for Consumer
Electronics Company
4Shadows - Profile
- Consortium formed to address challenges
formulated by EU - EU 6th Framework RD Program, call no. 5
- Strategic Objective 2.5.5 Software Services
- Research proposal submitted to EU 9/2005
- Members
- IBM, Univ of Milan Bicocca, Univ of Potsdam,
Univ of Brno, Artisys, Comverse Technologies, Net
Technologies, Philips, Scapa Technologies,
Blue Technology Validator Green - Technology
Provider Pink Dissemination/Exploitation
5Technical Overview
- A paradigm for developing complex software
systems with design-time and run-time
self-healing (SH) capabilities - Goal mitigate the challenge of growing software
complexity and its detrimental impact on software
quality - Integration of several SH technologies across the
system lifecycle - Mainly in middleware and applications
6Shadows Technologies
- The underlying set of SH technologies will
include - Verification, and run-time amelioration, of
Concurrent Systems (IBM) - Automatic Threshold Setting for Performance
Management (IBM) - Behavioral Capture and Test (Univ Milan)
- Formal Methods (Univ. Brno/Potsdam)
7Technology Validation
- The contribution of validators includes
- Gap analysis
- Requirement definitions
- Technology evaluation
- Validation environments include, e.g.
- Real-time resource constrained embedded C
software (Philips Nexpedia) - Server-side Java software for high-availability
telco systems (Comverses MMS) - Avionics software (Artisys)
8Methodology Flow
Requirements Definition
Analysis
SH-Oriented
Healing
Development
SH-Oriented
Assurance
Testing / Debugging
SH-Oriented
SH
System Deployment
System Design
9Abstract Architecture
Integrated Model-Based Framework for Designing
and Managing Self-Healing Systems
System Design and Management Standards
Methodology and Tools
Concurrency Testing
Fault Prediction and Automatic Threshold Setting
Behavioral Capture and Test
Model-Based Technologies
Open Standards
CIM
TPTP
10Shadows Solution Architecture
11Background Technologies
- ConTest Concurrency Testing
- ATS Automated Threshold Setting
- BCT Behavior Capture and Test
12ConTest Testing Concurrent and Distributed
Applications
13ConTest the Challenge
- Finding bugs in parallel and multi-threaded
software is challenging - Bugs depend on the program execution order
- In lab environment only a small subset of
possible execution orders occur - As a result, many problems/bugs are discovered
only in the field
14ConTest the Solution
- ConTest runs existing tests multiple times
- Using different scheduling orders created by
ConTest. - ConTest increases the probability of revealing
timing related bugs in Java programs - ConTest supports execution replay to reproduce
the execution that caused the bugs - Replay and debugging aids to assist once a bug is
found - Solution for Java done, C/C and C under
development
15ConTest Technology in Brief
16ConTest Benefits and Future
- Benefits
- ConTest improves testing of concurrent and
distributed applications for timing related bugs
from early development stages - ConTest has minimal impact on the testing process
and allows re-use of existing tests - Reduction of maintenance cost due to higher
quality - Planned, or in the works
- Automated fix of concurrency bugs
- For some bug families, this already works
17ATS Automated Threshold Setting
18ATS Problem Statement
- Given
- A computer system, its components, applications
running on the system - A service dependency of applications on
components - When unknown must revert to correlation analysis
(data mining, statistical) - Service-Level Objectives (SLOs) for the
system/applications and indications of their
violations - A monitoring infrastructure that
- monitors operational parameters at the components
- Generates/sends component alarms when
measurements violate thresholds - Compute thresholds on operational values of each
component metric, such that - Percentages of false alarms meet pre-specified
levels - Adapt thresholds to changes in workload patterns,
system configuration, and SLOs - The solution should be computationally efficient
19ATS Motivation
- In complex computer systems, manually-set
thresholds are NOT - Indicative
- Adaptive
- Scalable
- Sub-optimal and rigid performance management
- Administrator overloaded
- Automating threshold setting will allow more
reliable use of component-level performance
parameters and thresholds for system-level
performance management
20ATS Solution Approach
- Use standard tools to measure operational
parameters on components - Use SLOs set by administrators or policy
- Automation of threshold computation procedure
- Start with initial component level threshold
values - Use histories
- Of thresholds
- Of SLO violations
- Build a statistical model for PPV and NPV of the
thresholds based on the SLO and threshold
histories - Compute updated thresholds via the model to
satisfy target PPV and NPV - Iterate the process to dynamically update the
thresholds - Regular regression is inapplicable - we use
logistic regression
21ATS Status and plans
- ATS algorithms formulated and successfully tested
on a small laboratory system (2005) - Paper published and patent filed (2005)
- Future versions will address large, complex
systems - Multiple and compound SLOs
- Suggest system reconfiguration to allow for
better SLOs
22BCT Behavior Capture and Test
23Component-based software
- Component reuse
- Reduce costs
- Increase productivity
- Unexpected failures
- Components areRobust and reliable, butDesigned
without knowledge of the final system -gt
integration problems - Integration testing problems
- no source code
- incomplete specifications
24Integration problems
- Inconsistent interpretation of parameters or
values - Each component's interpretation may be
reasonable, but incompatible (Martian lander,
Sept. 1999) - Violations of value domains or of capacity or
size limits - Implicit assumptions on ranges of values or sizes
- Buffer overflow
- Side effects on parameters or resources
- Resources not explicitly mentioned in the
interface - temporary files
- Missing or misunderstood functionality
- Underspecified functionality leads to incorrect
assumptions - Hit counts
25Verifying component-based systems
- Testing
- mutational analysis Ghosh, Mathur TOOLS 2000
- Dynamic analysis
- Only numeric data Raz, Koopman, Shaw ICSE 2002
- Requires source code and focuses on data
McCamant, Ernst ESEC/FSE 2003
26Behavior Capture and Test (BCT)
- Key idea
- Integration analysis and test require information
about components behavior - Extensive reuse of components produces a lot of
information - Can we capture behavior information to test and
analyze component integration?
27BCT Main Steps
- BCT
- Capture Behavior Data
- Monitor component execution
- Capture run-time information
- Distill Behavior Models
- I/O models
- Interaction models
- Verify the Run-Time Behavior
- Verify reused/replaced components with behavior
models
28Contribution to Standards
- The Shadows project will be based on open
standards for software lifecycle management - Enable true collaboration and interoperability
- Faster adoption
- Example the TPTP framework enabled by the
Eclipse open-source standard IDE - Supports software modeling, testing, logging and
profiling
29Contribution to Standards cont
- The Consortium seeks close and productive working
connections with standards working groups - Potential Collaboration with DMTF
- CIM enhancements and refinements
- Automated Management models
- Behavior and State models
- Policy-Based Management
- Self healing models
30Summary
- The Shadows initiative is an independent RD
effort, which aims to improve state-of-art in
system lifecycle and system management - Shadows will rely on its background technologies
- Expand them to fix bugs of various types
- Combine them
- to cover a large variety of problems
- for data sharing and mutual improvement
- Shadows will build on open standards and
influence them - The project entails collaboration with partners
in Europe - Feedback Early Access Validation
31Backup Material