Title: Some thoughts for the industry session
1Some thoughts for the industry session
Cochin Conference Dec 18, 2002
- Prof. Kishor S. Trivedi
- Department of Electrical and Computer Engineering
- Duke University
- Durham, NC 27708-0291
- Phone (919)660-5269
- e-mail kst_at_ee.duke.edu
- At present visiting Professor IIT Kanpur, CSE
Dept.
2What does industry want?
- Well trained students
- Short term research problems solved
- Short courses on timely topics
3What do faculty want?
- Funding for their research
- Place their students in good company labs
- Hope to get their research results transferred to
industry - To get to know important and difficult problems
that can drive their research
4Some lessons learned
- Student placement should be guided by the advisor
- Start early with summer internship
- Patience is needed in listening to problems from
industry - Patience is needed in getting the IP problems
resolved - Expect to do at least 50 more work than the
funding provided - Tech transfer is a double edged sword
- Practical problems can give rise to respectable
research papers - Short courses are ideal entry points
5Characteristics of the Systemsbeing Studied
Dependability (Reliability, Availability, Safety)
- Redundancy Hardware (Static,Dynamic),
Information, Time - Fault Types Permanent, Intermittent, Transient,
Design - Fault Detection, Automated Reconfiguration
- Imperfect Coverage
- Maintenance scheduled, unscheduled
6Characteristics of the Systemsbeing Studied
- Performance
- Resource Contention, Concurrency and
Synchronization - Timeliness (Have to Meet Deadlines)
- Composite Performance and Dependability
- Degradable Levels of Performance
- Need Techniques and Tools that can Evaluate
- Systems with All the Characteristics Above
- Explicitly Address Complexity
7MEASURES TO BE EVALUATED
- Dependability
- Reliability R(t), System MTTF
- Availability Steady-state, Transient, Interval
- Safety
- Does it work, and for how long?''
- Performance
- Throughput, Loss Probability, Response Time
- Given that it works, how well does it work?''
8MEASURES TO BE EVALUATED
- Composite Performance and Dependability
- How much work will be done(lost) in a given
interval including the effects of
failure/repair/contention?'' - Need Techniques and Tools That Can Evaluate
- Performance, Dependability and Their Combinations
9PURPOSE OF EVALUATION
- Understanding a System
- Observation
- Operational Environment
- Controlled Environment
- Reasoning
- A Model is a Convenient Abstraction
10PURPOSE OF EVALUATION
- Predicting Behavior of a System
- Need a Model
- Accuracy Based on Degree of Extrapolation
- All Models are Wrong Some Models are Useful
- Prediction is fine as long as it is not about the
future
11Methods of Quantitative EVALUATION
- Measurement-Based
- Most believable, most expensive
- Not always possible or cost effective during
system design
12Methods of Quantitative Evaluation(Continued)
- Model-Based
- Less believable, Less expensive
- 1. Discrete-Event Simulation vs. Analytic
- 2. State-Space Methods vs. Non-State-Space
Methods - 3. Hybrid Simulation Analytic (SPNP)
- 4. State Space Non-State Space (SHARPE)
13Why MODEL?
- Provides a framework for gathering, organizing,
understanding and evaluating information about a
system e.g. Zitel, USS,HP - A cost-effective means to evaluate a system
- e.g. Boeing, USS, HP,IBM, Motorola,
-
- Cisco,SUN
14Why MODEL? (continued)
- Provides a means of evaluating a set of
alternatives in a structured and quantitative
manner e.g. Zitel, DEC,HP - Sometimes needed due to legal and contractual
obligations e.g. FAA - Sometimes needed for business reasons Motorola,
SUN, Cisco
15Compare two CLIENT-SERVER Architectures
Architecture 1
16Compare Connection Reliabilities
- Connection reliability R(t) is the probability
that throughout the interval 0,t) at least one
path exists from the client to server on which
all components are operational. - From R(t), system mean time to failure can be
computed
17Compare Connection Reliabilities
18Compare Connection Availabilities
- Connection (instantaneous, transient or point)
availability A(t) is the probability that at time
t at least one path exists from the client to
server on which all components are operational. - A(t)?R(t) and limiting or steady-state
Availability
19Compare Connection Availabilities
20MODELING THROUGHOUT SYSTEM LIFECYCLE
- System Specification/Design Phase
- Answer What-if Questions''
- Compare design alternatives (Zitel,HP,Motorola)
- Performance-Dependability Trade-offs (DEC)
- Design Optimization (wireless handoff)
21MODELING THROUGHOUT SYSTEM LIFECYCLE
- Design Verification Phase
- Use Measurements Models
- E.g. Fault/Injection Reliability Model
- Union Switch and Signals, Boeing, Draper
- Configuration Selection Phase DEC
- System Operational Phase Lucent
22CASE STUDY ZITEL
- Comparison of two different fault-tolerant
RAMdisks. - Stochastic Petri Net Package (SPNP) was used to
model the two systems for their reliability.
23CASE STUDY ZITEL
- Trivedi worked with the designers directly
- Model Validation was done using face validation
and sanity checks. - Parameterization was easy due to the experience
of the designers. - One difficult research problem originated from
the study Subsequently solved and published in
Microelectronics and Reliability journal.
24CASE STUDY VAXCLUSTER
- Developed three models of Processor Subsystem
- Two-Level Decomposition (IEEE-TR, Apr 89)
Inner Level
9-state Markov
Outer level n parallel diodes - A Detailed SPN Model (PNPM 89)
- A Detailed SPN model for Heterogeneous Cluster
(Averesky book)
25CASE STUDY VAXCLUSTER
- Storage Subsystem Model A fixed-point iteration
over a set of Markov submodels. (IEEE-TR, to
appear) - Observed that availability is maximized with 2
processors (HCSS 90) - Many interesting reliability, availability,
performability measures computed
26Case Study HP
- Cluster Availability Modeling
- Server Availability
- Mass Storage Arrays Availability Modeling
- Started with Markov chains via SHARPE
- Progressed toward Stochastic Petri Nets
- and Stochastic Reward nets via SPNP
27CASE STUDY LUCENT
- A Validated Model of Hardware-Software
Availability. - Worked with V. Mendiratta of Naperville.
- Model is semi-Markov solved using SHARPE.
- Parameters collected form field data.
- Model results validated against actual
measurements.
28CASE STUDY LUCENT, IBM, Motorola, SUN
- Software Rejuvenation
- A technique to counter software aging and
increase its availability to clients. - Evaluated optimum rejuvenation interval which
maximizes steady state availability (minimizes
expected cost) for IBM cluster, Motorola CMTS
cluster - Collected data from real systems to show aging
and to determine proactive fault management
strategies. Worked in our lab, with SUN
Microsystems
29CASE STUDY MOTOROLA
- Availability Performability Modeling
- Modeled several configurations of Communication
Enterprise Common Platform. - Practical approaches for approximating steady
state measures in large, repairable, and highly
dependable system model decomposition, state
space truncation, etc. - Both SHARPE and SPNP used.
30CASE STUDY MOTOROLA
- Recovery strategies in wireless handoff
- proposed and modeled several strategies
- a patent being filed by Motorola
- SPNP was used
- Hierarchy of two-level models used
- Fixed-point iteration was used
31CASE STUDY BELLCORE
- Architecture-based software reliability
- proposed a methodology
- applied the methodology to SHARPE
- used Bellcores test coverage tool, ATAC, to
parameterize the model - Bellcore is currently enhancing ATAC to
incorporate our methodology
32CASE STUDY DRAPER LAB
- Overall aim was Verification of system with very
high reliability/availability specifications.
Prototype under consideration was FTPP cluster
3. - Hybrid approach proposed
- Fault injection based measurements.
- Statistical analysis of measured data to enable
parameterization of analytical models.
33CASE STUDY DRAPER LAB
- Reliability modeling of the prototype done
Parameterization done with the aid of existing
reliability databases. - Analytical solution provided exact closed form
expressions - Markov model solved using SHARPE
- Petri net model solved using SPNP
- Reliability bottlenecks found
34CASE STUDY AT T
- GSHARPE
- A Preprocessor to SHARPE developed at Bell Labs
by a Duke Student. - User can specify Weibull Failure times and
lognormal and other repair time distributions. - GSHARPE fits these to phase type distributions
and produces a Markov model that is generated for
processing by SHARPE
35CASE STUDY BOEING
- An Integrated Reliability Environment
- A working prototype
- Developed a high-level modeling language (SDM)
- Designed and implemented an intelligent
interpreter
36CASE STUDY BOEING (Continued)
- Interpreter determines which solution method is
applicable - Five different modeling engines are integrated
- CAFTA, SETS, EHARP, SHARPE and SPNP.
37QUANTITATIVE EVALUATION TAXONOMY
Closed-form solution
Numerical solution using a tool
38 MODELING TAXONOMY
39STATE SPACE MODELING TAXONOMY
40ANALYTIC MODELING TAXONOMY
- NON-STATE SPACE MODELING TECHNIQUES
Product form queuing models
SP reliability block diagrams
Non-SP reliability block diagrams
41State Space Modeling Taxonomy
discrete-time Markov chains
Markovian modeling
continuous-time Markov chains
Markov reward models
State space methods
Semi-Markov models
non-Markovian modeling
Markov regenerative models
Non-Homogeneous Markov
42State-Space Based Models
- Transition label
- Probability (homogeneous) discrete-time Markov
chain (DTMC) - Time-independent Rate homogeneous
continuous-time Markov chain - Time-dependent Rate non-homogeneous
continuous-time Markov chain - Distribution function semi Markov process
- Two Dist. Functions Markov Regenerative Process
43IN ORDER TO FULFILL OUR GOALS OF
- Modeling Performance, Dependability and
Performability - Modeling Complex Systems
- We Need
- Automatic Generation and Solution of Large Markov
Reward Models
44IN ORDER TO FULFILL OUR GOALS OF
- Facility for State Truncation, Hierarchical
composition of Non-State-Space and State-Space
Models, Fixed-Point Iteration - There are Two Tools that Potentially meet these
Goals - Stochastic Petri Net Package (SPNP)
- Symbolic Hierarchical Automated Rel. and Perf.
Evaluator (SHARPE)
45MODELING SOFTWARE PACKAGES
- HARP - Hybrid Automated Reliability Predictor
(Duke Univ, funded by NASA
Langley) - SAVE - System Availability Estimator
(Duke Univ. funded by IBM) - SHARPE - Symbolic Hierarchical Automated
Reliability and Performance Evaluator installed
at nearly 280 locations (GUI available) - SPNP - Stochastic Petri Net Package installed at
nearly 120 locations (iSPN - GUI available) - D_RAMP for Union Switch and Signals by Duke, UVA
and CMU - SDM - Boeing Integrated Reliability Modeling
Environment (Jointly developed by Duke Univ.,
Univ. of Wash. and Boeing) - SDDS - Developed by Sohar with the help from K.
Trivedi - SREPT - Software Reliability Estimation and
Prediction Tool
46Challenges in Modeling
47COMPLEXITIES OF MODELS
- Large State Space
- Model construction problem
- Model solution problem
- Model Stiffness.
- Fast and slow rates acting together
- Failure And Recovery/Repair
- Performance and failure
48COMPLEXITIES OF MODELS
- Modeling Non-Exponential Distributions
- Combining performance and reliability
- Believability/Understandability/Usability
- Incorporation in the design process
- Connection between measurements models
- Parameterization
- Validation
49LARGENESS TOLERANCE
- Automated Model Construction
- Stochastic Petri nets (GreatSPN, SPNP, SHARPE,
DSPNexpress, ULTRASAN) - High level languages (SAVE, QNAP, ASSIST, SDM)
- Fault-Tree Recovery Info (HARP)
- Object-Oriented Approaches (TANGRAM)
- Loops in the specification of CTMC (SHARPE)
50LARGENESS TOLERANCE
- Efficient numerical solution techniques
- Sparse Storage
- Accurate and Efficient Solution Methods
- We have Generated and Solved Models
- with 1,000,000 states (has gone up
- considerably recently)
- Steady-State NEAR-Optimal SOR
- Transient Modified Jensen's method
51MODEL SPECIFICATION LANGUAGES
- Different languages can be used to specify a
single model type - SAVE,QNAP,SPNP all appear very different
underlying model type is Markov - Same language can be used to specify different
model typesRESQ input language used for PFQN or
EQN
52LARGENESS AVOIDANCE
- Non-State-Space methods
- Reliability block diagrams
- Fault-trees
- Product-Form Queuing Networks
- Approximate solutions
- State Truncation
- SAVE, SPNP, ASSIST (Kantz and Trivedi PNPM91)
53LARGENESS AVOIDANCE
- Approximate solutions
- Hierarchical Decomposition (Chapter 11)
- and Fixed-Point Iteration among submodels
- Heidelberger and Trivedi IEEE-TC,1983
- (Queueing Models)
- Ciardo and Trivedi PNPM91 (SPN Models)
- Tomek and Trivedi (Availability Models)
- Singhal (IEEE-TPDS, 1992)
- Chapter 11 of Sahner et al.
54LARGENESS AVOIDANCE
- Approximate solutions
- Time-Scale Decomposition
- Bobbio and Trivedi(IEEE-TC1986) Section 11.2
- Fluid Approximation
- Miltra Kulkarni Ciardo Nicol, and Trivedi
- FSPN
- Performability (Chapters 6 and 12)
55Difficulties in Modeling Using MRMs
- Stiffness
- Causes numerical difficulties in solution
- Stiffness Tolerance
- Develop stiffness tolerant numerical
- solution methods
- Stiffness Avoidance
- Avoid generating stiff models through
- decomposition
56STIFFNESS TOLERANCE
- Automatic Detection of Stiffness (HARP)
- Special Stable ODE Solver
- Reibman and Trivedi (TR-BDF2)
- Computers and Operations Research, 1988.
- Malhotra and Trivedi (Pade, Implicit RK)
57STIFFNESS TOLERANCE
- Uniformization for Stiff Markov Chains
- Muppala and Trivedi
- We can solve models with rate ratios of 108 or
higher - Implemented in SHARPE SPNP
58STIFFNESS AVOIDANCE
- Model-level decomposition
- Behavioral Decomposition (HARP, Bobbio Trivedi)
Fault-Occurrence vs. Fault/Error Handling - Hierarchical Composition (SHARPE) Composition of
Submodel solutions without generating a single
one-level overall model - Fixed-Point Iteration (Ciardo and Trivedi SPNP)
59Non-Exponential Behavior
- Non state space models Fault Trees, Reliability
Graphs, RBDs no problem
60Non-Exponential Behaviorin State Space Models
61NON-EXPONENTIAL DISTRIBUTIONS
- Phase-Type Expansions
- Malhotra and Reibman (GSHARPE)
- See Figure 9.38 on p. 191(Red Book)
- Non-Homogeneous Markov Chains
- CARE III, HARP
- Soft Reliability model with imperfect repairs
- solved using SHARPE
62NON-EXPONENTIAL DISTRIBUTIONS
- Semi-Markov Chains
- Ciardo et al, IEEE-TC Oct. 90
- Markov Regenerative Processes
- Choi, Logothetis, Kulkarni, Trivedi
- DSPN and MRSPN
- Choi, Kulkarni, Trivedi
- Discrete-Event Simulation
- Now in SPNP (FSPN an Non-Markovian SPN
- Simulation), RESQ, QNAP
63BELIEVABILITYUNDERSTANDABILITY
- Integration of Measurements and Models
- Measurements Provide Parameters to Models
- Models Provide Guidelines For Measurements
- Models Validated Against Measurements
- Integration of Different Modeling Tools
- Boeing SDM project
- IDEAS project at Duke
64BELIEVABILITY/UNDERSTANDABILITY
- Many Case-Studies of Validations Needed
- Vaxcluster Availability Model Wein Sathaye
- Hsueh, Iyer and Trivedi IEEE-TC, Apr. 1988
- AT T Validation of ESS
- Technology Transfer
- Seminars and Workshops
- Development and Dissemination of Tools
- Application of the Techniques and Tools
65MODELING AND MEASUREMENTS INTERFACES
- Measurements supply Input Parameters to Models
- (Model Calibration or Parameterization)
- Confidence Intervals should be obtained
- Boeing, Draper, Union Switch projects
- Model Sensitivity Analysis can suggest which
Parameters to Measure More Accurately Blake,
Reibman and Trivedi SIGMETRICS 1988.
66MODELING AND MEASUREMENTS INTERFACES
- Model Validation
- 1. Face Validation
- 2. Input-Output Validation
- 3. Validation of Model Assumptions
- (Hypothesis Testing)
- Rejection of a hypothesis regarding model
assumption based on measurement data leads to an
improved model
67MODELING AND MEASUREMENTS INTERFACES
- Model Structure Based on Measurement Data Hsueh,
Iyer and Trivedi IEEE TC, April 1988 Gokhale et
al, IPDS 98