Title: PhD Dissertation Defense EnergyEfficient Proactive Techniques for Safe
1PhD Dissertation DefenseEnergy-Efficient
Pro-active Techniques for Safe Survivable
Cyber-Physical Systems
- By
- Tridib Mukherjee
- Committee
- Prof. Sandeep Gupta
- Prof. Karamvir Chatha
- Prof. Partha Dasgupta
- Prof. Daniel Stanzione
Sponsors
2Outline
- Cyber-Physical Systems (CPS)
- Crisis response planning and preparedness
- Energy-efficient job management in data centers
- Ad hoc Networks
- Conclusions/ Future Research Directions
3Cyber Physical Systems (CPS)
From interactive to pro-active systems
Courtesy Vanderbilt University Drexel
University
Courtesy Idealog Magazine
- Pro-active systems can anticipate an event and
act in advance to avoid or minimize the
consequences of the event. - Migration from Interactive to Pro-active
computing for systems intimately connected to
the world around was suggested in 2000 by David
Tennenhouse, Director of Intel Research. - Pro-active CPS can involve actions in both the
physical and cyber world. - Example of pro-active operations in the physical
environment - pre-setting the cooling in data center to avoid
equipment redline temperatures. - preparedness drills for responding to
crises/disasters.
- Dynamic distributed systems to monitor,
coordinate, control, integrate and facilitate
physical processes - Physical environment can consist of human
inhabitants - Computing entities are autonomic and embedded.
- Operations in computing entities affect the
physical environment vice versa.
- Key Issues
- Physical Interactions
- Critical Applications
4Research Problem and Approach
- Three major problems of pro-active operations in
CPS - can be difficult to achieve under uncertain
environments in CPSs (e.g. crisis response) - can lead to high cost of operation for large
scale CPS (e.g. data centers) - can be highly energy-inefficient for
energy-constrained computing entities (e.g.
ad-hoc networks) - Research Approach constraint based optimization
to balance pro-activity for three different
applications with different objectives and
requirements - Crisis preparedness pro-active planning and
evaluation of crisis response when actions
outcomes are uncertain while meeting real-time
constraints for human survivability. - Data centers pro-active job scheduling to
dynamically reduce cooling demands while meeting
thermal constraints for equipment safety. - Ad hoc networks pro-active route management to
meet end-to-end reliability constraints while
minimizing the energy overhead.
How to balance pro-activity depending on system
requirements?
5Research Contributions
6Outline
- Cyber-Physical Systems (CPS)
- Crisis response planning and preparedness
- Energy-efficient job management in data centers
- Ad hoc Networks
- Conclusions/ Future Research Directions
7Importance of Crisis Preparedness
- In 2004, over 4 billion of Homeland Security
Grants allocated for assistance to the first
responders. - In 2005, 7.4 billion fund budgeted for Emergency
Preparedness and Response (around 20 of the
total budget). - over 3.5 billion (50) budgeted for assistance
to first responders. - Since March 1, 2003, approximately 8 billion
awarded to state, tribal and local governments to
prevent, prepare for, respond to and recover from
acts of terrorism and all hazards.
8Critical Application Fire in Building
Critical Event
Additional Critical Events
Detection
Detection
Crisis
Response
Recovery
Preparedness
Evaluation of Crises Response
Trapped People Rescuers
Detect fire using information from sensors
- Notify 911
- provide information to the first responders
Detect trapped people
HUMAN INTERACTION
- Analyze the Spatial Properties
- how to reach the source of fire
- which exits are closest
- is the closest exist free to get out
- Determine the required actions
- instruct the inhabitants to go to nearest safe
place - co-ordinate with the rescuers to evacuate
(normally using ad hoc networks).
- Requires pro-active evaluation and planning of
crisis response
Survivability effectiveness of response plan to
avoid disasters (life/property losses)
How to evaluate and plan actions with uncertain
outcomes?
9Criticality Critical Event Management
- Critical events
- Causes emergencies/crisis.
- Leads to loss of lives/property.
- Criticality
- Effects of critical events on the
smart-infrastructure. - Critical State state of the system under
criticality. - Window-of-opportunity (W) temporal constraint
for criticality. - Survivability effectiveness of the criticality
response actions in minimizing the disasters.
Critical Event
CRITICAL STATE
NORMAL STATE
Timely Criticality Response within
window-of-opportunity
Mismanagement of any criticality
DISASTER (loss of lives/property)
10Related Work
Preparedness Measures
Unaware of uncertainties
Cumbersome Documents
Model-based Verification
Reliability
Formal Modeling
QOS
Preparedness Drills
Physical lay-out design
Real-time
Objective Evaluation
Personnel Training
Pro-activity
Synergistic Planning
Human
Cyber
Physical
Cyber- physical
Level of Abstraction
- Different modeling options
- Hybrid automata can capture continuous time
dynamics in physical world. - A special case is timed I/O automata which can
time variation for the system. - Recent work has focused on probabilistic timed
automata. - We use Markov Decision Process.
- Can enable developing stochastic planning
policies.
11Background on Model based Verification/Analysis
Markov Decision Process based Criticality
Response Model (CRM)
- Model based analysis normally used to verify
critical systems such as avionics. - no need for actual scenario generation putting
lives/property at risk. - Formal models for abstraction of the system
behavior. - Expected system properties depend on the
requirements. - Formal models analyzed through model checking to
verify the system properties. - We use model based analysis to evaluate
effectiveness of crisis response processes.
System Behavior
System Requirements
Formal Models
Expected Properties
Model Checking
Property Verification
Requirement Verification
Criticality Response Evaluation Tool (CRET)
CRM can also be used to develop Criticality
Response Planning (CRP) policies
12Proposed Markov Decision Process Criticality
Response behavior Model (CRM)
- State-based stochastic model
- System in different critical states
- A state represents the combination of
criticalities in the system - States are organized in a hierarchical manner
- A level in the hierarchy represents the number of
criticalities in each state in that level - Normal state has a level 0 (i.e. there are no
criticalities in the normal state) - Critical Events
- Makes state transitions down the hierarchy
- Associated with criticality characteristics
- window-of-opportunity)
- Probability of the critical event
- Time to detect the criticality
- Mitigative Link
- Corresponds to response actions
- Makes state transitions up the state hierarchy.
- Associated with response action characteristics
- probability of actions success considering
uncertainties due to human involvement. - Time to complete the action
NORMAL STATE
Mitigative Link (ML)
Critical Event
Survivability probability of reaching normal
state depend on MLs success probabilities,
additional criticality probabilities and
conformity to window-of-opportunity.
T. Mukherjee, K. Venkatasubramanian and S. K. S.
Gupta, Performance Modeling of Critical Event
Management for Ubiquitous Computing Applications,
Proceedings of ACM MSWiM (MSWiM'06),
Terromolinos, Spain, October 2006
13Reachability to the Normal State
- Reachability to the normal state from any
arbitrary critical state s - s an immediate upstream state when action a is
performed.
NORMAL STATE
- An actions Q-value (qualifiedness) determined by
probability of reaching normal state when the
action is performed - s an immediate upstream state when action a is
performed.
sn
WOOP met
s
p(s, a, s)
WOOP NOT met
a
s
Probability of reaching the normal state from
state i
Actions Qualifiedness (Q-value)
s sn
s ? sn WOOP met
WOOP NOT met
Probability of reaching normal state if NO
additional criticality occurs at state i
Probability of reaching normal state if ANY
additional criticality occurs at state i
Probability of a criticality at state i
Normal state is stochastically reachable from a
state iff maximum Q-value from that state is
non-zero.
14CRP strategies
- Optimal at each state select action with max
Q-value. - Greedy at each state, select action with
optimum values of immediate parameters - e.g. Minimum Time (MT), Maximum Probability (MP),
Maximum number of Mitigated Criticalities (MMC). - Markov Decision Planning (MDP) At each state,
select action with maximum utility - utility uses the state-based stochastic model
parameters.
15MDP-based CRP strategies
- At a state, an action has utility based on
actions probability and reward - Actions reward function can be a combination of
the associated parameters - Locally maximum criticality Mitigation Per unit
Time (LMPT) - No knowledge of subsequent criticalities in the
reward. - Subsequent Criticality Aware locally maximum
criticality Mitigation Per unit Time (SCAMPT) - Actions reward in LMPT is enhanced with the
knowledge of probable subsequent criticalities.
expected maximum utility from next state
reward
Reward number of criticality mitigated
per unit time
Reward is same as LMPT except that probabilities
of subsequent criticalities are taken into
account
Tridib Mukherjee, and Sandeep K. S. Gupta, CRM
A Formal Method to Model Evaluate Crises
Response of Distributed Cyber-Physical Systems,
Under Review in TPDS.
16CRM for fire emergencies in Offshore Oil Gas
Production Platforms (OGPP)
- c1 Fire Alarm.
- c2 Imminent danger e.g. health hazards.
- c3 Assistance required to others e.g. trapped
personnel. - c4 Evacuation path not tenable.
0.5375
0.0154
Fire Alarm
0.0311
0.1849
0.4319
0.5
0.1977
0.2011
Fire Alarm Imminent Danger
Fire Alarm Non-tenable Path
Fire Alarm Assistance Required
0.5562
0.5827
0.371
0.2953
0.449
0.0635
0.3661
0.4764
Window-of-opportunity
Fire Alarm Imminent Danger Assistance Required
Fire Alarm Imminent Danger Non-tenable Path
Fire Alarm Assistance Required Non-tenable
Path
- survival time under asphyxiation.
0.5447
0.4242
0.5447
State transition probabilities derived from
established probability distribution in 1.
0.4242
0.3803
0.4172
0.0311
Fire Alarm Imminent Danger Non-tenable Path
Assistance Required
Fire Alarm Imminent Danger Assistance
Required Non-tenable Path
1 D. G. DiMattia, F. I. Khan, and P. R.
Amyotte, Determination of human error
probabilities for offshore platform musters,
Journal of Loss Prevention in the Process
Industries, vol. 18, pp. 488501, 2005.
Tridib Mukherjee, and Sandeep K. S. Gupta, A
Modeling Framework for Evaluating Effectiveness
of Smart-Infrastructure Crises Management
Systems , 2008 IEEE International Conference on
Technologies for Homeland Security (HST'08),
Waltham, MA, USA, April 2008
Enables Objective Evaluation of Criticality
Response in OGPP to Improve Crisis Preparedness
17Sample Q-value Analysis
- Preparedness Q-value based analysis allow
comparison among plans for - Different number of criticalities
- Different detection and action completion times
- Different states (i.e. different combination of
simultaneous criticalities)
Other applications Resource access control to
facilitate the planned actions under emergencies.
18Criticality Response Evaluation Tool (CRET)
AADL based Criticality Response System
Architecture Specification
Model based decision
Model Representation
AADL based CRP Specification
AADL based CRM Specification
Model Parsing
AADL OSATE Analysis Plug-ins
XML Representation and Analysis Software using
Matlab
Can specify any response planning policy
transcending beyond the proposed CRP strategies
Q-value Analysis
Model Processing
Preparedness Check Reachability to normal state
based on Q-value analysis
Tridib Mukherjee, and Sandeep K. S. Gupta, CRET
A Crisis Response Evaluation Tool to Improve
Crisis Preparedness, 2009 IEEE International
Conference on Technologies for Homeland Security
(HST'09), Waltham, MA, USA, May 2009
19Summary of Contributions
- Crisis Response Model (CRM)
- Markov decision process based modeling of crisis
response - Development of Q-value as evaluation criteria for
reachability to normal state - Crisis Response Planning (CRP)
- Optimal and naïve (greedy) strategies
- Markov decision planning strategies
- Crisis Response Evaluation Tool
- Objective evaluation of crisis response
T. Mukherjee, K. Venkatasubramanian and S. K. S.
Gupta, Performance Modeling of Critical Event
Management for Ubiquitous Computing Applications,
Proceedings of ACM MSWiM (MSWiM'06),
Terromolinos, Spain, October 2006
Tridib Mukherjee, and Sandeep K. S. Gupta, CRM
A Formal Method to Model Evaluate Crises
Response of Distributed Cyber-Physical Systems,
Under Review in TPDS.
Tridib Mukherjee, and Sandeep K. S. Gupta, A
Modeling Framework for Evaluating Effectiveness
of Smart-Infrastructure Crises Management
Systems , 2008 IEEE International Conference on
Technologies for Homeland Security (HST'08),
Waltham, MA, USA, April 2008
Tridib Mukherjee, and Sandeep K. S. Gupta, CRET
A Crisis Response Evaluation Tool to Improve
Crisis Preparedness, 2009 IEEE International
Conference on Technologies for Homeland Security
(HST'09), Waltham, MA, USA, May 2009
K. Venkatasubramanian, T. Mukherjee, and S. K. S.
Gupta, ''CAAC - An Adaptive and Proactive Access
Control Approach for Emergencies for Smart
Infrastructures", Accepted in the Special Issue
on Adaptive Security Systems in ACM Transactions
on Autonomic and Adaptive Systems (TAAS).
S. K. S. Gupta , T. Mukherjee, and K.
Venkatasubramanian, Criticality Aware Access
Control Model For Pervasive Applications",
Proceedings of 4th IEEE Conf. on Pervasive
Computing (PERCOM), Pisa, Italy, 2006.
20Outline
- Cyber-Physical Systems (CPS)
- Crisis response planning and preparedness
- Energy-efficient job management in data centers
- Ad hoc Networks
- Conclusions/ Future Research Directions
21Importance of the Problem
- Cooling is the chief driver of increased data
center construction cost, costing up to 5000 per
square foot in initial purchase price. - Cooling is one of the leading contributors to
ongoing total cost of ownership, costing one half
to one watt per watt spent on computation. - If we can eliminate even 25 of total cooling
costs, that can translate to a 1-2 million
annual cost reduction in a single large data
center.
22Related Work
Proactive Approach
Reactive Solutions
23Heat Interferences in Data Centers
Safetyinlet should be within the red-line
temperature to avoid equipment failure.
Problemcooling has to be pro-actively set very
low to have all inlet temperatures under redline.
Solutionproactive spatio-temporal job scheduling
to minimize interference cooling demands.
24Typical HPC Job Characteristics
- Job execution times are usually overestimated
during submission in HPC data centers. - Jobs can be spread over time to reduce peak
utilization - Trade-off with throughput, turn-around time and
resource utilization.
From job traces at ASU HPC data center
25Conventional Spatial and Temporal Scheduling
26Balancing Utilization Over Time
27Conceptual overview of thermal-aware job
scheduling
Balancing utilization over time reduces the peak
computing resource utilization leaving room for
thermal-aware spatial scheduling at all time
Peak air inlet temperaturedetermines upper bound
toCRAC temperature setting
CRAC temperature settingdetermines its
efficiency(Coefficient of Performance)
Spatial job scheduling (placement) determines
temperature distribution at any time using a
linear thermal model
Coefficient of Performance(source HP)
The lower the peak inlet temperature the higher
the CRAC efficiency
Q. Tang, T. Mukherjee, S. K. S. Gupta, and P.
Cayton, ''Sensor-based Fast Thermal Evaluation
Model for Energy-efficient High-performance
Datacenters", In the International Conf.
Intelligent Sensing Info.Proc. (ICISIP2006), Dec
2006.
Temperature distributiondetermines the
equipmentpeak air inlet temperature
T. Mukherjee, G. Varsamopoulos, S. K. S. Gupta,
and S. Rungta, 'Measurement-based Power
Profiling of Datacenter Equipment", (Extended
Abstract) In the Workshop on Green Computing
(with CLUSTER2007), Austiin, USA, Sep 2007.
There is a spatio-temporal job schedule that
minimizes the total energy (cooling computing)
consumption. Find it!
28Thermal-aware Job Scheduling Problem
- PROBLEM Given a set of incoming jobs, find a job
scheduling (i.e. job start times) and placement
(i.e. server assignment) to minimize the total
data center energy consumption subject to meeting
of job deadlines (submitted times for execution)
requires 3D (job x server x time)
decision-making.
Cooling Energy
Supply Temperature Upper Bound
Computing Energy
Job Migration Overhead
Capacity Constraint server assigned less server
available
Server Required Required no. of servers assigned
for jobs
Deadline Constraint job finish time less than
deadline
Arrival Constraint job start time later than
arrival
T. Mukherjee, A. Banerjee, G. Varasamopoulos, and
S. K. S. Gupta, Spatio-temporal Thermal-Aware
Job Scheduling to Minimize Energy Consumption in
Virtualized Heterogeneous Data Centers", Elsevier
Journal on Computer Networks (ComNet), Special
Issue on Virtualized Data Centers, ACCEPTED
(2009).
29Thermal-aware Job Scheduling Algorithms
- SCINT Algorithm Heuristic solution (genetic
algorithm) - Take a feasible solution and perform mutations
until certain number of iterations. - Spreads the jobs over time while meeting the
deadline. - Offline in nature requiring the job backlog
information - Takes hours of operation.
- EDF-LRH Algorithm Tries to mimic the behavior of
SCINT by spreading jobs using the Earliest
Deadline First (EDF) scheduling approach. - Place jobs to servers contributing the Lowest
Recirculated Heat (LRH) - Online in nature maintaining EDF job queues as
and when jobs arrive - Takes milliseconds of operation.
- FCFS Algorithm Does not conventional temporal
scheduling approach but uses thermal-aware job
placement techniques for energy-savings. - Place jobs to servers contributing the Lowest
Recirculated Heat (LRH) - Online in nature taking milliseconds of
operations
T. Mukherjee, A. Banerjee, G. Varasamopoulos, and
S. K. S. Gupta, Spatio-temporal Thermal-Aware
Job Scheduling to Minimize Energy Consumption in
Virtualized Heterogeneous Data Centers", Elsevier
Journal on Computer Networks (ComNet), Special
Issue on Virtualized Data Centers, ACCEPTED
(2009).
30Total Energy Consumption
- SCINT saves up to 60 of energy consumption.
- EDF-LRH mimics the behavior of SCINT specially
for low average data center utilization.
31Summary of Contributions
- Problem Formulation to minimize
energy-consumption in data centers - Spatio-temporal thermal-aware job scheduling
algorithms to - Offline algorithm SCINT
- Online algorithm EDF-LRH
- Measurement based power profiling of data center
equipment - Linear power model
- Preliminary software architecture
- Configure MOAB for thermal-aware job placement.
Q. Tang, T. Mukherjee, S. K. S. Gupta, and P.
Cayton, ''Sensor-based Fast Thermal Evaluation
Model for Energy-efficient High-performance
Datacenters", In the International Conf.
Intelligent Sensing Info.Proc. (ICISIP2006), Dec
2006.
T. Mukherjee, G. Varsamopoulos, S. K. S. Gupta,
and S. Rungta, 'Measurement-based Power
Profiling of Datacenter Equipment", (Extended
Abstract) In the Workshop on Green Computing
(with CLUSTER2007), Austiin, USA, Sep 2007.
T. Mukherjee, A. Banerjee, G. Varasamopoulos, and
S. K. S. Gupta, Spatio-temporal Thermal-Aware
Job Scheduling to Minimize Energy Consumption in
Virtualized Heterogeneous Data Centers", Elsevier
Journal on Computer Networks (ComNet), Special
Issue on Virtualized Data Centers, ACCEPTED
(2009).
T. Mukherjee, Q. Tang, C. Ziesman, S. K. S.
Gupta, and P. Cayton, Spftware Architecture for
Dynamic Thermal Management in Data Centers",
International Conference on Communication Systems
Software (COMSWARE), Bangalore, India, Jan, 2007.
T. Mukherjee, Q. Tang, C. Ziesman, and S. K. S.
Gupta, Dynamic Thermal Control and Management
towards Reducing Utility Cost in Data Centers ",
International Workshop on Feedback Control
Implementation and Design in Computing Systems
and Networks (FeBID), 2006.
T. Mukherjee, S. K. S. Gupta, and P. Cayton, emo
- Temparature-aware job placement in data centers
using Moab cluster management software ",
Research_at_Intel Day, Intel, Santa Clara, June,
2006.
32Outline
- Cyber-Physical Systems (CPS)
- Crisis response planning and preparedness
- Energy-efficient job management in data centers
- Ad hoc Networks
- Conclusions/ Future Research Directions
33Optimum Tuning of Pro-active Route Maintenance in
ad-hoc networks
34Application-aware Adaptive Optimization Sub-layer
35Proactive Routing Protocol Classification and
Research Contributions
Employs Beacons, Triggered Updates
Employs only Beacons
Employs Beacons, Periodic Updates
Employs Beacons, Periodic, Triggered Update
WRP, OLSR etc.
BFST, SS-SPST etc.
FSR, IARP etc.
DSDV, TBRPF etc.
- Contributions
- Analytical Model for determining optimum ß f
for different proactive protocols.1,2,3 - Developing a PPB type of protocol maintaining
energy-efficient routes. - Improves Self-Stabilizing Shortest Path Spanning
Tree (SS-SPST) for energy-efficiency. 4,5
1T. Mukherjee, S. K. S. Gupta, and G.
Varasamopoulos, ''Energy Optimization for
Proactive Unicast Route Maintenance in MANETs
under End-to-End Reliability Requirements", In
Elsevier Journal on Performance Evaluation, Vol.
66, Issue 3-5, Pages 141-157, Mar, 2009.
2T. Mukherjee, S. K. S. Gupta, and G.
Varasamopoulos, ''Analytical Model for Optimizing
Periodic Route Maintenance in Proactive Routing
for MANETs", In the Proc of ACM MSWiM, Crete
Island, Greece, Oct 2007.
3T. Mukherjee, S. K. S. Gupta, and G.
Varasamopoulos, ''Application-Aware Adaptive
Tuning of Proactive Routing Protocols for
MANETs", Under review in Transactions on
Autonomic and Adaptive Systems (TAAS).
4T. Mukherjee, G. Varasamopoulos, and S. K. S.
Gupta, ''Self-Managing Energy-Efficient Multicast
Support in MANETs under End-to-End Reliability
Constraints", In Elsevier Journal on Computer
Networks (ComNet), Vol. 53, Issue 10, Pages
1603-1627, July, 2009.
5T. Mukherjee, G. Sridharan, and S. K. S. Gupta,
''Energy-Aware Self-Stabilization in Mobile Ad
Hoc Networks A Multicasting Case Study", In the
21st IEEE Int'l Parallel and Distributed
Processing Symposium (IPDPS), Long Beach,
California, 26-30th March, 2007.
36Outline
- Cyber-Physical Systems (CPS)
- Crisis response planning and preparedness
- Energy-efficient job management in data centers
- Ad hoc Networks
- Conclusions/ Future Research Directions
37Conclusions
- Pro-activity need to be incorporated in a
synergistic manner to ensure safety and
survivability in the CPSs. - Pro-activity require handling of uncertain
outcomes of the pro-active actions - Pro-activity leads to high energy consumption.
- Crisis preparedness and planning for human
survivability under crisis - Abstracting the crisis response behavior as
system-as-a-whole can take into account the human
uncertainties in the physical world. - Facilitates stochastic planning and evaluation of
the crisis response - Model based verification and analysis enables the
crisis response evaluation in an objective
manner. - Dynamic determination of the period of route
maintenance in the ad hoc networks can
effectively balance the energy-reliability
trade-off. - Data center thermal management for thermal safety
of the equipment - Dynamically reducing cooling demands through
thermal-aware job scheduling and placement can
save up to 60 of the energy consumption while
ensuring the users perception of job completion.
38Future Research Directions
- Abstract modeling for CPS
- physical interference modeling
- can be governed by differential equations for
physical dynamics. - Crisis preparedness
- Considering action cost in the analysis of
response processes - Enhance the actions Q-value with the cost
- Model dynamics in complex scenarios
- dynamic unpredictable state-space instead of
static predictable state-space - Model composition in distributed and composite
systems - derive system-level global stochastic model by
combining multiple sub-system-level local
stochastic models (e.g. fire in a hospital
require two sub-systems i) fire management and
ii) medical emergency management - Data center
- Integration of power management with
thermal-aware job scheduling - Integration of cooling control with thermal-aware
job scheduling to develop a synergistic control
architecture.
39Questions ??
Impact Lab (http//impact.asu.edu) Creating
Humane Technologies for Ever-Changing World
40Background
- Pro-active systems can anticipate an event and
act in advance to avoid or minimize the
consequences of the event. - Pro-active CPS is necessary to address the
following design requirements - Safety impact of the physical interactions
should remain within a desirable limit to avoid
any damage to the physical and computing
entities. - Survivability the operations in the physical and
cyber subsystems ensure and/or incur no harm to
the human inhabitants. - Migration from Interactive to Pro-active
computing for systems intimately connected to
the world around was suggested in 2000 by David
Tennenhouse, Director of Intel Research. - Pro-active CPS can involve actions in both the
physical and cyber world. - Example of pro-active operations in the physical
environment - Safety pre-setting the cooling in data center to
avoid equipment redline temperatures. - Survivability preparedness drills for responding
to crises/disasters. - Example of pro-active operations in the cyber
world - Safety schedule jobs in data centers such that
equipment redline temperatures avoided. - Survivability pro-active route maintenance in ad
hoc networks employed for crisis response to
ensure low latency for information exchange.
41Example Cyber-Physical Systems
- Utilities
- Advanced Electric Power Grid
- Water Distribution
- Pressure Pipes Gas/Oil
- Search Rescue
- Crisis Response, etc.
- Monitoring Systems
- Pervasive Health Monitoring
- Monitoring of fire and chemical radiation plumes
- Wild-life Monitoring
- Forest Monitoring
42Design Decisions
Critical applications should be able to
avoid/handle dangerous physical conditions (e.g.
life/property losses).
Security
Survivability
Reliability
Real-time
Safety
Quality
Interactions between physical and cyber
components should not detrimentally impact the
physical conditions.
This dissertation focuses on the safety
survivability of CPS
43Physical Interactions (Interference)
44Reachability Metric
NORMAL STATE
- Reachability in terms of Q-value or Qualifiedness
of actions - probability of reaching normal state based on
- Probabilities of MLs.
- Probabilities of CLs at intermidiate states.
- Conformity to timing requirements.
Q-value is a quantitative measure to evaluate
crises response.
Critical Link (CL)
Mitigative Link (ML)
45Execution Times
46AADL based criticality response system
architecture specification
47Criticality Specification
48State and State Transition Specification
49State and State Transition Specification
Criticalities
Events in System
Critical States
System Modes
Event Dependent Mode Transition
State Transitions
Response Actions
Windows of Opportunity
Mode Properties
Action Times
mapped to
AADL Constructs
MCMA Components
50Sample Schema for Intermediate XML representation
Allows any expressions to specify policies
51Thermal issues in Data Centers
- Heat recirculation
- Hot air from the equipment air outlets is fed
back to the equipment air inlets - Hot spots
- Effect of Heat Recirculation
- Areas in the data center with alarmingly high
temperature - Consequence
- Cooling has to be set very low to have all inlet
temperatures in safe operating range - Solution
- Jobs to be placed to minimize heat-recirculation
- Linear thermal model developed previously to
predict the chassis inlet from equipment
utilization.
Courtesy Intel Labs
Q. Tang, T. Mukherjee, S. K. S. Gupta, and P.
Cayton, ''Sensor-based Fast Thermal Evaluation
Model for Energy-efficient High-performance
Datacenters", In the International Conf.
Intelligent Sensing Info.Proc. (ICISIP2006), Dec
2006.
T. Mukherjee, G. Varsamopoulos, S. K. S. Gupta,
and S. Rungta, 'Measurement-based Power
Profiling of Datacenter Equipment", (Extended
Abstract) In the Workshop on Green Computing
(with CLUSTER2007), Austiin, USA, Sep 2007.
52Instrumentation
On-site Set-up
Remote Power Meter Reading
Chassis
NETWORK
DualCom Power Meter
SNMP Control
Power Supply (208 V)
T. Mukherjee, G. Varasamopoulos, S. K. S. Gupta,
and Sanjay Rungta, ''Measurement based Power
Profiling of Data Center Equipment, In the First
International Worshop of Green Computing (in
conjunction with CLUSTER 2007), Austin, USA,
Sept, 2007
53Equipment Power Consumption
Power Supply
Blade Server Power
Empty Chassis Power
Memory Power
Hard Disk Power
CPU Power
54Power Model
- Power Consumption is mainly affected by the CPU
utilization - Power consumption is linear to the CPU
utilization
P a U b
T. Mukherjee, G. Varsamopoulos, S. K. S. Gupta,
and S. Rungta, 'Measurement-based Power
Profiling of Datacenter Equipment", (Extended
Abstract) In the Workshop on Green Computing
(with CLUSTER2007), Austiin, USA, Sep 2007.
55Linear Thermal Model
- Heat Recirculation Coefficients
- Analytical
- Matrix-based
- Properties of model
- Granularity at air inlets
- Assumes steadiness of air flow
P a U b
Max(Tin) lt Tred
Tin
Tsup
D
P
Tsup lt Tred Max(DxP)
Q. Tang, T. Mukherjee, S. K. S. Gupta, and P.
Cayton, ''Sensor-based Fast Thermal Evaluation
Model for Energy-efficient High-performance
Datacenters", In the International Conf.
Intelligent Sensing Info.Proc. (ICISIP2006), Dec
2006.
heat distribution
powervector
inlettemperatures
supplied airtemperatures
56Thermal-aware Job Scheduling
- PROBLEM Given a set of incoming jobs, find a job
scheduling (i.e. job start times) and placement
(i.e. server assignment) to minimize the total
data center energy consumption subject to meeting
of job deadlines (submitted times for execution)
requires 3D (job x server x time)
decision-making.
- SCINT Algorithm Heuristic solution (genetic
algorithm) - Take a feasible solution and perform mutations
until certain number of iterations. - Spreads the jobs over time while meeting the
deadline. - Offline in nature requiring the job backlog
information - Takes hours of operation.
- EDF-LRH Algorithm Tries to mimic the behavior of
SCINT by spreading jobs using the Earliest
Deadline First (EDF) scheduling approach. - Place jobs to servers contributing the Lowest
Recirculated Heat (LRH) - Online in nature maintaining EDF job queues as
and when jobs arrive - Takes milliseconds of operation.
T. Mukherjee, A. Banerjee, G. Varasamopoulos, and
S. K. S. Gupta, Spatio-temporal Thermal-Aware
Job Scheduling to Minimize Energy Consumption in
Virtualized Heterogeneous Data Centers", Elsevier
Journal on Computer Networks (ComNet), Special
Issue on Virtualized Data Centers, ACCEPTED
(2009).
57Energy Consumption
- Total Power Computing Cooling Power
- Cooling power depends on the computing power and
the COP. - Energy consumption is the total power multiplied
by the observed period of time.
Ptot Pcomp Pcool
Ptot Pcomp Pcomp/COP(Tsup) Pcomp
Pcomp/COP(Tred max(D x P))
E Ptot x time
58Software Architecture
Presentation
Scheduling Control
Access data from the chassis level sensors
Datacenter Servers
59Modularized Implementation of Thermal Awareness
in Task Scheduling
T. Mukherjee, Q. Tang, C. Ziesman, S. K. S.
Gupta, and P. Cayton, ''Software Architecture for
Dynamic Thermal Management in Datacenters", In
the International Conf. Communication System
Software Middleware (COMSWARE), Bangalore, India,
Jan 2007.
60Proactive Route Maintenance Operations in MANETs
- Overhead
- Periodic beacon messages for link state
maintenance. - Periodic route update bcast.
- Triggered route update bcast with each link
change.
E x N x ?logN? / ß
E x N3 x ?logN? / f
E x N3 x ?logN? for each triggered update
High Energy Overhead in Maintenance Operations
Reduces Applicability
Low Scalability
Reduce maintenance operations and find optimum ß
f to minimize energy overhead.
61PDR Constraint
- Derived through Packet Deliver Ratio Required
- P Probability of packet loss due to single link
failure. - P ? x route-reconstruction delay.
- Packet Delivery Ratio (1 - P)D.
- (1 - P)D gt ?.
P
Function of link change and application traffic
distribution
Application reliability requirement
route-reconstruction delay lt 1 ?1/D / ?
62Optimizations for different Pro-active Protocols
63Sample Results
64Sample Results
65Self-stabilization in Distributed Computing
- A distributed system is self-stabilizing if it
- Guarantees convergence to valid global state in
finite time from any invalid state based on local
actions in distributed nodes. - Ensures closure by keeping the system in valid
state unless faults occur. -
- Self-stabilization can adapt to topological
changes and node failures in MANETs based on
local actions.
Topological Changes and Node Failures for MANETs.
Fault
Closure
Invalid State
Valid State
Convergence
Local actions in distributed nodes.
Applied to Multicasting in MANETs
66PPB Type of Multicast Routing using
Self-Stabilization
Multicast source
- Maintains source-based multi-cast tree.
- Actions based on local information in the nodes
and neighbors. - Pro-active neighbor monitoring through periodic
beacon messages. - Neighbor check at each round (with at least one
beacon reception from all the neighbors) - Execute actions only in case of changes in the
neighborhood.
Topological Change
Convergence Based on Local actions
Problem energy-efficiency
is not considered
Self-Stabilizing Spanning Tree
67Energy Aware Self-Stabilizing Protocol (SS-SPST-E)
- Actions at each node
- (parent selection)
- Identify potential parents.
- Estimate additional cost after joining potential
parent. - Select parent with minimum additional cost.
- Change distance to root.
Loop Detected
E
Not in tree
F
A
B
D
C
X
AdditionalCost (B ? X) TB R
Potential Parents of X
AdditionalCost (A ? X) TA 2R
- Action Triggers
- Parent disconnection.
- Parent cost not minimum.
- Change in distance of parent to root.
Select Parent with minimum Additional Cost
Minimum overall cost when parent is locally
selected
Execute action when any action trigger is on
Tree validity Tree will remain connected
with no loops.
68Sample Result