Title: Resilient Real-Time Cyber-Physical Systems Josef De Vaughn
1Resilient Real-Time Cyber-Physical Systems
- Josef De Vaughn Allen, PhD
- INFOSEC Professional
2Agenda
- Purpose
- The Problem
- ORNL Value Add
- Resilient Cyber-Physical System
- What is a SCADA
- Current Limitations
- ORNLs Approach
- Risks
- Payoff
- Team
- References
3Purpose
- ORNL is investing significant IRD for Resilient
Infrastructure Systems - Want to get guidance that our findings and
approach are in a direction that will add value
to United States and abroad - Areas targeting
- Smart Grid
- Water
- Oil Gas
- Should there be others???
- About me
- Leading the effort at ORNL to gather requirements
and lay out a plan of attack for Resilient
Systems - Full Faculty Member at Florida State University
(Computer Science) - NSA INFOSEC Professional
- Primary Systems Installed were TS/SCI Classified
Systems - Intelligence Community/DoD/Five Eyes
- Large Scale Systems (Eight Years Experience)
- System Security Architect for several large scale
systems - Marine Corps and served in Desert Storm/Desert
Shield
We want to provide a relevant solution that the
customers (YOU) can use
4The Problem
- Munitions Grade Malware are going beyond just
affecting computing and network resources - Compromises a physical system (e.g. STUXNET)
- We must be able to create cyber-secure interfaces
- Core Issues
- Lots of attention being paid to intrusion(s)
going undetected - Attention is NOT being paid to payload
- Payload was targeting physical devices, not
computing/network resources - Smart, or not, most physical devices are too
trusting - Physical Devices must be cyber-aware
- Firmware and embedded software for remote sensor
devices cannot be trusted - No known federal mandate has been made for
quality assurance for country of origin - Firmware needs to be certified affirmed to be
correct, true, or genuine - Computing network devices and physical devices
can not communicate compromise so development of
a cyber-physical layer is needed - Reality
- We will get attacked and penetrated (i.e. Hacked)
- We must be able to move forward after being
attacked (i.e. Resilience)
How do we respond after getting punched in the
face?
5ORNL Value Add
Vision Statement Provide end-to-end solutions for
cyber-secure resilient systems using cutting
edge, reference based, COTS HW/SW, with open
source standards while not being cost prohibitive
6Resilient Cyber-Physical Systems
Resilient Cyber-Physical Systems refers to the
tight conjoining of, and coordination between,
Computational-Physical resources and how
adaptable the system will be in the midst of
adversity
7Resilient Cyber-Physical Systems (CPS)
CPS
Dynamic Computing Systems
Dynamic Smart Sensors
8Supervisory Control And Data Acquisition (SCADA)
system
- SCADA, Controls, consists of three elements
- Master system at a control center (Computing)
- Communications system (Network, phone lines)
- Multiple remote monitoring and control devices
(Sensor, RTUs, protection relays, meters)
Paraphrased from IEEE Tutorial course on
fundamentals of supervisory systems. Technical
Report 91 EH0337-6 PWR
9Current Limitations
- Models and Work toward Generic Resilient
computing/network systems - 1999 2003 DoD/IC
- Organically Assured and Survivable Information
System (OASIS) - OASIS Program Manager, Jaynarayan Lala DARPA/ITO
- 2003 2006 (Europe)
- Malicious and Accidental-Fault Tolerance for
Internet Applications (MAFTIA) - 2009
- DHS A Roadmap for Cyber Security Research
- Trusted Systems have not adequately evolved.
- Can we leverage lessons learned?
- Yes
No framework, no implementation for a resilient
cyber physical system
10Current Limitations Todays Use
- Married to the Machine Architecture
- Controllers are not fully optimized
- Registry
- Shared memory
- Shared libraries
- Not agnostic to the CPU instruction set
- Static planning reactive planning
- Hot swaps
- OS imaging
- Hashing
- Non adaptive configuration management
- Who cares?
- Users maintainers of mission critical systems
11ORNL Approach
- Inline continuity of operations (ICOOP) versus
COOP - No matter what attack, or disruption, there is a
plan to complete the mission at-hand - Need in-place fail-over nodes and not remote
back-up - Real-Time Trusted Platform Module/Base Aware
Sensors/Controllers - We can not build up for resilience we must build
within - Must be cost realistic
- Adaptive planning algorithms for redirecting
existing resources (I.E. GRID computing) - Leverage Mobile Phone embedded implementation
Architectures creating cyber-aware
hardware/software for mission critical systems - Dynamic Virtualization with Program
Differentiation
Leverage ORNL Distributed Energy Communications
and Control and create a mission critical ICOOP
framework and implementation model
12ORNL Approach
- We will discuss resiliency for the three main
components of a control system - Dynamic Computing Systems (Computing)
- Dynamic Virtualization for Health Status of
System - Dynamic Program Differentiation
- Wireless Power System Fingerprinting
- Communications system (Network, Phone lines)
- Dynamic White-Listing (Harris Corporation)
- Multiple remote monitoring and control devices
- Dynamic Smart (Sensors)
- Trusted Computing/TPM Aware Sensors
- Elliptic Curve Cryptography Aware for Real-Time
- 155-bit ECC uses 11,000 transistors while a
512-bit RSA implementation uses 50,000. - Penetration Defense
- Command Validation
- Input Validation
- Disturbance Rejection
- Mathematical (Framework)
- Mixture Model via GPU and/or Secure Cloud
Computing - Model the system to get a snap shot of SCADA/PMU
13Risk
- Tighter Coupling of resources with the mission
- Complexity of process scheduling may increase
- Antiquated resources will need to be taken
retro-fitted offline
14Payoff
- Create a dynamic architecture based on adaptive
planning - No single point of failure
- If a system is compromised, there is a best
path to finish existing/current mission - Based on mission focus
- Maximization of open standard COTS/SW/HW/OS in a
directed mission - Creating an ontology/taxonomy that leads to novel
non-linear scheduling algorithms on computing
nodes - Define the critical components for resilience
- Saves carbon foot print
- Vision For Securing Control Systems in the energy
Sector - In 10 years, control systems for critical
applications will be designed, installed,
operated, and maintained to survive an
intentional cyber assault with no loss of
critical function. - Roadmap to Secure Control Systems in the Energy
Sector DOE/DHSJanuary 2006
15Team
- Oak Ridge National Laboratory
- Josef D. Allen
- Aleksandar Dimitrovski
- Robert Gillen
- Shaun Gleason
- Dilip Reddy
- Isabelle Snyder
- Bogdan Vacaliuc
- Phillip Vallance
- Richard Wallace
- Florida State University
- David Whalley
- Xiuwen Liu
- Gary Tyson
- Michael Steurer
- Karl Schoder
- GE Research
- Arthur Chip Cotton
- Harris Corporation
- Travis Berrier
- University of Tennessee
- Yilu Liu
Cross Discipline Team is Essential for Success!!
16References and Related Work
- References
- D. Chang, S. Hines, P. West, G. Tyson, and D.
Whalley, Program Differentiation" in the Journal
of Circuits, Systems, and Computers, accepted
March 2011 - X. Liu, A. Srivastava, and D. L.
Wang,Intrinsic generalization analysis of low
dimensional representations,' Neural Networks,
vol. 16, no. 5/6, pp. 537--545, 2003. - S. C. Zhu and X. Liu, Learning in Gibbsian
fields How accurate and how fast can it be?''
IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 24, no. 7, pp. 1001--1006,
2002. - F. Wang, F. Gong, C. Sargor, K.
Goseva-Popstojanova, K. S. Trivedi and F. Jou,
SITAR A Scalable Intrusion Tolerant
Architecture for Distributed Services, 2nd
Annual IEEE Systems, Man, and Cybernetics
Information Assurance Workshop, West Point, New
York, June 2001 - João Filipe Ferreira, Jorge Lobo, Jorge Dias.
Journal of Real Time Image Processing (2010)
Bayesian real-time perception algorithms on GPU - D. Powell and R. Stroud, Conceptual model and
architecture of MAFTIA, MAFTIA Deliverable D21,
2003. - N. F. Neves and P. Verissimo, Complete
Specifications of APIs and Protocols for the
MAFTIA middleware, MAFTIA Deliverable D9,
2002. - M. Dacier (editor), Design of an
Intrusion-Tolerant Intrusion Detection System,
MAFTIA Deliverable D10, 2002. - M. Castro and B. Liskov, Practical Byzantine
Fault Tolerance and Proactive Recovery, ACM
Transactions on Computer Systems, vol. 20, no. 4,
pp. 398-461, 2002. - J. Levy, H. Saidi, and T. Uribe, Combining
monitors for runtime system verification,
Electronic Notes in Theoretical Computer Science,
vol. 70, no. 4, 2002. - Berger, S. Cáceres, R. Goldman, K. A. Perez,
R. Sailer, R. van Doorn, L. vTPM Virtualizing
the Trusted Platform Module USENIXSS 06
Proceedings of the 15th conference on USENIX
Security Symposium, USENIX Association, 2006,
2121 - Sadeghi, A. Scheibel, M. Stüble, C. Wolf, M.
Play it once again, Sam Enforcing Stateful
Licenses on Open Platforms Second Workshop on
Advances in Trusted Computing (WATC 06 Fall),
2006 - Xue, Y., Some Viewpoints and Experiences on Wide
Area Measurement Systems and Wide Area Control
Systems, 2008 IEEE Journal - C. Arguayo, J. Reed, Detecting Unauthorized
Software Execution in SDR Using Power
Fingerprinting, MILCOM 2010 - T. Messerges, E. Dabbish, R. Sloan, Examining
Smart-Card Security under the Threat of Power
Analysis Attacks, IEEE Transactions on
Computers, Vol 51, No. 4, April 2002 - Related Work
- Very recently there were several noticeable
efforts toward intrusion tolerant systems - DARPA OASIS (Organically Assured and Survivable
Information Systems, http//ieeexplore.ieee.org/xp
l/mostRecentIssue.jsp?punumber8932)
17Guidance
18Guidance
- We want to make sure that our direction makes
sense. - Please give feed back!!.
- Presenter Josef D. Allen
- Email allenjd_at_ornl.gov
19Thank You
20Our Solution
21SCADA for Power System
22Generators Control Strategies
Local Control Speed governor
Local Control
G
G
Local Control Speed governor
Local Control Speed governor
G
G
Local Control Speed governor
Local Control Speed governor
G
G
Area Control AGC System
Local Control (Speed governors) respond to
frequency or load changes at the generator
output, fast response (10s)
Area Control (AGC system) Continuously monitor
frequency and tie line flows. Changes the output
of the participating generators to bring both
frequency and interchange back to schedule, slow
response (10minutes)
23Large Frequency Deviation Impact
- Electrical islands local power system cut off
from outside power source due to tripping of
ties. - Frequency decay between 0.5hz and 4hz per second
- At around 59.3hz load shedding by under
frequency relays to prevent complete shutdown - If enough load cannot be shed, generating units
will trip (automatically or manually) and lead
to complete shutdown
24Synchronous Generator Control Loops Prime Mover
and Exciter Control
25Network Protection
26Isolate Corrupted Devices in Network
- Monitor for network pre-intrusion from generation
to substation - Change current reactive security to proactive,
anticipatory security - Robust adaptable security without perimeter
reconfiguration limitations - Low cost, power scavenging makes installation easy
1. Hydroelectric dam 2. Generator 3. Step-up
transformer 4. Grid high voltage transmission
lines 5. Terminal Station 6. Subtransmission
lines 7. How it is used by the customer 8.
Distribution substation
27Protecting Device Communication
Management
- One way communications path with high assurance
firewall/VPN - Can stop propagation of malware
- Containment function for digital traffic with
Dynamic whitelisting
Switch
Monitored Comm.
Invalid Comm.
28Server Architecture Resiliency
29Resilient Controls Under Duress
- Resilient systems under duress rely on the
following - The computing devices in question for control
systems run a constrained set of software (i.e.
Not general-purpose machines) - The devices are generally-available commodity
hardware - The devices serve to accept input data, make a
decision, and respond appropriately. - They are not individually responsible for
trending, persistence, etc. But may (and likely
do) send such data to an external system for
historical or other analyses - The Computing Systems environment is limited in
power, space, and available options for
individual system redundancy - The systems can operate on virtualized hardware
(standard type-1 virtualization approaches)
30VM Health Management
Dom0 monitors VM1-VMn based on the rules defined
for each system.
VM1
VM2
Health Monitor
System1-a
Original System Image Cache
System2-a
Hypervisor
31VM Health Management
VM1
VM2
The monitors for VM1 trigger a risk condition and
apply the configured response (mitigation)
Health Monitor
System1-a
System2-a
Original System Image Cache
Hypervisor
32VM Health Management
VM1
VM2
The original version of System1 is pulled from
the cache, and a uniquely obfuscated variant is
produced (System1-b)
Health Monitor
System1-b
System1-a
System2-a
Original System Image Cache
Hypervisor
33VM Health Management
VM3 is brought online and System1-b is deployed.
VM1
VM2
VM3
Health Monitor
System1-a
System2-a
Original System Image Cache
System1-b
Hypervisor
34VM Health Management
Once VM3 is fully online, I/O ports from VM1 are
migrated to VM3
VM1
VM2
VM3
Health Monitor
System1-a
System2-a
Original System Image Cache
System1-b
Hypervisor
35VM Health Management
VM1 is archived for later forensic review and
shut down
VM2
VM3
Health Monitor
System2-a
Original System Image Cache
System1-b
Hypervisor
36Digital Fingerprinting
37Overcoming Embedded Code
- Dynamic Power Consumption in a digital processor
is caused by transient currents and charges and
discharge of load capacitance that occurs during
bit transactions. - Key Comments
- Transactions depend on specific/unique
- Instructions Sets
- Memory Addresses
- Inter-Instructions Transitions
- Bottom Line
- Execution of a specific routine yields a unique
power consumption signature - All manufactured hardware is unique
- Even if it is from the same assembly line!
- PMU/GridEye can allow us to obtain power
signatures of desired devices directly or
wirelessly
38Dynamic Smart Sensors
39Trusted Secure Control Devices
- Control system devices will integrate a secure
element (TPM, smartcard, USIM) - Required for new use cases
- Near field communication (NFC)
- Contactless applications are executed on a secure
element - Enables mobile payment, ticketing, smart
posters, DRM - Secure transactions
- Secure browsing
- Research needed for secure element (SE)
integration - Research needed for secure channels to SE are
established
- Secure element (TPM) that supports trust
establishment is required - Trust decision based on integrity stored in
secure element - TPM was designed for current (insecure) operating
system environments - Access to TPM over virtualization boundary not
directly possible - Possible solutions
- Virtualization of TPM (insecure)
- Only one instance gets access to TPM (inflexible,
trust statements incomplete) - Better approach
- Provide vT-enhanced TPM(s)
- Next specification of TCG (TPM.Next)
40Mathematical Framework
41System Model
- We will develop efficient and effective
statistical models of process behaviors - We will use local windows of system call
profiles, port scanning activities, resource
accessing local patterns, controller outputs, and
inferred underlying controller state parameters
as feature vectors - Each group of related process will be modeled as
a mixture model to reflect the different
operating states. As the system is very complex,
the key is to find efficient and effective
statistical models that allow accurate, real-time
inference
42Mixture Models
- We use a unified a mixture model framework to
model different processes in the system - Here x is a vector of observed variables, q
consists of all the parameters and P(wj) are the
priors for different types of underlying
processes - For example, for a generator, q consists of the
physics-based model parameters (such as the
frequency of the generated electricity) here the
estimated probability models will have very small
variation due to stringent requirements - For cyber processes, q consists of model
parameters in representations derived from all
monitored measurements - In order to learn effective representations and
therefore enable real-time inference, we use
optimal component analysis learning
43Suitability of real-time inference for GPU
- The inference can be decomposed into
computational components that are highly
parallelizable - As we have a unified framework, the differences
among different types of processes are in the
data, and therefore lead naturally to single
instruction multiple data (SIMD) parallelization - The high cost arises also due to the large data
sets involved. - SIMD features of GPUs provide a means of dealing
with the scalability of highly parallelizable
algorithms operating on large data structures. - GPUs provide massive parallelism and high speed
gains at low costs. - Source Bayesian real-time perception
algorithms on GPU by João Filipe Ferreira, Jorge
Lobo, Jorge Dias. Journal of Real Time Image
Processing (2010)
Inspired by the study of biological systems,
several Bayesian inference algorithms for
artificial perception exist
44Timing
- Global Protection (SCADA) operates on the order
of seconds - (2- 4 Sec)
- Local Protection of power system cycles
- (1/f 1/60HZ 16.67ms )
- Instantaneous Protection (Physical) operates at
the order of milliseconds - 2 6 Cycles
- Time Delayed Back-Up and System wide Protection
(COOP) operates on the order of 100s milliseconds - 20 30 Cycles
- Additional Level of back up will be additional to
Time-Delayed - Additional 20 30 Cycles
- PMU data 30 Seconds/Sample (Typically)
- Can go 240 HZ, 8000Hz or Higher
General reference Times Electric Power
Engineering Handbook 2nd ed., L.Grigsby editor,
CRC Press 2006.
45Timing
- Heisenberg uncertainty principle
- Can not measure the present position and
determine future momentum - Direction
- Leverage all information in a unified way to
provide an actionable decision (Cyber-Physical) - Must continue the mission while under duress
- We can NOT overcome Physics
- (Nonlinear Dynamic Decision Making)
- We propose a unified resilient manifold framework
to model interactions among all the components in
different states (both steady states and
transient states)
General reference Times Electric Power
Engineering Handbook 2nd ed., L.Grigsby editor,
CRC Press 2006.
46Timing
- Combine RTU/SCADA with PMU/WAMS
- Real-Time Cognitive Trajectory Based Data Mining
- Suggest PMU use wavelets vice FFT to allow for
faster decisions - On line Dynamic State Estimation (Tracking)
- Suggests Quantum Path Planning
- Leverage work done in DoD Tracking (GMTI)
- Developing a mathematical and computational
framework for detecting and classifying weak,
distributed anomalous behavior in computer
networks - On Line Optimization for decision making
47Comment
- 2008 (NARI) Nanjing Automation Research Institute
(China) - Vision obtained from Western Systems Coordinating
Council - NASPINET
- China is implementing real-time systems dynamics
of Power Grid via non-linear estimation of
real/static state estimations - Yusheng Xue Chief Engineer of NARI since 1993 for
Peoples Republic of China