Title: Cognitive Support for Intelligent Survivability Management
1Cognitive Support for Intelligent Survivability
Management
Dec 18, 2007
2Outline
- Project Summary and Progress Report
- Goals/Objectives
- Changes
- Current status
- Technical Details of Ongoing Tasks
- Event Interpretation
- Response Selection
- Rapid Response (ILC)
- Learning Augmentation
- Simulated test bed (simulator)
- Next steps
- Development and Integration
- Red team evaluation
OLC
3Project Summary Progress Report
4Background
- Outcome of DARPA OASIS Dem/Val program
- Survivability architecture
- Protection, detection and reaction (defense
mechanisms) - Synergistic organization of overlapping defense
functionality - Demonstrated in the context of an AFRL exemplar
(JBI) - With knowledge about the architecture, human
defenders can be highly effective - Even against a sophisticated adversary with
significant inside access and privilege (learning
exercise runs)
Survivability architecture provides the dials and
knobs, but an intelligent control loop in the
form of human expertswas needed for managing
them
Managing Making effective decisions
What was this knowledge? How did the human
defenders use it? Can the intelligent control
loop be automated?
5Incentives and Obstacles
- Incentives
- Narrowing of the qualitative gap in automated
cyber-defense decision making - Self managed survivability architecture
- Self-regenerative systems
- Next generation of adaptive system technology
- From hard-coded adaptation rules to cognitive
rules to evolutionary
- Obstacles (at various levels)
- Concept insight (sort of) but no formalization
- Implementation Architecture, Tool capability
choice - Evaluation How to create a easonably complex
context wide range of incidents - Real system?
- Evaluation how to quantify and validate
- Usefulness and effectiveness
- Measuring technological advancement
6CSISM Objectives
- Design and implement an automated cyber-defense
decision making mechanism - Expert level proficiency
- Drive a typical defense enabled system
- Effective, reusable, easy to port and retarget
- Evaluate it in a wider context and scope
- Nature and type of events and observations
- Size and complexity of the system
- Readiness for a real system context
- Understanding the residual issues challenges
7Main Problem
- Making sense of low-level information (alerts,
observations) to drive low-level
defense-mechanisms (block, isolate etc.) such
that higher-level objectives (survive, continue
to operate) are achieved - Doing it as well as human experts
- And also as well as in other disciplines
- Additional difficulties
- Rapid and real time decision-making and response
- Uncertainty due to incomplete and imperfect
information - Widely varying operating conditions (no alerts to
100s of alerts per second) - New symptoms and changes in adversarys strategy
8For Example.
- Consider a missing protocol message alert
- Observable a system specific alert
- A accuse B of omission
- Interpretation
- A is not dead (he reported it)
- Is A lying? (corrupt)
- B is dead
- B is not dead, just behaving badly (corrupt)
- A and B cannot communicate
- Refinement (depending on what else we know about
the system, the attacker objective..) - Other communications between A and B
- A svc is dead if the host is dead..
- OS platform and likelihood of multi-platform
exploits.. - Response selection
- Now or later?
- Many options
related
related
9Approach
Learn
Stream of events and observations
Interpret
React
React
React
Hypotheses
System
Respond
Actions
Modify parameter or policy
- Multiple levels of reasoning
- Varying spatial and temporal scope
- Different techniques
- The main control loop is partitioned into 2 main
parts event interpretation response selection
10Concrete Goals
- A working prototype integrating
- Policy based reactive (cyber-defense) response
- Cognitive control loop for system-wide
(cyber-defense) event interpretation and response - Learning augmentation to modify defense
parameters and policies - Achieve expert level proficiency
- In making appropriate cyber-defense decision
- Evaluation by
- ground truth ODV operator responses to
symptoms caused by red team - Program metrics
11Current State
- Accomplished quite a bit in 1 year
- KR and reasoning framework for handling
cyber-defense events well developed - Proof of concept capability demonstrated for
various components at multiple levels - OLC, ILC, Learner and Simulator
- E.g., Prover9, Soar (various iterations)
- Began integration and tackling incidental issues
- Evaluation ongoing (internal external)
- Slightly behind in terms of response
implementation and integration - Various reasons (inherent complexity, and the
fact that it is very hard to debug the reasoning
mechanism) - Longer term issue Confidence in such cognitive
engine? Is a system-wide scope really tenable? Is
it possible to build better debugging support? - Taken mitigating steps (see next)
12Significant Changes
Recall the linear flow using various types of
knowledge? That was what we were planning in
June. This evolved, and the actual flow looks
like the following
Knowledge about attacker goal (bin3)
Knowledge about bad behavior (bin 1) , protocols
and scenarios (bin 4)
Knowledge about info flow (bin 2) and protocols
and scenarios (bin 4)
Process Accusation Evidence
Translation Map Down
Prune (Coherence and proof)
Garbage collect
Build
Refine
accusation
evidence
Constraint network
13Significant Changes (contd)
- Response mechanism
- Do in Jess/Java instead of Soar
- Issues
- Get the state accessible to Jess/Java
- Viewers
- Dual purpose usability and debugging
- Was Rule driven write a Soar rule to produce
what to display - Now get the state from Soar and process
14Schedule
- Midterm release (Aug 2007) done
- Red team visit (Early 2008)
- Next release (Feb 2008)
- Code freeze (April 2008)
- Red team exercises (May/June 2008)
15Event Interpretation and Response (OLC)
16OLC Overall Goals
- Interpret alerts and observations
- (sometimes lack of observations triggers alerts)
- Find appropriate response
- (sometimes it may decide that no response is
necessary) - Housekeep
- Keep history
- Clean up
17OLC Components
Event Interpretation
Response Selection
Summary
responses
accusations and evidence
Learning
History
18Event Interpretation
- Main Objectives
- Essential Event Interpretation
- Interpreting events in terms of hypotheses and
models - Uses deduction and coherence to decide which
hypotheses are candidates for response - Incidental Undertakings
- Protecting the interpretation mechanisms from
attack flooding and resource consumption - Current status and plans
- Note that items with a are in progress
Event interpretation creates candidate hypotheses
which can be responded to.
19Event Interpretation Decision Flow
Response Selection
Event Interpretation
generator
Summary
hypotheses
claims
theorem proving
dilemmas
coherence
learning
history
20Knowledge Representation
- Turn very specific knowledge into an intermediate
form amenable to reasoning - i.e. Q2SM sent a malformed Spread Message -gt
- Q2SM is Corrupt
- Create a graph of inputs and intermediate states
to enable reasoning about the whole system - Accusations and Evidence
- Hypotheses
- Constraints between A and B
- Use the graph to enable deduction via proof and
to perform a coherence search
Specific system inputs are translated into a
reusable intermediate form which is used for
reasoning.
21Preparing to Reason
- Observations and Alerts are transformed to
Accusations and Evidence - Currently translation is done in Soar but may
move outside to keep the translation and
reasoning separate
Alerts notification of an anomalous event
Accusations generic alert
Observation notification of an expected event
Evidence generic observation
Alerts and Observations are turned into
Accusations and Evidence that can be reasoned
about.
22Alerts and Accusations
- By using accusations the universe of bad behavior
used in reasoning is limited, with limited loss
of fidelity. - The five accusations below are representative of
attacks in the system - Value accused sent malformed data
- Policy accused violated a security policy
- Timing accused send well-formed data at the
wrong time - Omission expected data was never received from
accused - Flood accused is sending much more data than
expected
CSISM uses 5 types of accusations to reason about
a potentially infinite number of bad actions that
could be reported.
23Evidence
- While accusations show unexpected behavior
evidence is used for expected behavior - Evidence limits the universe of expected behavior
used in reasoning, with limited loss of fidelity. - Alive The subject is alive
- Timely The subject participated in a timely
exchange of information - Specific historical data about interactions is
used by the OLC, just not in event interpretation
CSISM uses two types of evidence to represent the
occurrence of expected actions for event
interpretation.
24Hypotheses
- When an accusation is created a set of hypotheses
are proposed that explain the accusation - For example a value accusation means either the
accuser or accused is corrupt and that the
accuser is not dead. - The following hypotheses (both positive and
negative) can be proposed - Dead Subject is dead fail-stop failure
- Corrupt Subject is corrupt
- Communication-Broken Subject has lost
connectivity - Flooded Subject is starved of critical resources
- OR a meta-hypothesis that either of a number of
related hypotheses are true
Accusations lead to hypotheses about the cause of
the accusation.
25Reasoning Structure
- Hypotheses, Accusations, and Evidence are
connected using constraints - The resulting graph is used for
- Coherence search
- Proving system facts
host dead
-100
accusation
100
100
OR
100
100
-400
-400
host dead
comm broken
-400
100
host corrupt
A graph is created to enable reasoning about
hypotheses.
26Proofs about the System
- The OLC needs to derive as much certain
information as it can, but it needs to do this
very quickly. The OLC does model-theoretic
reasoning to find hypotheses that are theorems
(i.e., always true) or necessarily false - For example, it can assume the attacker has a
single platform exploit, and consider each
platform in turn, finding which hypotheses are
true or false in all cases. Then it can assume
the attacker has exploits for two platforms and
repeat the process - A hypothesis can be proven true or proven false
or have an unknown proof status - Claims Hypotheses that are proven true
Claims are definite candidates for response
27Coherence
- Coherence partitions the system into clusters
that make sense together - For example, for a single accusation either the
accuser or the accused may be corrupt but these
hypotheses will cluster apart - Responses can be made on the basis of the
partition, or partition membership when a proof
is not available
In the absence of provable information coherence
may enable actions to be taken.
28Protection and Cleanup
- Without oversight resources can be overwhelmed
- Due to flooding we rate limit incoming messages
- Excessive information accumulation
- We take two approaches to mitigate excessive
information accumulation - Removing outdated information by making it
inactive - If some remedial action has cleared up a past
problem - If new information makes previous information
outdated or redundant - If old information contradicts new information
- If an inconsistency occurs we remove
low-confidence information until the
inconsistency is removed - When resources are very constrained more drastic
measures are taken - Hypotheses that have not been acted upon for some
time will be removed, along with related
accusations
Resources are reclaimed and managed to prevent
uncontrolled data loss or corruption.
29Current Status and Future Plans
- Knowledge Representation
- Accusation translation is implemented
- May need to change to better align with the
evidences - Evidence implementation in process
- Will leverage the code and structure for
accusation generation - Use of coherence partition in response
selection--ongoing - Protection and Cleanup are being implemented
- Flood control development is ongoing
- The active/inactive distinction is designed and
ready to implement - Drastic hypothesis removal is still being
designed
Much work has been accomplished, work still
remains.
30Response Selection
Main Objectives
- Decide promptly how to react to an attack
- Block the attack in most situations
- Make gaming the system difficult
- Reaction based on high-confidence event
interpretation - History of responses is taken into account when
selecting next response - Not necessarily deterministic
31Response Selection Decision Flow
Response Selection
Event Interpretation
Summary
propose
claims
dilemmas
potentially useful responses
responses
prune
learning
history
32Response Terminology
- A response is an abstract OLC action, described
generically - Example quarantine(X), where X could be a host,
file, process, memory segment, network segment
etc. - A response will be carried out in a sequence of
- response steps
- Steps for quarantine(X) isHost(X) include
- Reconfigure process protection domains on X
- Reconfigure firewall local to X
- Reconfigure firewalls remote to X
- Steps for quarantine(X) isFile(X) include
- Mark file non-exectuable
- Take specimen then delete
- A command is the input to actuators that
- implement a single response step
- Use /sbin/iptables to reconfigure software
firewalls - Use ADF Policy Server commands to reconfigure ADF
cards - Use tripwire commands to scan file systems
or
Resp1
Resp2
and
specialization
Step2
Step1
Step3
Cmd1
Cmd1
Cmd1
33Kinds of Response
- Refresh e.g., start from checkpoint
- Reset e.g., start from scratch
- Isolate -- permanent
- Quarantine/unquarantine -- temporary
- Downgrade/upgrade services and resources
- Ping check liveness
- Move migrate component
The DPASA design used all of these except
move. The OLC design has similar emphasis.
34Response Selection Phases
- Phase I propose
- Set of claims (hypotheses that are likely true)
implies set of possibly useful responses - Phase II prune
- Discard lower priority
- Discard based on history
- Discard based on lookahead
- Choose between incompatible alternatives
- Choose unpredictably if possible
- Learning algorithm will tune Phase II parameters
35Example
- Event interpretation claims Q1PSQ is corrupt
- Relevant knowledge
- PSQ is not checkpointable
- Propose
- (A) Reset Q1PSQ, i.e., reboot, or
- (B) Quarantine Q1PSQ using firewall, or
- (C) Isolate Quad 1
- Prune
- Reboot has already been tried, so discard (A)
- Q1PSQ is not critical, so no need to discard (B)
- Prefer (B) to (C) because more easily reversible,
but override if too many previous anomalies in
Quad 1 - Learning
- Modify the definition of too many used when
pruning (B)
36Using Lookahead for Pruning
- Event interpretation provides an intelligent
guess about the attackers capability - OLC rules encode knowledge about attackers
possible goals - Lookahead estimates the potential future state,
given assumptions about capability, goals, and
response selection - If response X has better potential future than Y,
favor X
37Status
- Design
- Rules for proposing responses encoded in
first-order logic - Corresponding pruning rules described in English
- Implementation
- Mounting responses for given hypotheses
prototyped in Soar - Actual response logic is being moved outside Soar
- Risk mitigation step
- Some specific to a particular Learning Exercise
run
Much less complete than Event Interpretation, but
we are getting there
38Fast Containment Response and Policies
- Michael Atighetchi
- (On behalf of Steve Harp)
39Inner Loop Controller (ILC) Objectives
Goal Attempt to contain and correct problems at
the earliest stage possible
- Policy Driven Implement policies and tactics
from OLC on a single host. - Autonomous high speed responsecan work when
disconnected from the OLC by an attack or
failure - Flexible Policies can be updated at any time
- Adaptive Use learned characteristics of host
and monitored services to tune the policy. - Low impact on mission able to back out of
defensive decisions when warranted
40Current ILC Policies
- Generic...
- Excessive resource consumption
- Memory, CPU load
- Tampering with component files
- Unexpected service termination
- Unexpected network connections
- Specialized...
- JBI client registration failure
- ADF Agent failure modes
- SELinux enforcement status
41Policy Example ?
File Integrity Check Fails, F1
Local File F1 statuscorrupted
File Tampering Problem
Block Network
Suspend Affected Services
File Cleanup Subgoal
Quarantine
Restore File
Remove Block/Suspend Reasons
42ILC Recent Work
- ILC Configuration
- Made this driven by (ontology) model of DPASA
- No need for configuration files for each host
- ILC runs on each simulated host, automatically
loads required policy components - Integration
- Hitherto ILC has been (unit) tested as
stand-alone - Initial integration with jess DPASA simulator
complete, broadening support for multiple
policies - Adjustments to API to match simulator
43ILC Current Status
- ILC policy to handle various applications
- Model driven configuration
- Metrics
- Rules 94 Functions 134 Frames 24
Globals 20 - Base reaction time (in unit test) 4 ms.
- (Measuring the inference part only.)
- Target reaction time is lt 100 ms.
44ILC Ongoing Work
- Complete integration with the rest of CSISM
framework - DPASA Simulator
- ILC---OLC Interaction
- Designed integration TBD
- Testing
- Verify correct reactions in simulator to various
simulated attacks - Measure reaction times
45Learning Augmentation
- Michael Atighetchi
- (On behalf of Karen Haigh)
46Learning Augmentation Motivation
- Why learning?
- Extremely difficult to capture all the
complexities of the system, particularly
interactions among activities - The system is dynamic (static configuration gets
out of date) - Core Challenge
Offline Training Good data Complex
environment - Dynamic system
Online Training - Unknown data Complex
environment Dynamic system
Human Good data - Complex environment - Dynamic
system
CSISMs Experimental Sandbox Good data
(self-labeled) Complex environment Dynamic
system
Very hard for adversary to train the learner!!!
Sandbox approach successfully tried in SRS phase 1
Adaptation is the key to survival
47Development Plan for Learning in CSISM
- Responses under normal conditions (Calibration)
- Important first step because it learns how to
respond to normal conditions - Showed at June PI meeting
- Situation-dependent responses under attack
conditions - Multi-stage attacks
- Since June
48Calibration Results for all Registration times
Beta0.0005
These two shoulder points indicate upper and
lower limits.
As more observations are collected, the estimates
become more confident of the range of expected
values (i.e. tighter estimates to observations)
June07 PI meeting
49Multistage Attacks
- Multistage attacks involve a sequence of actions
that span multiple hosts and take multiple steps
to succeed. - A sequence of actions with causal relationships.
- An action A must occur set up the initial
conditions for action B. Action B would have no
effect without previously executing action A. - Challenge identify which observations indicate
the necessary and sufficient elements of an
attack (credit assignment). - Incidental observations that are either
- side effects of normal operations, or
- chaff explicitly added by an attacker to divert
the defender. - Concealment (e.g. to remove evidence)
- Probabilistic actions (e.g. to improve
probability of success)
Not yet
50Architectural Schema for Learning ofAttack
Theories and Situation-Dependent Responses
CSISM Sensors (ILC, IDS)
1
2
3
4
5
6
Observations ending in failure of protected
system. Only some are essential.
Viable Defense Strategies and Detection Rules
Viable Attack Theories
Failure
51Multi-Stage Learner
The hard part!
- Do
- Generate Theory according to heuristic
- Complete set of theories is all permutations of
all members of Powerset( observations ) - Test Theory
- Incrementally Update OLC / ILC rulebase
- while Theories remain
52Heuristics Structure of Results
- Primary Goal Find all shortest valid attacks
(i.e. minimum required subset) as soon as
possible - Example In ABCDE, AC and DE may both be valid
- Secondary Goal Find all valid attacks as soon as
possible - Example In ABCDE, ABC may also be valid
- Heuristics
- Shortest first
- Longest first
- Edit distance to original
- Dynamic resort to valid set
- Initially, edit distance to the original attack
- Remaining theories are compared to all valid
attacks edit distance is averaged - Dynamic Resort / Free to remove chaff
- Same as Dynamic Resort to valid set, but cost
of deletion is zero - Worst Case Comparison Sort theories so that
- Shortest valid attack is found last
- All valid attacks at the end
53Comparison of Heuristics
4-observation 3-stage attack 4 obs 64
potential trials 10 obs 10 million potential
trials
54Incremental Hypothesis Generation
- Enhanced query learner generates attack
hypotheses - incrementally, with low memory overhead it is
able to explore large observation spaces (gtgt8
steps) - in heuristic order to acquire the concept
rapidly - Heuristic bias
- look for shorter attacks first (adjustable prior)
- suspect order of steps has an influence
- suspect steps to interact positively (for the
attacker) - performance comparable to edit-distlength
55Incremental Hypothesis Generation Results
- Target concept The disjunction ".A.B." or
".B.C." - Scores represent sum of trial numbers for
elementary concepts - Note
- There are many possible observation sequences
that could generate these target concepts scores
is average of 8 of the sequences - For observation sequences longer than 8, learners
that pre-enumerate and sort their queries run out
of memory
SONNI Short-Ordered-NonNegative Incremental
Hypothesis Generator
56Status, Development Plan Future steps
- June07 PI Meeting
- Responses under normal conditions (Calibration)
- Analyze DPASA data (done)
- Integrate with ILC (single node) (done)
- Add experimentation sandbox (single-node)
- Calibrate across nodes
- Situation-dependent responses under attack
conditions - Multi-stage attacks
- Since June
- Development of sandbox, and initial integration
efforts with learner (done) - Attack actions, observations, and control actions
- Quality signal
- Development of multistage algorithm (version 1.0
done) - Theories with sandbox
- Incremental generation of theories
- TODO ILC input / OLC output
57Simulated Testbed
- Michael Atighetchi
- (on behalf of Michael Atighetchi)
58Why Simulation ?
Defense-Enabled JBIas tested under OASIS Dem/Val
- Simulation of defense-enabled system
- Use a specification
- Use as integration middleware
- Use for red team experimentation
59JessSim The JESS Simulator
Implemented via JESS rules and functions
Generatedvia Protégé plugin
60JessSim Current Status
- Implemented Protocols (14)
- Plumbing (5 rules)
- Alert (6 rules)
- Registration (8 rules)
- SELinux (1 rule)
- Reboot (3 rules)
- LC message (3 rules)
- ADF (3 rules)
- Heartbeat (1 rule)
- PSQ (3 rules)
- Tripwire (3 rules)
- ServiceControl (1 rule)
- POSIX Signals (1 rule)
- Process Memory/CPU status (2 rules)
- Host Memory/CPU status (2 rules)
Implemented Attacks (8) Avail Disable SELinux
service Avail Shutdown a host Avail Cause a
Downstream Controller to crash Avail Cause
corruption of endpoint references in SMs Avail
Killing of processes via kill -9 Integrity
Corruption of files Policy Violation Creation of
a new (rogue) process Avail Causing a process to
overload CPU
- Test Coverage
- Unit tests 28 junit tests covering protocol
details - OASIS Dem/Val Main events of DPASA Run 6.
- Fidelity
- Focused on application-level protocols
61JessSim Ongoing Work
- Increase fidelity of network simulation
- checks for network connectivity crash(router)
gt com broken (A,B) - Simulation of TCP/IP flows for ILC
- Increase fidelity of host simulation for ILC
- install-network-block / remove-network-block
- note-network-connection / reset-network-connection
- quarantine-file / restore-file / delete-file /
checkpoint-file - note-selinux-down / note-selinux-up
- shun-network-address / unshun-network-address
- Enable-interface / disable-interface
- Set-boot-option
- Protocols for ILC/OLC communication
- forward-to-olc()
- Cleanup
- Convert all time unit to seconds in all scenarios
62Next Steps Integration and Evaluation
63Learning Integration
- ILC Learning
- Pre-deployment calibration learn threshold
parameters for registration times - Calibrate across nodes
- OLC-Learning
- Results from learning with the experimentation
sand box - Parameter tuning
- New rules/heuristics
64ILC lt-gt OLC Integration
- ILC-gtOLC
- Calls to OLC implemented in ILC policies via
calls to ilc-api - ILC as an Informant to OLC
- ILC as Henchman of OLC
- OLC-gtILC
- OLC can process alerts forwarded to it from the
ILC - Consider ILC as a mechanism during response
selection
65JessSim Integration
- ILC integration with jessSim
- ArrestRunawayProcess loop working
- Implement file, network, and reboot protocols
necessary to support other existing ILC loops - OLC integration with jessSim
- OLC fully integrated with jessSim
- Adjust integration given changes due to
- Moving transcription logic Alerts-gtAccusation,
Observations-gtEvidence into Jess - Performing response selection in Jess
- Integration Framework
- All components execute within a single JVM
- Support execution of ILC and OLC on dedicated
hosts to measure timeliness.
66Integration FrameworkCurrent Status
67Integration FrameworkNeeded for Red Team
Experimentation
68Evaluation
- Interaction with Red and White Teams
- Initial telecon (Late October)
- Continued technical interchange about CSISM
capabilities - Potential gaps/disagreements
- How to use the simulator
- Evaluation goals
- Next steps
- Demonstration of the system
- Red team visit
- Code drop