Cognitive Support for Intelligent Survivability Management presentation

About This Presentation

Transcript and Presenter's Notes

Title: Cognitive Support for Intelligent Survivability Management

1
Cognitive Support for Intelligent Survivability
Management
Dec 18, 2007
2
Outline

Project Summary and Progress Report
Goals/Objectives
Changes
Current status
Technical Details of Ongoing Tasks
Event Interpretation
Response Selection
Rapid Response (ILC)
Learning Augmentation
Simulated test bed (simulator)
Next steps
Development and Integration
Red team evaluation

OLC
3
Project Summary Progress Report

Partha Pal

4
Background

Outcome of DARPA OASIS Dem/Val program
Survivability architecture
Protection, detection and reaction (defense
mechanisms)
Synergistic organization of overlapping defense
functionality
Demonstrated in the context of an AFRL exemplar
(JBI)
With knowledge about the architecture, human
defenders can be highly effective
Even against a sophisticated adversary with
significant inside access and privilege (learning
exercise runs)

Survivability architecture provides the dials and
knobs, but an intelligent control loop in the
form of human expertswas needed for managing
them
Managing Making effective decisions
What was this knowledge? How did the human
defenders use it? Can the intelligent control
loop be automated?
5
Incentives and Obstacles

Incentives
Narrowing of the qualitative gap in automated
cyber-defense decision making
Self managed survivability architecture
Self-regenerative systems
Next generation of adaptive system technology
From hard-coded adaptation rules to cognitive
rules to evolutionary

Obstacles (at various levels)
Concept insight (sort of) but no formalization
Implementation Architecture, Tool capability
choice
Evaluation How to create a easonably complex
context wide range of incidents
Real system?
Evaluation how to quantify and validate
Usefulness and effectiveness
Measuring technological advancement

6
CSISM Objectives

Design and implement an automated cyber-defense
decision making mechanism
Expert level proficiency
Drive a typical defense enabled system
Effective, reusable, easy to port and retarget
Evaluate it in a wider context and scope
Nature and type of events and observations
Size and complexity of the system
Readiness for a real system context
Understanding the residual issues challenges

7
Main Problem

Making sense of low-level information (alerts,
observations) to drive low-level
defense-mechanisms (block, isolate etc.) such
that higher-level objectives (survive, continue
to operate) are achieved
Doing it as well as human experts
And also as well as in other disciplines
Additional difficulties
Rapid and real time decision-making and response
Uncertainty due to incomplete and imperfect
information
Widely varying operating conditions (no alerts to
100s of alerts per second)
New symptoms and changes in adversarys strategy

8
For Example.

Consider a missing protocol message alert
Observable a system specific alert
A accuse B of omission
Interpretation
A is not dead (he reported it)
Is A lying? (corrupt)
B is dead
B is not dead, just behaving badly (corrupt)
A and B cannot communicate
Refinement (depending on what else we know about
the system, the attacker objective..)
Other communications between A and B
A svc is dead if the host is dead..
OS platform and likelihood of multi-platform
exploits..
Response selection
Now or later?
Many options

related
related
9
Approach
Learn
Stream of events and observations
Interpret
React
React
React
Hypotheses
System
Respond
Actions
Modify parameter or policy

Multiple levels of reasoning
Varying spatial and temporal scope
Different techniques
The main control loop is partitioned into 2 main
parts event interpretation response selection

10
Concrete Goals

A working prototype integrating
Policy based reactive (cyber-defense) response
Cognitive control loop for system-wide
(cyber-defense) event interpretation and response
Learning augmentation to modify defense
parameters and policies
Achieve expert level proficiency
In making appropriate cyber-defense decision
Evaluation by
ground truth ODV operator responses to
symptoms caused by red team
Program metrics

11
Current State

Accomplished quite a bit in 1 year
KR and reasoning framework for handling
cyber-defense events well developed
Proof of concept capability demonstrated for
various components at multiple levels
OLC, ILC, Learner and Simulator
E.g., Prover9, Soar (various iterations)
Began integration and tackling incidental issues
Evaluation ongoing (internal external)
Slightly behind in terms of response
implementation and integration
Various reasons (inherent complexity, and the
fact that it is very hard to debug the reasoning
mechanism)
Longer term issue Confidence in such cognitive
engine? Is a system-wide scope really tenable? Is
it possible to build better debugging support?
Taken mitigating steps (see next)

12
Significant Changes
Recall the linear flow using various types of
knowledge? That was what we were planning in
June. This evolved, and the actual flow looks
like the following
Knowledge about attacker goal (bin3)
Knowledge about bad behavior (bin 1) , protocols
and scenarios (bin 4)
Knowledge about info flow (bin 2) and protocols
and scenarios (bin 4)
Process Accusation Evidence
Translation Map Down
Prune (Coherence and proof)
Garbage collect
Build
Refine
accusation
evidence
Constraint network
13
Significant Changes (contd)

Response mechanism
Do in Jess/Java instead of Soar
Issues
Get the state accessible to Jess/Java
Viewers
Dual purpose usability and debugging
Was Rule driven write a Soar rule to produce
what to display
Now get the state from Soar and process

14
Schedule

Midterm release (Aug 2007) done
Red team visit (Early 2008)
Next release (Feb 2008)
Code freeze (April 2008)
Red team exercises (May/June 2008)

15
Event Interpretation and Response (OLC)

Franklin Webber

16
OLC Overall Goals

Interpret alerts and observations
(sometimes lack of observations triggers alerts)
Find appropriate response
(sometimes it may decide that no response is
necessary)
Housekeep
Keep history
Clean up

17
OLC Components
Event Interpretation
Response Selection
Summary
responses
accusations and evidence
Learning
History
18
Event Interpretation

Main Objectives
Essential Event Interpretation
Interpreting events in terms of hypotheses and
models
Uses deduction and coherence to decide which
hypotheses are candidates for response
Incidental Undertakings
Protecting the interpretation mechanisms from
attack flooding and resource consumption
Current status and plans
Note that items with a are in progress

Event interpretation creates candidate hypotheses
which can be responded to.
19
Event Interpretation Decision Flow
Response Selection
Event Interpretation
generator
Summary
hypotheses
claims
theorem proving
dilemmas
coherence
learning
history
20
Knowledge Representation

Turn very specific knowledge into an intermediate
form amenable to reasoning
i.e. Q2SM sent a malformed Spread Message -gt
Q2SM is Corrupt

Create a graph of inputs and intermediate states
to enable reasoning about the whole system
Accusations and Evidence
Hypotheses
Constraints between A and B
Use the graph to enable deduction via proof and
to perform a coherence search

Specific system inputs are translated into a
reusable intermediate form which is used for
reasoning.
21
Preparing to Reason

Observations and Alerts are transformed to
Accusations and Evidence
Currently translation is done in Soar but may
move outside to keep the translation and
reasoning separate

Alerts notification of an anomalous event
Accusations generic alert
Observation notification of an expected event
Evidence generic observation
Alerts and Observations are turned into
Accusations and Evidence that can be reasoned
about.
22
Alerts and Accusations

By using accusations the universe of bad behavior
used in reasoning is limited, with limited loss
of fidelity.
The five accusations below are representative of
attacks in the system
Value accused sent malformed data
Policy accused violated a security policy
Timing accused send well-formed data at the
wrong time
Omission expected data was never received from
accused
Flood accused is sending much more data than
expected

CSISM uses 5 types of accusations to reason about
a potentially infinite number of bad actions that
could be reported.
23
Evidence

While accusations show unexpected behavior
evidence is used for expected behavior
Evidence limits the universe of expected behavior
used in reasoning, with limited loss of fidelity.
Alive The subject is alive
Timely The subject participated in a timely
exchange of information
Specific historical data about interactions is
used by the OLC, just not in event interpretation

CSISM uses two types of evidence to represent the
occurrence of expected actions for event
interpretation.
24
Hypotheses

When an accusation is created a set of hypotheses
are proposed that explain the accusation
For example a value accusation means either the
accuser or accused is corrupt and that the
accuser is not dead.
The following hypotheses (both positive and
negative) can be proposed
Dead Subject is dead fail-stop failure
Corrupt Subject is corrupt
Communication-Broken Subject has lost
connectivity
Flooded Subject is starved of critical resources
OR a meta-hypothesis that either of a number of
related hypotheses are true

Accusations lead to hypotheses about the cause of
the accusation.
25
Reasoning Structure

Hypotheses, Accusations, and Evidence are
connected using constraints
The resulting graph is used for
Coherence search
Proving system facts

host dead
-100
accusation
100
100
OR
100
100
-400
-400
host dead
comm broken
-400
100
host corrupt
A graph is created to enable reasoning about
hypotheses.
26
Proofs about the System

The OLC needs to derive as much certain
information as it can, but it needs to do this
very quickly. The OLC does model-theoretic
reasoning to find hypotheses that are theorems
(i.e., always true) or necessarily false
For example, it can assume the attacker has a
single platform exploit, and consider each
platform in turn, finding which hypotheses are
true or false in all cases. Then it can assume
the attacker has exploits for two platforms and
repeat the process
A hypothesis can be proven true or proven false
or have an unknown proof status
Claims Hypotheses that are proven true

Claims are definite candidates for response
27
Coherence

Coherence partitions the system into clusters
that make sense together
For example, for a single accusation either the
accuser or the accused may be corrupt but these
hypotheses will cluster apart
Responses can be made on the basis of the
partition, or partition membership when a proof
is not available

In the absence of provable information coherence
may enable actions to be taken.
28
Protection and Cleanup

Without oversight resources can be overwhelmed
Due to flooding we rate limit incoming messages
Excessive information accumulation
We take two approaches to mitigate excessive
information accumulation
Removing outdated information by making it
inactive
If some remedial action has cleared up a past
problem
If new information makes previous information
outdated or redundant
If old information contradicts new information
If an inconsistency occurs we remove
low-confidence information until the
inconsistency is removed
When resources are very constrained more drastic
measures are taken
Hypotheses that have not been acted upon for some
time will be removed, along with related
accusations

Resources are reclaimed and managed to prevent
uncontrolled data loss or corruption.
29
Current Status and Future Plans

Knowledge Representation
Accusation translation is implemented
May need to change to better align with the
evidences
Evidence implementation in process
Will leverage the code and structure for
accusation generation
Use of coherence partition in response
selection--ongoing
Protection and Cleanup are being implemented
Flood control development is ongoing
The active/inactive distinction is designed and
ready to implement
Drastic hypothesis removal is still being
designed

Much work has been accomplished, work still
remains.
30
Response Selection
Main Objectives

Decide promptly how to react to an attack
Block the attack in most situations
Make gaming the system difficult
Reaction based on high-confidence event
interpretation
History of responses is taken into account when
selecting next response
Not necessarily deterministic

31
Response Selection Decision Flow
Response Selection
Event Interpretation
Summary
propose
claims
dilemmas
potentially useful responses
responses
prune
learning
history
32
Response Terminology

A response is an abstract OLC action, described
generically
Example quarantine(X), where X could be a host,
file, process, memory segment, network segment
etc.
A response will be carried out in a sequence of
response steps
Steps for quarantine(X) isHost(X) include
Reconfigure process protection domains on X
Reconfigure firewall local to X
Reconfigure firewalls remote to X
Steps for quarantine(X) isFile(X) include
Mark file non-exectuable
Take specimen then delete
A command is the input to actuators that
implement a single response step
Use /sbin/iptables to reconfigure software
firewalls
Use ADF Policy Server commands to reconfigure ADF
cards
Use tripwire commands to scan file systems

or
Resp1
Resp2
and
specialization
Step2
Step1
Step3
Cmd1
Cmd1
Cmd1
33
Kinds of Response

Refresh e.g., start from checkpoint
Reset e.g., start from scratch
Isolate -- permanent
Quarantine/unquarantine -- temporary
Downgrade/upgrade services and resources
Ping check liveness
Move migrate component

The DPASA design used all of these except
move. The OLC design has similar emphasis.
34
Response Selection Phases

Phase I propose
Set of claims (hypotheses that are likely true)
implies set of possibly useful responses
Phase II prune
Discard lower priority
Discard based on history
Discard based on lookahead
Choose between incompatible alternatives
Choose unpredictably if possible
Learning algorithm will tune Phase II parameters

35
Example

Event interpretation claims Q1PSQ is corrupt
Relevant knowledge
PSQ is not checkpointable
Propose
(A) Reset Q1PSQ, i.e., reboot, or
(B) Quarantine Q1PSQ using firewall, or
(C) Isolate Quad 1
Prune
Reboot has already been tried, so discard (A)
Q1PSQ is not critical, so no need to discard (B)
Prefer (B) to (C) because more easily reversible,
but override if too many previous anomalies in
Quad 1
Learning
Modify the definition of too many used when
pruning (B)

36
Using Lookahead for Pruning

Event interpretation provides an intelligent
guess about the attackers capability
OLC rules encode knowledge about attackers
possible goals
Lookahead estimates the potential future state,
given assumptions about capability, goals, and
response selection
If response X has better potential future than Y,
favor X

37
Status

Design
Rules for proposing responses encoded in
first-order logic
Corresponding pruning rules described in English
Implementation
Mounting responses for given hypotheses
prototyped in Soar
Actual response logic is being moved outside Soar
Risk mitigation step
Some specific to a particular Learning Exercise
run

Much less complete than Event Interpretation, but
we are getting there
38
Fast Containment Response and Policies

Michael Atighetchi
(On behalf of Steve Harp)

39
Inner Loop Controller (ILC) Objectives
Goal Attempt to contain and correct problems at
the earliest stage possible

Policy Driven Implement policies and tactics
from OLC on a single host.
Autonomous high speed responsecan work when
disconnected from the OLC by an attack or
failure
Flexible Policies can be updated at any time
Adaptive Use learned characteristics of host
and monitored services to tune the policy.
Low impact on mission able to back out of
defensive decisions when warranted

40
Current ILC Policies

Generic...
Excessive resource consumption
Memory, CPU load
Tampering with component files
Unexpected service termination
Unexpected network connections

Specialized...
JBI client registration failure
ADF Agent failure modes
SELinux enforcement status

41
Policy Example ?
File Integrity Check Fails, F1
Local File F1 statuscorrupted
File Tampering Problem
Block Network
Suspend Affected Services
File Cleanup Subgoal
Quarantine
Restore File
Remove Block/Suspend Reasons
42
ILC Recent Work

ILC Configuration
Made this driven by (ontology) model of DPASA
No need for configuration files for each host
ILC runs on each simulated host, automatically
loads required policy components
Integration
Hitherto ILC has been (unit) tested as
stand-alone
Initial integration with jess DPASA simulator
complete, broadening support for multiple
policies
Adjustments to API to match simulator

43
ILC Current Status

ILC policy to handle various applications
Model driven configuration
Metrics
Rules 94 Functions 134 Frames 24
Globals 20
Base reaction time (in unit test) 4 ms.
(Measuring the inference part only.)
Target reaction time is lt 100 ms.

44
ILC Ongoing Work

Complete integration with the rest of CSISM
framework
DPASA Simulator
ILC---OLC Interaction
Designed integration TBD
Testing
Verify correct reactions in simulator to various
simulated attacks
Measure reaction times

45
Learning Augmentation

Michael Atighetchi
(On behalf of Karen Haigh)

46
Learning Augmentation Motivation

Why learning?
Extremely difficult to capture all the
complexities of the system, particularly
interactions among activities
The system is dynamic (static configuration gets
out of date)
Core Challenge

Offline Training Good data Complex
environment - Dynamic system
Online Training - Unknown data Complex
environment Dynamic system
Human Good data - Complex environment - Dynamic
system
CSISMs Experimental Sandbox Good data
(self-labeled) Complex environment Dynamic
system
Very hard for adversary to train the learner!!!
Sandbox approach successfully tried in SRS phase 1
Adaptation is the key to survival
47
Development Plan for Learning in CSISM

Responses under normal conditions (Calibration)
Important first step because it learns how to
respond to normal conditions
Showed at June PI meeting
Situation-dependent responses under attack
conditions
Multi-stage attacks
Since June

48
Calibration Results for all Registration times
Beta0.0005
These two shoulder points indicate upper and
lower limits.
As more observations are collected, the estimates
become more confident of the range of expected
values (i.e. tighter estimates to observations)
June07 PI meeting
49
Multistage Attacks

Multistage attacks involve a sequence of actions
that span multiple hosts and take multiple steps
to succeed.
A sequence of actions with causal relationships.
An action A must occur set up the initial
conditions for action B. Action B would have no
effect without previously executing action A.
Challenge identify which observations indicate
the necessary and sufficient elements of an
attack (credit assignment).
Incidental observations that are either
side effects of normal operations, or
chaff explicitly added by an attacker to divert
the defender.
Concealment (e.g. to remove evidence)
Probabilistic actions (e.g. to improve
probability of success)

Not yet
50
Architectural Schema for Learning ofAttack
Theories and Situation-Dependent Responses
CSISM Sensors (ILC, IDS)
1
2
3
4
5
6
Observations ending in failure of protected
system. Only some are essential.
Viable Defense Strategies and Detection Rules
Viable Attack Theories
Failure
51
Multi-Stage Learner
The hard part!

Do
Generate Theory according to heuristic
Complete set of theories is all permutations of
all members of Powerset( observations )
Test Theory
Incrementally Update OLC / ILC rulebase
while Theories remain

52
Heuristics Structure of Results

Primary Goal Find all shortest valid attacks
(i.e. minimum required subset) as soon as
possible
Example In ABCDE, AC and DE may both be valid
Secondary Goal Find all valid attacks as soon as
possible
Example In ABCDE, ABC may also be valid
Heuristics
Shortest first
Longest first
Edit distance to original
Dynamic resort to valid set
Initially, edit distance to the original attack
Remaining theories are compared to all valid
attacks edit distance is averaged
Dynamic Resort / Free to remove chaff
Same as Dynamic Resort to valid set, but cost
of deletion is zero
Worst Case Comparison Sort theories so that
Shortest valid attack is found last
All valid attacks at the end

53
Comparison of Heuristics
4-observation 3-stage attack 4 obs 64
potential trials 10 obs 10 million potential
trials
54
Incremental Hypothesis Generation

Enhanced query learner generates attack
hypotheses
incrementally, with low memory overhead it is
able to explore large observation spaces (gtgt8
steps)
in heuristic order to acquire the concept
rapidly
Heuristic bias
look for shorter attacks first (adjustable prior)
suspect order of steps has an influence
suspect steps to interact positively (for the
attacker)
performance comparable to edit-distlength

55
Incremental Hypothesis Generation Results

Target concept The disjunction ".A.B." or
".B.C."
Scores represent sum of trial numbers for
elementary concepts
Note
There are many possible observation sequences
that could generate these target concepts scores
is average of 8 of the sequences
For observation sequences longer than 8, learners
that pre-enumerate and sort their queries run out
of memory

SONNI Short-Ordered-NonNegative Incremental
Hypothesis Generator
56
Status, Development Plan Future steps

June07 PI Meeting
Responses under normal conditions (Calibration)
Analyze DPASA data (done)
Integrate with ILC (single node) (done)
Add experimentation sandbox (single-node)
Calibrate across nodes
Situation-dependent responses under attack
conditions
Multi-stage attacks

Since June
Development of sandbox, and initial integration
efforts with learner (done)
Attack actions, observations, and control actions
Quality signal
Development of multistage algorithm (version 1.0
done)
Theories with sandbox
Incremental generation of theories
TODO ILC input / OLC output

57
Simulated Testbed

Michael Atighetchi
(on behalf of Michael Atighetchi)

58
Why Simulation ?
Defense-Enabled JBIas tested under OASIS Dem/Val

Simulation of defense-enabled system
Use a specification
Use as integration middleware
Use for red team experimentation

59
JessSim The JESS Simulator
Implemented via JESS rules and functions
Generatedvia Protégé plugin
60
JessSim Current Status

Implemented Protocols (14)
Plumbing (5 rules)
Alert (6 rules)
Registration (8 rules)
SELinux (1 rule)
Reboot (3 rules)
LC message (3 rules)
ADF (3 rules)
Heartbeat (1 rule)
PSQ (3 rules)
Tripwire (3 rules)
ServiceControl (1 rule)
POSIX Signals (1 rule)
Process Memory/CPU status (2 rules)
Host Memory/CPU status (2 rules)

Implemented Attacks (8) Avail Disable SELinux
service Avail Shutdown a host Avail Cause a
Downstream Controller to crash Avail Cause
corruption of endpoint references in SMs Avail
Killing of processes via kill -9 Integrity
Corruption of files Policy Violation Creation of
a new (rogue) process Avail Causing a process to
overload CPU

Test Coverage
Unit tests 28 junit tests covering protocol
details
OASIS Dem/Val Main events of DPASA Run 6.
Fidelity
Focused on application-level protocols

61
JessSim Ongoing Work

Increase fidelity of network simulation
checks for network connectivity crash(router)
gt com broken (A,B)
Simulation of TCP/IP flows for ILC
Increase fidelity of host simulation for ILC
install-network-block / remove-network-block
note-network-connection / reset-network-connection
quarantine-file / restore-file / delete-file /
checkpoint-file
note-selinux-down / note-selinux-up
shun-network-address / unshun-network-address
Enable-interface / disable-interface
Set-boot-option
Protocols for ILC/OLC communication
forward-to-olc()
Cleanup
Convert all time unit to seconds in all scenarios

62
Next Steps Integration and Evaluation

Partha Pal

63
Learning Integration

ILC Learning
Pre-deployment calibration learn threshold
parameters for registration times
Calibrate across nodes
OLC-Learning
Results from learning with the experimentation
sand box
Parameter tuning
New rules/heuristics

64
ILC lt-gt OLC Integration

ILC-gtOLC
Calls to OLC implemented in ILC policies via
calls to ilc-api
ILC as an Informant to OLC
ILC as Henchman of OLC
OLC-gtILC
OLC can process alerts forwarded to it from the
ILC
Consider ILC as a mechanism during response
selection

65
JessSim Integration

ILC integration with jessSim
ArrestRunawayProcess loop working
Implement file, network, and reboot protocols
necessary to support other existing ILC loops
OLC integration with jessSim
OLC fully integrated with jessSim
Adjust integration given changes due to
Moving transcription logic Alerts-gtAccusation,
Observations-gtEvidence into Jess
Performing response selection in Jess
Integration Framework
All components execute within a single JVM
Support execution of ILC and OLC on dedicated
hosts to measure timeliness.

66
Integration FrameworkCurrent Status
67
Integration FrameworkNeeded for Red Team
Experimentation
68
Evaluation

Interaction with Red and White Teams
Initial telecon (Late October)
Continued technical interchange about CSISM
capabilities
Potential gaps/disagreements
How to use the simulator
Evaluation goals
Next steps
Demonstration of the system
Red team visit
Code drop

Write a Comment

User Comments (0)

About PowerShow.com

Cognitive Support for Intelligent Survivability Management PowerPoint PPT Presentation