Statistical Methods for Detecting Computer Attacks from Streaming Internet Data - PowerPoint PPT Presentation

About This Presentation

Title:

Statistical Methods for Detecting Computer Attacks from Streaming Internet Data

Description:

Ginger Davis, University of Virginia ... and inefficient Inadequate Session Aggregation Results * Data Revised Data Processing Methodology: Convert: ... – PowerPoint PPT presentation

Number of Views:174

Avg rating:3.0/5.0

Slides: 45

Provided by: tm9v

Learn more at: https://www.niss.org

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Methods for Detecting Computer Attacks from Streaming Internet Data

1
Statistical Methods for Detecting Computer
Attacks from Streaming Internet Data
Ginger Davis, University of Virginia Systems
Information Engineering Department Joint work
with David Marchette Karen Kafadar INTERFACE
2008 May 22, 2008
2
Outline

Motivation
Data
TCP Classification
Graphical Displays

3
Motivation

Cyber attacks on computer networks are threats to
nearly all operations in society.
We need computational tools and statistical
methods to identify attacks and stop them before
they force shutdowns.
Use patterns in Internet traffic data to
Perform user profiling
Detect anomalies, network interruptions, unusual
behavior, masquerades

4
Project Background

Personal Computer
The Internet (Circa 2006)
Burning Power Transformer (May 2007)

Facts
The Internet is growing
Computer network attacks are increasing
Need for network security research tools

5
Previous Work in Detecting Aberrations

Examples
Disease surveillance
Nuclear product manufacturing
Fraud detection (credit cards phone use)
These data sets are often
Reasonable small (say less than 100 per day)
Easily stratified (by disease, site, cardholder)
Approximately independent
Can often apply Statistical Process Control tools

6
Features of Internet Traffic Data

Relentless (streaming)
Not independent of other systems thousands of
messages from thousands of ports/addresses each
minute
Diverse (text, numeric, image)
Dispersed (geographically)
Data often not from some convenient mathematical
pdf

7
Four Stages of Data Graphics

Static Graphics
Scatterplot, conditioning plot, density plot
Interactive Graphics
Brushing, cropping, cutting, coloring, rotating,
linked plots
Dynamic Graphics (interact directly with fixed
size data set on the client)
Recursive or dynamically smoothed plot, mode tree
Evolutionary Graphics (continually evolving
streaming data sets)
Waterfall diagram, streaming chart, skyline plot

8
Challenges

Internet traffic data are streaming
Unusable in raw form and require pre-processing
Detecting anomalies requires characterizing
typical behavior

9
Specific challenges for streaming data

Data value
what to collect/discard/save for later
Data warehouse
acquisition, storage, distribution
Tools/algorithms for pre-processing
Methods for analysis
Robustness,sufficiency
Informative visual displays

10
Internet Traffic Data

All internet communications are transmitted via
packets.
Fundamental unit of information is a packet
Packet consists of data and headers that control
communication
Internet Protocol (IP) addresses
Transmission Control Protocol (TCP)

11
Internet Traffic Data
12
Internet Traffic Data
13
Internet Traffic Data
14
IP Header (Marchette 2001)
15
TCP
16
TCP
17
TCP Header
18
Hierarchy of Data

Packets
Identifying characteristics
Bytes of information being sent
Flows
Communication between source-destination
Connection
Collection of source flows and destination flows
Activity
Collection of similar connections
User session
Collection of activities

19
Hierarchy of Data Example
20
Goal for Data Hierarchy

Developing models for each level of the hierarchy
which are dependent on models for other levels in
the hierarchy

21
TCP Classification

Detecting anomalies requires characterizing
typical behavior
We will classify network traffic according to its
application

22
Background

Motivation
Port numbers map packets to their respective
applications
The only thing that matters is that the two
communicating hosts know which port number to
look for
Malicious users can use a well known port like 80
(web traffic) for other uses and as a result are
less likely to be noticed.

23
Goal and Objective

Goal
To prevent malicious users from masquerading
their activities.
Objective
To develop classification tree and multinomial
logit models which could be used to correctly
identify application protocols by looking at
session variable characteristics

24
Data

Preliminary Data Processing Methodology
Convert Binary -gtText -gt SQL
Proved to be slow, and inefficient
Inadequate Session Aggregation Results

25
Data

Revised Data Processing Methodology
Convert Binary -gt Text -gt SQL
Faster, more efficient, tracks more variables for
each session

26
Data

Session Aggregation Process
Ordered observations in database by time
Logically grouped each packet into a session
using standard TCP semantics
Created unique session definitions
Maintained averages and variances for each
sessions variables
Session completion status is determined and
marked according to TCP semantics
Packet and session tables were linked by foreign
keys

27
Data
Enterprise Data Set Collected By Lawrence
Berkeley National Laboratory Contains
129,903,861 TCP Packets 453,135 TCP Sessions
GMU Data Set Collected By George Mason
University _______ Contains 7,024,590 TCP
Packets 91,016 TCP Sessions
House Data Set Collected By Capstone Team 8
_______________ Contains 1,110,335 TCP
Packets 21,311 TCP Sessions
28
Model Creation
Training and Testing Data Set Creation
29
Model Creation
Scenarios Used in Data Analysis
Real World Corporate Scenario Used all
application ports present in the data sets
Idealized Scenario Used only top application
ports in the data sets
Home Network Scenario Used only http, https,
pop, and smtp application ports present in the
data sets
30
Model Creation

Classification Tree Algorithm Parameters
RPART originally developed for R
Dependent Variable Application Port
Independent Variables 39 session variables
Splitting Criteria Gini Index

31
Model Creation
Classification Tree Snapshot
32
Model Creation

Multinomial Logit Models
Dependent variable Application Port
Independent variables 39 session variables

33
Results Classification Trees
Real World Corporate Scenario All Ports and All
Variables
Takeaway Good prediction capability within the
same data set inconsistent results when
benchmarked against different data sets.
34
Results Classification Trees
Idealized Scenario Top Ports and All Variables
Takeaway Significant prediction improvement for
the Enterprise data set. Limiting ports, cleansed
the noise from the data.
35
Results Classification Trees
Home Network Scenario Four Ports and All
Variables
Takeaway Improved prediction results both within
and across data sets.
36
Results Classification Trees
Port 80 Across Data Sets 4 Application Ports
Takeaway HTTP traffic (port 80) predictions
appear to be robust across the models when only
looking at four application variables.
37
Results Multi-categorical Logistical
Idealized Scenario Top Ports All Variables
Takeaway Weaker prediction results in the
Enterprise data set. Practical in a real-time
environment given appropriate environment/implemen
tation.
38
Conclusion

Project Takeaways
Replicated / expanded prior research work
successfully on real network data
Used a fast/exportable model creation and
classification process -gt Classification Trees
Created a robust toolkit for processing and
storing network data

39
Future Work