Title: Statistical Methods for Detecting Computer Attacks from Streaming Internet Data
1Statistical Methods for Detecting Computer
Attacks from Streaming Internet Data
Ginger Davis, University of Virginia Systems
Information Engineering Department Joint work
with David Marchette Karen Kafadar INTERFACE
2008 May 22, 2008
2Outline
- Motivation
- Data
- TCP Classification
- Graphical Displays
3Motivation
- Cyber attacks on computer networks are threats to
nearly all operations in society. - We need computational tools and statistical
methods to identify attacks and stop them before
they force shutdowns. - Use patterns in Internet traffic data to
- Perform user profiling
- Detect anomalies, network interruptions, unusual
behavior, masquerades
4Project Background
Personal Computer
The Internet (Circa 2006)
Burning Power Transformer (May 2007)
- Facts
- The Internet is growing
- Computer network attacks are increasing
- Need for network security research tools
5Previous Work in Detecting Aberrations
- Examples
- Disease surveillance
- Nuclear product manufacturing
- Fraud detection (credit cards phone use)
- These data sets are often
- Reasonable small (say less than 100 per day)
- Easily stratified (by disease, site, cardholder)
- Approximately independent
- Can often apply Statistical Process Control tools
6Features of Internet Traffic Data
- Relentless (streaming)
- Not independent of other systems thousands of
messages from thousands of ports/addresses each
minute - Diverse (text, numeric, image)
- Dispersed (geographically)
- Data often not from some convenient mathematical
pdf
7Four Stages of Data Graphics
- Static Graphics
- Scatterplot, conditioning plot, density plot
- Interactive Graphics
- Brushing, cropping, cutting, coloring, rotating,
linked plots - Dynamic Graphics (interact directly with fixed
size data set on the client) - Recursive or dynamically smoothed plot, mode tree
- Evolutionary Graphics (continually evolving
streaming data sets) - Waterfall diagram, streaming chart, skyline plot
8Challenges
- Internet traffic data are streaming
- Unusable in raw form and require pre-processing
- Detecting anomalies requires characterizing
typical behavior
9Specific challenges for streaming data
- Data value
- what to collect/discard/save for later
- Data warehouse
- acquisition, storage, distribution
- Tools/algorithms for pre-processing
- Methods for analysis
- Robustness,sufficiency
- Informative visual displays
10Internet Traffic Data
- All internet communications are transmitted via
packets. - Fundamental unit of information is a packet
- Packet consists of data and headers that control
communication - Internet Protocol (IP) addresses
- Transmission Control Protocol (TCP)
11Internet Traffic Data
12Internet Traffic Data
13Internet Traffic Data
14IP Header (Marchette 2001)
15TCP
16TCP
17TCP Header
18Hierarchy of Data
- Packets
- Identifying characteristics
- Bytes of information being sent
- Flows
- Communication between source-destination
- Connection
- Collection of source flows and destination flows
- Activity
- Collection of similar connections
- User session
- Collection of activities
19Hierarchy of Data Example
20Goal for Data Hierarchy
- Developing models for each level of the hierarchy
which are dependent on models for other levels in
the hierarchy
21TCP Classification
- Detecting anomalies requires characterizing
typical behavior - We will classify network traffic according to its
application
22Background
- Motivation
- Port numbers map packets to their respective
applications - The only thing that matters is that the two
communicating hosts know which port number to
look for - Malicious users can use a well known port like 80
(web traffic) for other uses and as a result are
less likely to be noticed.
23Goal and Objective
- Goal
- To prevent malicious users from masquerading
their activities. - Objective
- To develop classification tree and multinomial
logit models which could be used to correctly
identify application protocols by looking at
session variable characteristics
24Data
- Preliminary Data Processing Methodology
- Convert Binary -gtText -gt SQL
- Proved to be slow, and inefficient
- Inadequate Session Aggregation Results
25Data
- Revised Data Processing Methodology
- Convert Binary -gt Text -gt SQL
- Faster, more efficient, tracks more variables for
each session
26Data
- Session Aggregation Process
- Ordered observations in database by time
- Logically grouped each packet into a session
using standard TCP semantics - Created unique session definitions
- Maintained averages and variances for each
sessions variables - Session completion status is determined and
marked according to TCP semantics - Packet and session tables were linked by foreign
keys
27Data
Enterprise Data Set Collected By Lawrence
Berkeley National Laboratory Contains
129,903,861 TCP Packets 453,135 TCP Sessions
GMU Data Set Collected By George Mason
University _______ Contains 7,024,590 TCP
Packets 91,016 TCP Sessions
House Data Set Collected By Capstone Team 8
_______________ Contains 1,110,335 TCP
Packets 21,311 TCP Sessions
28Model Creation
Training and Testing Data Set Creation
29Model Creation
Scenarios Used in Data Analysis
Real World Corporate Scenario Used all
application ports present in the data sets
Idealized Scenario Used only top application
ports in the data sets
Home Network Scenario Used only http, https,
pop, and smtp application ports present in the
data sets
30Model Creation
- Classification Tree Algorithm Parameters
- RPART originally developed for R
- Dependent Variable Application Port
- Independent Variables 39 session variables
- Splitting Criteria Gini Index
31Model Creation
Classification Tree Snapshot
32Model Creation
- Multinomial Logit Models
- Dependent variable Application Port
- Independent variables 39 session variables
33Results Classification Trees
Real World Corporate Scenario All Ports and All
Variables
Takeaway Good prediction capability within the
same data set inconsistent results when
benchmarked against different data sets.
34Results Classification Trees
Idealized Scenario Top Ports and All Variables
Takeaway Significant prediction improvement for
the Enterprise data set. Limiting ports, cleansed
the noise from the data.
35Results Classification Trees
Home Network Scenario Four Ports and All
Variables
Takeaway Improved prediction results both within
and across data sets.
36Results Classification Trees
Port 80 Across Data Sets 4 Application Ports
Takeaway HTTP traffic (port 80) predictions
appear to be robust across the models when only
looking at four application variables.
37Results Multi-categorical Logistical
Idealized Scenario Top Ports All Variables
Takeaway Weaker prediction results in the
Enterprise data set. Practical in a real-time
environment given appropriate environment/implemen
tation.
38Conclusion
- Project Takeaways
- Replicated / expanded prior research work
successfully on real network data - Used a fast/exportable model creation and
classification process -gt Classification Trees - Created a robust toolkit for processing and
storing network data
39Future Work
- Implement classification trees in a real network
security application - Handle minority class presence in the data
- Make use of pruning to develop smaller models
40Evolutionary Displays for EDA
41Waterfall Diagrams (Wegman Marchette 2003)
42(No Transcript)
43(No Transcript)
44Summary / Future Work