Statistical Methods for Detecting Computer Attacks from Streaming Internet Data - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Methods for Detecting Computer Attacks from Streaming Internet Data

Description:

Ginger Davis, University of Virginia ... and inefficient Inadequate Session Aggregation Results * Data Revised Data Processing Methodology: Convert: ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 45
Provided by: tm9v
Learn more at: https://www.niss.org
Category:

less

Transcript and Presenter's Notes

Title: Statistical Methods for Detecting Computer Attacks from Streaming Internet Data


1
Statistical Methods for Detecting Computer
Attacks from Streaming Internet Data
Ginger Davis, University of Virginia Systems
Information Engineering Department Joint work
with David Marchette Karen Kafadar INTERFACE
2008 May 22, 2008
2
Outline
  • Motivation
  • Data
  • TCP Classification
  • Graphical Displays

3
Motivation
  • Cyber attacks on computer networks are threats to
    nearly all operations in society.
  • We need computational tools and statistical
    methods to identify attacks and stop them before
    they force shutdowns.
  • Use patterns in Internet traffic data to
  • Perform user profiling
  • Detect anomalies, network interruptions, unusual
    behavior, masquerades

4
Project Background


Personal Computer
The Internet (Circa 2006)
Burning Power Transformer (May 2007)
  • Facts
  • The Internet is growing
  • Computer network attacks are increasing
  • Need for network security research tools

5
Previous Work in Detecting Aberrations
  • Examples
  • Disease surveillance
  • Nuclear product manufacturing
  • Fraud detection (credit cards phone use)
  • These data sets are often
  • Reasonable small (say less than 100 per day)
  • Easily stratified (by disease, site, cardholder)
  • Approximately independent
  • Can often apply Statistical Process Control tools

6
Features of Internet Traffic Data
  • Relentless (streaming)
  • Not independent of other systems thousands of
    messages from thousands of ports/addresses each
    minute
  • Diverse (text, numeric, image)
  • Dispersed (geographically)
  • Data often not from some convenient mathematical
    pdf

7
Four Stages of Data Graphics
  • Static Graphics
  • Scatterplot, conditioning plot, density plot
  • Interactive Graphics
  • Brushing, cropping, cutting, coloring, rotating,
    linked plots
  • Dynamic Graphics (interact directly with fixed
    size data set on the client)
  • Recursive or dynamically smoothed plot, mode tree
  • Evolutionary Graphics (continually evolving
    streaming data sets)
  • Waterfall diagram, streaming chart, skyline plot

8
Challenges
  • Internet traffic data are streaming
  • Unusable in raw form and require pre-processing
  • Detecting anomalies requires characterizing
    typical behavior

9
Specific challenges for streaming data
  • Data value
  • what to collect/discard/save for later
  • Data warehouse
  • acquisition, storage, distribution
  • Tools/algorithms for pre-processing
  • Methods for analysis
  • Robustness,sufficiency
  • Informative visual displays

10
Internet Traffic Data
  • All internet communications are transmitted via
    packets.
  • Fundamental unit of information is a packet
  • Packet consists of data and headers that control
    communication
  • Internet Protocol (IP) addresses
  • Transmission Control Protocol (TCP)

11
Internet Traffic Data
12
Internet Traffic Data
13
Internet Traffic Data
14
IP Header (Marchette 2001)
15
TCP
16
TCP
17
TCP Header
18
Hierarchy of Data
  • Packets
  • Identifying characteristics
  • Bytes of information being sent
  • Flows
  • Communication between source-destination
  • Connection
  • Collection of source flows and destination flows
  • Activity
  • Collection of similar connections
  • User session
  • Collection of activities

19
Hierarchy of Data Example
20
Goal for Data Hierarchy
  • Developing models for each level of the hierarchy
    which are dependent on models for other levels in
    the hierarchy

21
TCP Classification
  • Detecting anomalies requires characterizing
    typical behavior
  • We will classify network traffic according to its
    application

22
Background
  • Motivation
  • Port numbers map packets to their respective
    applications
  • The only thing that matters is that the two
    communicating hosts know which port number to
    look for
  • Malicious users can use a well known port like 80
    (web traffic) for other uses and as a result are
    less likely to be noticed.

23
Goal and Objective
  • Goal
  • To prevent malicious users from masquerading
    their activities.
  • Objective
  • To develop classification tree and multinomial
    logit models which could be used to correctly
    identify application protocols by looking at
    session variable characteristics

24
Data
  • Preliminary Data Processing Methodology
  • Convert Binary -gtText -gt SQL
  • Proved to be slow, and inefficient
  • Inadequate Session Aggregation Results

25
Data
  • Revised Data Processing Methodology
  • Convert Binary -gt Text -gt SQL
  • Faster, more efficient, tracks more variables for
    each session

26
Data
  • Session Aggregation Process
  • Ordered observations in database by time
  • Logically grouped each packet into a session
    using standard TCP semantics
  • Created unique session definitions
  • Maintained averages and variances for each
    sessions variables
  • Session completion status is determined and
    marked according to TCP semantics
  • Packet and session tables were linked by foreign
    keys

27
Data
Enterprise Data Set Collected By Lawrence
Berkeley National Laboratory Contains
129,903,861 TCP Packets 453,135 TCP Sessions
GMU Data Set Collected By George Mason
University _______ Contains 7,024,590 TCP
Packets 91,016 TCP Sessions
House Data Set Collected By Capstone Team 8
_______________ Contains 1,110,335 TCP
Packets 21,311 TCP Sessions
28
Model Creation
Training and Testing Data Set Creation
29
Model Creation
Scenarios Used in Data Analysis
Real World Corporate Scenario Used all
application ports present in the data sets
Idealized Scenario Used only top application
ports in the data sets
Home Network Scenario Used only http, https,
pop, and smtp application ports present in the
data sets
30
Model Creation
  • Classification Tree Algorithm Parameters
  • RPART originally developed for R
  • Dependent Variable Application Port
  • Independent Variables 39 session variables
  • Splitting Criteria Gini Index

31
Model Creation
Classification Tree Snapshot
32
Model Creation
  • Multinomial Logit Models
  • Dependent variable Application Port
  • Independent variables 39 session variables

33
Results Classification Trees
Real World Corporate Scenario All Ports and All
Variables
Takeaway Good prediction capability within the
same data set inconsistent results when
benchmarked against different data sets.
34
Results Classification Trees
Idealized Scenario Top Ports and All Variables
Takeaway Significant prediction improvement for
the Enterprise data set. Limiting ports, cleansed
the noise from the data.
35
Results Classification Trees
Home Network Scenario Four Ports and All
Variables
Takeaway Improved prediction results both within
and across data sets.
36
Results Classification Trees
Port 80 Across Data Sets 4 Application Ports
Takeaway HTTP traffic (port 80) predictions
appear to be robust across the models when only
looking at four application variables.
37
Results Multi-categorical Logistical
Idealized Scenario Top Ports All Variables
Takeaway Weaker prediction results in the
Enterprise data set. Practical in a real-time
environment given appropriate environment/implemen
tation.
38
Conclusion
  • Project Takeaways
  • Replicated / expanded prior research work
    successfully on real network data
  • Used a fast/exportable model creation and
    classification process -gt Classification Trees
  • Created a robust toolkit for processing and
    storing network data

39
Future Work
  • Implement classification trees in a real network
    security application
  • Handle minority class presence in the data
  • Make use of pruning to develop smaller models

40
Evolutionary Displays for EDA
41
Waterfall Diagrams (Wegman Marchette 2003)
42
(No Transcript)
43
(No Transcript)
44
Summary / Future Work
Write a Comment
User Comments (0)
About PowerShow.com