Title: Data Mining Approaches for Network Intrusion Detection
1Data Mining Approaches for Network Intrusion
Detection
- Karla Bracamonte
- Jeffrey Gawlinski
- Jordan Harstad
- Omar Rodriguez
- Michael Wright
2Intrusion Detection Current
- Detection is best if it occurs during the
scanning step - Real-time intrusion detection
- Pros Scan network traffic on the fly looking for
well known scan patterns - Cons Tuned specifically to detect known service
level network attacks - Intrusion detection should follow a proactive
approach
3VisualizationPresenting a Graphical Summary of
the Data
- Communication- presenting a graphical summery of
the data - PROS Possible to communicate the most important
aspects of collected data - CONS Not all information can be communicated
visually it is limited by the complexity that
the human eye can appreciate.
4VisualizationPresenting a Graphical Summary of
the Data
- Visual Techniques
- Scatterplots
- Projection Matrices
- Coplots
- Parallel Coordinates
- Etc
5VisualizationPresenting a Graphical Summary of
the Data
- Distortion Methods minimize a scope of data
which allows a certain data set to be studied
without loss of entire perspective. - Interactive Methods viewing output dynamically
through the use of a possible UI to project,
zoom, and manipulate the data on demand.
6VisualizationPresenting a Graphical Summary of
the Data
- Audio Data Mining uses visual techniques by
changing signal pitches into graphs to recognize
unique patterns. - Help find a pattern of early warning signs of
human anger through telephone communication.
7Data Summarization
- Data Summarization is an important data analysis
task in data warehouse and online analytic
processing, another used term for data
summarization is summary statistics
8Data Summarization Offline Data Mining and
Importance of Statistics
- For example, networks with high traffic are faced
with a larger amount of data to analyze.
Nevertheless, with the use of data summarization,
data may be analyzed pattern by pattern,
detecting abnormal behavior and/or results
9Data Summarization Offline Data Mining and
Importance of Statistics
- Summary statistics are quantities, such as the
mean and standard deviation that capture various
characteristics of a potential large set of
values with a single number or small set of
numbers - Indeed, for many people, summary statistics are
the most visible manifestation of statistics
10Data Summarization Offline Data Mining and
Importance of Statistics
- Frequencies and Mode
- Given a set of unordered categorical values,
there is not much that can be done to further
characterize the values except to compute the
frequency with which each value occurs for a
particular set of data - Percentiles
- For ordered data, it is more useful to consider
the percentiles of a set of values
11Data SummarizationOffline Data Mining and
Importance of Statistics
- Measures of Location Mean and Median
- For continues data, two of the most widely used
summary statistics are the mean and median, which
are measures of the location of the set of values
- Measures of Spread Range and Variance
- Another set of commonly used summary statistics
for continuous data are those that measure the
dispersion or spread of a set of values. Such
measures indicate if the attribute values are
widely spread out or if they are rarely
concentrated around a single point such as the
mean.
12Data SummarizationOffline Data Mining and
Importance of Statistics
- Multivariate Summary Statistics
- Measures of location for data that consists of
several attributes (multivariate data) can be
obtained by computing the mean or median
separately for each attribute. - Other Ways to Summarize Data
- skeweness
13Data Summarization Offline Data Mining and
Importance of Statistics
- off-line processing is a reasonable solution.
- Off-line processing provides the techniques for
broader analysis of network traffic.
14Network Intelligence Gathering
- Foot Printing
- Administrative, technical, and billing contacts,
which include employee names, email addresses,
and phone fax numbers - IP address range
- DNS servers
- Mail servers
15Network Intelligence Gathering
- Enumeration
- process of extracting valid accounts or exported
resource names from systems - Scanning
- the art of detecting which systems are alive and
reachable via the Internet, and what services
they offer, using techniques such as ping sweeps,
port scans, and operating system identification
16Network Based Attacks
- Attack on availability
- Making a network unavailable or unusable to a
user or a group of users - Attack on confidentiality-
- Many attacks are on that of personal data.
Whether it is a name, address, email, social
security number or credit card number, many
network based attacks are solely there for the
purpose of gathering confidential and/or personal
information on an individual, group of
individuals, company or object.
17Network Based Attacks
- Attack on integrity-
- It is possible for the data to be intercepted all
together and thus never reach the intended
recipient. - Attack on authenticity-
- Modifies an original data cluster and then passes
it on as unmodified.
18Network Based Attacks
- Attack on access control-
- This method attacks a legitimate machine within a
secure network in hopes to access network and
server resources. - Attack on privacy
- An attack on privacy is mainly used for the
recording of data in some way or another.
Whether it is tracking specific website usage,
online video game play, email addresses this
method is used by attackers to exploit an
individuals activity on a computer.
19Network Based Attacks
- Prevention
- Firewalls
- Virus Scanners
- Common Sense
20Known FlagsData Mining for Security
- Suspicious red flags are not conclusive proof
that fraud has been committed. - Simply one tool of many for preventative
measures. - Not a single catch all rule through data mining-
should not be solely relied upon. - Consistent pattern is a must for possible fraud
identification.
21Known FlagsData Mining for Security
- Example telecommunications fraud
- Nodes represent different countries
- Lines represent international phone calls
- Unusually bright activity represents strange
activity determined as fraud
22Known FlagsData Mining for Security
- Example compromised credit card accounts-
- A distinct pattern usually involves a lost or
stolen account to be swiped at a gas station. No
gas is purchased only used to check status of
account to see if active. - Subsequent large jewelry and electronic purchases
shortly follow.
23Known FlagsData Mining for Security
- Example terrorist activity has not been
countered as a result of data mining. - It has no distinct pattern a terrorists profile
is no clear definition. - Large government (NSA) programs have made
attempts in data mining for preventive measures
without success. - Total Information Awareness generated thousands
of tips every month for over a year without a
single lead into terrorist organizations
24ClassificationPredicting the Category to Which a
Particular Record Belongs
- A major part of the classification process is the
initial information gathering task. - The idea behind this collection of data is that
normal and abnormal patterns of occurrences can
be differentiated from one another, and
algorithms can then be created to detect such
patterns. - Once detected, said algorithms would then be able
to flag suspicious events as abnormal in
real-time, and alert the appropriate person(s) as
to the potential intrusion(s).
25ClassificationPredicting the Category to Which a
Particular Record Belongs
- There are several ways these algorithms can
operate, and commonly they are implemented to run
off of decision trees or simply a set of
predefined rules that the system data must meet.
26ClassificationPredicting the Category to Which a
Particular Record Belongs
- There are many different options available that
employ decision trees. Some of these options
include - Classification and Regression Trees (CART)
- Chi Square Automatic Interaction Detection
(CHAID) - CART works by inducing two-way splits in a
dataset, causing it to become segmented, whereas
CHAID uses chi square tests to create splits in a
dataset of variable size, also causing the data
to become segmented.
27ClassificationPredicting the Category to Which a
Particular Record Belongs
- Lee and Stolfo conducted several experiments
pertaining to classification methods in their
paper Data Mining Approaches for Intrusion
Detection. The first of these experiments was
on a set of sendmail system call data. This data
consisted of sendmail traces, with the trace data
consisting of two columns of integers. - The traces contained within the data were
classified as both normal and abnormal, where the
normal constituted a trace of the sendmail
daemon and a concatenation of several invocations
of the sendmail program and the abnormal was
composed of the following attacks - Three traces of sunsendmailcp (sscp)
- Two traces of syslog-remote
- Two traces of syslog-local
- Two traces of decode
- One trace of sm5x
- One trace of sm565a
28ClassificationPredicting the Category to Which a
Particular Record Belongs
- After this data was obtained, system call
sequences had to be derived and labeled as normal
or abnormal so that they could then be supplied
to RIPPER, the rule learning program that was
used to generate rules that predict whether or
not a sequence is normal or abnormal. The
Intrusion Detection system then followed a
post-processing scheme to decide whether or not
the current trace was an intrusion, using the
RIPPER predictions. - The logic here is that when there is an intrusion
on the system, most of the adjacent system call
sequences will be abnormal.
29ClassificationPredicting the Category to Which a
Particular Record Belongs
- From the results it is important to notice that
generally speaking, intrusion traces will create
much larger abnormal regions than normal traces. - Also note that the results show that the rules
were generated can be applied to intrusion traces
not included in the training dataset. - This means that the rules for normal patterns
can be used to detect anomalies. - The rules from experiments C and D, on the other
hand, represent the abnormal sequence patterns.
These rules work very well for detecting types
seen in the training data, but perform worse than
A and B when it comes to detecting intrusions on
traces that were not seen in the training data.
The implication here is that the rule set for
abnormal patterns performs well on predictable
intrusions from things such as misuse or other
repeatable events where good basis data can be
used to generate the rules, but is unreliable
when it comes to flagging new types of intrusions
that may occur in the future.
30ClassificationPredicting the Category to Which a
Particular Record Belongs
- The next approach that Lee and Stolfo attempted
involved creating an anomaly detection routine
using only normal traces for training data.
Experiments were carried out to determine the
normal correlation between system calls, i.e. the
nth or the middle system calls in normal
sequences of length n. - Lee and Stolfo declared that improvement in
accuracy can come from adding more features,
rather than just system calls, into the models of
program execution. Items such as the file
structure and paths within that were traversed
(directories and names of touched files) could be
used to generate stronger rules.
31ClassificationPredicting the Category to Which a
Particular Record Belongs
- Lee and Stolfo further examined network intrusion
detection by monitoring network traffic directly
using a packet capturing program, tcdump, to
collect data. - In conclusion, when the data is not designed
specifically for security purposes (like in this
case), it cannot be used to build a detection
model without a certain amount of modifications
(or pre-processing) being made. Due to all of
the changes that must be taken care of, it goes
without saying that one must have a lot of
knowledge in the domain being tested, and as such
the process is not easily automated. On the
other hand, it is important to again note that by
adding extra measures, the accuracy of the
classification model can be improved.