Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications

About This Presentation

Title:

Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications

Description:

What are the application domains exhibiting these characteristics? ... When classifying the pixels in mammogram images, cancerous pixels represent only ... – PowerPoint PPT presentation

Number of Views:243

Avg rating:3.0/5.0

Slides: 62

Provided by: Louis1

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications

1
Data Mining for Analysis of Rare Events A Case
of Computer Security and Other Applications

Jozef Zurada
Department of Computer Information Systems
College of Business
University of Louisville
Louisville, Kentucky
USA
email jmzura01_at_louisville.edu

2
Outline

Introduction to Knowledge Discovery in Databases
and Data Mining
Data Mining Tools, Techniques, and Tasks
High-dimensional data
Feature and values reduction, and sampling
Rare Events
What are they?
What are the application domains exhibiting these
characteristics?
What are the limitations of standard data mining
techniques?
Major Techniques for Detecting Rare Events?
Supervised (Classification) techniques -
Predictive Modeling
Tree based approaches, Neural networks
Unsupervised Techniques
Anomaly/Outlier Detection, Clustering
Other Data Mining Techniques Association Rules
Case Study Intrusion Detection Systems
What are the general types/categories of cyber
attacks
Data Mining architecture for Intrusion Detection
Systems
Conclusion and Questions

3
What is KDD?

Finding/extracting interesting information from
data stored in large databases/data warehouses
Interesting
non-trivial
implicit
previously unknown (novel)
easily understood
rule length, number of conditions in a rule
potentially useful (actionable)
Information
patterns
rules
correlations
relationships hidden in data
descriptions of rare events
detection of outliers/anomalies/rare events
prediction of events
Interesting patterns represent knowledge

4
Measures of Pattern Interestingness

Objective
Rule support
Represents the percentage of transactions from a
transaction database that the given rule
satisfies
Probability P(XnY), where XnY indicates that a
transaction contains both X and Y
support (X?Y) P(XnY)
Rule confidence
Assesses the degree of certainty of the detected
association
Conditional probability P(YX), that is, the
probability that a transaction containing X also
contains Y
confidence (X?Y) P(YX)
Subjective
based on user beliefs in the data
Each measure associated with a threshold
controlled by the user
Rules that do not satisfy a confidence threshold
of, say 50, considered uninteresting
reflect noise, exceptions, or minority cases
Objective measures are combined with subjective
measures

5
Steps in the KDD Process

Understanding the application domain
relevant prior knowledge and goals of application
Data cleaning, integration, and preprocessing
(60 of effort)
Creating a target data set
data selection and transformation
feature and data reduction
selection of variables, sampling of rows
Applying the DM technique(s) - the core of KDD
choosing task classification, prediction,
clustering
choosing the algorithm
search for patterns of interest
Interpreting evaluating mined patterns
Use of discovered knowledge

6
A KDD Process
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
7
A KDD Process

These activities are iterative, interactive and
have a user-friendly character
End-user has to accept/reject the results
produced by the KDD system

8
KDD Integration of Many Disciplines

Database Technology
Statistics
Machine Learning Artificial Intelligence
Information Science
High-Performance Computing

Visualization
Pattern Recognition
Neural Networks
Fuzzy Logic
Evolutionary Computing
Graph Theory

9
Data Mining Techniques

Neural Networks
Decision Trees
Fuzzy Systems (Logic, Rules)
Genetic Algorithms
Association Rules
Memory-based Reasoning (k-Nearest Neighbor)
Deviation/Anomaly Detection
Allow one to
learn from data
understand something new
answer tough questions
locate a problem
Can be complemented by traditional statistical
techniques, OLAP, and SQL queries

10
Unsupervised DM Techniques

Use unsupervised learning
no target or class variable
groups input data records into classes based on
self-similarities in the data
The goal is not specific
Tell me something interesting about the data
What common characteristics/profiles do
terrorists share?
What is the activity pattern of a typical
network intruder?
No constraints on a DM system
No indications of what the user expects and what
kind of discovery could be of interest
Examples clustering, finding association rules,
deviation detection, neural networks

11
Supervised DM Techniques

Use supervised learning
classification, prediction
target (dependent) variable has clearly defined
label
Attempt to
predict a specific data value
weight, height, age
classify/categorize an item into a fixed set of
known classes
(yes/no, friend/foe, healthy/bankrupt,
legitimate/illegitimate)
Goal is specific
Ex. Will this company go bankrupt?
Is this individual a friend or a foe
(terrorist)?
Is this credit card transaction legitimate
or fraudulent?
Is someone trying to access a computer
network an intruder or not?

12
Classification Task

Deals with discrete outcomes intruder/non-intrude
r, legitimate/fraudulent, friend/foe
Learning a function that classifies a data item
into one of several predefined classes
set of rules
mathematical equation
set of weights
Training set consists of pre-classified examples
Newly presented object is assigned a class
A network system administrator can use the
classifier to decide whether a person accessing
the network is an intruder or not

13
Clustering Task

Unsupervised learning
Segmenting a heterogeneous population into number
of more homogeneous clusters or groups
No predefined classes which will be used for
training
The records are grouped together based on
self-similarity
It is up to you what meaning, if any, to attach
to the resulting classes
It is often done as a prelude to some other form
of DM (classification)
Often based on computing the distances between
data points

14
Optimization Task

Finding one or a series of optimal solutions from
among a very large number of possible solutions
Traditional mathematical techniques may break
down because of billions of combinations

15
High-Dimensionality Data

Data/dimensionality reduction
of features
of samples
of values for the features
Gains of data reduction
Improved predictive/descriptive accuracy
Model better understood
Uses less rules, weights, variables
Fewer features
Next round of data collection, irrelevant
features can be discarded

16
Data Preparation

Always done, regardless of the DM task and
technique
Depends on
amounts of data
DM task (classification, clustering/segmentation)
types of values (numeric or categorical) for
features/variables
behavior of data with respect to time
Normalization
data values scaled to a specific range 0,1,
z-scores
Reasons
features with larger values overweight features
with smaller values
clustering techniques based on computing the
distance between data points
neural networks learn better
prevents saturation of neurons

17
Data Preparation

Data Smoothing/Rounding
Minor differences between the values of a feature
unimportant
Binning
placing values in different intervals by
consulting their neighbors
Transformation of features
Reduces the of features

18
Data Preparation

Outlier detection
Samples inconsistent with respect to the
remaining data
Not an easy subject
Some applications focused on outlier detection
others are not
Ex. detecting fraudulent credit card transactions
1 out of 10,000 transactions is fraudulent.
In many classes of DM applications, we remove
them
Careful with the automatic removal of outliers
Methods for outlier detection
Visualization for 2-D, 3-D or 4-D
Based on mean and variance of feature
Distance-based
multidimensional samples
calculate the distance between all samples in an
n-dim dataset
outliers are those samples which do not have
enough neighbors

19
Sampling

Millions of cases often 20,000 or so is enough
Sample has the same probability distribution as
the population
Random sampling
with replacement
without replacement
Stratified sampling
Initial data set is split into non-overlapping
subsets
sampling is performed on each strata
independently of another
Incremental sampling
Increasingly larger random subsets to observe the
trends in performances of the tool and to stop
when no progress is made
How many samples?
No simple answer - enough
The depends on
algorithms
of classes the algorithm predicts
of variables in a data set
reliability of the results

20
Feature Reduction

Hundreds of features
many irrelevant, correlated, redundant
Feature selection often a space search problem
Small of features ? can be searched
exhaustively (all combinations)
20 features 220 combinations gt 1,000,000
combinations

21
Feature Reduction Methods

Independent examination of features based on the
mean variance
Test features separately one feature at a time
Feature examined normally distributed
Given feature is independent of the others
Examines one feature at a time without taking
into account the relationship to other features
Collective examination of features based on
feature means and covariances
tests all features together
features have normally distributed values
impractical and computationally prohibitive
yields huge search space

22
Feature Reduction Methods

Principal component analysis (PCA)
Very popular, well-established, frequently used
Complex in terms of calculations
Components, which contribute the least to the
variation in the data set, are eliminated
Entropy measure
Called unsupervised feature selection
no output feature containing a class label
Removing an irrelevant feature from a set may not
change the information content of the data set
Information content is measured by entropy
Features on numeric or categorical scale
Numeric - normalized Euclidean distance
Categorical - Hamming distance

23
Neural Networks

Enable to acquire, store, and utilize
experiential knowledge
Try to emulate biological neurological systems
Try to mimic/approximate the way the human brain
functions and processes information
Used successfully for the following tasks
Classification
Clustering
Optimization
Implemented as mathematical models of the human
brain

24
Neural Networks

Characterized by their three properties
Computational property
built of neurons
summation node and activation function
organized in layers
interconnected using weights
Architecture of the network
Feed-forward NN with error back-propagation
classification, prediction
Kohonen network
clustering (segmentation)
Learning property
supervised mode (with a teacher)
unsupervised mode (without a teacher)
Knowledge is encoded in networks weights

25
(No Transcript)
26
Decision Trees

Useful for classification tasks
Learn from data, like neural networks
Operation based on the algorithms that
make the clusters at the node purer and purer by
progressively reducing disorder (impurity) in the
original data set
impurity is measured by entropy
find the optimum number of splits and determine
where to partition the data to maximize the
information gain
Nodes, branches and leaves indicate the
variables, conditions, and outcomes, respectively
Most predictive variable placed at the top node
of the tree
Model is represented in the form of explicit and
understandable rule-like relationships among
variables
Each rule represents a unique path from the root
to each leaf
Not as robust and good as neural networks in
detecting complex nonlinear relationships between
variables

27
(No Transcript)
28
Fuzzy Logic

Enables to build fuzzy systems
Knowledge is encoded in fuzzy sets and fuzzy
rules
Fuzzy rules enable one to reason or describe a
process in terms of approximations
Fuzzy sets sets without clearly defined
boundaries
Can produce very accurate results
Fast response time
Knowledge about the fuzzy rules and fuzzy sets
elicited from domain experts
generated from the given data
neuro-fuzzy systems

29
Genetic Algorithms

Solve problems (mainly optimization) by borrowing
a technique from nature
Use 3 Darwins basic principles
Survival of the fittest (reproduction)
Cross-breeding (crossovers)
Mutation
to create approximate solutions for problems
fitness function selection and encoding genomes
is often difficult
Example
You work for a shipping firm and have to make
shipments to 6 different towns. You have one car
and your task is to minimize the distance
traveled. The plane can visit each city only once
and can start from any city.

30
Traveling Salesman Problem (TSP)
N6 cities Number of unique paths (N-1)!/2 (6-1
)!/25!/2(12345)/260
Chicago
5
1
New York
San Francisco
2
N25 cities Number of unique paths(N-1)!/2 (25-
1)!/2 (12345.222324)/2 1.55x 1025
paths (very large number) It would take the
fastest computer millions of years to calculate
all possible solutions (paths) Computationally
intractable
4
Los Angeles
3
Miami
7
6
Mexico City
31
Rare Events

We are drowning in the massive amount of data
that are being collected, while starving for
knowledge at the same time
Despite the enormous amount of data, particular
events of interest are still quite rare
Rare events are events that occur very
infrequently, i.e. their frequency ranges from
.01 to 10
However, when they occur, their consequences can
be quite dramatic often in a negative sense

32
Applications of Rare Cases

Network intrusion detection
Number of intrusions on the network is typically
a very small fraction of the total network
traffic
Credit card fraud transaction
Millions of legitimate transactions are stored,
while only a very small percentage is fraudulent
Medical diagnostics
When classifying the pixels in mammogram images,
cancerous pixels represent only a very small
fraction of the entire image

33
Applications of Rare Cases

Web mining
lt 3 of all people visiting Amazon.com make a
purchase
Identifying passengers at airports (through
biometrics) and screening their luggage
Only an extremely small number of passengers is
suspected of hostile activities the same refers
to the passengers luggage that may contain
explosives
Fraud detection
auto insurance detecting people who stage
accidents to collect on insurance
Profiling Individuals
finding clusters of model terrorists who share
similar characteristics
Money laundering, Financial fraud, Churn analysis

34
Key Technical Challenges for Detecting Rare Events

Large data size
High dimensionality
Temporal nature of the data
Skewed class distribution
Rare events are underrepresented in the data set
minority class
Data preprocessing
On-line analysis

35
Limitations of Standard Data Mining Schemes

Many classic data mining issues and methods apply
in the domain of rare cases
Limitations
Standard approaches for feature selection and
construction, computing distances between
samples, and sampling do not work well for rare
case analysis
While most normal events are similar to each
other, rare events are quite different from one
another
Regular network traffic is fairly standard, while
suspicious ones vary from the standard ones in
many different ways
Metrics used to evaluate normal event detection
methods
Overall classification accuracy is not
appropriate for evaluating methods for rare event
detection
In many applications data keeps arriving in
real-time, and there is a need to detect rare
events on the fly, with models built only on the
events seen so far

36
Computer Security

Broad and extremely important field
Generally encompasses two aspects
How computers can be used to secure the
information contained within organizations
Detection and/or prevention of unauthorized
access or attacks on computers, networks,
operating system, data, and applications local to
an organization
How computers can be used to detect hostile
activity in a sensitive geographical area (such
as in an airport)
Involves computer vision technology
Identifying patterns of activities that can
suggest a friend or foe

37
Computer Security

The ability of a computer system to protect
information and system resources with respect to
Confidentiality Prevention of unauthorized
disclosure of information
Integrity Prevention of unauthorized
modification of information
Availability Prevention of unauthorized
withholding of information
Intrusion Cyber attack that tries to bypass
security mechanism
Outsider attack on the system from Internet
Hackers, spies, kiddies
Stealing, spying, probing (to collect information
about the host)
DoS attacks, viruses, worms
Insider (employee) attempt to gain and misuse
non-authorized privileges

38
Taxonomy of Computer Attacks

Intrusions can be classified according to several
categories
Attack type
DoS, worms/trojan horses
Number of network connections involved in the
attack
single connection cyber attacks
multiple connections cyber attacks
Source of the attack
multiple location vs. single location
coordinated/distributed attacks?
inside vs. outside
Target of the attack
Single or many different destinations
Environment (network, host, P2P (peer-to-peer),
wireless networks, ..)
Less secure physical layer
No traffic concentration points for monitoring
packets
Automation (manual, automated, semi-automated
attack)
Need to analyze network data from several sites
to detect these attacks

39
Prevention Existing Security Mechanisms

Security protocols and policies
IPSec Security at the IP layer
Source authentication
Encryption
Secure Socket Layer (SSL)
Source authentication
Encryption
Host based protections
Regularly installing patches, defending accounts,
integrity checks
Firewalls
Control flow of traffic between networks
Block traffic from Internet and to Internet
Monitor communication between networks and
examines each packet to see if it should be let
through
All the above mechanism are insufficient due to
Security holes, insider attacks, multiple levels
of data confidentiality within an organization
Sophistication of cyber attacks, their severity,
and increased intruders knowledge
Data mining can help
It is not a cure for all problems

40
Motivation - Data Mining for Intrusion Detection

Increased interest in data mining based intrusion
detection
Attacks for which it is difficult to build
signatures
Attack stealthiness
Unforeseen/Unknown/Emerging attacks
Distributed/coordinated attacks
Data mining approaches for intrusion detection
Misuse detection
Supervised learning
Anomaly detection
Unsupervised learning
Summarization of attacks using association rules

41
Motivation - Data Mining for Intrusion Detection

Data mining approaches for intrusion detection
Misuse detection
Supervised learning
Based on extensive knowledge of patterns
associated with known attacks provided by human
experts
Building predictive models from labeled data sets
(instances are labeled as normal or
intrusive) to identify known intrusions
Major advantages
High accuracy in detecting many kinds of known
attacks
Produce models that can be easily understood
Major limitations
Cannot detect unknown and emerging attacks
The data has to be labeled
Signature database has to be manually revised for
each new type of discovered attack
Major approaches pattern (signature) matching,
expert systems, neural networks, decision trees,
logistic regression, memory-based reasoning
SNORT system

42
Motivation - Data Mining for Intrusion Detection

Data mining approaches for intrusion detection
Anomaly detection
Unsupervised learning
Based on profiles that represent normal behavior
of users, hosts, or networks, and detecting
attacks as significant deviations from this
profile
Major benefit - potentially able to recognize
unforeseen attacks
Major limitation - possible high false alarm
rate, since detected deviations do not
necessarily represent actual attacks
Major approaches statistical methods, expert
systems, clustering, neural networks, outlier
detection schemes, deviation/anomaly detection
Analyze each event to determine how similar (or
dissimilar) it is to the majority
Success depends on the choice of similarity
measures, dimension weighting
Summarization of attacks using association rules

43
IDS Information Source

Host-based IDS
base the decisions on information obtained from a
single host (e.g. system log data, system calls
data)
Network-based IDS
make decisions according to the information and
data obtained by monitoring the traffic in the
network to which the hosts are connected
Wireless network IDS
detect intrusions by analyzing traffic between
mobile nodes
Application Logs
detect intrusions analyzing for example database
logs (database misuse), web logs
IDS Sensor Alerts
analysis on low-level sensor alarms
Analysis of alarms generated by other IDSs

44
Data Sources in Network Intrusion Detection

Network traffic data is usually collected using
network sniffers
Tcpdump
080215.471817 0107b384633
0107b384633 loopback 60
0000 0100 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000
080219.391039 172.16.112.100.3055 gt
172.16.112.10.ntp v1 client strat 0 poll 0 prec
0
080219.391456 172.16.112.10.ntp gt
172.16.112.100.3055 v1 server strat 5 poll 4
prec - 16 (DF)
net-flow tools
Source and destination IP address, Source and
destination ports, Type of service, Packet and
byte counts, Start and end time, Input and output
interface numbers, TCP flags, Routing information
(next-hop address, source autonomous system (AS)
number, destination AS number)
0624.12439.344 0624.12448.292 211.59.18.101
4350 160.94.179.138 1433 6 2 3 144
0624.9110.667 0624.9119.635 24.201.13.122
3535 160.94.179.151 1433 6 2 3 132
0624.12440.572 0624.12449.496 211.59.18.101
4362 160.94.179.150 1433 6 2 3 152
Collected data are in the form of network
connections or network packets (a network
connection may contain several packets)

45
Projects Data Mining in Intrusion Detection

MADAM ID (Mining Audit Data for Automated Models
for Intrusion Detection) Columbia University,
Georgia Tech, Florida Tech
ADAM (Audit Data Analysis and Mining) - George
Mason University
MINDS (University of Minnesota)
Intelligent Intrusion Detection IIDS
(Mississippi State University)
Data Mining for Network Intrusion Detection
(MITRE corporation)
Institute for Security Technology Studies (ISTS),
Dartmouth College
Intrusion Detection Techniques (Arizona State
University)
Agent based data mining system (Iowa State
University)
IDDM (Intrusion Detection using Data Mining
Techniques) Department of Defense, Australia

46
Data Preprocessing for Data Mining in ID

Converting the data from monitored system
(computer network, host machine, ) into data
(features) that will be used in data mining
models
For misuse detection, labeling data examples into
normal or intrusive may require enormous time for
many human experts
Building data mining models
Misuse detection models
Anomaly detection models
Analysis and summarization of results

47
(No Transcript)
48
(No Transcript)
49
Misuse Detection - Evaluation of Rare Class
Problems F-value

Accuracy is not sufficient metric for evaluation
(TNTP)/(TNFPFNTP)
Ex. Network traffic data set with 99.99 of
normal data and 0.01 of intrusions
Trivial classifier that labels everything with
the normal class can achieve 99.99 accuracy!!!
Focus on both recall and precision
Recall (R) TP/(TPFN)
Precision (P) TP/(TPFP)
F-measure 2RP/(RP)

C rare class (attacks)
NC normal class (normal connections

50
Misuse Detection - Evaluation of Rare Class
Problems ROC

C rare class (attacks)
NC normal class (normal connections)

Standard measures for evaluating rare class
problems
Detection rate (Recall) ratio between the
number of correctly detected rare events
(attacks) and the total number of rare events
(attacks)
False alarm (false positive) rate ratio between
the number of data records from majority class
(normal connections) that are misclassified as
rare events (attacks) and the total number of
data records from majority class (normal
connections)
ROC Curve trade-off between detection rate and
false alarm rate

51
Misuse Detection - Evaluation of Rare Class
Problems ROC
Area under the ROC curve (AUC) is computed using
a form of the trapezoid rule
52
Misuse Detection - Manipulating Data Records

Over-sampling the rare class
Make the duplicates of the rare events until the
data set contains as many examples as the
majority class gt balance the classes
Does not increase information but increase
misclassification cost
SMOTE (Synthetic Minority Over-sampling
TEchnique)
Synthetic generating the minority class examples
When generating artificial minority class
example, distinguish two types of features
Continuous
Nominal (Categorical) features
Down-sizing (undersampling) the majority class
Sample the data records from majority class
Randomly
Near misses examples
Examples far from minority class examples (far
from decision boundaries)
Introduce sampled data records into the original
data set instead of original data records from
the majority class
Usually results in a general loss of information
and potentially overly general rules

53
Unsupervised Techniques Anomaly Detection

Build models of normal behavior and detect
anomalies as deviations from it
Possible high false alarm rate - previously
unseen (yet legitimate) data records may be
recognized as anomalies
Two types of techniques
with access to normal data
with NO access to normal data (not known what is
normal)

54
Outlier Detection Schemes

Outlier is defined as a data point which is very
different from the rest of the data based on some
measure
Detect novel attacks/intrusions by identifying
them as deviations from normal behavior
Identify normal behavior
Construct useful set of features
Define similarity function
Use outlier detection algorithm
Statistics based approaches
Distance based approaches
Nearest neighbor approaches
Clustering based approaches
Density based schemes

55
Distance-based Outlier Detection Scheme

k-Nearest Neighbor approach
For each data point d compute the distance to the
k-th nearest neighbor dk
Sort all data points according to the distance dk
Outliers are points that have the largest
distance dk and therefore are located in the more
sparse neighborhoods
Usually data points that have top n distance dk
are identified as outliers
n user parameter
Not suitable for datasets that have modes with
varying density

56
Distance-based Outlier Detection Scheme

Mahalanobis-distance based approach
Mahalanobis distance is more appropriate for
computing distances with skewed distributions
The measure takes the shape of the data
distribution into account
Example
In Euclidean space, data point p1 is closer to
the origin than data point p2
When computing Mahalanobis distance, data points
p1 and p2 are equally distant from the origin

57
Model based outlier detection schemes

Use a prediction model to learn the normal
behavior
Every deviation from learned prediction model can
be treated as anomaly or potential intrusion
Recent approaches
Neural networks
Unsupervised Support Vector Machines (SVMs)

58
Neural networks for outlier detection

Use a replicator 4-layer feed-forward neural
network (RNN) with the same number of input and
output nodes
Input variables are the output variables so that
RNN forms a compressed model of the data during
training
A measure of outlyingness is the reconstruction
error of individual data points

59
Conclusions

Data mining analysis of rare events requires
special attention
Many real world applications exhibit needle-in
the-haystack type of problem
Current state of the art data mining techniques
are still insufficient for efficient handling
rare events
Need for designing better and more accurate data
mining models

60
References

Han, J., and Kamber, M., Data Mining Concept and
Techniques, Morgan Kaufmann Publishers, 2001
Kantardzic, M., Data Mining Methods, Concepts,
and Algorithms, IEEE-Press/Wiley, 2003
Lazarevic, A., Srivastava, J., Kumar, V., Data
Mining for Computer Security Applications, IEEE
ICDM 2003 Tutorial
Kantardzic, M., and Zurada, J. (Eds.), Next
Generations of Data Mining Applications, IEEE
Press/Wiley, 2005
Tan, P., Steinbach, M., Kumar, V., Introduction
to Data Mining, Addison Wesley, 2005
Zurada, J., Knowledge Discovery and Data Mining,
Lecture Notes on Blackboard, Spring 2005