Title: Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications
1Data Mining for Analysis of Rare Events A Case
of Computer Security and Other Applications
- Jozef Zurada
- Department of Computer Information Systems
- College of Business
- University of Louisville
- Louisville, Kentucky
- USA
- email jmzura01_at_louisville.edu
2Outline
- Introduction to Knowledge Discovery in Databases
and Data Mining - Data Mining Tools, Techniques, and Tasks
- High-dimensional data
- Feature and values reduction, and sampling
- Rare Events
- What are they?
- What are the application domains exhibiting these
characteristics? - What are the limitations of standard data mining
techniques? - Major Techniques for Detecting Rare Events?
- Supervised (Classification) techniques -
Predictive Modeling - Tree based approaches, Neural networks
- Unsupervised Techniques
- Anomaly/Outlier Detection, Clustering
- Other Data Mining Techniques Association Rules
- Case Study Intrusion Detection Systems
- What are the general types/categories of cyber
attacks - Data Mining architecture for Intrusion Detection
Systems - Conclusion and Questions
3What is KDD?
- Finding/extracting interesting information from
data stored in large databases/data warehouses - Interesting
- non-trivial
- implicit
- previously unknown (novel)
- easily understood
- rule length, number of conditions in a rule
- potentially useful (actionable)
- Information
- patterns
- rules
- correlations
- relationships hidden in data
- descriptions of rare events
- detection of outliers/anomalies/rare events
- prediction of events
- Interesting patterns represent knowledge
4Measures of Pattern Interestingness
- Objective
- Rule support
- Represents the percentage of transactions from a
transaction database that the given rule
satisfies - Probability P(XnY), where XnY indicates that a
transaction contains both X and Y - support (X?Y) P(XnY)
- Rule confidence
- Assesses the degree of certainty of the detected
association - Conditional probability P(YX), that is, the
probability that a transaction containing X also
contains Y - confidence (X?Y) P(YX)
- Subjective
- based on user beliefs in the data
- Each measure associated with a threshold
controlled by the user - Rules that do not satisfy a confidence threshold
of, say 50, considered uninteresting - reflect noise, exceptions, or minority cases
- Objective measures are combined with subjective
measures
5Steps in the KDD Process
- Understanding the application domain
- relevant prior knowledge and goals of application
- Data cleaning, integration, and preprocessing
(60 of effort) - Creating a target data set
- data selection and transformation
- feature and data reduction
- selection of variables, sampling of rows
- Applying the DM technique(s) - the core of KDD
- choosing task classification, prediction,
clustering - choosing the algorithm
- search for patterns of interest
- Interpreting evaluating mined patterns
- Use of discovered knowledge
6A KDD Process
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
7A KDD Process
- These activities are iterative, interactive and
have a user-friendly character - End-user has to accept/reject the results
produced by the KDD system
8KDD Integration of Many Disciplines
- Database Technology
- Statistics
- Machine Learning Artificial Intelligence
- Information Science
- High-Performance Computing
- Visualization
- Pattern Recognition
- Neural Networks
- Fuzzy Logic
- Evolutionary Computing
- Graph Theory
9Data Mining Techniques
- Neural Networks
- Decision Trees
- Fuzzy Systems (Logic, Rules)
- Genetic Algorithms
- Association Rules
- Memory-based Reasoning (k-Nearest Neighbor)
- Deviation/Anomaly Detection
- Allow one to
- learn from data
- understand something new
- answer tough questions
- locate a problem
- Can be complemented by traditional statistical
techniques, OLAP, and SQL queries
10Unsupervised DM Techniques
- Use unsupervised learning
- no target or class variable
- groups input data records into classes based on
self-similarities in the data - The goal is not specific
- Tell me something interesting about the data
- What common characteristics/profiles do
terrorists share? - What is the activity pattern of a typical
network intruder? - No constraints on a DM system
- No indications of what the user expects and what
kind of discovery could be of interest - Examples clustering, finding association rules,
deviation detection, neural networks
11Supervised DM Techniques
- Use supervised learning
- classification, prediction
- target (dependent) variable has clearly defined
label - Attempt to
- predict a specific data value
- weight, height, age
- classify/categorize an item into a fixed set of
known classes - (yes/no, friend/foe, healthy/bankrupt,
legitimate/illegitimate) - Goal is specific
- Ex. Will this company go bankrupt?
- Is this individual a friend or a foe
(terrorist)? - Is this credit card transaction legitimate
or fraudulent? - Is someone trying to access a computer
network an intruder or not?
12Classification Task
- Deals with discrete outcomes intruder/non-intrude
r, legitimate/fraudulent, friend/foe - Learning a function that classifies a data item
into one of several predefined classes - set of rules
- mathematical equation
- set of weights
- Training set consists of pre-classified examples
- Newly presented object is assigned a class
- A network system administrator can use the
classifier to decide whether a person accessing
the network is an intruder or not
13Clustering Task
- Unsupervised learning
- Segmenting a heterogeneous population into number
of more homogeneous clusters or groups - No predefined classes which will be used for
training - The records are grouped together based on
self-similarity - It is up to you what meaning, if any, to attach
to the resulting classes - It is often done as a prelude to some other form
of DM (classification) - Often based on computing the distances between
data points
14Optimization Task
- Finding one or a series of optimal solutions from
among a very large number of possible solutions - Traditional mathematical techniques may break
down because of billions of combinations
15High-Dimensionality Data
- Data/dimensionality reduction
- of features
- of samples
- of values for the features
- Gains of data reduction
- Improved predictive/descriptive accuracy
- Model better understood
- Uses less rules, weights, variables
- Fewer features
- Next round of data collection, irrelevant
features can be discarded
16Data Preparation
- Always done, regardless of the DM task and
technique - Depends on
- amounts of data
- DM task (classification, clustering/segmentation)
- types of values (numeric or categorical) for
features/variables - behavior of data with respect to time
- Normalization
- data values scaled to a specific range 0,1,
z-scores - Reasons
- features with larger values overweight features
with smaller values - clustering techniques based on computing the
distance between data points - neural networks learn better
- prevents saturation of neurons
17Data Preparation
- Data Smoothing/Rounding
- Minor differences between the values of a feature
unimportant - Binning
- placing values in different intervals by
consulting their neighbors - Transformation of features
- Reduces the of features
18Data Preparation
- Outlier detection
- Samples inconsistent with respect to the
remaining data - Not an easy subject
- Some applications focused on outlier detection
others are not - Ex. detecting fraudulent credit card transactions
- 1 out of 10,000 transactions is fraudulent.
- In many classes of DM applications, we remove
them - Careful with the automatic removal of outliers
- Methods for outlier detection
- Visualization for 2-D, 3-D or 4-D
- Based on mean and variance of feature
- Distance-based
- multidimensional samples
- calculate the distance between all samples in an
n-dim dataset - outliers are those samples which do not have
enough neighbors
19Sampling
- Millions of cases often 20,000 or so is enough
- Sample has the same probability distribution as
the population - Random sampling
- with replacement
- without replacement
- Stratified sampling
- Initial data set is split into non-overlapping
subsets - sampling is performed on each strata
independently of another - Incremental sampling
- Increasingly larger random subsets to observe the
trends in performances of the tool and to stop
when no progress is made - How many samples?
- No simple answer - enough
- The depends on
- algorithms
- of classes the algorithm predicts
- of variables in a data set
- reliability of the results
20Feature Reduction
- Hundreds of features
- many irrelevant, correlated, redundant
- Feature selection often a space search problem
- Small of features ? can be searched
exhaustively (all combinations) - 20 features 220 combinations gt 1,000,000
combinations
21Feature Reduction Methods
- Independent examination of features based on the
mean variance - Test features separately one feature at a time
- Feature examined normally distributed
- Given feature is independent of the others
- Examines one feature at a time without taking
into account the relationship to other features - Collective examination of features based on
feature means and covariances - tests all features together
- features have normally distributed values
- impractical and computationally prohibitive
- yields huge search space
22Feature Reduction Methods
- Principal component analysis (PCA)
- Very popular, well-established, frequently used
- Complex in terms of calculations
- Components, which contribute the least to the
variation in the data set, are eliminated - Entropy measure
- Called unsupervised feature selection
- no output feature containing a class label
- Removing an irrelevant feature from a set may not
change the information content of the data set - Information content is measured by entropy
- Features on numeric or categorical scale
- Numeric - normalized Euclidean distance
- Categorical - Hamming distance
23Neural Networks
- Enable to acquire, store, and utilize
experiential knowledge - Try to emulate biological neurological systems
- Try to mimic/approximate the way the human brain
functions and processes information - Used successfully for the following tasks
- Classification
- Clustering
- Optimization
- Implemented as mathematical models of the human
brain
24Neural Networks
- Characterized by their three properties
- Computational property
- built of neurons
- summation node and activation function
- organized in layers
- interconnected using weights
- Architecture of the network
- Feed-forward NN with error back-propagation
- classification, prediction
- Kohonen network
- clustering (segmentation)
- Learning property
- supervised mode (with a teacher)
- unsupervised mode (without a teacher)
- Knowledge is encoded in networks weights
25(No Transcript)
26Decision Trees
- Useful for classification tasks
- Learn from data, like neural networks
- Operation based on the algorithms that
- make the clusters at the node purer and purer by
progressively reducing disorder (impurity) in the
original data set - impurity is measured by entropy
- find the optimum number of splits and determine
where to partition the data to maximize the
information gain - Nodes, branches and leaves indicate the
variables, conditions, and outcomes, respectively - Most predictive variable placed at the top node
of the tree - Model is represented in the form of explicit and
understandable rule-like relationships among
variables - Each rule represents a unique path from the root
to each leaf - Not as robust and good as neural networks in
detecting complex nonlinear relationships between
variables
27(No Transcript)
28Fuzzy Logic
- Enables to build fuzzy systems
- Knowledge is encoded in fuzzy sets and fuzzy
rules - Fuzzy rules enable one to reason or describe a
process in terms of approximations - Fuzzy sets sets without clearly defined
boundaries - Can produce very accurate results
- Fast response time
- Knowledge about the fuzzy rules and fuzzy sets
- elicited from domain experts
- generated from the given data
- neuro-fuzzy systems
29Genetic Algorithms
- Solve problems (mainly optimization) by borrowing
a technique from nature - Use 3 Darwins basic principles
- Survival of the fittest (reproduction)
- Cross-breeding (crossovers)
- Mutation
- to create approximate solutions for problems
- fitness function selection and encoding genomes
is often difficult - Example
- You work for a shipping firm and have to make
shipments to 6 different towns. You have one car
and your task is to minimize the distance
traveled. The plane can visit each city only once
and can start from any city.
30Traveling Salesman Problem (TSP)
N6 cities Number of unique paths (N-1)!/2 (6-1
)!/25!/2(12345)/260
Chicago
5
1
New York
San Francisco
2
N25 cities Number of unique paths(N-1)!/2 (25-
1)!/2 (12345.222324)/2 1.55x 1025
paths (very large number) It would take the
fastest computer millions of years to calculate
all possible solutions (paths) Computationally
intractable
4
Los Angeles
3
Miami
7
6
Mexico City
31Rare Events
- We are drowning in the massive amount of data
that are being collected, while starving for
knowledge at the same time - Despite the enormous amount of data, particular
events of interest are still quite rare - Rare events are events that occur very
infrequently, i.e. their frequency ranges from
.01 to 10 - However, when they occur, their consequences can
be quite dramatic often in a negative sense
32Applications of Rare Cases
- Network intrusion detection
- Number of intrusions on the network is typically
a very small fraction of the total network
traffic - Credit card fraud transaction
- Millions of legitimate transactions are stored,
while only a very small percentage is fraudulent - Medical diagnostics
- When classifying the pixels in mammogram images,
cancerous pixels represent only a very small
fraction of the entire image
33Applications of Rare Cases
- Web mining
- lt 3 of all people visiting Amazon.com make a
purchase - Identifying passengers at airports (through
biometrics) and screening their luggage - Only an extremely small number of passengers is
suspected of hostile activities the same refers
to the passengers luggage that may contain
explosives - Fraud detection
- auto insurance detecting people who stage
accidents to collect on insurance - Profiling Individuals
- finding clusters of model terrorists who share
similar characteristics - Money laundering, Financial fraud, Churn analysis
34Key Technical Challenges for Detecting Rare Events
- Large data size
- High dimensionality
- Temporal nature of the data
- Skewed class distribution
- Rare events are underrepresented in the data set
minority class - Data preprocessing
- On-line analysis
35Limitations of Standard Data Mining Schemes
- Many classic data mining issues and methods apply
in the domain of rare cases - Limitations
- Standard approaches for feature selection and
construction, computing distances between
samples, and sampling do not work well for rare
case analysis - While most normal events are similar to each
other, rare events are quite different from one
another - Regular network traffic is fairly standard, while
suspicious ones vary from the standard ones in
many different ways - Metrics used to evaluate normal event detection
methods - Overall classification accuracy is not
appropriate for evaluating methods for rare event
detection - In many applications data keeps arriving in
real-time, and there is a need to detect rare
events on the fly, with models built only on the
events seen so far
36Computer Security
- Broad and extremely important field
- Generally encompasses two aspects
- How computers can be used to secure the
information contained within organizations - Detection and/or prevention of unauthorized
access or attacks on computers, networks,
operating system, data, and applications local to
an organization - How computers can be used to detect hostile
activity in a sensitive geographical area (such
as in an airport) - Involves computer vision technology
- Identifying patterns of activities that can
suggest a friend or foe
37Computer Security
- The ability of a computer system to protect
information and system resources with respect to - Confidentiality Prevention of unauthorized
disclosure of information - Integrity Prevention of unauthorized
modification of information - Availability Prevention of unauthorized
withholding of information - Intrusion Cyber attack that tries to bypass
security mechanism - Outsider attack on the system from Internet
- Hackers, spies, kiddies
- Stealing, spying, probing (to collect information
about the host) - DoS attacks, viruses, worms
- Insider (employee) attempt to gain and misuse
non-authorized privileges
38Taxonomy of Computer Attacks
- Intrusions can be classified according to several
categories - Attack type
- DoS, worms/trojan horses
- Number of network connections involved in the
attack - single connection cyber attacks
- multiple connections cyber attacks
- Source of the attack
- multiple location vs. single location
coordinated/distributed attacks? - inside vs. outside
- Target of the attack
- Single or many different destinations
- Environment (network, host, P2P (peer-to-peer),
wireless networks, ..) - Less secure physical layer
- No traffic concentration points for monitoring
packets - Automation (manual, automated, semi-automated
attack) - Need to analyze network data from several sites
to detect these attacks
39Prevention Existing Security Mechanisms
- Security protocols and policies
- IPSec Security at the IP layer
- Source authentication
- Encryption
- Secure Socket Layer (SSL)
- Source authentication
- Encryption
- Host based protections
- Regularly installing patches, defending accounts,
integrity checks - Firewalls
- Control flow of traffic between networks
- Block traffic from Internet and to Internet
- Monitor communication between networks and
examines each packet to see if it should be let
through - All the above mechanism are insufficient due to
- Security holes, insider attacks, multiple levels
of data confidentiality within an organization - Sophistication of cyber attacks, their severity,
and increased intruders knowledge - Data mining can help
- It is not a cure for all problems
40Motivation - Data Mining for Intrusion Detection
- Increased interest in data mining based intrusion
detection - Attacks for which it is difficult to build
signatures - Attack stealthiness
- Unforeseen/Unknown/Emerging attacks
- Distributed/coordinated attacks
- Data mining approaches for intrusion detection
- Misuse detection
- Supervised learning
- Anomaly detection
- Unsupervised learning
- Summarization of attacks using association rules
41Motivation - Data Mining for Intrusion Detection
- Data mining approaches for intrusion detection
- Misuse detection
- Supervised learning
- Based on extensive knowledge of patterns
associated with known attacks provided by human
experts - Building predictive models from labeled data sets
(instances are labeled as normal or
intrusive) to identify known intrusions - Major advantages
- High accuracy in detecting many kinds of known
attacks - Produce models that can be easily understood
- Major limitations
- Cannot detect unknown and emerging attacks
- The data has to be labeled
- Signature database has to be manually revised for
each new type of discovered attack - Major approaches pattern (signature) matching,
expert systems, neural networks, decision trees,
logistic regression, memory-based reasoning - SNORT system
42Motivation - Data Mining for Intrusion Detection
- Data mining approaches for intrusion detection
- Anomaly detection
- Unsupervised learning
- Based on profiles that represent normal behavior
of users, hosts, or networks, and detecting
attacks as significant deviations from this
profile - Major benefit - potentially able to recognize
unforeseen attacks - Major limitation - possible high false alarm
rate, since detected deviations do not
necessarily represent actual attacks - Major approaches statistical methods, expert
systems, clustering, neural networks, outlier
detection schemes, deviation/anomaly detection - Analyze each event to determine how similar (or
dissimilar) it is to the majority - Success depends on the choice of similarity
measures, dimension weighting - Summarization of attacks using association rules
43IDS Information Source
- Host-based IDS
- base the decisions on information obtained from a
single host (e.g. system log data, system calls
data) - Network-based IDS
- make decisions according to the information and
data obtained by monitoring the traffic in the
network to which the hosts are connected - Wireless network IDS
- detect intrusions by analyzing traffic between
mobile nodes - Application Logs
- detect intrusions analyzing for example database
logs (database misuse), web logs - IDS Sensor Alerts
- analysis on low-level sensor alarms
- Analysis of alarms generated by other IDSs
44Data Sources in Network Intrusion Detection
- Network traffic data is usually collected using
network sniffers - Tcpdump
- 080215.471817 0107b384633
0107b384633 loopback 60 - 0000 0100 0000 0000 0000 0000 0000 0000
- 0000 0000 0000 0000 0000 0000 0000 0000
- 0000 0000 0000 0000 0000 0000 0000
- 080219.391039 172.16.112.100.3055 gt
172.16.112.10.ntp v1 client strat 0 poll 0 prec
0 - 080219.391456 172.16.112.10.ntp gt
172.16.112.100.3055 v1 server strat 5 poll 4
prec - 16 (DF) - net-flow tools
- Source and destination IP address, Source and
destination ports, Type of service, Packet and
byte counts, Start and end time, Input and output
interface numbers, TCP flags, Routing information
(next-hop address, source autonomous system (AS)
number, destination AS number) - 0624.12439.344 0624.12448.292 211.59.18.101
4350 160.94.179.138 1433 6 2 3 144 - 0624.9110.667 0624.9119.635 24.201.13.122
3535 160.94.179.151 1433 6 2 3 132 - 0624.12440.572 0624.12449.496 211.59.18.101
4362 160.94.179.150 1433 6 2 3 152 - Collected data are in the form of network
connections or network packets (a network
connection may contain several packets)
45Projects Data Mining in Intrusion Detection
- MADAM ID (Mining Audit Data for Automated Models
for Intrusion Detection) Columbia University,
Georgia Tech, Florida Tech - ADAM (Audit Data Analysis and Mining) - George
Mason University - MINDS (University of Minnesota)
- Intelligent Intrusion Detection IIDS
(Mississippi State University) - Data Mining for Network Intrusion Detection
(MITRE corporation) - Institute for Security Technology Studies (ISTS),
Dartmouth College - Intrusion Detection Techniques (Arizona State
University) - Agent based data mining system (Iowa State
University) - IDDM (Intrusion Detection using Data Mining
Techniques) Department of Defense, Australia
46Data Preprocessing for Data Mining in ID
- Converting the data from monitored system
(computer network, host machine, ) into data
(features) that will be used in data mining
models - For misuse detection, labeling data examples into
normal or intrusive may require enormous time for
many human experts - Building data mining models
- Misuse detection models
- Anomaly detection models
- Analysis and summarization of results
47(No Transcript)
48(No Transcript)
49Misuse Detection - Evaluation of Rare Class
Problems F-value
- Accuracy is not sufficient metric for evaluation
- (TNTP)/(TNFPFNTP)
- Ex. Network traffic data set with 99.99 of
normal data and 0.01 of intrusions - Trivial classifier that labels everything with
the normal class can achieve 99.99 accuracy!!! - Focus on both recall and precision
- Recall (R) TP/(TPFN)
- Precision (P) TP/(TPFP)
- F-measure 2RP/(RP)
- C rare class (attacks)
- NC normal class (normal connections
50Misuse Detection - Evaluation of Rare Class
Problems ROC
- C rare class (attacks)
- NC normal class (normal connections)
- Standard measures for evaluating rare class
problems - Detection rate (Recall) ratio between the
number of correctly detected rare events
(attacks) and the total number of rare events
(attacks) - False alarm (false positive) rate ratio between
the number of data records from majority class
(normal connections) that are misclassified as
rare events (attacks) and the total number of
data records from majority class (normal
connections) - ROC Curve trade-off between detection rate and
false alarm rate
51Misuse Detection - Evaluation of Rare Class
Problems ROC
Area under the ROC curve (AUC) is computed using
a form of the trapezoid rule
52Misuse Detection - Manipulating Data Records
- Over-sampling the rare class
- Make the duplicates of the rare events until the
data set contains as many examples as the
majority class gt balance the classes - Does not increase information but increase
misclassification cost - SMOTE (Synthetic Minority Over-sampling
TEchnique) - Synthetic generating the minority class examples
- When generating artificial minority class
example, distinguish two types of features - Continuous
- Nominal (Categorical) features
- Down-sizing (undersampling) the majority class
- Sample the data records from majority class
- Randomly
- Near misses examples
- Examples far from minority class examples (far
from decision boundaries) - Introduce sampled data records into the original
data set instead of original data records from
the majority class - Usually results in a general loss of information
and potentially overly general rules
53Unsupervised Techniques Anomaly Detection
- Build models of normal behavior and detect
anomalies as deviations from it - Possible high false alarm rate - previously
unseen (yet legitimate) data records may be
recognized as anomalies - Two types of techniques
- with access to normal data
- with NO access to normal data (not known what is
normal)
54Outlier Detection Schemes
- Outlier is defined as a data point which is very
different from the rest of the data based on some
measure - Detect novel attacks/intrusions by identifying
them as deviations from normal behavior - Identify normal behavior
- Construct useful set of features
- Define similarity function
- Use outlier detection algorithm
- Statistics based approaches
- Distance based approaches
- Nearest neighbor approaches
- Clustering based approaches
- Density based schemes
55Distance-based Outlier Detection Scheme
- k-Nearest Neighbor approach
- For each data point d compute the distance to the
k-th nearest neighbor dk - Sort all data points according to the distance dk
- Outliers are points that have the largest
distance dk and therefore are located in the more
sparse neighborhoods - Usually data points that have top n distance dk
are identified as outliers - n user parameter
- Not suitable for datasets that have modes with
varying density
56Distance-based Outlier Detection Scheme
- Mahalanobis-distance based approach
- Mahalanobis distance is more appropriate for
computing distances with skewed distributions - The measure takes the shape of the data
distribution into account - Example
- In Euclidean space, data point p1 is closer to
the origin than data point p2 - When computing Mahalanobis distance, data points
p1 and p2 are equally distant from the origin
57Model based outlier detection schemes
- Use a prediction model to learn the normal
behavior - Every deviation from learned prediction model can
be treated as anomaly or potential intrusion - Recent approaches
- Neural networks
- Unsupervised Support Vector Machines (SVMs)
58Neural networks for outlier detection
- Use a replicator 4-layer feed-forward neural
network (RNN) with the same number of input and
output nodes - Input variables are the output variables so that
RNN forms a compressed model of the data during
training - A measure of outlyingness is the reconstruction
error of individual data points
59Conclusions
- Data mining analysis of rare events requires
special attention - Many real world applications exhibit needle-in
the-haystack type of problem - Current state of the art data mining techniques
are still insufficient for efficient handling
rare events - Need for designing better and more accurate data
mining models
60References
- Han, J., and Kamber, M., Data Mining Concept and
Techniques, Morgan Kaufmann Publishers, 2001 - Kantardzic, M., Data Mining Methods, Concepts,
and Algorithms, IEEE-Press/Wiley, 2003 - Lazarevic, A., Srivastava, J., Kumar, V., Data
Mining for Computer Security Applications, IEEE
ICDM 2003 Tutorial - Kantardzic, M., and Zurada, J. (Eds.), Next
Generations of Data Mining Applications, IEEE
Press/Wiley, 2005 - Tan, P., Steinbach, M., Kumar, V., Introduction
to Data Mining, Addison Wesley, 2005 - Zurada, J., Knowledge Discovery and Data Mining,
Lecture Notes on Blackboard, Spring 2005
61Thank you for attending this lecture!