KDD Cup - PowerPoint PPT Presentation

About This Presentation

Title:

KDD Cup

Description:

KDD Cup '99: Classifier Learning. Predictive Model for ... KDD Cup Overview. Held Annually in conjunction with Knowledge Discovery and Data Mining Conference ... – PowerPoint PPT presentation

Number of Views:1226

Avg rating:3.0/5.0

Slides: 14

Provided by: clif8

Learn more at: https://www.cs.purdue.edu

Category:

Tags: kdd | cup

more less

Transcript and Presenter's Notes

Title: KDD Cup

1
KDD Cup 99 Classifier LearningPredictive
Model for Intrusion Detection

Charles Elkan
1999 Conference on Knowledge Discovery and Data
Mining
Presented by Chris Clifton

2
KDD Cup Overview

Held Annually in conjunction with Knowledge
Discovery and Data Mining Conference (now
ACM-sponsored)
Challenge problem(s) released well before
conference
Goal is to give best solution to problem
Relatively informal contest
Gives standard test for comparing techniques
Winner announced at KDD conference
Lots of recognition to winner

3
Classifier Learning forIntrusion Detection

One of two KDD99 challenge problems
Other was a knowledge discovery problem
Goal is to learn a classifier to define TCP/IP
connections as intrusion/okay
Data Collection of features describing TCP
connection
Class Non-attack or type of attack
Scoring Cost per Test Sample
Wrong answers penalized based on type of wrong

4
Data TCP connection information

Dataset developed for 1998 DARPA Intrusion
Detection Evaluation Program
Nine weeks of raw TCP dump data from simulated
USAF LAN
Simulated attacks to give positive examples
Processed into 5 million training connections,
2 million test
Some attributes derived from raw data
Twenty-four attack types in training data, four
classes
DOS denial-of-service, e.g. syn flood
R2L unauthorized access from a remote machine,
e.g. guessing password
U2R unauthorized access to local superuser
(root) privileges, e.g., various buffer
overflow'' attacks
probing surveillance and other probing, e.g.,
port scanning.
Test set includes fourteen attack types not found
in training set

5
Basic features of individual TCP connections
feature name description type
duration length (number of seconds) of the connection continuous
protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc. discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discrete
land 1 if connection is from/to the same host/port 0 otherwise discrete
wrong_fragment number of wrong'' fragments continuous
urgent number of urgent packets continuous
6
Content features within a connection suggested by
domain knowledge
feature name description type
hot number of hot'' indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in 0 otherwise discrete
num_compromised number of compromised'' conditions continuous
root_shell 1 if root shell is obtained 0 otherwise discrete
su_attempted 1 if su root'' command attempted 0 otherwise discrete
num_root number of root'' accesses continuous
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the hot'' list 0 otherwise discrete
is_guest_login 1 if the login is a guest''login 0 otherwise discrete
7
Traffic features computed using a two-second time
window
feature name description type
count number of connections to the same host as the current connection in the past two seconds continuous
Note The following features refer to these same-host connections.
serror_rate of connections that have SYN'' errors continuous
rerror_rate of connections that have REJ'' errors continuous
same_srv_rate of connections to the same service continuous
diff_srv_rate of connections to different services continuous
srv_count number of connections to the same service as the current connection in the past two seconds continuous
Note The following features refer to these same-service connections.
srv_serror_rate of connections that have SYN'' errors continuous
srv_rerror_rate of connections that have REJ'' errors continuous
srv_diff_host_rate of connections to different host continuous
8
Scoring

Each prediction gets a score
Row is correct answer
Column is prediction made
Score is average over all predictions

normal probe DOS U2R R2L
normal 0 1 2 2 2
probe 1 0 2 2 2
DOS 2 1 0 2 2
U2R 3 2 2 0 2
R2L 4 2 2 2 0
9
Results

Twenty-four entries, scores0.2331 0.2356 0.2367
0.2411 0.2414 0.2443 0.2474 0.2479 0.2523 0.2530
0.2531 0.2545 0.2552 0.2575 0.2588 0.2644 0.2684
0.2952 0.3344 0.3767 0.3854 0.3899 0.5053 0.9414
1-Nearest Neighbor scored 0.2523

10
Winning MethodBagged Boosting

Submitted by Bernhard Pfahringer, ML Group,
Austrian Research Institute for AI
50 samples from the original 5 million odd
examples set
Contrary to standard bagging the sampling was
slightly biased
all of the examples of the two smallest classes
U2R and R2L
4000 PROBE, 80000 NORMAL, and 400000 DOS examples
duplicate entries in the original data set
removed
Ten C5 decision trees induced from each sample
used both C5's error-cost and boosting options.
Final predictions computed from 50 single
predictions of each training sample by minimizing
conditional risk
minimizes sum of error-costs times
class-probabilities
Took approximately 1 day of 200MHz 2 processor
Sparc to train

11
Confusion Matrix(Breakdown of score)
12
Analysis of winning entry

Result comparable to 1-NN except on rare
classes
Training sample of winner biased to rare classes
Does this give us a general principle?
Misses badly for some attack categories
True for 1-NN as well
Problem with feature set?

13
Second and Third places(Probably not
statistically significant)

Itzhak Levin, LLSoft, Inc. Kernel Miner
Link broken?
Vladimir Miheev, Alexei Vopilov, and Ivan
Shabalin, MP13, Moscow, Russia
Verbal rules constructed by an expert
First echelon of voting decision trees
Second echelon of voting decision trees
Steps sequentially
Branch to the next step occurs whenever the
current one has failed to recognize the
connection
Trees constructed using their own (previously
developed) tree learning algorithm