Title: KDD Cup
1KDD Cup 99 Classifier LearningPredictive
Model for Intrusion Detection
- Charles Elkan
- 1999 Conference on Knowledge Discovery and Data
Mining - Presented by Chris Clifton
2KDD Cup Overview
- Held Annually in conjunction with Knowledge
Discovery and Data Mining Conference (now
ACM-sponsored) - Challenge problem(s) released well before
conference - Goal is to give best solution to problem
- Relatively informal contest
- Gives standard test for comparing techniques
- Winner announced at KDD conference
- Lots of recognition to winner
3Classifier Learning forIntrusion Detection
- One of two KDD99 challenge problems
- Other was a knowledge discovery problem
- Goal is to learn a classifier to define TCP/IP
connections as intrusion/okay - Data Collection of features describing TCP
connection - Class Non-attack or type of attack
- Scoring Cost per Test Sample
- Wrong answers penalized based on type of wrong
4Data TCP connection information
- Dataset developed for 1998 DARPA Intrusion
Detection Evaluation Program - Nine weeks of raw TCP dump data from simulated
USAF LAN - Simulated attacks to give positive examples
- Processed into 5 million training connections,
2 million test - Some attributes derived from raw data
- Twenty-four attack types in training data, four
classes - DOS denial-of-service, e.g. syn flood
- R2L unauthorized access from a remote machine,
e.g. guessing password - U2RÂ unauthorized access to local superuser
(root) privileges, e.g., various buffer
overflow'' attacks - probing surveillance and other probing, e.g.,
port scanning. - Test set includes fourteen attack types not found
in training set
5Basic features of individual TCP connections
feature name description type
duration length (number of seconds) of the connection continuous
protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc. discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discreteÂ
land 1 if connection is from/to the same host/port 0 otherwise discrete
wrong_fragment number of wrong'' fragments continuous
urgent number of urgent packets continuous
6Content features within a connection suggested by
domain knowledge
feature name description type
hot number of hot'' indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in 0 otherwise discrete
num_compromised number of compromised'' conditions continuous
root_shell 1 if root shell is obtained 0 otherwise discrete
su_attempted 1 if su root'' command attempted 0 otherwise discrete
num_root number of root'' accesses continuous
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the hot'' list 0 otherwise discrete
is_guest_login 1 if the login is a guest''login 0 otherwise discrete
7Traffic features computed using a two-second time
window
feature name description type
count number of connections to the same host as the current connection in the past two seconds continuous
Note The following features refer to these same-host connections.
serror_rate of connections that have SYN'' errors continuous
rerror_rate of connections that have REJ'' errors continuous
same_srv_rate of connections to the same service continuous
diff_srv_rate of connections to different services continuous
srv_count number of connections to the same service as the current connection in the past two seconds continuous
Note The following features refer to these same-service connections.
srv_serror_rate of connections that have SYN'' errors continuous
srv_rerror_rate of connections that have REJ'' errors continuous
srv_diff_host_rate of connections to different host continuous
8Scoring
- Each prediction gets a score
- Row is correct answer
- Column is prediction made
- Score is average over all predictions
normal probe DOS U2R R2L
normal 0 1 2 2 2
probe 1 0 2 2 2
DOS 2 1 0 2 2
U2R 3 2 2 0 2
R2L 4 2 2 2 0
9Results
- Twenty-four entries, scores0.2331 0.2356 0.2367
0.2411 0.2414 0.2443 0.2474 0.2479 0.2523 0.2530
0.2531 0.2545 0.2552 0.2575 0.2588 0.2644 0.2684
0.2952 0.3344 0.3767 0.3854 0.3899 0.5053 0.9414 - 1-Nearest Neighbor scored 0.2523
10Winning MethodBagged Boosting
- Submitted by Bernhard Pfahringer, ML Group,
Austrian Research Institute for AI - 50 samples from the original 5 million odd
examples set - Contrary to standard bagging the sampling was
slightly biased - all of the examples of the two smallest classes
U2R and R2L - 4000 PROBE, 80000 NORMAL, and 400000 DOS examples
- duplicate entries in the original data set
removed - Ten C5 decision trees induced from each sample
- used both C5's error-cost and boosting options.
- Final predictions computed from 50 single
predictions of each training sample by minimizing
conditional risk - minimizes sum of error-costs times
class-probabilities - Took approximately 1 day of 200MHz 2 processor
Sparc to train
11Confusion Matrix(Breakdown of score)
12Analysis of winning entry
- Result comparable to 1-NN except on rare
classes - Training sample of winner biased to rare classes
- Does this give us a general principle?
- Misses badly for some attack categories
- True for 1-NN as well
- Problem with feature set?
13Second and Third places(Probably not
statistically significant)
- Itzhak Levin, LLSoft, Inc. Kernel Miner
- Link broken?
- Vladimir Miheev, Alexei Vopilov, and Ivan
Shabalin, MP13, Moscow, Russia - Verbal rules constructed by an expert
- First echelon of voting decision trees
- Second echelon of voting decision trees
- Steps sequentially
- Branch to the next step occurs whenever the
current one has failed to recognize the
connection - Trees constructed using their own (previously
developed) tree learning algorithm