Using Text Categorization Techniques for Intrusion Detection - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Using Text Categorization Techniques for Intrusion Detection

Description:

3. Text Categorization ... by a loud, hooting noise from his nephew Harry's room...' --- J. K. Rowling, Harry Potter and the Chamber of Secrets. 8/7/02. USENIX ... – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 38

Provided by: Yih26

Category:

more less

Transcript and Presenter's Notes

Title: Using Text Categorization Techniques for Intrusion Detection

1
Using Text Categorization Techniques for
Intrusion Detection

Yihua Liao
V. Rao Vemuri
Univ. of California, Davis

Represented by Sylvain Leblanc at RMC CSL 13
November 2002
2
Outline

Text categorization
Methodology
Experiments with DARPA data
Conclusions

3
Text Categorization

Group text documents into one/more predefined
categories based on their content
Important for information retrieval, sorting of
email or files, etc.

4
Sample Document

Not for the first time, an argument had broken
out over breakfast at number four, Privet Drive.
Mr. Vernon Dursley had been woken in the early
hours of the morning by a loud, hooting noise
from his nephew Harrys room
--- J. K. Rowling, Harry Potter and the Chamber
of Secrets.

5
Text Categorization

Preprocessing
Remove HTML (or other) tags
Remove stop-words that carry no info. (pron,
prep, conj, etc.)
Word stemming (suffix removal), group words such
as play, played and playing.

6
Document Representation

Vector space model Transform document into a
vector
Count word frequency
Frequency of word i in document j fij
Weight of word i in document j aij
Word-by-document matrix A (aij)

7
Weighting techniques

Boolean weighting aij 0 or 1
Frequency weighting aij fij
Term frequency inverse document frequency
weighting (tf-idf)
Ni of documents in which the word occurs at
least once.

8
Dimension reduction

Major challengehigh dimensionality of feature
space 15,000 words
Dimension reduction approaches
Feature selection
Re-parameterization

9
Machine Learning Methods

Neural networks
Support vector machines
Naïve Bayesian classifiers
K-nearest neighbor

10
k-Nearest Neighbor Classifier

To Classify unknown document X
Calculate the similarity between X and training
samples
Look at the class labels of k nearest neighbors.
Assume instances of the same class cluster
together in vector space.

11
KNN

Use the class labels of k most similar neighbors
to predict the class of new document X (cosine
similarity)
Use a cutoff threshold to assign a new document
to an existing class

12
Outline

Text categorization
Methodology
Experiments with DARPA data
Conclusions

13
Modeling program behavior

Intrusions often occur when program misused.
Learn program behavior profiles from previous
executions. (Forrest, Lee, etc.)
Short sequences of system calls (local ordering)
Profiles for individual programs
Time-consuming training and testing process
Different method?

14
Analogy

Text Categorization ?? Intrusion Detection
word ?? system call
text document ?? list of system calls
issued by a process
different categories ??normal/ intrusive

15
System call document

Process id 994
close execve open mmap open mmap mmap munmap
mmap mmap close open mmap close open mmap
mmap munmap mmap close close munmap open ioctl
access chown ioctl access chmod close close
close close close exit
? Process vector

16
Advantages

Limited system-call vocabulary. No dimension
reduction techniques needed.
Simple binary categorization problem
No need to learn individual program profiles with
kNN.

17
(No Transcript)
18
Remarks

Assume frequencies of system calls issued by a
program appear in a consistent manner
If intrusion not reveal anomaly in the above
sense, we will miss to catch it
Attacks caused by an abuse of perfectly normal
processes

19
Remarks (cont.)

Each process classified when it terminates
If an attack occurs over one/more sessions and a
session comprises of several processes, then we
sort of achieve real-time detection.

20
Outline

Text categorization
Methodology
Experiments with DARPA data
Conclusions

21
Experiments

Data set 1998 DARPA BSM data
Provides a large sample of network-based attacks
embedded in normal background traffic.
TCPDUMP and BSM audit data collected on a
simulated network.

22
(No Transcript)
23
Anomaly Detection

Training data
606 distinct process vectors
from 4 simulation days (no attacks)
50 distinct system calls
Testing data
55 sessions, 35 attacks (multiple sessions)
412 sessions, 5285 normal processes from one
simulation day. (DARPA training)

24
(No Transcript)
25
Result (ROC curves for tf-idf weighting)
26
Result (comparison of tf-idf and f weighting)
27
Anomaly Detection Signature Verification

Training data
606 distinct process vectors
19 intrusive processes
Testing data
412 sessions, 5285 normal processes from one
simulation day.
24 attacks from DARPA testing data

28
Anomaly Detection Signature Verification

New test process is compared to intrusive process
first
If there is a perfect match gt attack prcess
Otherwise, anomaly detection procedure is
performed

29
Result

Tf-idf weighting, K 10 and threshold 0.8
false positive rate is 0.59 (31 false alarms)
8 were captured with signature verification

30
Missed Attacks

Process table -- new denial of service attack
Consists of abuse of a perfectly legal action
Match with one training normal process exactly,
no abnormality when analyze individual processes

31
Outline

Text categorization
Methodology
Experiments with DARPA data
Conclusions

32
Summary

Frequencies of system calls used to characterize
program behavior.
Text categorization weighting techniques
KNN can effectively detect intrusive program
behavior. No individual program profiles are
needed.

33
Comments on KNN

Lazy learning no training needed, computation
at query time
Efficient memory indexing may improve performance

34
Future Work

Combine local similarity global similarity of
system call sequences?
N-gram text classification

35
Comments from Author on the presentation at USENIX

Real-time usefulness of the method
The processes are compared when they close.
Relative frequency of system calls
Although the individual system calls will be the
same for identical/similar processes, their
relative frequency will change
There are questions about the methods usefulness
in a real environment vs. DARPA data set.

36
Acknowledgement

Thank MIT Lincoln Laboratory for providing the
DARPA training and testing data!
Thank reviewers and Dr. Vern Paxson!
Work supported by AFOSR grant F49620-01-1-0327 to
the Center for Digital Security of the University
of California, Davis.

37
Thank you!!! Questions?

Write a Comment

User Comments (0)