Using Text Categorization Techniques for Intrusion Detection - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Using Text Categorization Techniques for Intrusion Detection

Description:

3. Text Categorization ... by a loud, hooting noise from his nephew Harry's room...' --- J. K. Rowling, Harry Potter and the Chamber of Secrets. 8/7/02. USENIX ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 38
Provided by: Yih26
Category:

less

Transcript and Presenter's Notes

Title: Using Text Categorization Techniques for Intrusion Detection


1
Using Text Categorization Techniques for
Intrusion Detection
  • Yihua Liao
  • V. Rao Vemuri
  • Univ. of California, Davis

Represented by Sylvain Leblanc at RMC CSL 13
November 2002
2
Outline
  • Text categorization
  • Methodology
  • Experiments with DARPA data
  • Conclusions

3
Text Categorization
  • Group text documents into one/more predefined
    categories based on their content
  • Important for information retrieval, sorting of
    email or files, etc.

4
Sample Document
  • Not for the first time, an argument had broken
    out over breakfast at number four, Privet Drive.
    Mr. Vernon Dursley had been woken in the early
    hours of the morning by a loud, hooting noise
    from his nephew Harrys room
  • --- J. K. Rowling, Harry Potter and the Chamber
    of Secrets.

5
Text Categorization
  • Preprocessing
  • Remove HTML (or other) tags
  • Remove stop-words that carry no info. (pron,
    prep, conj, etc.)
  • Word stemming (suffix removal), group words such
    as play, played and playing.

6
Document Representation
  • Vector space model Transform document into a
    vector
  • Count word frequency
  • Frequency of word i in document j fij
  • Weight of word i in document j aij
  • Word-by-document matrix A (aij)

7
Weighting techniques
  • Boolean weighting aij 0 or 1
  • Frequency weighting aij fij
  • Term frequency inverse document frequency
    weighting (tf-idf)
  • Ni of documents in which the word occurs at
    least once.

8
Dimension reduction
  • Major challengehigh dimensionality of feature
    space 15,000 words
  • Dimension reduction approaches
  • Feature selection
  • Re-parameterization

9
Machine Learning Methods
  • Neural networks
  • Support vector machines
  • Naïve Bayesian classifiers
  • K-nearest neighbor

10
k-Nearest Neighbor Classifier
  • To Classify unknown document X
  • Calculate the similarity between X and training
    samples
  • Look at the class labels of k nearest neighbors.
  • Assume instances of the same class cluster
    together in vector space.

11
KNN
  • Use the class labels of k most similar neighbors
    to predict the class of new document X (cosine
    similarity)
  • Use a cutoff threshold to assign a new document
    to an existing class

12
Outline
  • Text categorization
  • Methodology
  • Experiments with DARPA data
  • Conclusions

13
Modeling program behavior
  • Intrusions often occur when program misused.
  • Learn program behavior profiles from previous
    executions. (Forrest, Lee, etc.)
  • Short sequences of system calls (local ordering)
  • Profiles for individual programs
  • Time-consuming training and testing process
  • Different method?

14
Analogy
  • Text Categorization ?? Intrusion Detection
  • word ?? system call
  • text document ?? list of system calls
    issued by a process
  • different categories ??normal/ intrusive

15
System call document
  • Process id 994
  • close execve open mmap open mmap mmap munmap
    mmap mmap close open mmap close open mmap
    mmap munmap mmap close close munmap open ioctl
    access chown ioctl access chmod close close
    close close close exit
  • ? Process vector

16
Advantages
  • Limited system-call vocabulary. No dimension
    reduction techniques needed.
  • Simple binary categorization problem
  • No need to learn individual program profiles with
    kNN.

17
(No Transcript)
18
Remarks
  • Assume frequencies of system calls issued by a
    program appear in a consistent manner
  • If intrusion not reveal anomaly in the above
    sense, we will miss to catch it
  • Attacks caused by an abuse of perfectly normal
    processes

19
Remarks (cont.)
  • Each process classified when it terminates
  • If an attack occurs over one/more sessions and a
    session comprises of several processes, then we
    sort of achieve real-time detection.

20
Outline
  • Text categorization
  • Methodology
  • Experiments with DARPA data
  • Conclusions

21
Experiments
  • Data set 1998 DARPA BSM data
  • Provides a large sample of network-based attacks
    embedded in normal background traffic.
  • TCPDUMP and BSM audit data collected on a
    simulated network.

22
(No Transcript)
23
Anomaly Detection
  • Training data
  • 606 distinct process vectors
  • from 4 simulation days (no attacks)
  • 50 distinct system calls
  • Testing data
  • 55 sessions, 35 attacks (multiple sessions)
  • 412 sessions, 5285 normal processes from one
    simulation day. (DARPA training)

24
(No Transcript)
25
Result (ROC curves for tf-idf weighting)
26
Result (comparison of tf-idf and f weighting)
27
Anomaly Detection Signature Verification
  • Training data
  • 606 distinct process vectors
  • 19 intrusive processes
  • Testing data
  • 412 sessions, 5285 normal processes from one
    simulation day.
  • 24 attacks from DARPA testing data

28
Anomaly Detection Signature Verification
  • New test process is compared to intrusive process
    first
  • If there is a perfect match gt attack prcess
  • Otherwise, anomaly detection procedure is
    performed

29
Result
  • Tf-idf weighting, K 10 and threshold 0.8
  • false positive rate is 0.59 (31 false alarms)
  • 8 were captured with signature verification

30
Missed Attacks
  • Process table -- new denial of service attack
  • Consists of abuse of a perfectly legal action
  • Match with one training normal process exactly,
    no abnormality when analyze individual processes

31
Outline
  • Text categorization
  • Methodology
  • Experiments with DARPA data
  • Conclusions

32
Summary
  • Frequencies of system calls used to characterize
    program behavior.
  • Text categorization weighting techniques
  • KNN can effectively detect intrusive program
    behavior. No individual program profiles are
    needed.

33
Comments on KNN
  • Lazy learning no training needed, computation
    at query time
  • Efficient memory indexing may improve performance

34
Future Work
  • Combine local similarity global similarity of
    system call sequences?
  • N-gram text classification

35
Comments from Author on the presentation at USENIX
  • Real-time usefulness of the method
  • The processes are compared when they close.
  • Relative frequency of system calls
  • Although the individual system calls will be the
    same for identical/similar processes, their
    relative frequency will change
  • There are questions about the methods usefulness
    in a real environment vs. DARPA data set.

36
Acknowledgement
  • Thank MIT Lincoln Laboratory for providing the
    DARPA training and testing data!
  • Thank reviewers and Dr. Vern Paxson!
  • Work supported by AFOSR grant F49620-01-1-0327 to
    the Center for Digital Security of the University
    of California, Davis.

37
Thank you!!! Questions?
Write a Comment
User Comments (0)
About PowerShow.com