HighPerformance Data Mining for Cyber Security - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

HighPerformance Data Mining for Cyber Security

Description:

University of California, Davis Feb 16, 2006. High-Performance Data Mining for ... UMN Computers doing large transfers via BitTorrent to many outside hosts ... – PowerPoint PPT presentation

Number of Views:658
Avg rating:3.0/5.0
Slides: 31
Provided by: wwwuser
Category:

less

Transcript and Presenter's Notes

Title: HighPerformance Data Mining for Cyber Security


1
High-Performance Data Mining for Cyber Security
Vipin Kumar Department of Computer Science
University of Minnesota kumar_at_cs.umn.edu
http//www.cs.umn.edu/kumar
Collaborators Paul Dokas, Eric Eilertson, Levent
Ertoz, Aleksandar Lazarevic, Michael
Steinbach, Haiyang Liu, Changho Choi, Mark
Shaneck, George Simon, Jaideep Srivastava,
Pang-Ning Tan, Varun Chandola, Yongdae Kim,
Zhi-li Zhang
2
Cyber Intrusion Detection - Motivation
  • Sophistication of cyber attacks and their
    severity is increasing
  • Large-scale denial of service attacks
  • Identify Theft/ Fraud
  • Espionage
  • DOD and Other U.S. Government Agencies are major
    targets for sophisticated state sponsored cyber
    attacks
  • Security mechanisms always have inevitable
    vulnerabilities
  • Firewalls are not sufficient to ensure security
    in computer networks
  • Insider attacks difficult to detect

1990 1991 1992 1993 1994 1995 1996 1997
1998 1999 2000 2001 2002 2003
Incidents Reported to Computer Emergency Response
Team/Coordination Center
Spread of SQL Slammer worm 10 minutes after its
deployment
3
Intrusion Detection Systems
  • Intrusion Detection System
  • Combination of software and hardware that
    attempts to perform intrusion detection
  • Raises the alarm when possible intrusion happens
  • Traditional intrusion detection system IDS tools
    are based on signatures of known attacks
  • Limitations
  • Signature database has to be manually revised
    for each new type of discovered intrusion
  • Substantial latency in deployment of newly
    created signatures across the computer system
  • Cannot detect emerging cyber threats
  • Not suitable for detecting policy violations and
    insider abuse
  • Do not provide understanding of network traffic
  • Generate too many false alarms
  • Not suited for detecting multi-step attacks

Example of SNORT rule (MS-SQL Slammer
worm) any -gt udp port 1434 (content"81 F1 03 01
04 9B 81 F1 01" content"sock" content"send")
www.snort.org
4
Data Mining for Intrusion Detection
  • Increased interest in data mining based intrusion
    detection over the past decade
  • Misuse detection
  • Suitable for attacks for which it is difficult to
    build signatures
  • Builds predictive models from labeled data sets
    (instances are labeled as normal or
    intrusive) to identify known intrusions
  • Cannot detect unknown and emerging attacks
  • Madam ID project, ADAM project, fuzzy association
    rules Bridges00, decision trees Sinclair99,
    neural networks Lippmann00, Ghosh99, genetic
    algorithms Bridges00, Sinclair99, cost
    sensitive modeling (AdaCost Fan99, MetaCost
    Domingos99, Ting00), learning from rare class
    (Kubat97, Fawcett97, Provost01, Japkowicz01,
    Joshi02, Lazarevic03
  • Anomaly detection
  • Detects emerging/novel attacks as deviations from
    normal behavior
  • Potential high false alarm rate - previously
    unseen (yet legitimate) system behaviors may also
    be recognized as anomalies
  • PHAD, ALAD Chan01, Cha02, ADAM Barbara01
    finite mixture model Yamanishi00, ?2 based
    Ye01), temporal sequence learning Lane98,
    neural networks Ryan98, generating artificial
    anomalies Fan01, clustering Eskin02,
    unsupervised SVM Eskin02, Lazarevic03, outlier
    detection schemes (MINDS), Bayesian net
    Valdes00, Hidden Markov models Ourston03

5
Technical Approach Data Mining for Intrusion
Detection
Training Set
continuous
categorical
categorical
temporal
class
  • Misuse Detection Building Predictive Models

Live data
  • Key Technical Challenges
  • Large data size
  • High dimensionality
  • Temporal nature of the data
  • Skewed class distribution
  • Data preprocessing
  • On-line analysis

Summarization of attacks using association rules
Learn Classifier
Clustering Anomaly Detection
Rules Discovered Src IP 206.163.37.95, Dest
Port 139, Bytes ? 150, 200 --gt ATTACK
Link Analysis
6
MINDS Minnesota INtrusion Detection System
MINDS system
Association pattern analysis
Summary and characterizationof attacks
Anomaly scores
network
Detected novel attacks
Anomaly detection

Humananalyst
  • Net flow tools
  • tcpdump

Data capturing device
Labels
Known attack detection
Detected known attacks
Feature Extraction
Filtering
  • Data mining based anomaly detection system
  • Used at the University of Minnesota to analyze
    network traffic to/from 40,000 computers
  • Incorporated into Interrogator architecture at
    ARL Center for Intrusion Monitoring and
    Protection (CIMP), PoC Bencevenko and Long (ARL)
  • Helps analyze data from multiple sensors at DoD
    sites around the country
  • Routinely detects attacks and intrusive behavior
    not detected by widely used intrusion detection
    systems
  • Insider Abuse / Policy Violations / Worms / Scans
  • ARL-CIMP considers MINDS as the first effective
    anomaly intrusion detection system

7
Feature Extraction Module
  • Three groups of features
  • Basic features of individual TCP connections
  • source destination IP - Features 1 2
  • source destination port - Features 3 4
  • Protocol Feature 5
  • Duration Feature 6
  • Bytes per packets Feature 7
  • number of bytes Feature 8
  • Time based features
  • For the same source (destination) IP address,
    number of unique destination (source) IP
    addresses inside the network in last T seconds
    Features 9 (13)
  • Number of connections from source (destination)
    IP to the same destination (source) port in last
    T seconds Features 11 (15)
  • Connection based features
  • For the same source (destination) IP address,
    number of unique destination (source) IP
    addresses inside the network in last N
    connections - Features 10 (14)
  • Number of connections from source (destination)
    IP to the same destination (source) port in last
    N connections - Features 12 (16)

8
Detection of Anomalies on Real Network Data
  • Anomalies/attacks picked by MINDS include
    scanning activities, worms, and non-standard
    behavior such as policy violations and insider
    attacks. Many of these attacks detected by MINDS,
    have already been on the CERT/CC list of recent
    advisories and incident notes.
  • Some illustrative examples of intrusive behavior
    detected using MINDS at U of M
  • Scans
  • Detected scanning for Microsoft DS service on
    port 445/TCP
  • Undetected by SNORT since the scanning was
    non-sequential (very slow). Rule added to SNORT
    in September 2002
  • Detected scanning for Oracle server
  • Undetected by SNORT because the scanning was
    hidden within another Web scanning
  • Detected a distributed windows networking scan
    from multiple source locations
  • Policy Violations
  • Identified machine running Microsoft PPTP VPN
    server on non-standard ports
  • Undetected by SNORT since the collected GRE
    traffic was part of the normal traffic
  • Identified compromised machines running FTP
    servers on non-standard ports, which is a policy
    violation
  • Example of anomalous behavior following a
    successful Trojan horse attack
  • Detected computers on the network apparently
    communicating with outside computers over a VPN
    or on IPv6
  • Worms
  • Detected several instances of slapper worm that
    were not identified by SNORT since they were
    variations of existing worm code
  • Detected unsolicited ICMP ECHOREPLY messages to a
    computer previously infected with Stacheldract
    worm (a DDos agent)

9
Typical Anomaly Detection Output
  • January 25, 2003 (24 hours after the Slammer worm)
  • Anomalous connections that correspond to the
    slammer worm

10
Typical Anomaly Detection Output
  • January 26, 2003 (48 hours after the Slammer worm)
  • Anomalous connections that correspond to the
    slammer worm
  • Anomalous connections that correspond to the ping
    scan
  • Connections corresponding to UM machines
    connecting to half-life game servers

11
Summarization of Anomalous Connections
  • January 26, 2003 (48 hours after the Slammer worm)

Potential Rules 1. Dest Port 1434/UDP
packets ? 0, 2) --gt Highly anomalous
behavior (Slammer Worm) 2. Src IP
142.150.Y.101, Dest Port 2048/ICMP bytes ?
0, 1829 --gt Highly anomalous behavior (ping
scan)
12
Association Pattern Analysis
  • Given a set of records each of which contain some
    number of items from a given collection
  • Generate subsets of items that frequently occur
    together
  • Produce dependency rules which will predict
    occurrence of an item based on occurrences of
    other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
13
Summarization Using Association Patterns
Ranked connections
attack
Discriminating Association Pattern Generator
Anomaly Detection System
normal
update
  • Build normal profile
  • Study changes in normal behavior
  • Create attack summary
  • Detect misuse behavior
  • Understand nature of the attack

R1 TCP, DstPort1863 ? Attack R100 TCP,
DstPort80 ? Normal
Knowledge Base
14
Typical MINDS Output
  • UMN computer connecting to a remote FTP server,
    running on port 5002
  • Summarized TCP reset packets received from
    64.156.X.74, which is a victim of DoS attack, and
    we were observing backscatter, i.e. replies to
    spoofed packets
  • Summarization of FTP scan from a computer in
    Columbia, 200.75.X.2
  • Summary of IDENT lookups, where a remote computer
    tries to get user name
  • Summarization of a USENET server transferring a
    large amount of data

15
Typical MINDS Output
  • UMN computers doing bulk transfers
  • 160.94.122.142 is running a rogue FTP server on
    60000/TCP
  • UMN Computers doing large transfers via
    BitTorrent to many outside hosts
  • 208.2.X.101 is scanning for computers on port
    139/TCP. Majority of the packets are 192bytes or
    144bytes, except for the second summary (score
    88.2)
  • UMN computer running a RealMedia server, that was
    not known to the analyst
  • Odd looking P2P traffic to/from a UMN computer
    (potentially KaZaA or Gnutella)
  • The remote computer was scanning for 57/TCP,
    where RESET packets are sent back from computers
    that do not have 57/TCP open.

16
Typical MINDS Output
  • UM computers doing bulk transfers
  • Attack on Real-Media server (Reported by CERT on
    September 9, 2003, RealNetworks media server
    RTSP protocol parser buffer overflow)
  • 8200/tcp traffic related to gotomypc.com which
    allows users to remotely control a desktop
    (involves a third party)
  • Mysterious traffic investigation inconclusive

17
Detecting Modes of Network Traffic Using
Clustering
  • Used Shared Nearest Neighbor (SNN) clustering
  • Not distracted by noise in the data
  • CPU intensive O(N2)
  • Requires storing an N x K matrix
  • K (number of neighbors) is typically between 10
    20
  • K should be about the size of the smallest expect
    mode
  • Clustered 850,000 connections collected over one
    hour at one US Army Fort
  • Took 10 hours on a 16 CPU cluster
  • Found 3135 clusters
  • Largest clusters around 500 records, smallest
    cluster 10 records
  • Large clusters correspond to normal behavior
  • Many small clusters correspond to policy
    violations or other undesired behavior

18
Detecting Modes of Network Traffic Using
Clustering
  • Large clusters of VPN traffic (hundreds of
    connections)
  • Used between forts for secure sharing of data and
    working remotely

19
Detecting Modes of Network Traffic Using
Clustering
  • Clusters Involving GoToMyPC.com (Army Data)
  • Policy violation, allows remote control of a
    desktop

20
Detecting Modes of Network Traffic Using
Clustering
  • Clusters involving mysterious ping and SNMP
    traffic
  • Misconfigured computer subjected to SNMP
    surveillance

21
Detecting Modes of Network Traffic Using
Clustering
  • Clusters involving unusual repeated ftp sessions
  • Further investigations revealed a mis-configured
    Army computer was trying to contact Microsoft

22
Clustering of UMN Traffic
  • A sample of 7500 network connections were
    clustered
  • Cluster samples
  • Clusters representing KaZaA traffic. Each cluster
    correspond to traffic from a different source
  • Bulk transfers from various source IPs to various
    destinations

23
Clustering of UMN Traffic
  • Connections to several Hotmail servers from UMN
    computers
  • Miscellaneous FTP traffic with small payload

24
Current MINDS Research and Development Work
  • Correlation of suspicious events across network
    sites
  • Helps detect sophisticated attacks not
    identifiable by single site analyses
  • Scalable parallel anomaly detection
  • Distributed correlation algorithms
  • Grids middleware
  • Analysis of long term data (months/years)
  • Uncover suspicious stealth activities (e.g.
    insiders leaking/modifying information)
  • Size of the data and inherent computational
    complexity of the pattern finding algorithms
    require HPC

How to detect a distributed network attack?
25
Map of the Global IP Space
26
Suspicious Traffic on Port 80
Destination IPs of suspicious connections within
the 3 class B networks at the U of M
Source IPs of suspicious connections in the
global IP space
999 unique sources, 1126 unique destinations,
1516 total flows involved Failed connections
O Successful connections
27
(No Transcript)
28
Suspicious Traffic on Port 445
Destination IPs of suspicious connections within
the 3 class B networks at the U of M
Source IPs of suspicious connections in the
global IP space
7982 unique sources, 6184 unique destinations,
9930 total flows involved Failed connections
O Successful connections
29
(No Transcript)
30
Publications
  • Managing Cyber Threats Issues, Approaches and
    Challenges, edited by V. Kumar, J. Srivastava,
    and A. Lazarevic, Kluwer Academic Publishers,
    2005.
  • A Survey of Intrusion Detections Systems, A.
    Lazarevic, V. Kumar, and J. Srivastava, in
    Managing Cyber Threats Issues, Approaches and
    Challenges, edited by V. Kumar, J. Srivastava,
    and A. Lazarevic, Kluwer Academic Publishers, May
    2005.
  • Scan Detection A Data Mining Approach, Gyorgy
    Simon, Hui Xiong, Eric Eilertson, Vipin Kumar,
    Proceedings of SIAM International Conference in
    Data Mining, April 2006.
  • Compressing Data into an Informative
    Representation. Varun Chandola and Vipin Kumar,
    Proceedings of 5th International Conference on
    Data Mining (ICDM) 2005, pages 98-105, Houston,
    TX
  • Parallel and Distributed Computing for Cyber
    Security. An article based on the keynote talk by
    Vipin Kumar at 17th International Conference on
    Parallel and Distributed Computing Systems
    (PDCS-2004). DS Online Journal, 2005.
  • Feature Bagging for Outlier Detection, Aleksandar
    Lazarevic and Vipin Kumar, Proceedings of the
    Eleventh ACM SIGKDD Intl Conf. on Knowledge
    Discovery and Data Mining (SIGKDD 2005) Chicago,
    2005.
  • MINDS A New Approach to the Information Security
    Process, E. E. Eilertson, L. Ertoz and V. Kumar,
    Proceedings of the 24th Army Science Conference
    Orlando, November 2004.
  • MINDS - Minnesota Intrusion Detection System,
    Ertöz, L., Eilertson, E., Lazarevic, A., Tan, P.,
    Srivastava, J., Kumar, V., Dokas, P., Data
    Mining Next Generation Challenges and Future
    Directions, editors H. Kargupta, A. Joshi, K.
    Sivakumar, Y. Yesha MIT/AAAI Press, 2004, AHPCRC
    Technical Report 2003-121
  • A Comparative Study of Anomaly Detection Schemes
    in Network Intrusion Detection, Lazarevic, A.,
    Ertoz, L., Kumar, V., Ozgur, A., Srivastava, J.,
    SIAM 2003, San Francisco, CA.
  • Detection of Novel Network Attacks Using Data
    Mining, L. Ertöz, E. Eilertson, A. Lazarevic, P.
    Tan, P. Dokas, V. Kumar, J. Srivastava, Workshop
    on Data Mining for Computer Security, IEEE
    International Conference on Data Mining,
    Melbourne, FL, November 19, 2003, AHPCRC
    Technical Report 2003-108
Write a Comment
User Comments (0)
About PowerShow.com