CS 590M Fall 2001: Security Issues in Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

CS 590M Fall 2001: Security Issues in Data Mining

Description:

Problem: assign items to pre-defined classes. Sample Y = Y1 ... Yn. Set of classes X. Given Y, choose C that contains Y. How do we know how to do this? ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 12
Provided by: clif8
Category:

less

Transcript and Presenter's Notes

Title: CS 590M Fall 2001: Security Issues in Data Mining


1
CS 590M Fall 2001 Security Issues in Data Mining
  • Lecture 3 Classification

2
What is Classification?
  • Problem assign items to pre-defined classes
  • Sample Y Y1 Yn
  • Set of classes X
  • Given Y, choose C that contains Y
  • How do we know how to do this?
  • Training data set of items for which proper Xi
    is known.

3
Issues
  • Classification accuracy
  • False positives, False negatives
  • No clear best metric
  • Computation cost
  • Training
  • Classification

4
Approaches
  • Naïve Bayes
  • K-Nearest Neighbor
  • Decision rules/Decision trees
  • Neural Networks

5
Naïve Bayes History
  • Bayes classifier From Probability Theory
  • Idea A-posteriori probability of class given
    all inputs is best possible classifier
  • Problem doesnt generalize.
  • Solution Bayesian Belief network

Y2
Y1
Y4
Y3
P(XiY) P(Y4Y2,Y3)P(Y2Y1)P(Y3Y1)P(Y1)
6
Problems with Bayesian Belief Network
  • What should the network structure be?
  • Some work in how to learn the structure
  • Getting it wrong results in over-specificity
  • What are the probabilities?
  • Learning techniques exist here
  • Computational cost to learn network

7
Naïve Bayes
  • Two-layer Bayes network
  • No need to learn structure
  • Assumes inputs independent
  • Learn the probabilities that work best on
    training data

Y1
Y2
Y3
P(XY1...Yn) P(X)?i P(YiX)
X
8
K-Nearest Neighbor
  • Idea Choose closest training item
  • Class of test is same as class of closest
    training item
  • Need to define distance
  • What if this is a bad match?
  • Find K closest items
  • Use most common class in those K

9
KNN Advantages
  • As training set ? 8, K ? 8, result approaches
    optimal
  • View as best probability over all samples
    this is Bayes theorem
  • Training simple
  • Just put training set into a data structure

10
KNN Problems
  • With small K, only captures convex classes
  • High dimensionality may be nearest in
    irrelevant attributes
  • Query time Search all training data
  • Algorithms to make this faster
  • But good enough to be standard for comparison

11
Classification and Security
  • Ideas on how to use classifiers to improve
    security
  • Intrusion Detection
  • ?
  • Potential risks
  • Identifying private information based on
    similarity with training data
Write a Comment
User Comments (0)
About PowerShow.com