Spam%20Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Spam%20Detection

Description:

A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung ... Reduce the Number of Iterations in Interative SVMs ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 41
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Spam%20Detection


1
Spam Detection
  • Jingrui He
  • 10/08/2007

2
Spam Types
  • Email Spam
  • Unsolicited commercial email
  • Blog Spam
  • Unwanted comments in blogs
  • Splogs
  • Fake blogs to boost PageRank

3
From Learning Point of View
  • Spam Detection
  • Classification problem (ham vs. spam)
  • Feature Extraction
  • A Learning Approach to Spam Detection based on
    Social Networks. H.Y. Lam and D.Y. Yeung
  • Fast Classifier
  • Relaxed Online SVMs for Spam Filtering. D.
    Sculley, G.M. Wachman

4
A Learning Approach to Spam Detection based on
Social Networks
  • H.Y. Lam and D.Y. Yeung
  • CEAS 2007

5
Problem Statement
  • n Email Accounts
  • Sender Set Receiver Set
  • Labeled Sender Set s.t.
  • Goal
  • Assign the remaining account with in

6
System Flow Chart
7
Social Network from Logs
  • Directed Graph
  • Directed Edge
  • Email sent from to
  • Edge Weight
  • is the
    number of emails sent from to

8
System Flow Chart
9
Features from Email Social Networks
  • In-count / Out-count
  • The sum of in-coming / out-going edge weights
  • In-degree / Out-degree
  • The number of email accounts that a node receives
    emails from / sends emails to

10
Features from Email Social Networks
  • Communication Reciprocity (CR)
  • The percentage of interactive neighbors that a
    node has

The set of accounts that sent emails to
The set of accounts that received emails from
11
Features from Email Social Networks
  • Communication Interaction Average (CIA)
  • The level of interaction between a sender and
    each of the corresponding recipients

12
Features from Email Social Networks
  • Clustering Coefficient (CC)
  • Friends-of-friends relationship between email
    accounts

Number of connections between neighbors of
Number of neighbors of
13
System Flow Chart
14
Preprocessing
  • Sender Feature Vector
  • Weighted Features

Problematic?
15
System Flow Chart
16
Assigning Spam Score
  • Similarity Weighted k-NN method
  • Gaussian similarity
  • Similarity weighted mean k-NN scores
  • Score scaling

The set of k nearest neighbors
17
Experiments
  • Enron Dataset 9150 Senders
  • To Get
  • Legitimate Enron senders email transactions
    within the Enron email domain
  • 5000 generated spam accounts
  • 120 senders from each class
  • Results Averaged over 100 Times

18
Number of Nearest Neighbors
19
Feature Weights (CC)
20
Feature Weights (CIA)
21
Feature Weights (CR)
22
Feature Weights
  • In/Out-Count In/Out-Degree
  • The smaller the better
  • Final Weights
  • In/Out-count In/Out-degree 1
  • CR 1
  • CIA 10
  • CC 15

23
Conclusion
  • Legitimacy Score
  • No content needed
  • Can Be Combined with Content-Based Filters
  • More Sophisticated Classifiers
  • SVM, boosting, etc
  • Classifiers Using Combined Feature

24
Relaxed Online SVMs for Spam Filtering
  • D. Sculley and G.M. Washman
  • SIGIR 2007

25
Anti-Spam Controversy
  • Support Vector Machines (SVMs)
  • Academic Researchers
  • Statistically robust
  • State-of-the-art performance
  • Practitioners
  • Quadratic in the number of training examples
  • Impractical!
  • Solution Relaxed Online SVMs

26
Background SVMs
  • Data Set
  • Class Label 1 for spam -1 for ham
  • Classifier
  • To Find and
  • Minimize
  • Constraints

Tradeoff parameter
Slack variable
Maximizing the margin
Minimizing the loss function
27
Online SVMs
28
Tuning the Tradeoff Parameter C
  • Spamassassin data set 6034 examples

Large C preferred
29
Email Spam and SVMs
  • TREC05P-1 92189 Messages
  • TREC06P 37822 messages

30
Blog Comment Spam and SVMs
  • Leave One Out Cross Validation
  • 50 Blog Posts 1024 Comments

31
Splogs and SVMs
  • Leave One Out Cross Validation
  • 1380 Examples

32
Computational Cost
  • Online SVMs Quadratic Training Time

33
Relaxed Online SVMs (ROSVM)
  • Objective Function of SVMs
  • Large C Preferred
  • Minimizing training error more important than
    maximizing the margin
  • ROSVM
  • Full margin maximization not necessary
  • Relax this requirement

34
Three Ways to Relax SVMs (1)
  • Only Optimize Over the Recent p Examples
  • Dual form of SVMs
  • Constraints

The last value found for when

35
Three Ways to Relax SVMs (2)
  • Only Update on Actual Errors
  • Original online SVMs
  • Update when
  • ROSVM
  • Update when
  • m0 mistake driven online SVMs
  • NO significant degrade in performance
  • Significantly reduce cost

36
Three Ways to Relax SVMs (3)
  • Reduce the Number of Iterations in Interative
    SVMs
  • SMO repeated pass over the training set to
    minimize the objective function
  • Parameter T the maximum number of iterations
  • T1 little impact on performance

37
Testing Reduced Size
38
Testing Reduced Iterations
39
Testing Reduced Updates
40
Online SVMs and ROSVM
  • ROSVM

Email Spam
Blog Comment Spam
Splog Data Set
Write a Comment
User Comments (0)
About PowerShow.com