Title: Spam%20Detection
1Spam Detection
2Spam Types
- Email Spam
- Unsolicited commercial email
- Blog Spam
- Unwanted comments in blogs
- Splogs
- Fake blogs to boost PageRank
3From Learning Point of View
- Spam Detection
- Classification problem (ham vs. spam)
- Feature Extraction
- A Learning Approach to Spam Detection based on
Social Networks. H.Y. Lam and D.Y. Yeung - Fast Classifier
- Relaxed Online SVMs for Spam Filtering. D.
Sculley, G.M. Wachman
4A Learning Approach to Spam Detection based on
Social Networks
- H.Y. Lam and D.Y. Yeung
- CEAS 2007
5Problem Statement
- n Email Accounts
- Sender Set Receiver Set
- Labeled Sender Set s.t.
- Goal
- Assign the remaining account with in
6System Flow Chart
7Social Network from Logs
- Directed Graph
- Directed Edge
- Email sent from to
- Edge Weight
- is the
number of emails sent from to
8System Flow Chart
9Features from Email Social Networks
- In-count / Out-count
- The sum of in-coming / out-going edge weights
- In-degree / Out-degree
- The number of email accounts that a node receives
emails from / sends emails to
10Features from Email Social Networks
- Communication Reciprocity (CR)
- The percentage of interactive neighbors that a
node has
The set of accounts that sent emails to
The set of accounts that received emails from
11Features from Email Social Networks
- Communication Interaction Average (CIA)
- The level of interaction between a sender and
each of the corresponding recipients
12Features from Email Social Networks
- Clustering Coefficient (CC)
- Friends-of-friends relationship between email
accounts
Number of connections between neighbors of
Number of neighbors of
13System Flow Chart
14Preprocessing
- Sender Feature Vector
-
-
- Weighted Features
-
Problematic?
15System Flow Chart
16Assigning Spam Score
- Similarity Weighted k-NN method
- Gaussian similarity
- Similarity weighted mean k-NN scores
- Score scaling
The set of k nearest neighbors
17Experiments
- Enron Dataset 9150 Senders
- To Get
- Legitimate Enron senders email transactions
within the Enron email domain - 5000 generated spam accounts
- 120 senders from each class
- Results Averaged over 100 Times
18Number of Nearest Neighbors
19Feature Weights (CC)
20Feature Weights (CIA)
21Feature Weights (CR)
22Feature Weights
- In/Out-Count In/Out-Degree
- The smaller the better
- Final Weights
- In/Out-count In/Out-degree 1
- CR 1
- CIA 10
- CC 15
23Conclusion
- Legitimacy Score
- No content needed
- Can Be Combined with Content-Based Filters
- More Sophisticated Classifiers
- SVM, boosting, etc
- Classifiers Using Combined Feature
24Relaxed Online SVMs for Spam Filtering
- D. Sculley and G.M. Washman
- SIGIR 2007
25Anti-Spam Controversy
- Support Vector Machines (SVMs)
- Academic Researchers
- Statistically robust
- State-of-the-art performance
- Practitioners
- Quadratic in the number of training examples
- Impractical!
- Solution Relaxed Online SVMs
26Background SVMs
- Data Set
- Class Label 1 for spam -1 for ham
- Classifier
- To Find and
- Minimize
- Constraints
Tradeoff parameter
Slack variable
Maximizing the margin
Minimizing the loss function
27Online SVMs
28Tuning the Tradeoff Parameter C
- Spamassassin data set 6034 examples
Large C preferred
29Email Spam and SVMs
- TREC05P-1 92189 Messages
- TREC06P 37822 messages
30Blog Comment Spam and SVMs
- Leave One Out Cross Validation
- 50 Blog Posts 1024 Comments
31Splogs and SVMs
- Leave One Out Cross Validation
- 1380 Examples
32Computational Cost
- Online SVMs Quadratic Training Time
33Relaxed Online SVMs (ROSVM)
- Objective Function of SVMs
- Large C Preferred
- Minimizing training error more important than
maximizing the margin - ROSVM
- Full margin maximization not necessary
- Relax this requirement
34Three Ways to Relax SVMs (1)
- Only Optimize Over the Recent p Examples
- Dual form of SVMs
- Constraints
The last value found for when
35Three Ways to Relax SVMs (2)
- Only Update on Actual Errors
- Original online SVMs
- Update when
- ROSVM
- Update when
- m0 mistake driven online SVMs
- NO significant degrade in performance
- Significantly reduce cost
36Three Ways to Relax SVMs (3)
- Reduce the Number of Iterations in Interative
SVMs - SMO repeated pass over the training set to
minimize the objective function - Parameter T the maximum number of iterations
- T1 little impact on performance
37Testing Reduced Size
38Testing Reduced Iterations
39Testing Reduced Updates
40Online SVMs and ROSVM
Email Spam
Blog Comment Spam
Splog Data Set