Countering Spam Using Classification Techniques

1 / 46

About This Presentation

Title:

Countering Spam Using Classification Techniques

Description:

... existing approaches Ongoing Research Redirection Phishing Social Spam ... Examples Problem Description Email spam detection can be modeled as a ... – PowerPoint PPT presentation

Number of Views:3

Avg rating:3.0/5.0

Slides: 47

Provided by: SteveW221

Learn more at: http://www.mathcs.emory.edu

more less

Transcript and Presenter's Notes

Title: Countering Spam Using Classification Techniques

1
Countering Spam Using Classification Techniques

Steve Webb
webb_at_cc.gatech.edu
Data Mining Guest Lecture
February 21, 2008

2
Overview

Introduction
Countering Email Spam
Problem Description
Classification History
Ongoing Research
Countering Web Spam
Problem Description
Classification History
Ongoing Research
Conclusions

3
Introduction

The Internet has spawned numerous
information-rich environments
Email Systems
World Wide Web
Social Networking Communities
Openness facilities information sharing, but it
also makes them vulnerable

4
Denial of Information (DoI) Attacks

Deliberate insertion of low quality information
(or noise) into information-rich environments
Information analog to Denial of Service (DoS)
attacks
Two goals
Promotion of ideals by means of deception
Denial of access to high quality information
Spam is the currently the most prominent example
of a DoI attack

5
Overview

Introduction
Countering Email Spam
Problem Description
Classification History
Ongoing Research
Countering Web Spam
Problem Description
Classification History
Ongoing Research
Conclusions

6
Countering Email Spam

Close to 200 billion (yes, billion) emails are
sent each day
Spam accounts for around 90 of that email
traffic
2 million spam messages every second

7
Old Email Spam Examples
8
Problem Description

Email spam detection can be modeled as a binary
text classification problem
Two classes spam and legitimate (non-spam)
Example of supervised learning
Build a model (classifier) based on training data
to approximate the target function
Construct a function f M ? spam, legitimate
such that it overlaps F M ? spam, legitimate
as much as possible

9
Problem Description (cont.)

How do we represent a message?
How do we generate features?
How do we process features?
How do we evaluate performance?

10
How do we represent a message?

Classification algorithms require a consistent
format
Saltons vector space model (bag of words) is
the most popular representation
Each message m is represented as a feature vector
f of n features ltf1, f2, , fngt

11
How do we generate features?

Sources of information
SMTP connections
Network properties
Email headers
Social networks
Email body
Textual parts
URLs
Attachments

12
How do we process features?

Feature Tokenization
Alphanumeric tokens
N-grams
Phrases
Feature Scrubbing
Stemming
Stop word removal
Feature Selection
Simple feature removal
Information-theoretic algorithms

13
How do we evaluate performance?

Traditional IR metrics
Precision vs. Recall
False positives vs. False negatives
Imbalanced error costs
ROC curves

14
Classification History

Sahami et al. (1998)
Used a Naïve Bayes classifier
Were the first to apply text classification
research to the spam problem
Pantel and Lin (1998)
Also used a Naïve Bayes classifier
Found that Naïve Bayes outperforms RIPPER

15
Classification History (cont.)

Drucker et al. (1999)
Evaluated Support Vector Machines as a solution
to spam
Found that SVM is more effective than RIPPER and
Rocchio
Hidalgo and Lopez (2000)
Found that decision trees (C4.5) outperform Naïve
Bayes and k-NN

16
Classification History (cont.)

Up to this point, private corpora were used
exclusively in email spam research
Androutsopoulos et al. (2000a)
Created the first publicly available email spam
corpus (Ling-spam)
Performed various feature set size, training set
size, stemming, and stop-list experiments with a
Naïve Bayes classifier

17
Classification History (cont.)

Androutsopoulos et al. (2000b)
Created another publicly available email spam
corpus (PU1)
Confirmed previous research than Naïve Bayes
outperforms a keyword-based filter
Carreras and Marquez (2001)
Used PU1 to show that AdaBoost is more effective
than decision trees and Naïve Bayes

18
Classification History (cont.)

Androutsopoulos et al. (2004)
Created 3 more publicly available corpora (PU2,
PU3, and PUA)
Compared Naïve Bayes, Flexible Bayes, Support
Vector Machines, and LogitBoost FB, SVM, and LB
outperform NB
Zhang et al. (2004)
Used Ling-spam, PU1, and the SpamAssassin corpora
Compared Naïve Bayes, Support Vector Machines,
and AdaBoost SVM and AB outperform NB

19
Classification History (cont.)

CEAS (2004 present)
Focuses solely on email and anti-spam research
Generates a significant amount of academic and
industry anti-spam research
Klimt and Yang (2004)
Published the Enron Corpus the first
large-scale corpus of legitimate email messages
TREC Spam Track (2005 present)
Produces new corpora every year
Provides a standardized platform to evaluate
classification algorithms

20
Ongoing Research

Concept Drift
New Classification Approaches
Adversarial Classification
Image Spam

21
Concept Drift

Spam content is extremely dynamic
Topic drift (e.g., specific scams)
Technique drift (e.g., obfuscations)
How do we keep up with the Joneses?
Batch vs. Online Learning

22
New Classification Approaches

Filter Fusion
Compression-based Filtering
Network behavioral clustering

23
Adversarial Classification

Classifiers assume a clear distinction between
spam and legitimate features
Camouflaged messages
Mask spam content with legitimate content
Disrupt decision boundaries for classifiers

24
Camouflage Attacks

Baseline performance
Accuracies consistently higher than 98
Classifiers under attack
Accuracies degrade to between 50 and 70
Retrained classifiers
Accuracies climb back to between 91 and 99

25
Camouflage Attacks (cont.)

Retraining postpones the problem, but it doesnt
solve it
We can identify features that are less
susceptible to attack, but thats simply another
stalling technique

26
Image Spam

What happens when an email does not contain
textual features?
OCR is easily defeated
Classification using image properties

27
Overview

Introduction
Countering Email Spam
Problem Description
Classification History
Ongoing Research
Countering Web Spam
Problem Description
Classification History
Ongoing Research
Conclusions

28
Countering Web Spam

What is web spam?
Traditional definition
Our definition
Between 13.8 and 22.1 of all web pages

29
Ad Farms

Only contain advertising links (usually ad
listings)
Elaborate entry pages used to deceive visitors

30
Ad Farms (cont.)

Clicking on an entry page link leads to an ad
listing
Ad syndicators provide the content
Web spammers create the HTML structures

31
Parked Domains

Domain parking services
Provide place holders for newly registered
domains
Allow ad listings to be used as place holders to
monetize a domain
Inevitably, web spammers abused these services

32
Parked Domains (cont.)

Functionally equivalent to Ad Farms
Both rely on ad syndicators for content
Both provide little to no value to their visitors
Unique Characteristics
Reliance on domain parking services (e.g.,
apps5.oingo.com, searchportal.information.com,
etc.)
Typically for sale by owner (Offer To Buy This
Domain)

33
Parked Domains (cont.)
34
Advertisements

Pages advertising specific products or services
Examples of the kinds of pages being advertised
in Ad Farms and Parked Domains

35
Problem Description

Web spam detection can also be modeled as a
binary text classification problem
Saltons vector space model is quite common
Feature processing and performance evaluation are
also quite similar
But what about feature generation

36
How do we generate features?

Sources of information
HTTP connections
Hosting IP addresses
Session headers
HTML content
Textual properties
Structural properties
URL linkage structure
PageRank scores
Neighbor properties

37
Classification History

Davison (2000)
Was the first to investigate link-based web spam
Built decision trees to successfully identify
nepotistic links
Becchetti et al. (2005)
Revisited the use of decision trees to identify
link-based web spam
Used link-based features such as PageRank and
TrustRank scores

38
Classification History

Drost and Scheffer (2005)
Used Support Vector Machines to classify web spam
pages
Relied on content-based features as well as
link-based features
Ntoulas et al. (2006)
Built decision trees to classify web spam
Used content-based features (e.g., fraction of
visible content, compressibility, etc.)

39
Classification History

Up to this point, previous web spam research was
limited to small (on the order of a few
thousand), private data sets
Webb et al. (2006)
Presented the Webb Spam Corpus a
first-of-its-kind large-scale, publicly available
web spam corpus (almost 350K web spam pages)
http//www.webbspamcorpus.org
Castillo et al. (2006)
Presented the WEBSPAM-UK2006 corpus a publicly
available web spam corpus (only contains 1,924
web spam pages)

40
Classification History

Castillo et al. (2007)
Created a cost-sensitive decision tree to
identify web spam in the WEBSPAM-UK2006 data set
Used link-based features from Becchetti et al.
(2005) and content-based features from Ntoulas
et al. (2006)
Webb et al. (2008)
Compared various classifiers (e.g., SVM, decision
trees, etc.) using HTTP session information
exclusively
Used the Webb Spam Corpus, WebBase data, and the
WEBSPAM-UK2006 data set
Found that these classifiers are comparable to
(and in many cases, better than) existing
approaches

41
Ongoing Research

Redirection
Phishing
Social Spam

42
Redirection

144,801 unique redirect chains (1.54 average HTTP
redirects)
43.9 of web spam pages use some form of HTML or
JavaScript redirection

43
Phishing

Interesting form of deception that affects email
and web users
Another form of adversarial classification

44
Social Spam

Comment spam
Bulletin spam
Message spam

45
Conclusions

Email and web spam are currently two of the
largest information security problems
Classification techniques offer an effective way
to filter this low quality information
Spammers are extremely dynamic, generating
various areas of important future research

46
Questions

Write a Comment

User Comments (0)