Title: Countering Spam Using Classification Techniques
1Countering Spam Using Classification Techniques
- Steve Webb
- webb_at_cc.gatech.edu
- Data Mining Guest Lecture
- February 21, 2008
2Overview
- Introduction
- Countering Email Spam
- Problem Description
- Classification History
- Ongoing Research
- Countering Web Spam
- Problem Description
- Classification History
- Ongoing Research
- Conclusions
3Introduction
- The Internet has spawned numerous
information-rich environments - Email Systems
- World Wide Web
- Social Networking Communities
- Openness facilities information sharing, but it
also makes them vulnerable
4Denial of Information (DoI) Attacks
- Deliberate insertion of low quality information
(or noise) into information-rich environments - Information analog to Denial of Service (DoS)
attacks - Two goals
- Promotion of ideals by means of deception
- Denial of access to high quality information
- Spam is the currently the most prominent example
of a DoI attack
5Overview
- Introduction
- Countering Email Spam
- Problem Description
- Classification History
- Ongoing Research
- Countering Web Spam
- Problem Description
- Classification History
- Ongoing Research
- Conclusions
6Countering Email Spam
- Close to 200 billion (yes, billion) emails are
sent each day - Spam accounts for around 90 of that email
traffic - 2 million spam messages every second
7Old Email Spam Examples
8Problem Description
- Email spam detection can be modeled as a binary
text classification problem - Two classes spam and legitimate (non-spam)
- Example of supervised learning
- Build a model (classifier) based on training data
to approximate the target function - Construct a function f M ? spam, legitimate
such that it overlaps F M ? spam, legitimate
as much as possible
9Problem Description (cont.)
- How do we represent a message?
- How do we generate features?
- How do we process features?
- How do we evaluate performance?
10How do we represent a message?
- Classification algorithms require a consistent
format - Saltons vector space model (bag of words) is
the most popular representation - Each message m is represented as a feature vector
f of n features ltf1, f2, , fngt
11How do we generate features?
- Sources of information
- SMTP connections
- Network properties
- Email headers
- Social networks
- Email body
- Textual parts
- URLs
- Attachments
12How do we process features?
- Feature Tokenization
- Alphanumeric tokens
- N-grams
- Phrases
- Feature Scrubbing
- Stemming
- Stop word removal
- Feature Selection
- Simple feature removal
- Information-theoretic algorithms
13How do we evaluate performance?
- Traditional IR metrics
- Precision vs. Recall
- False positives vs. False negatives
- Imbalanced error costs
- ROC curves
14Classification History
- Sahami et al. (1998)
- Used a Naïve Bayes classifier
- Were the first to apply text classification
research to the spam problem - Pantel and Lin (1998)
- Also used a Naïve Bayes classifier
- Found that Naïve Bayes outperforms RIPPER
15Classification History (cont.)
- Drucker et al. (1999)
- Evaluated Support Vector Machines as a solution
to spam - Found that SVM is more effective than RIPPER and
Rocchio - Hidalgo and Lopez (2000)
- Found that decision trees (C4.5) outperform Naïve
Bayes and k-NN
16Classification History (cont.)
- Up to this point, private corpora were used
exclusively in email spam research - Androutsopoulos et al. (2000a)
- Created the first publicly available email spam
corpus (Ling-spam) - Performed various feature set size, training set
size, stemming, and stop-list experiments with a
Naïve Bayes classifier
17Classification History (cont.)
- Androutsopoulos et al. (2000b)
- Created another publicly available email spam
corpus (PU1) - Confirmed previous research than Naïve Bayes
outperforms a keyword-based filter - Carreras and Marquez (2001)
- Used PU1 to show that AdaBoost is more effective
than decision trees and Naïve Bayes
18Classification History (cont.)
- Androutsopoulos et al. (2004)
- Created 3 more publicly available corpora (PU2,
PU3, and PUA) - Compared Naïve Bayes, Flexible Bayes, Support
Vector Machines, and LogitBoost FB, SVM, and LB
outperform NB - Zhang et al. (2004)
- Used Ling-spam, PU1, and the SpamAssassin corpora
- Compared Naïve Bayes, Support Vector Machines,
and AdaBoost SVM and AB outperform NB
19Classification History (cont.)
- CEAS (2004 present)
- Focuses solely on email and anti-spam research
- Generates a significant amount of academic and
industry anti-spam research - Klimt and Yang (2004)
- Published the Enron Corpus the first
large-scale corpus of legitimate email messages - TREC Spam Track (2005 present)
- Produces new corpora every year
- Provides a standardized platform to evaluate
classification algorithms
20Ongoing Research
- Concept Drift
- New Classification Approaches
- Adversarial Classification
- Image Spam
21Concept Drift
- Spam content is extremely dynamic
- Topic drift (e.g., specific scams)
- Technique drift (e.g., obfuscations)
- How do we keep up with the Joneses?
- Batch vs. Online Learning
22New Classification Approaches
- Filter Fusion
- Compression-based Filtering
- Network behavioral clustering
23Adversarial Classification
- Classifiers assume a clear distinction between
spam and legitimate features - Camouflaged messages
- Mask spam content with legitimate content
- Disrupt decision boundaries for classifiers
24Camouflage Attacks
- Baseline performance
- Accuracies consistently higher than 98
- Classifiers under attack
- Accuracies degrade to between 50 and 70
- Retrained classifiers
- Accuracies climb back to between 91 and 99
25Camouflage Attacks (cont.)
- Retraining postpones the problem, but it doesnt
solve it - We can identify features that are less
susceptible to attack, but thats simply another
stalling technique
26Image Spam
- What happens when an email does not contain
textual features? - OCR is easily defeated
- Classification using image properties
27Overview
- Introduction
- Countering Email Spam
- Problem Description
- Classification History
- Ongoing Research
- Countering Web Spam
- Problem Description
- Classification History
- Ongoing Research
- Conclusions
28Countering Web Spam
- What is web spam?
- Traditional definition
- Our definition
- Between 13.8 and 22.1 of all web pages
29Ad Farms
- Only contain advertising links (usually ad
listings) - Elaborate entry pages used to deceive visitors
30Ad Farms (cont.)
- Clicking on an entry page link leads to an ad
listing - Ad syndicators provide the content
- Web spammers create the HTML structures
31Parked Domains
- Domain parking services
- Provide place holders for newly registered
domains - Allow ad listings to be used as place holders to
monetize a domain - Inevitably, web spammers abused these services
32Parked Domains (cont.)
- Functionally equivalent to Ad Farms
- Both rely on ad syndicators for content
- Both provide little to no value to their visitors
- Unique Characteristics
- Reliance on domain parking services (e.g.,
apps5.oingo.com, searchportal.information.com,
etc.) - Typically for sale by owner (Offer To Buy This
Domain)
33Parked Domains (cont.)
34Advertisements
- Pages advertising specific products or services
- Examples of the kinds of pages being advertised
in Ad Farms and Parked Domains
35Problem Description
- Web spam detection can also be modeled as a
binary text classification problem - Saltons vector space model is quite common
- Feature processing and performance evaluation are
also quite similar - But what about feature generation
36How do we generate features?
- Sources of information
- HTTP connections
- Hosting IP addresses
- Session headers
- HTML content
- Textual properties
- Structural properties
- URL linkage structure
- PageRank scores
- Neighbor properties
37Classification History
- Davison (2000)
- Was the first to investigate link-based web spam
- Built decision trees to successfully identify
nepotistic links - Becchetti et al. (2005)
- Revisited the use of decision trees to identify
link-based web spam - Used link-based features such as PageRank and
TrustRank scores
38Classification History
- Drost and Scheffer (2005)
- Used Support Vector Machines to classify web spam
pages - Relied on content-based features as well as
link-based features - Ntoulas et al. (2006)
- Built decision trees to classify web spam
- Used content-based features (e.g., fraction of
visible content, compressibility, etc.)
39Classification History
- Up to this point, previous web spam research was
limited to small (on the order of a few
thousand), private data sets - Webb et al. (2006)
- Presented the Webb Spam Corpus a
first-of-its-kind large-scale, publicly available
web spam corpus (almost 350K web spam pages) - http//www.webbspamcorpus.org
- Castillo et al. (2006)
- Presented the WEBSPAM-UK2006 corpus a publicly
available web spam corpus (only contains 1,924
web spam pages)
40Classification History
- Castillo et al. (2007)
- Created a cost-sensitive decision tree to
identify web spam in the WEBSPAM-UK2006 data set - Used link-based features from Becchetti et al.
(2005) and content-based features from Ntoulas
et al. (2006) - Webb et al. (2008)
- Compared various classifiers (e.g., SVM, decision
trees, etc.) using HTTP session information
exclusively - Used the Webb Spam Corpus, WebBase data, and the
WEBSPAM-UK2006 data set - Found that these classifiers are comparable to
(and in many cases, better than) existing
approaches
41Ongoing Research
- Redirection
- Phishing
- Social Spam
42Redirection
- 144,801 unique redirect chains (1.54 average HTTP
redirects) - 43.9 of web spam pages use some form of HTML or
JavaScript redirection
43Phishing
- Interesting form of deception that affects email
and web users - Another form of adversarial classification
44Social Spam
- Comment spam
- Bulletin spam
- Message spam
45Conclusions
- Email and web spam are currently two of the
largest information security problems - Classification techniques offer an effective way
to filter this low quality information - Spammers are extremely dynamic, generating
various areas of important future research
46Questions