Countering Spam Using Classification Techniques

1 / 46
About This Presentation
Title:

Countering Spam Using Classification Techniques

Description:

... existing approaches Ongoing Research Redirection Phishing Social Spam ... Examples Problem Description Email spam detection can be modeled as a ... – PowerPoint PPT presentation

Number of Views:3
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Countering Spam Using Classification Techniques


1
Countering Spam Using Classification Techniques
  • Steve Webb
  • webb_at_cc.gatech.edu
  • Data Mining Guest Lecture
  • February 21, 2008

2
Overview
  • Introduction
  • Countering Email Spam
  • Problem Description
  • Classification History
  • Ongoing Research
  • Countering Web Spam
  • Problem Description
  • Classification History
  • Ongoing Research
  • Conclusions

3
Introduction
  • The Internet has spawned numerous
    information-rich environments
  • Email Systems
  • World Wide Web
  • Social Networking Communities
  • Openness facilities information sharing, but it
    also makes them vulnerable

4
Denial of Information (DoI) Attacks
  • Deliberate insertion of low quality information
    (or noise) into information-rich environments
  • Information analog to Denial of Service (DoS)
    attacks
  • Two goals
  • Promotion of ideals by means of deception
  • Denial of access to high quality information
  • Spam is the currently the most prominent example
    of a DoI attack

5
Overview
  • Introduction
  • Countering Email Spam
  • Problem Description
  • Classification History
  • Ongoing Research
  • Countering Web Spam
  • Problem Description
  • Classification History
  • Ongoing Research
  • Conclusions

6
Countering Email Spam
  • Close to 200 billion (yes, billion) emails are
    sent each day
  • Spam accounts for around 90 of that email
    traffic
  • 2 million spam messages every second

7
Old Email Spam Examples
8
Problem Description
  • Email spam detection can be modeled as a binary
    text classification problem
  • Two classes spam and legitimate (non-spam)
  • Example of supervised learning
  • Build a model (classifier) based on training data
    to approximate the target function
  • Construct a function f M ? spam, legitimate
    such that it overlaps F M ? spam, legitimate
    as much as possible

9
Problem Description (cont.)
  • How do we represent a message?
  • How do we generate features?
  • How do we process features?
  • How do we evaluate performance?

10
How do we represent a message?
  • Classification algorithms require a consistent
    format
  • Saltons vector space model (bag of words) is
    the most popular representation
  • Each message m is represented as a feature vector
    f of n features ltf1, f2, , fngt

11
How do we generate features?
  • Sources of information
  • SMTP connections
  • Network properties
  • Email headers
  • Social networks
  • Email body
  • Textual parts
  • URLs
  • Attachments

12
How do we process features?
  • Feature Tokenization
  • Alphanumeric tokens
  • N-grams
  • Phrases
  • Feature Scrubbing
  • Stemming
  • Stop word removal
  • Feature Selection
  • Simple feature removal
  • Information-theoretic algorithms

13
How do we evaluate performance?
  • Traditional IR metrics
  • Precision vs. Recall
  • False positives vs. False negatives
  • Imbalanced error costs
  • ROC curves

14
Classification History
  • Sahami et al. (1998)
  • Used a Naïve Bayes classifier
  • Were the first to apply text classification
    research to the spam problem
  • Pantel and Lin (1998)
  • Also used a Naïve Bayes classifier
  • Found that Naïve Bayes outperforms RIPPER

15
Classification History (cont.)
  • Drucker et al. (1999)
  • Evaluated Support Vector Machines as a solution
    to spam
  • Found that SVM is more effective than RIPPER and
    Rocchio
  • Hidalgo and Lopez (2000)
  • Found that decision trees (C4.5) outperform Naïve
    Bayes and k-NN

16
Classification History (cont.)
  • Up to this point, private corpora were used
    exclusively in email spam research
  • Androutsopoulos et al. (2000a)
  • Created the first publicly available email spam
    corpus (Ling-spam)
  • Performed various feature set size, training set
    size, stemming, and stop-list experiments with a
    Naïve Bayes classifier

17
Classification History (cont.)
  • Androutsopoulos et al. (2000b)
  • Created another publicly available email spam
    corpus (PU1)
  • Confirmed previous research than Naïve Bayes
    outperforms a keyword-based filter
  • Carreras and Marquez (2001)
  • Used PU1 to show that AdaBoost is more effective
    than decision trees and Naïve Bayes

18
Classification History (cont.)
  • Androutsopoulos et al. (2004)
  • Created 3 more publicly available corpora (PU2,
    PU3, and PUA)
  • Compared Naïve Bayes, Flexible Bayes, Support
    Vector Machines, and LogitBoost FB, SVM, and LB
    outperform NB
  • Zhang et al. (2004)
  • Used Ling-spam, PU1, and the SpamAssassin corpora
  • Compared Naïve Bayes, Support Vector Machines,
    and AdaBoost SVM and AB outperform NB

19
Classification History (cont.)
  • CEAS (2004 present)
  • Focuses solely on email and anti-spam research
  • Generates a significant amount of academic and
    industry anti-spam research
  • Klimt and Yang (2004)
  • Published the Enron Corpus the first
    large-scale corpus of legitimate email messages
  • TREC Spam Track (2005 present)
  • Produces new corpora every year
  • Provides a standardized platform to evaluate
    classification algorithms

20
Ongoing Research
  • Concept Drift
  • New Classification Approaches
  • Adversarial Classification
  • Image Spam

21
Concept Drift
  • Spam content is extremely dynamic
  • Topic drift (e.g., specific scams)
  • Technique drift (e.g., obfuscations)
  • How do we keep up with the Joneses?
  • Batch vs. Online Learning

22
New Classification Approaches
  • Filter Fusion
  • Compression-based Filtering
  • Network behavioral clustering

23
Adversarial Classification
  • Classifiers assume a clear distinction between
    spam and legitimate features
  • Camouflaged messages
  • Mask spam content with legitimate content
  • Disrupt decision boundaries for classifiers

24
Camouflage Attacks
  • Baseline performance
  • Accuracies consistently higher than 98
  • Classifiers under attack
  • Accuracies degrade to between 50 and 70
  • Retrained classifiers
  • Accuracies climb back to between 91 and 99

25
Camouflage Attacks (cont.)
  • Retraining postpones the problem, but it doesnt
    solve it
  • We can identify features that are less
    susceptible to attack, but thats simply another
    stalling technique

26
Image Spam
  • What happens when an email does not contain
    textual features?
  • OCR is easily defeated
  • Classification using image properties

27
Overview
  • Introduction
  • Countering Email Spam
  • Problem Description
  • Classification History
  • Ongoing Research
  • Countering Web Spam
  • Problem Description
  • Classification History
  • Ongoing Research
  • Conclusions

28
Countering Web Spam
  • What is web spam?
  • Traditional definition
  • Our definition
  • Between 13.8 and 22.1 of all web pages

29
Ad Farms
  • Only contain advertising links (usually ad
    listings)
  • Elaborate entry pages used to deceive visitors

30
Ad Farms (cont.)
  • Clicking on an entry page link leads to an ad
    listing
  • Ad syndicators provide the content
  • Web spammers create the HTML structures

31
Parked Domains
  • Domain parking services
  • Provide place holders for newly registered
    domains
  • Allow ad listings to be used as place holders to
    monetize a domain
  • Inevitably, web spammers abused these services

32
Parked Domains (cont.)
  • Functionally equivalent to Ad Farms
  • Both rely on ad syndicators for content
  • Both provide little to no value to their visitors
  • Unique Characteristics
  • Reliance on domain parking services (e.g.,
    apps5.oingo.com, searchportal.information.com,
    etc.)
  • Typically for sale by owner (Offer To Buy This
    Domain)

33
Parked Domains (cont.)
34
Advertisements
  • Pages advertising specific products or services
  • Examples of the kinds of pages being advertised
    in Ad Farms and Parked Domains

35
Problem Description
  • Web spam detection can also be modeled as a
    binary text classification problem
  • Saltons vector space model is quite common
  • Feature processing and performance evaluation are
    also quite similar
  • But what about feature generation

36
How do we generate features?
  • Sources of information
  • HTTP connections
  • Hosting IP addresses
  • Session headers
  • HTML content
  • Textual properties
  • Structural properties
  • URL linkage structure
  • PageRank scores
  • Neighbor properties

37
Classification History
  • Davison (2000)
  • Was the first to investigate link-based web spam
  • Built decision trees to successfully identify
    nepotistic links
  • Becchetti et al. (2005)
  • Revisited the use of decision trees to identify
    link-based web spam
  • Used link-based features such as PageRank and
    TrustRank scores

38
Classification History
  • Drost and Scheffer (2005)
  • Used Support Vector Machines to classify web spam
    pages
  • Relied on content-based features as well as
    link-based features
  • Ntoulas et al. (2006)
  • Built decision trees to classify web spam
  • Used content-based features (e.g., fraction of
    visible content, compressibility, etc.)

39
Classification History
  • Up to this point, previous web spam research was
    limited to small (on the order of a few
    thousand), private data sets
  • Webb et al. (2006)
  • Presented the Webb Spam Corpus a
    first-of-its-kind large-scale, publicly available
    web spam corpus (almost 350K web spam pages)
  • http//www.webbspamcorpus.org
  • Castillo et al. (2006)
  • Presented the WEBSPAM-UK2006 corpus a publicly
    available web spam corpus (only contains 1,924
    web spam pages)

40
Classification History
  • Castillo et al. (2007)
  • Created a cost-sensitive decision tree to
    identify web spam in the WEBSPAM-UK2006 data set
  • Used link-based features from Becchetti et al.
    (2005) and content-based features from Ntoulas
    et al. (2006)
  • Webb et al. (2008)
  • Compared various classifiers (e.g., SVM, decision
    trees, etc.) using HTTP session information
    exclusively
  • Used the Webb Spam Corpus, WebBase data, and the
    WEBSPAM-UK2006 data set
  • Found that these classifiers are comparable to
    (and in many cases, better than) existing
    approaches

41
Ongoing Research
  • Redirection
  • Phishing
  • Social Spam

42
Redirection
  • 144,801 unique redirect chains (1.54 average HTTP
    redirects)
  • 43.9 of web spam pages use some form of HTML or
    JavaScript redirection

43
Phishing
  • Interesting form of deception that affects email
    and web users
  • Another form of adversarial classification

44
Social Spam
  • Comment spam
  • Bulletin spam
  • Message spam

45
Conclusions
  • Email and web spam are currently two of the
    largest information security problems
  • Classification techniques offer an effective way
    to filter this low quality information
  • Spammers are extremely dynamic, generating
    various areas of important future research

46
Questions
Write a Comment
User Comments (0)