Improving Spam Detection Based on Structural Similarity - PowerPoint PPT Presentation

About This Presentation
Title:

Improving Spam Detection Based on Structural Similarity

Description:

Presented by Jared Bott. 2. Outline. Overview. Concepts. Detecting Spam. Experimental Results ... A user is not just an email address. It can be a domain, etc. ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 23
Provided by: IR0157
Learn more at: http://www.cs.ucf.edu
Category:

less

Transcript and Presenter's Notes

Title: Improving Spam Detection Based on Structural Similarity


1
Improving Spam Detection Based on Structural
Similarity
  • By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo
    B. Almeida,
  • Luis M. A. Bettencourt, Virgílio A. F. Almeida,
    Jussara M. Almeida
  • Presented at Steps to Reducing Unwanted Traffic
    on the Internet Workshop, 2005
  • Presented by Jared Bott

2
Outline
  • Overview
  • Concepts
  • Detecting Spam
  • Experimental Results
  • Analysis of Paper

3
Overview
  • New algorithm to detect spam messages
  • Uses email information that is harder to change
  • Works in conjunction with another spam classifier
  • I.e. SpamAssassin
  • Less false positives than compared methods

4
Spam Detection Problem
  • Spam detection algorithms use some part of emails
    to determine if a message is spam
  • Spammers change messages so that they do not meet
    detection criteria for spam
  • Very easy to change spam messages, usernames,
    domains, subjects, etc.

5
Key Idea
  • The lists that spammers and legitimate users send
    messages to and from can be used as the
    identifiers of classes of email traffic.
  • The lists of addresses spammers send to are
    unlikely to be similar to those of legitimate
    users.
  • Lists dont change that often

6
Using Lists
  • A user is not just an email address. It can be a
    domain, etc.
  • Represent email user as a vector in
    multi-dimensional conceptual space created with
    all possible contacts
  • Each sender and each recipient has their own
    vector
  • Model relationship between senders and recipients

7
Constructing Vectors
  • If there is at least one email sent from sender
    si to recipient rn, then the value in sis
    vectors nth dimension is 1. Otherwise, that
    value is 0.
  • If there is at least one email received by
    recipient ri from sender sn, the value in ris
    vectors nth dimension is 1. Otherwise it is 0.

8
Example Vectors
9
Similarity Between Senders
  • Similarity between senders si and sk is the
    cosine of the angle between their vectors
  • cos(si, sk)
  • 0 means no shared contact
  • 1 means identical contact lists
  • In legitimate email, a 1 means that the senders
    operate in the same social group.
  • In spammers, a 1 means that the senders use the
    same list or are the same person.

10
Grouping Users Into Clusters
  • Group users with similar vectors
  • Users with similar vectors are likely to have
    related roles, i.e. spammer or legitimate user
  • Each cluster is represented by a vector
  • This vector is the sum of all its component
    users vectors

11
Similarity Between a User and a Cluster
  • Similarity is derived from user to user
    similarity equation
  • If sender si is a member of cluster sck, then the
    similarity is cos(sck si, si).
  • If sender si is not a member of cluster sck, then
    the similarity is cos(sck, si).
  • Similarity between a user and a cluster will
    change over time
  • Remove the users vector from the clusters
    vector when computing similarity and
    reclassifying a user

12
Detecting Spam
  • Two probabilities to compute
  • Ps(m) Probability of an email m being sent by a
    spammer
  • Pr(m) Probability of an email m being addressed
    to users that receive spam

13
Detecting Spam
  • When an email arrives, classify it using some
    other method
  • Find the cluster (sc) the emails sender belongs
    in
  • If many users in the cluster send messages that
    are classified as spam by auxiliary method, the
    probability of all the users in that cluster
    sending spam is high
  • Update the scs spam probability
  • Ps(m) ? scs spam probability

14
Detecting Spam
  • For all recipients of the email, find the cluster
    (rc) each one belongs to
  • Update the spam probability for each cluster
  • Pr(m) ? Pr(m) spam probability of each rc
  • Pr(m) ? Pr(m)/number of recipients

15
Detecting Spam
  • Compute a spam rank for the email based upon
    Pr(m) and Ps(m)
  • If the spam rank is above some threshold (?),
    label it as spam
  • If the spam rank is below 1- ?, label it is
    legitimate
  • Otherwise label the email as the auxiliary
    methods classification

16
(No Transcript)
17
Experimental Results
  • Tested on a log of eight days of email from a
    large Brazilian university
  • Tested on a 2.8 GHz Pentium 4 with 512 MB RAM
  • Able to classify 20 messages per second
  • Faster than the average message arrival peak rate

18
Results
19
Results
  • Manually checked false positives to see if they
    were spam or not
  • Auxiliary algorithm had more false positives

20
Strengths
  • Less false positives than SpamAssassin
  • Low-cost
  • Works with message information that doesnt
    change that much

21
Weaknesses
  • Needs an additional message classifier, i.e.
    SpamAssassin
  • Manual tuning of algorithm

22
Improvements
  • Time correlation of similar addresses
  • Collaborative filtering based upon user feedback
Write a Comment
User Comments (0)
About PowerShow.com