Improving Spam Detection Based on Structural Similarity - PowerPoint PPT Presentation

About This Presentation

Title:

Improving Spam Detection Based on Structural Similarity

Description:

Number of Views:71

Avg rating:3.0/5.0

Slides: 23

Provided by: IR0157

Learn more at: http://www.cs.ucf.edu

Category:

more less

Transcript and Presenter's Notes

Title: Improving Spam Detection Based on Structural Similarity

1
Improving Spam Detection Based on Structural
Similarity

2
Outline

3
Overview

4
Spam Detection Problem

Spam detection algorithms use some part of emails
to determine if a message is spam
Spammers change messages so that they do not meet
detection criteria for spam
Very easy to change spam messages, usernames,
domains, subjects, etc.

5
Key Idea

The lists that spammers and legitimate users send
messages to and from can be used as the
identifiers of classes of email traffic.
The lists of addresses spammers send to are
unlikely to be similar to those of legitimate
users.
Lists dont change that often

6
Using Lists

A user is not just an email address. It can be a
domain, etc.
Represent email user as a vector in
multi-dimensional conceptual space created with
all possible contacts
Each sender and each recipient has their own
vector
Model relationship between senders and recipients

7
Constructing Vectors

If there is at least one email sent from sender
si to recipient rn, then the value in sis
vectors nth dimension is 1. Otherwise, that
value is 0.
If there is at least one email received by
recipient ri from sender sn, the value in ris
vectors nth dimension is 1. Otherwise it is 0.

8
Example Vectors
9
Similarity Between Senders

Similarity between senders si and sk is the
cosine of the angle between their vectors
cos(si, sk)
0 means no shared contact
1 means identical contact lists
In legitimate email, a 1 means that the senders
operate in the same social group.
In spammers, a 1 means that the senders use the
same list or are the same person.

10
Grouping Users Into Clusters

Group users with similar vectors
Users with similar vectors are likely to have
related roles, i.e. spammer or legitimate user
Each cluster is represented by a vector
This vector is the sum of all its component
users vectors

11
Similarity Between a User and a Cluster

Similarity is derived from user to user
similarity equation
If sender si is a member of cluster sck, then the
similarity is cos(sck si, si).
If sender si is not a member of cluster sck, then
the similarity is cos(sck, si).
Similarity between a user and a cluster will
change over time
Remove the users vector from the clusters
vector when computing similarity and
reclassifying a user

12
Detecting Spam

13
Detecting Spam

When an email arrives, classify it using some
other method
Find the cluster (sc) the emails sender belongs
in
If many users in the cluster send messages that
are classified as spam by auxiliary method, the
probability of all the users in that cluster
sending spam is high
Update the scs spam probability
Ps(m) ? scs spam probability

14
Detecting Spam