FILTERING EMAILS - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

FILTERING EMAILS

Description:

Alokika Dash. Electrical & Computer Engineering. University of Maryland. 2 ... Alokika Dash. University of. Maryland. IDEA BEHIND SPAM DETECTION 'Is message x spam? ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 24

Provided by: Rabi1

Category:

more less

Transcript and Presenter's Notes

Title: FILTERING EMAILS

1

FILTERING E-MAILS
FOR SPAM
Alokika Dash
Electrical Computer Engineering
University of Maryland

WHAT IS SPAMING AND TEXT FILTERING?
Unsolicited e-mail.
Spam e-mail is 50 of all e-mail received by
organizations
How do we get spam?
Address leakage
Active snooping by spammers

TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
3
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

PROPERTIES OF SPAMMING
The e-mail message is sent to a large number of
people advertising a product or service.
The e-mails are Unwanted.
The sender of spam frequently attempts to hide
or obscure their identity.
The spammer is not inclined to stop.

4
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

WHY SPAM IS A CONCERN?
Direct IT-Cost
Delivering the message costs money in internet
bandwidth and disk storage.
Indirect cost
Employee productivity.
Business risk
Spam is unregulated.
No control over the content of e-mail.
What should we do?
Hire someone to read our mail and discard the
spam
Use machine learning techniques to automate this
process

5
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

IDEA BEHIND SPAM DETECTION
Is message x spam?
Feature Extraction
Each document is distilled into a set of features
such as words, phrases, meta-data, etc.
Tokenizing
This set of features can then be represented as a
vector whose components are boolean
(multivariate) or real values.
Classification algorithm
uses the feature vector as a basis upon which the
document is judged.

6
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

DOMAIN SPECIFICATION
FEATURES
words in the body of the message,
headers (senders and message paths)
HTML code (like colors)
word pairs, phrases
meta information (e.g publication date, document
type, publication source)
CLASSIFICATION METHODS
Rule Based or Key-word Filtering
Statistical or Machine learning

7
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

8
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

MACHINE LEARNING TECHNIQUES
Bayesian Techniques
Boosting Trees
Support Vector Machines
Decision Trees

9
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

TRAINING DATA FOR SPAM FILTERING
The ML approach relies on the availability of an
initial corpus
SpamAssassin Public Corpus
Total count 6047 messages, with about a 31 spam
ratio.
LingSpam corpus
Total Count 2412 messages with about a 16.6
spam ratio
Annexia corpus
Great Spam Archive8
Total count 15369 spam messages.
Initial corpus is split in two sets
Training Set
Test Set
Common Practice 90 of the corpus should be used
for training while 10 should be used for testing.

10
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

BAYESIAN FILTERING
Messages e are first split into several types
of tokens header tokens, body tokens and
synthesized tokens.
Count the number of times each token occurs in
each corpus
Map each token to the probability that an email
containing it is a spam
Posterior probability of folder fi given e

11
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE PROBABLITY ASSIGNMENT wj
P( wj fi ) madam 0.99 promotion 0.99
republic 0.99 shortest 0.047 mandatory 0.047
sorry 0.082 enter 0.907 very 0.147 investmen
t 0.86
12
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

ADVANTAGES
They learn automatically from spam and from good
mail
Any message with a probability of over 90 is
spam, anything else is good
Result in a very robust and efficient anti-spam
approach that returns hardly any false positives.
caught 99.5 of spam with less than .03 false
positives implies that instead of receiving 20
spams per day you might receive one spam every 10
days or so

13
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

ADABOOST ALGORITHM
Run a given weak learner several times on
slightly altered training data
Use decision tree algorithm as the weaker learner
Assign each example of the given training set a
weight. At the beginning all weights are equal,
but in every round the weak learner returns a
hypothesis
The final hypothesis is a combination of the
hypotheses of all rounds, namely a weighted
majority vote, where hypotheses with lower
classification error have higher weight

14
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE
15
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE(CONTD)
16
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE(CONTD)
17
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE(CONTD)
18
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

ADVANTAGES
Fast and Simple
Flexiblecan combine with any weak learner
No priori knowledge needed about weak learner
Not prone to over fitting

19
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

BAYESIAN NETWORK Vs ADABOOST
Number of frequent words used 50
Number of frequent words used 100

20
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

CHALLENGES IN DETECTING SPAM
More aggressive a single filter becomes, the more
likely that a good message will be blocked
accidentally (FALSE POSITIVES)
For example
Entertainment conglomerate estimated that a
single lost email from an important customer
could cost them more than 100,000
Incorrectly blocked mail from a constituent could
cost the votes of ones family, friends and
neighbors.
Spam fighting accuracy declines in proportion
with the number of people covered by the filter.

21
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

CHALLENGES IN DETECTING SPAM
Spammers Get Past Filters
Changing message content
Increasing message volume
New delivery mechanisms
Attacking anti-spam groups

22
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

FUTURE WORK IN TEXT FILTERING
Filter based on word triples or longer phrases
rather than individual words
Add a second level of testing designed
specifically to avoid false positives.
Second level of filtering may or may not be
bayesian
If a mail triggers this second level of filters
it will be accepted even if its spam probability
is above the threshold.
Focus extra attention on specific parts of the
email
Decompose domain names from the rest of the text
in an email in that they often consist of several
words stuck together.