Spam Filters

1 / 19

About This Presentation

Title:

Spam Filters

Description:

Not necessarily commercial 'flaming', political. Spam arriving in ... Some systems allow users to handcraft rules, rather than yes/no, best to have an ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 20

Provided by: sharonm6

more less

Transcript and Presenter's Notes

Title: Spam Filters

1
Spam Filters

2
What is Spam?

Unsolicited (legally, no existing relationship
Automated
Bulk
Email
Not necessarily commercial flaming, political

3
Spam arriving in Michaels mail box in August

You have won a lottery
Your bank needs your account details
Money transfer from Nigeria
On-line pharmaceuticals
Software for sale
Alarm systems
Looking for a safe, ethical secondary income?
Music and film downloads

4
Why send spam?

Email is fast, cheap,easy
Availability of enormous address lists (or guess
likely addresses from dictionaries e.g.
ireland3_at_, harvesting)
7 of email users have bought something
100 responses to 10 million emails will produce a
profit
Illegal in the EU, but not in all US states

5
Whats wrong with spam?

Wastes time deleting unwanted messages
User sees offensive material
Fills up file server storage space
Some people vulnerable to confidence tricks
BrightMail estimate 8 of email was spam in 2001,
40 in 2002. May stall the internet altogether

6
Combating spam

Blacklisting maintain a list of email addresses
of known spammers
Greylisting challenge suspected spam emails
e.g. by answering a question which is simple for
a human but difficult for a computer e.g. how
many animals in this picture?
Munging - to defeat harvesters, e.g. post your
email as cormac at dublin dot com on the web
Litigation - e.g. anti-spam company Habeas haiku
winter into spring, brightly anticipated, like
Habeas SWE. EU says all bulk email should be opt
in unless there is an existing relationship.

7
Spam filters

Spam filters are an example of text
classification (e.g. topic, language, author)
What is worse, saying a legitimate email is spam
or letting through a spam message ?

8
Rule-based filters

Some systems allow users to handcraft rules,
rather than yes/no, best to have an associated
probability, e.g. Barcalys ? 90, Ivory Coast ?
70.
But this is time consuming and tedious
Users must be savvy enough to create them
They must be constantly refined as the nature of
spam changes

9
Adaptive filters

Learn directly from the data in the users
mailbox
Which words are truly characteristic of spam?
Compare with automatic indexing (stemming,
mid-frequency words)

10
Training vs. test sets

1. Learn the rules on the training data
2. See if the rules work on the test data
E.g. use the LingSpam corpus (400 spams, 200
legitimate messages sent to the Linguist List
Better to build your own corpus spammers can
overcome filters built on just one corpus

11
Chi-Squared Test

Find most characteristic words in spam / non-spam
by chi-squared test (also finds difference
between men and womens speech)

12
Mutual Information (1)

word, category e.g. how often is the word
download found in spam?
word e.g. how many messages altogether contain
download?
category e.g. how many messages altogether are
spam?
N total number of messages

13
Mutual Information (2)

MI log2 ( download ,spam N / download
spam )
The higher the MI, the more download is typical
of spam
Now we have found which words are most typical of
spam and legitimate messages, we must use this
information to classify the unseen messages in
the test set

14
Bayesian Modelling

Used in expert systems
We want to work our the probability of the
hypothesis given the evidence, P ( H E )
E.g. P ( spam contains NOW! )
P ( not spam contains NOW! )
Which is greater?
Bayes rule P ( H E ) P (E H) P (H) /
P (E)

15
Combining Evidence (1)

A Naïve Bayesian model assumes that multiple
evidence is not conditionally dependent. Compare
Toffee Vodka wins the 200 at Newmarket
All for Laura wins the 235 at Newmarket
Nebraska Tornado wins the 315 at Newmarket
Newcastle beat Birmingham
Newcastle lead Birmingham at half-time
Shearer scores a hat-trick

16
Combining Evidence (2)

In a Naïve Bayesian model,
P ( cheap, v1agra, NOW! spam) P (cheap
spam) P ( v1agra spam ) P (NOW! spam)
Now we can find
P ( spam cheap, v1agra, NOW! ) a
P (not spam cheap, v1agra, NOW!) b
Odds on spam given that the message contains
these three words a / b
In real text, words are conditionally dependent
e.g. click here
Only classify as spam if 100 1 on.

17
Non-word indicators of spam

phrases e.g. free money, only , over 21
punctuation!!!
domain name of sender .edu less likely to be
spam than .com
spam more likely to be sent at night than
legitimate email
If less than 9 non-alphanumeric characters, more
likely to be legitimate
Look for images, colours, HTML tags

18
Evaluation of spam filters

Junk precision percentage of messages in the
test data classified as junk which truly are junk
Junk recall percentage of junk messages in the
test data classified as junk
Legitimate precision percentage of messages in
the test data classified as legitimate which
truly are legitimate
Legitimate recall percentage of legitimate
messages in the test data which are classified as
legitimate

19
Summary

The need to create spam filters automatically
Find words which are typical of spam, and words
which are typical of legitimate emails, using
training data
Use this knowledge to automatically classify new
emails

Write a Comment

User Comments (0)