Title: Spam Filters
1Spam Filters
2What is Spam?
- Unsolicited (legally, no existing relationship
- Automated
- Bulk
- Email
- Not necessarily commercial flaming, political
3Spam arriving in Michaels mail box in August
- You have won a lottery
- Your bank needs your account details
- Money transfer from Nigeria
- On-line pharmaceuticals
- Software for sale
- Alarm systems
- Looking for a safe, ethical secondary income?
- Music and film downloads
4Why send spam?
- Email is fast, cheap,easy
- Availability of enormous address lists (or guess
likely addresses from dictionaries e.g.
ireland3_at_, harvesting) - 7 of email users have bought something
- 100 responses to 10 million emails will produce a
profit - Illegal in the EU, but not in all US states
5Whats wrong with spam?
- Wastes time deleting unwanted messages
- User sees offensive material
- Fills up file server storage space
- Some people vulnerable to confidence tricks
- BrightMail estimate 8 of email was spam in 2001,
40 in 2002. May stall the internet altogether
6Combating spam
- Blacklisting maintain a list of email addresses
of known spammers - Greylisting challenge suspected spam emails
e.g. by answering a question which is simple for
a human but difficult for a computer e.g. how
many animals in this picture? - Munging - to defeat harvesters, e.g. post your
email as cormac at dublin dot com on the web - Litigation - e.g. anti-spam company Habeas haiku
winter into spring, brightly anticipated, like
Habeas SWE. EU says all bulk email should be opt
in unless there is an existing relationship.
7Spam filters
- Spam filters are an example of text
classification (e.g. topic, language, author) - What is worse, saying a legitimate email is spam
or letting through a spam message ?
8Rule-based filters
- Some systems allow users to handcraft rules,
rather than yes/no, best to have an associated
probability, e.g. Barcalys ? 90, Ivory Coast ?
70. - But this is time consuming and tedious
- Users must be savvy enough to create them
- They must be constantly refined as the nature of
spam changes
9Adaptive filters
- Learn directly from the data in the users
mailbox - Which words are truly characteristic of spam?
- Compare with automatic indexing (stemming,
mid-frequency words)
10Training vs. test sets
- 1. Learn the rules on the training data
- 2. See if the rules work on the test data
- E.g. use the LingSpam corpus (400 spams, 200
legitimate messages sent to the Linguist List - Better to build your own corpus spammers can
overcome filters built on just one corpus
11Chi-Squared Test
- Find most characteristic words in spam / non-spam
by chi-squared test (also finds difference
between men and womens speech)
12Mutual Information (1)
- word, category e.g. how often is the word
download found in spam? - word e.g. how many messages altogether contain
download? - category e.g. how many messages altogether are
spam? - N total number of messages
13Mutual Information (2)
- MI log2 ( download ,spam N / download
spam ) - The higher the MI, the more download is typical
of spam - Now we have found which words are most typical of
spam and legitimate messages, we must use this
information to classify the unseen messages in
the test set
14Bayesian Modelling
- Used in expert systems
- We want to work our the probability of the
hypothesis given the evidence, P ( H E ) - E.g. P ( spam contains NOW! )
- P ( not spam contains NOW! )
- Which is greater?
- Bayes rule P ( H E ) P (E H) P (H) /
P (E)
15Combining Evidence (1)
- A Naïve Bayesian model assumes that multiple
evidence is not conditionally dependent. Compare - Toffee Vodka wins the 200 at Newmarket
- All for Laura wins the 235 at Newmarket
- Nebraska Tornado wins the 315 at Newmarket
- Newcastle beat Birmingham
- Newcastle lead Birmingham at half-time
- Shearer scores a hat-trick
16Combining Evidence (2)
- In a Naïve Bayesian model,
- P ( cheap, v1agra, NOW! spam) P (cheap
spam) P ( v1agra spam ) P (NOW! spam) - Now we can find
- P ( spam cheap, v1agra, NOW! ) a
- P (not spam cheap, v1agra, NOW!) b
- Odds on spam given that the message contains
these three words a / b - In real text, words are conditionally dependent
e.g. click here - Only classify as spam if 100 1 on.
17Non-word indicators of spam
- phrases e.g. free money, only , over 21
- punctuation!!!
- domain name of sender .edu less likely to be
spam than .com - spam more likely to be sent at night than
legitimate email - If less than 9 non-alphanumeric characters, more
likely to be legitimate - Look for images, colours, HTML tags
18Evaluation of spam filters
- Junk precision percentage of messages in the
test data classified as junk which truly are junk - Junk recall percentage of junk messages in the
test data classified as junk - Legitimate precision percentage of messages in
the test data classified as legitimate which
truly are legitimate - Legitimate recall percentage of legitimate
messages in the test data which are classified as
legitimate
19Summary
- The need to create spam filters automatically
- Find words which are typical of spam, and words
which are typical of legitimate emails, using
training data - Use this knowledge to automatically classify new
emails