Spam Filters

1 / 19
About This Presentation
Title:

Spam Filters

Description:

Not necessarily commercial 'flaming', political. Spam arriving in ... Some systems allow users to handcraft rules, rather than yes/no, best to have an ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 20
Provided by: sharonm6

less

Transcript and Presenter's Notes

Title: Spam Filters


1
Spam Filters

2
What is Spam?
  • Unsolicited (legally, no existing relationship
  • Automated
  • Bulk
  • Email
  • Not necessarily commercial flaming, political

3
Spam arriving in Michaels mail box in August
  • You have won a lottery
  • Your bank needs your account details
  • Money transfer from Nigeria
  • On-line pharmaceuticals
  • Software for sale
  • Alarm systems
  • Looking for a safe, ethical secondary income?
  • Music and film downloads

4
Why send spam?
  • Email is fast, cheap,easy
  • Availability of enormous address lists (or guess
    likely addresses from dictionaries e.g.
    ireland3_at_, harvesting)
  • 7 of email users have bought something
  • 100 responses to 10 million emails will produce a
    profit
  • Illegal in the EU, but not in all US states

5
Whats wrong with spam?
  • Wastes time deleting unwanted messages
  • User sees offensive material
  • Fills up file server storage space
  • Some people vulnerable to confidence tricks
  • BrightMail estimate 8 of email was spam in 2001,
    40 in 2002. May stall the internet altogether

6
Combating spam
  • Blacklisting maintain a list of email addresses
    of known spammers
  • Greylisting challenge suspected spam emails
    e.g. by answering a question which is simple for
    a human but difficult for a computer e.g. how
    many animals in this picture?
  • Munging - to defeat harvesters, e.g. post your
    email as cormac at dublin dot com on the web
  • Litigation - e.g. anti-spam company Habeas haiku
    winter into spring, brightly anticipated, like
    Habeas SWE. EU says all bulk email should be opt
    in unless there is an existing relationship.

7
Spam filters
  • Spam filters are an example of text
    classification (e.g. topic, language, author)
  • What is worse, saying a legitimate email is spam
    or letting through a spam message ?

8
Rule-based filters
  • Some systems allow users to handcraft rules,
    rather than yes/no, best to have an associated
    probability, e.g. Barcalys ? 90, Ivory Coast ?
    70.
  • But this is time consuming and tedious
  • Users must be savvy enough to create them
  • They must be constantly refined as the nature of
    spam changes

9
Adaptive filters
  • Learn directly from the data in the users
    mailbox
  • Which words are truly characteristic of spam?
  • Compare with automatic indexing (stemming,
    mid-frequency words)

10
Training vs. test sets
  • 1. Learn the rules on the training data
  • 2. See if the rules work on the test data
  • E.g. use the LingSpam corpus (400 spams, 200
    legitimate messages sent to the Linguist List
  • Better to build your own corpus spammers can
    overcome filters built on just one corpus

11
Chi-Squared Test
  • Find most characteristic words in spam / non-spam
    by chi-squared test (also finds difference
    between men and womens speech)

12
Mutual Information (1)
  • word, category e.g. how often is the word
    download found in spam?
  • word e.g. how many messages altogether contain
    download?
  • category e.g. how many messages altogether are
    spam?
  • N total number of messages

13
Mutual Information (2)
  • MI log2 ( download ,spam N / download
    spam )
  • The higher the MI, the more download is typical
    of spam
  • Now we have found which words are most typical of
    spam and legitimate messages, we must use this
    information to classify the unseen messages in
    the test set

14
Bayesian Modelling
  • Used in expert systems
  • We want to work our the probability of the
    hypothesis given the evidence, P ( H E )
  • E.g. P ( spam contains NOW! )
  • P ( not spam contains NOW! )
  • Which is greater?
  • Bayes rule P ( H E ) P (E H) P (H) /
    P (E)

15
Combining Evidence (1)
  • A Naïve Bayesian model assumes that multiple
    evidence is not conditionally dependent. Compare
  • Toffee Vodka wins the 200 at Newmarket
  • All for Laura wins the 235 at Newmarket
  • Nebraska Tornado wins the 315 at Newmarket
  • Newcastle beat Birmingham
  • Newcastle lead Birmingham at half-time
  • Shearer scores a hat-trick

16
Combining Evidence (2)
  • In a Naïve Bayesian model,
  • P ( cheap, v1agra, NOW! spam) P (cheap
    spam) P ( v1agra spam ) P (NOW! spam)
  • Now we can find
  • P ( spam cheap, v1agra, NOW! ) a
  • P (not spam cheap, v1agra, NOW!) b
  • Odds on spam given that the message contains
    these three words a / b
  • In real text, words are conditionally dependent
    e.g. click here
  • Only classify as spam if 100 1 on.

17
Non-word indicators of spam
  • phrases e.g. free money, only , over 21
  • punctuation!!!
  • domain name of sender .edu less likely to be
    spam than .com
  • spam more likely to be sent at night than
    legitimate email
  • If less than 9 non-alphanumeric characters, more
    likely to be legitimate
  • Look for images, colours, HTML tags

18
Evaluation of spam filters
  • Junk precision percentage of messages in the
    test data classified as junk which truly are junk
  • Junk recall percentage of junk messages in the
    test data classified as junk
  • Legitimate precision percentage of messages in
    the test data classified as legitimate which
    truly are legitimate
  • Legitimate recall percentage of legitimate
    messages in the test data which are classified as
    legitimate

19
Summary
  • The need to create spam filters automatically
  • Find words which are typical of spam, and words
    which are typical of legitimate emails, using
    training data
  • Use this knowledge to automatically classify new
    emails
Write a Comment
User Comments (0)