FILTERING EMAILS - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

FILTERING EMAILS

Description:

Alokika Dash. Electrical & Computer Engineering. University of Maryland. 2 ... Alokika Dash. University of. Maryland. IDEA BEHIND SPAM DETECTION 'Is message x spam? ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 24
Provided by: Rabi1
Category:
Tags: emails | filtering | dash

less

Transcript and Presenter's Notes

Title: FILTERING EMAILS


1
  • FILTERING E-MAILS
  • FOR SPAM
  • Alokika Dash
  • Electrical Computer Engineering
  • University of Maryland

2
  • WHAT IS SPAMING AND TEXT FILTERING?
  • Unsolicited e-mail.
  • Spam e-mail is 50 of all e-mail received by
    organizations
  • How do we get spam?
  • Address leakage
  • Active snooping by spammers

TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
3
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • PROPERTIES OF SPAMMING
  • The e-mail message is sent to a large number of
    people advertising a product or service.
  • The e-mails are Unwanted.
  • The sender of spam frequently attempts to hide
    or obscure their identity.
  • The spammer is not inclined to stop.

4
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • WHY SPAM IS A CONCERN?
  • Direct IT-Cost
  • Delivering the message costs money in internet
    bandwidth and disk storage.
  • Indirect cost
  • Employee productivity.
  • Business risk
  • Spam is unregulated.
  • No control over the content of e-mail.
  • What should we do?
  • Hire someone to read our mail and discard the
    spam
  • Use machine learning techniques to automate this
    process

5
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • IDEA BEHIND SPAM DETECTION
  • Is message x spam?
  • Feature Extraction
  • Each document is distilled into a set of features
    such as words, phrases, meta-data, etc.
  • Tokenizing
  • This set of features can then be represented as a
    vector whose components are boolean
    (multivariate) or real values.
  • Classification algorithm
  • uses the feature vector as a basis upon which the
    document is judged.

6
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • DOMAIN SPECIFICATION
  • FEATURES
  • words in the body of the message,
  • headers (senders and message paths)
  • HTML code (like colors)
  • word pairs, phrases
  • meta information (e.g publication date, document
    type, publication source)
  • CLASSIFICATION METHODS
  • Rule Based or Key-word Filtering
  • Statistical or Machine learning

7
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland

8
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • MACHINE LEARNING TECHNIQUES
  • Bayesian Techniques
  • Boosting Trees
  • Support Vector Machines
  • Decision Trees

9
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • TRAINING DATA FOR SPAM FILTERING
  • The ML approach relies on the availability of an
    initial corpus
  • SpamAssassin Public Corpus
  • Total count 6047 messages, with about a 31 spam
    ratio.
  • LingSpam corpus
  • Total Count 2412 messages with about a 16.6
    spam ratio
  • Annexia corpus
  • Great Spam Archive8
  • Total count 15369 spam messages.
  • Initial corpus is split in two sets
  • Training Set
  • Test Set
  • Common Practice 90 of the corpus should be used
    for training while 10 should be used for testing.

10
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • BAYESIAN FILTERING
  • Messages e are first split into several types
    of tokens header tokens, body tokens and
    synthesized tokens.
  • Count the number of times each token occurs in
    each corpus
  • Map each token to the probability that an email
    containing it is a spam
  • Posterior probability of folder fi given e

11
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE PROBABLITY ASSIGNMENT wj
P( wj fi ) madam 0.99 promotion 0.99
republic 0.99 shortest 0.047 mandatory 0.047
sorry 0.082 enter 0.907 very 0.147 investmen
t 0.86
12
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • ADVANTAGES
  • They learn automatically from spam and from good
    mail
  • Any message with a probability of over 90 is
    spam, anything else is good
  • Result in a very robust and efficient anti-spam
    approach that returns hardly any false positives.
  • caught 99.5 of spam with less than .03 false
    positives implies that instead of receiving 20
    spams per day you might receive one spam every 10
    days or so

13
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • ADABOOST ALGORITHM
  • Run a given weak learner several times on
    slightly altered training data
  • Use decision tree algorithm as the weaker learner
  • Assign each example of the given training set a
    weight. At the beginning all weights are equal,
    but in every round the weak learner returns a
    hypothesis
  • The final hypothesis is a combination of the
    hypotheses of all rounds, namely a weighted
    majority vote, where hypotheses with lower
    classification error have higher weight

14
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE
15
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE(CONTD)
16
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE(CONTD)
17
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE(CONTD)
18
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • ADVANTAGES
  • Fast and Simple
  • Flexiblecan combine with any weak learner
  • No priori knowledge needed about weak learner
  • Not prone to over fitting

19
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • BAYESIAN NETWORK Vs ADABOOST
  • Number of frequent words used 50
  • Number of frequent words used 100

20
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • CHALLENGES IN DETECTING SPAM
  • More aggressive a single filter becomes, the more
    likely that a good message will be blocked
    accidentally (FALSE POSITIVES)
  • For example
  • Entertainment conglomerate estimated that a
    single lost email from an important customer
    could cost them more than 100,000
  • Incorrectly blocked mail from a constituent could
    cost the votes of ones family, friends and
    neighbors.
  • Spam fighting accuracy declines in proportion
    with the number of people covered by the filter.

21
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • CHALLENGES IN DETECTING SPAM
  • Spammers Get Past Filters
  • Changing message content
  • Increasing message volume
  • New delivery mechanisms
  • Attacking anti-spam groups

22
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
  • FUTURE WORK IN TEXT FILTERING
  • Filter based on word triples or longer phrases
    rather than individual words
  • Add a second level of testing designed
    specifically to avoid false positives.
  • Second level of filtering may or may not be
    bayesian
  • If a mail triggers this second level of filters
    it will be accepted even if its spam probability
    is above the threshold.
  • Focus extra attention on specific parts of the
    email
  • Decompose domain names from the rest of the text
    in an email in that they often consist of several
    words stuck together.

23
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
QUESTIONS?
Write a Comment
User Comments (0)
About PowerShow.com