SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare - PowerPoint PPT Presentation

About This Presentation
Title:

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare

Description:

Assignment of weights to individual features representing the predictive ... Very popular algorithm credited with starting the craze for Bayesian Filters ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 34
Provided by: malk
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare


1
SCAVENGER A JUNK MAIL CLASSIFICATION
PROGRAMRohan Malkhare
  • Committee
  • Dr. Eugene Fink
  • Dr. Dewey Rundus
  • Dr. Alan Hevner

2
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work

3
Introduction
  • Anti-spam efforts
  • Legislation
  • Technology
  • White listing of Email addresses
  • Black Listing of Email addresses/domains
  • Challenge Response mechanisms
  • Content Filtering
  • Learning Techniques

4
Introduction
  • Learning techniques for Spam classification
  • Feature Extraction
  • Assignment of weights to individual features
    representing the predictive strength of a feature
  • Combining weights of extracted features during
    classification to numerically determine whether
    mail is spam/legitimate

5
Introduction
  • Current algorithms
  • Word or phrases as features
  • Probabilities of occurrence in spam/legitimate
    collections as weights
  • Bayes rule or one of its variants for combining
    weights

6
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work

7
Related Work
  • Cohen (1996)
  • RIPPER, Rule Learning System
  • Rules in a human-comprehensible format
  • Pantel Lin (1998)
  • Naïve-Bayes with words as features
  • Microsoft Research (1998)
  • Naïve-Bayes with the mutual information measure
    to select features with strongest resolving power
  • Words and domain-specific attributes of spam used
    as features

8
Related Work
  • Paul Graham (2002) A Plan for spam
  • Very popular algorithm credited with starting the
    craze for Bayesian Filters
  • Uses naïve-bayes with words as features
  • Bill Yerazunis (2002) CRM114 sparse binary
    polynomial hashing algorithm
  • Most accurate algorithm till date (over 99.7
    accuracy)
  • Distinctive because of its powerful feature
    extraction technique
  • Uses Bayesian chain rule for combining weights

9
Related Work
  • CRM114 algorithm Feature Extraction
  • Slide a Window of 5 words over the incoming text
  • Generate order-preserving sub-phrases containing
    all combinations of windowed words
  • For one window, 24 16 features are generated
  • Very high computational complexity
  • E.g. Click here to buy Viagra
  • Features generated would be Click, Click
    here, Click to,Click buy, Click Viagra,
    Click here to, Click here buy etc.

10
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work

11
Algorithm
  • Feature Extraction
  • Sentences in a message are identified by using
    the delimiting characters ., ?, !, ,
    lt, gt
  • All possible word-pairings are formed from the
    sentences
  • Commonly occuring words are skipped
  • These word-pairings serve as features to be used
    for classification

12
Algorithm
  • Feature Extraction (continued.)
  • If number of words become greater than a constant
    K, then series of K words is treated as a
    sentence
  • Value of K is set to 20
  • E.g. There is a problem in the tables that have
    been copied to the database problem
    tables, tables problem, problem copied,
    copied problem, problem database, database
    problem etc. are the features that would be
    formed out of the sentence

13
Algorithm
  • Feature Extraction (continued.)
  • Entire subject line is treated as one sentence
  • For HTML, all content within lt and gt is
    treated as one sentence
  • For a sentence of n words, scavenger creates
    (n-1)(n-2) features as compared to 2n-1
    created by CRM114

14
Algorithm
  • Weight Assignment
  • Weights represent predictive strength of features
  • Discretized values are assigned as weights to
    features depending on whether the feature is a
    strong evidence or a weak evidence
  • Strong pieces of evidence should have high
    impact on the classification decision and weak
    pieces of evidence should have low impact on the
    classification decision

15
Algorithm
  • Weight Assignment (Continued)
  • Categorization of features into strong and
    weak pieces of evidence is done on the basis of
    frequency of occurrence of the feature in
    spam/legitimate collections, exclusivity of
    occurrence and on heuristic rules like distance
    between words in the word-pairing, whether the
    feature is from the subject or the body.
  • Only exclusively occuring features are assigned
    weights
  • Features occuring in both spam and legitimate
    collections are ignored.

16
Algorithm
  • Weight Assignment (Continued)
  • What weights to select for the strong evidences
    and the weak evidences?
  • During classification, the class having more
    pieces of strong evidence should win
    regardless of the number of weak evidences on
    either side.
  • In the absence of strong evidences on either
    side, the class having more pieces of weak
    evidence should win.

17
Algorithm
  • Weight Assignment (Continued)
  • Intuitively, we would like to have as much
    distance between the values we choose for the
    strong and weak evidences.
  • We select 0.9 as the weight for strong
    evidences and 0.1 as the weight for weak
    evidences.

18
Algorithm
  • Combining of weights
  • Total spam evidence sum of spam weights of
    matching features
  • Total legitimate evidence sum of legitimate
    weights of matching features
  • If Total spam evidence gt M Total legitimate
    evidence, then message is spam
  • M is the thresold parameter which can be used as
    tuning knob

19
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work

20
Measurements
  • Precision and Recall used as parameters of
    measurement
  • Spam PrecisionMessages correctly classified as
    spam / Total Messages classified as spam
  • Spam Recall Messages correctly classified as
    spam / Total Spam Messages in Testing set
  • Precision gives accuracy with respect to false
    positives
  • Recall gives capacity of filter to catch spam

21
Measurements
  • Testing data
  • Downloaded around 5600 spam messages from
    http//www.spamarchive.org
  • Used around 960 legitimate mails from Dr. Finks
    mailbox
  • Cross-Validation
  • K-fold cross-validation for two values of K, K2
    and K5
  • K2 Dividing data into 2 equal-sized sets
  • K5 Dividing data into 5 equal-sized sets

22
Measurements
  • Comparison with Paul Grahams naïve-bayes
    algorithm
  • Implemented Grahams algorithm for two methods of
    feature extraction
  • Wordsphrases as features
  • Feature extraction similar to scavenger

23
Measurements
ALGORITHM K5 K5 K2 K2
ALGORITHM SPAM PRECISION (AVERAGE) SPAM RECALL (AVERAGE) SPAM PRECISION (AVERAGE) SPAM RECALL (AVERAGE)
Scavenger (M1) 100 99.85 99.92 99.72
Naïve-bayes (wordsphrases) 100 98.87 99.80 97.03
Naïve-bayes (with scavenger feature extraction method) 100 99.15 99.65 98.68
24
Measurements
M Missed Spam () False Positives () Spam Recall ()
0.25 0 13.23 100
0.5 0.07 6.17 99.93
0.75 0.11 2.05 99.89
1 0.28 0.58 99.72
1.25 0.3 0.58 99.70
1.5 0.35 0.58 99.66
1.75 0.41 0.29 99.59
2 0.49 0 99.51
2.25 0.6 0 99.4
2.5 0.74 0 99.36
25
Measurements
26
Measurements
  • Why scavenger performs better than naïve-bayes?
  • Powerful feature extraction (as powerful as
    CRM114)
  • Calculates predictive strength on basis of
    frequency of occurrence as well as heuristic rules

27
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work

28
Implementation
  • Windows-PC based filter
  • Runs for Individual email accounts in IMAP mail
    servers
  • Three Modules
  • Configuration
  • Training
  • Classification

29
Implementation
  • Classifier runs as a Windows Service
  • Connects to mail server every ten minutes
  • Downloads new messages, classifies them
  • Moves messages classified as spam to a
    pre-configured folder on the server

30
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work

31
Future Work
  • Incorporating message headers during feature
    extraction step
  • Incorporating domain-specific attributes of spam
    during weight combination step

32
Publications
  • Dr. William Yerazunis (inventor of CRM114)
    mentioned the scavenger algorithm at the MIT
    spam conference on Jan 16, 2004
  • To be published in the First Conference on email
    and anti-spam in Palo-Alto, California in July
    2004

33
Acknowledgements
  • Dr. Eugene Fink
  • Dr. Dewey Rundus, Dr. Alan Hevner
  • Dr. Paul Graham, MIT, Boston
  • Dr. William Yerazunis, MERL, Boston
Write a Comment
User Comments (0)
About PowerShow.com