A Neural Network Classifier for Junk E-Mail - PowerPoint PPT Presentation

About This Presentation
Title:

A Neural Network Classifier for Junk E-Mail

Description:

A Neural Network Classifier for Junk E-Mail. Ian Stuart, Sung-Hyuk Cha, ... kindle life pddez liw mzac. v a l i u m - D i a z e p a m used to relieve anxiety ... – PowerPoint PPT presentation

Number of Views:285
Avg rating:3.0/5.0
Slides: 27
Provided by: Ian19
Learn more at: http://csis.pace.edu
Category:

less

Transcript and Presenter's Notes

Title: A Neural Network Classifier for Junk E-Mail


1
A Neural Network Classifier for Junk E-Mail
  • Ian Stuart, Sung-Hyuk Cha, and Charles Tappert
  • CSIS Student/Faculty Research Day
  • May 7, 2004

2
Spam, spam, spam,
3
Fighting spam
  • Several commercial applications exist
  • Server-side expensive
  • Client-side time-consuming
  • No approach is 100 effective
  • Spammers are aggressive and adaptable
  • Best solutions are typically hybrids of different
    approaches and criteria

4
Common approaches
  • Simple filters
  • Common words or phrases
  • Unusual punctuation or capitalization
  • Blacklisting just say NO (if you can)
  • Reject e-mail from known spammers
  • Whitelisting friends only, please
  • Accept e-mail only from known correspondents
  • Classifiers examine each e-mail and decide
  • Only a few publications on spam classifiers

5
Naïve Bayesian classifiers
  • Used in commercial classifiers
  • Assumes recognition features are independent
  • Max likelihood product of likelihoods of
    features
  • E-mail classifier examines each word
  • Training assigns a probability to each word
  • Look up each word/probability in a dictionary
  • If the product of the probabilities exceeds a
    given threshold, it is spam
  • Challenge creating the dictionary
  • We compare our Neural Network against two
    published Naïve Bayesian classifiers

6
Naïve Bayesian classifier issues
  • How many features (words), which ones?
  • How is degradation avoided as spammers
    vocabulary changes?
  • What values are assigned to new words?
  • What are the thresholds?
  • How to avoid sabotage of classifier?

7
Which one isnt spam?(subject headers)
  • 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh
  • Money Back Guarantee_HGH
  • kindle life pddez liw mzac
  • v a l i u m - D i a z e p a m used to relieve
    anxiety
  • Fairfield tennis schedule
  • Dramatic E,nhancement for .Men f"fumqid
  • ,Refina'nce now. Don't wait

8
Which one isnt spam? (subject headers)
  • 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh
  • Money Back Guarantee_HGH
  • kindle life pddez liw mzac
  • v a l i u m - D i a z e p a m used to relieve
    anxiety
  • Fairfield tennis schedule
  • Dramatic E,nhancement for .Men f"fumqid
  • ,Refina'nce now. Don't wait

9
Spammers make patterns
  • The more they try to hide, the easier it is to
    see them
  • Therefore, we use common spammer patterns
    (instead of vocabulary) as features for
    classification
  • Learn these patterns with a Neural Network

10
Neural Network features
  • Total of 17 features
  • 6 from the subject header
  • 2 from priority and content-type headers
  • 9 from the e-mail body

11
Features from subject header
  1. Number of words with no vowels
  2. Number of words with at least two of letters J,
    K, Q, X, Z
  3. Number of words with at least 15 characters
  4. Number of words with non-English characters,
    special characters such as punctuation, or digits
    at beginning or middle of word
  5. Number of words with all letters in uppercase
  6. Binary feature indicating 3 or more repeated
    characters

12
Features from priority and content-type headers
  1. Binary feature indicating whether the priority
    had been set to any level besides normal or
    medium
  2. Binary feature indicating whether a content-type
    header appeared within the message headers or
    whether the content type had been set to
    text/html

13
Features from message body
  • Proportion of alphabetic words with no vowels and
    at least 7 characters
  • Proportion of alphabetic words with at lease two
    of letters J, K, Q, X, Z
  • Proportion of alphabetic words at least 15
    characters long
  • Binary feature indicating whether the strings
    From and To were both present
  • Number of HTML opening comment tags
  • Number of hyperlinks (href)
  • Number of clickable images represented in HTML
  • Binary feature indicating whether a text color
    was set to white
  • Number of URLs in hyperlinks with digits or ,
    , or _at_

14
Neural Network spam classifier
  • 3-layer, feed-forward network (Perceptron)
  • 17 input units, variable hidden layer units, 1
    output unit
  • Data 1,654 e-mails 854 spam, 800 legitimate
  • Use half of each (spam/non-spam) for training,
    the other half for testing
  • Test with variations of hidden nodes (4 to 14)
    and epochs (100 to 500)

15
Definitions used for classifier success measures
  • nSS number of spam classified as spam
  • nSL number of spam classified as legitimate
  • nLL number of legitimate classified as
    legitimate
  • nLS number of legitimate classified as spam

16
Measure of success precision
  • Precision the percentage of labeled
    spam/legitimate e-mail correctly classified

17
Measure of success precision
  • Precision the percentage of labeled
    spam/legitimate e-mail correctly classified

18
Measure of success accuracy
  • Accuracy the percentage of actual
    spam/legitimate e-mail correctly classified

19
Measure of success accuracy
  • Accuracy the percentage of actual
    spam/legitimate e-mail correctly classified

20
Neural Network results
  • Best overall results with 12 hidden nodes at 500
    epochs
  • Spam Precision 92.45
  • Legitimate Precision 91.32
  • Spam Accuracy 91.80
  • Legitimate Accuracy 92.00
  • 35 spams misclassified 8.20
  • 32 legitimates misclassified 8.00

21
Misclassified e-mails
  • Most spam misclassified as legitimate were short
    in length, with few hyperlinks
  • Most legitimate e-mails misclassified as spam had
    unusual features for personal e-mail (that is,
    they were spam-like in appearance)

22
Comparing Neural Network and Naïve Bayesian
Classifiers
  • Accuracy of the NN classifier is comparable to
    that reported for Naïve Bayesian classifiers
  • NN classifier required fewer features (17 versus
    100 in one study and 500 in another)
  • NN classifier uses descriptive qualities of words
    and messages similar to those used by human
    readers

23
Blacklisting Experiment
  • Manually entered IP addresses of e-mail
    incorrectly tagged by NN classifier
  • Entered first (original) IP address and, when
    present, second IP address (e.g., mail server or
    ISP)
  • Into a website that sends IP addresses to 173
    working spam blacklists and returns the hits,
    http//www.declude.com/junkmail/support/ip4r.htm
  • Counted only hit counts greater than one as spam
    since single-list hits to be anomalies

24
Blacklisting Experimental Results
  • Of the 32 legitimate e-mails misclassified by the
    NN, 53 were identified as spam
  • Of the 35 spam e-mails misclassified by the NN,
    97 were identified as spam
  • These poor results indicate that the blacklisting
    strategy, at least for these databases, is
    inadequate

25
Conclusions
  • NN competitive to Naïve Bayesian studies despite
    using a much smaller feature set
  • Room for refinement of parsing for features
  • Use of descriptive, more human-like features
    makes NN less subject to degradation than Naïve
    Bayesian

26
Conclusions (cont.)
  • Neural Network approach is useful and accurate,
    but too many legitimate -gt spam
  • Should be powerful when used in conjunction with
    a whitelist to reduce legitimate -gt spam (nLS),
    increasing spam precision and legitimate accuracy
  • Blacklisting strategy is not very helpful
Write a Comment
User Comments (0)
About PowerShow.com