Title: A Neural Network Classifier for Junk E-Mail
1A Neural Network Classifier for Junk E-Mail
- Ian Stuart, Sung-Hyuk Cha, and Charles Tappert
- CSIS Student/Faculty Research Day
- May 7, 2004
2Spam, spam, spam,
3Fighting spam
- Several commercial applications exist
- Server-side expensive
- Client-side time-consuming
- No approach is 100 effective
- Spammers are aggressive and adaptable
- Best solutions are typically hybrids of different
approaches and criteria
4Common approaches
- Simple filters
- Common words or phrases
- Unusual punctuation or capitalization
- Blacklisting just say NO (if you can)
- Reject e-mail from known spammers
- Whitelisting friends only, please
- Accept e-mail only from known correspondents
- Classifiers examine each e-mail and decide
- Only a few publications on spam classifiers
5Naïve Bayesian classifiers
- Used in commercial classifiers
- Assumes recognition features are independent
- Max likelihood product of likelihoods of
features - E-mail classifier examines each word
- Training assigns a probability to each word
- Look up each word/probability in a dictionary
- If the product of the probabilities exceeds a
given threshold, it is spam - Challenge creating the dictionary
- We compare our Neural Network against two
published Naïve Bayesian classifiers
6Naïve Bayesian classifier issues
- How many features (words), which ones?
- How is degradation avoided as spammers
vocabulary changes? - What values are assigned to new words?
- What are the thresholds?
- How to avoid sabotage of classifier?
7Which one isnt spam?(subject headers)
- 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh
- Money Back Guarantee_HGH
- kindle life pddez liw mzac
- v a l i u m - D i a z e p a m used to relieve
anxiety - Fairfield tennis schedule
- Dramatic E,nhancement for .Men f"fumqid
- ,Refina'nce now. Don't wait
8Which one isnt spam? (subject headers)
- 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh
- Money Back Guarantee_HGH
- kindle life pddez liw mzac
- v a l i u m - D i a z e p a m used to relieve
anxiety - Fairfield tennis schedule
- Dramatic E,nhancement for .Men f"fumqid
- ,Refina'nce now. Don't wait
9Spammers make patterns
- The more they try to hide, the easier it is to
see them - Therefore, we use common spammer patterns
(instead of vocabulary) as features for
classification - Learn these patterns with a Neural Network
10Neural Network features
- Total of 17 features
- 6 from the subject header
- 2 from priority and content-type headers
- 9 from the e-mail body
11Features from subject header
- Number of words with no vowels
- Number of words with at least two of letters J,
K, Q, X, Z - Number of words with at least 15 characters
- Number of words with non-English characters,
special characters such as punctuation, or digits
at beginning or middle of word - Number of words with all letters in uppercase
- Binary feature indicating 3 or more repeated
characters
12Features from priority and content-type headers
- Binary feature indicating whether the priority
had been set to any level besides normal or
medium - Binary feature indicating whether a content-type
header appeared within the message headers or
whether the content type had been set to
text/html
13Features from message body
- Proportion of alphabetic words with no vowels and
at least 7 characters - Proportion of alphabetic words with at lease two
of letters J, K, Q, X, Z - Proportion of alphabetic words at least 15
characters long - Binary feature indicating whether the strings
From and To were both present - Number of HTML opening comment tags
- Number of hyperlinks (href)
- Number of clickable images represented in HTML
- Binary feature indicating whether a text color
was set to white - Number of URLs in hyperlinks with digits or ,
, or _at_
14Neural Network spam classifier
- 3-layer, feed-forward network (Perceptron)
- 17 input units, variable hidden layer units, 1
output unit - Data 1,654 e-mails 854 spam, 800 legitimate
- Use half of each (spam/non-spam) for training,
the other half for testing - Test with variations of hidden nodes (4 to 14)
and epochs (100 to 500)
15Definitions used for classifier success measures
- nSS number of spam classified as spam
- nSL number of spam classified as legitimate
- nLL number of legitimate classified as
legitimate - nLS number of legitimate classified as spam
16Measure of success precision
- Precision the percentage of labeled
spam/legitimate e-mail correctly classified
17Measure of success precision
- Precision the percentage of labeled
spam/legitimate e-mail correctly classified
18Measure of success accuracy
- Accuracy the percentage of actual
spam/legitimate e-mail correctly classified
19Measure of success accuracy
- Accuracy the percentage of actual
spam/legitimate e-mail correctly classified
20Neural Network results
- Best overall results with 12 hidden nodes at 500
epochs - Spam Precision 92.45
- Legitimate Precision 91.32
- Spam Accuracy 91.80
- Legitimate Accuracy 92.00
- 35 spams misclassified 8.20
- 32 legitimates misclassified 8.00
21Misclassified e-mails
- Most spam misclassified as legitimate were short
in length, with few hyperlinks - Most legitimate e-mails misclassified as spam had
unusual features for personal e-mail (that is,
they were spam-like in appearance)
22Comparing Neural Network and Naïve Bayesian
Classifiers
- Accuracy of the NN classifier is comparable to
that reported for Naïve Bayesian classifiers - NN classifier required fewer features (17 versus
100 in one study and 500 in another) - NN classifier uses descriptive qualities of words
and messages similar to those used by human
readers
23Blacklisting Experiment
- Manually entered IP addresses of e-mail
incorrectly tagged by NN classifier - Entered first (original) IP address and, when
present, second IP address (e.g., mail server or
ISP) - Into a website that sends IP addresses to 173
working spam blacklists and returns the hits,
http//www.declude.com/junkmail/support/ip4r.htm - Counted only hit counts greater than one as spam
since single-list hits to be anomalies
24Blacklisting Experimental Results
- Of the 32 legitimate e-mails misclassified by the
NN, 53 were identified as spam - Of the 35 spam e-mails misclassified by the NN,
97 were identified as spam - These poor results indicate that the blacklisting
strategy, at least for these databases, is
inadequate
25Conclusions
- NN competitive to Naïve Bayesian studies despite
using a much smaller feature set - Room for refinement of parsing for features
- Use of descriptive, more human-like features
makes NN less subject to degradation than Naïve
Bayesian
26Conclusions (cont.)
- Neural Network approach is useful and accurate,
but too many legitimate -gt spam - Should be powerful when used in conjunction with
a whitelist to reduce legitimate -gt spam (nLS),
increasing spam precision and legitimate accuracy - Blacklisting strategy is not very helpful