Title: FILTERING EMAILS
1- FILTERING E-MAILS
- FOR SPAM
- Alokika Dash
- Electrical Computer Engineering
- University of Maryland
2- WHAT IS SPAMING AND TEXT FILTERING?
- Unsolicited e-mail.
- Spam e-mail is 50 of all e-mail received by
organizations - How do we get spam?
- Address leakage
- Active snooping by spammers
TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
3TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- PROPERTIES OF SPAMMING
- The e-mail message is sent to a large number of
people advertising a product or service. - The e-mails are Unwanted.
- The sender of spam frequently attempts to hide
or obscure their identity. - The spammer is not inclined to stop.
4TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- WHY SPAM IS A CONCERN?
- Direct IT-Cost
- Delivering the message costs money in internet
bandwidth and disk storage. - Indirect cost
- Employee productivity.
- Business risk
- Spam is unregulated.
- No control over the content of e-mail.
- What should we do?
- Hire someone to read our mail and discard the
spam - Use machine learning techniques to automate this
process
5TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- IDEA BEHIND SPAM DETECTION
- Is message x spam?
- Feature Extraction
- Each document is distilled into a set of features
such as words, phrases, meta-data, etc. - Tokenizing
- This set of features can then be represented as a
vector whose components are boolean
(multivariate) or real values. - Classification algorithm
- uses the feature vector as a basis upon which the
document is judged.
6TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- DOMAIN SPECIFICATION
- FEATURES
- words in the body of the message,
- headers (senders and message paths)
- HTML code (like colors)
- word pairs, phrases
- meta information (e.g publication date, document
type, publication source) - CLASSIFICATION METHODS
- Rule Based or Key-word Filtering
- Statistical or Machine learning
7TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
8TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- MACHINE LEARNING TECHNIQUES
-
- Bayesian Techniques
- Boosting Trees
- Support Vector Machines
- Decision Trees
9TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- TRAINING DATA FOR SPAM FILTERING
- The ML approach relies on the availability of an
initial corpus - SpamAssassin Public Corpus
- Total count 6047 messages, with about a 31 spam
ratio. - LingSpam corpus
- Total Count 2412 messages with about a 16.6
spam ratio - Annexia corpus
- Great Spam Archive8
- Total count 15369 spam messages.
- Initial corpus is split in two sets
- Training Set
- Test Set
- Common Practice 90 of the corpus should be used
for training while 10 should be used for testing.
10TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- BAYESIAN FILTERING
- Messages e are first split into several types
of tokens header tokens, body tokens and
synthesized tokens. - Count the number of times each token occurs in
each corpus - Map each token to the probability that an email
containing it is a spam - Posterior probability of folder fi given e
11TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE PROBABLITY ASSIGNMENT wj
P( wj fi ) madam 0.99 promotion 0.99
republic 0.99 shortest 0.047 mandatory 0.047
sorry 0.082 enter 0.907 very 0.147 investmen
t 0.86
12TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- ADVANTAGES
- They learn automatically from spam and from good
mail - Any message with a probability of over 90 is
spam, anything else is good - Result in a very robust and efficient anti-spam
approach that returns hardly any false positives.
- caught 99.5 of spam with less than .03 false
positives implies that instead of receiving 20
spams per day you might receive one spam every 10
days or so
13TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- ADABOOST ALGORITHM
- Run a given weak learner several times on
slightly altered training data - Use decision tree algorithm as the weaker learner
- Assign each example of the given training set a
weight. At the beginning all weights are equal,
but in every round the weak learner returns a
hypothesis - The final hypothesis is a combination of the
hypotheses of all rounds, namely a weighted
majority vote, where hypotheses with lower
classification error have higher weight
14TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE
15TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE(CONTD)
16TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE(CONTD)
17TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
EXAMPLE(CONTD)
18TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- ADVANTAGES
- Fast and Simple
- Flexiblecan combine with any weak learner
- No priori knowledge needed about weak learner
- Not prone to over fitting
19TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- BAYESIAN NETWORK Vs ADABOOST
- Number of frequent words used 50
- Number of frequent words used 100
20TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- CHALLENGES IN DETECTING SPAM
- More aggressive a single filter becomes, the more
likely that a good message will be blocked
accidentally (FALSE POSITIVES) - For example
- Entertainment conglomerate estimated that a
single lost email from an important customer
could cost them more than 100,000 - Incorrectly blocked mail from a constituent could
cost the votes of ones family, friends and
neighbors. - Spam fighting accuracy declines in proportion
with the number of people covered by the filter.
21TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- CHALLENGES IN DETECTING SPAM
- Spammers Get Past Filters
- Changing message content
- Increasing message volume
- New delivery mechanisms
- Attacking anti-spam groups
22TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
- FUTURE WORK IN TEXT FILTERING
- Filter based on word triples or longer phrases
rather than individual words - Add a second level of testing designed
specifically to avoid false positives. - Second level of filtering may or may not be
bayesian - If a mail triggers this second level of filters
it will be accepted even if its spam probability
is above the threshold. - Focus extra attention on specific parts of the
email - Decompose domain names from the rest of the text
in an email in that they often consist of several
words stuck together.
23TEXT FILTERING FOR SPAM EMAILS Alokika
Dash University of Maryland
QUESTIONS?