Title: SPAM
1SPAM Christian Loza Srikanth Palla Liqin Zhang
2Overview
- Introduction
- Background
- Measurement
- Methods
- Compare different methods
- Conclusions
3Introduction
- If you use email, it's likely that you've
recently been visited by a piece of Spam- an
unsolicited, unwanted messag, sent to you with
out your permission.Sending spam violates the
Acceptable Use Policy (AUP)of almost all ISP's
and can lead to the termination of the sender's
account.As the recipient directly bears the cost
of delivery, storage, and processing, one could
regard spam as the electronic equivalent of
"postage-due" junk mail.
4Introduction
- Spammers frequently engage in deliberate fraud to
send out their messages. Spammers often use false
names, addresses, phone numbers, and other
contact information to set up "disposable"
accounts at various Internet service providers.
They also often use falsified or stolen credit
card numbers to pay for these accounts. This
allows them to move quickly from one account to
the next as the host ISPs discover and shut down
each one.
5Introduction
- In recent years, the spam has show no signals of
stopping growth - This is mainly because it does work
- The advantage is that is a cheap way to increase
the customer base.
6- Spammers frequently engage in deliberate fraud to
send out their messages. Spammers often use false
names, addresses, phone numbers, and other
contact information to set up "disposable"
accounts at various Internet service providers.
They also often use falsified or stolen credit
card numbers to pay for these accounts. This
allows them to move quickly from one account to
the next as the host ISPs discover and shut down
each one.
7- Spammers frequently go to great lengths to
conceal the origin of their messages. They do
this by spoofing e-mail addresses . The spammer
hacks the email protocol SMTP so that a message
appears to originate from another email address.
Some ISPs and domains require the use of SMTP
AUTHallowing positive identification of the
specific account from which an e-mail originates.
8- One cannot completely spoof an e-mail address
chain, since the receiving mailserver records the
actual connection from the last mailserver's IP
address however, spammers can forge the rest of
the ostensible history of the mailservers the
e-mail has ostensibly traversed. Spammers
frequently seek out and make use of vulnerable
third-party systems such as open mail relays and
open proxy servers.
9Address Collection
- Spammers may harvest e-mail addresses from a
number of sources. A popular method uses e-mail
addresses which their owners have published for
other purposes. Usenet posts, especially those in
archives such as Google groups, frequently yield
addresses. Simply searching the Web for pages
with addresses ? such as corporate staff
directories ? can yield thousands of addresses,
most of them deliverable.
10Address Collection
- Spammers have also subscribed to discussion
mailing lists for the purpose of gathering the
addresses of posters. The DNS and WHOIS systems
require the publication of technical contact
information for all Internet domains spammers
have illegally crawled these resources for email
addresses. Many spammers utilize programs called
Web Spiders to find email addresses on web
pages.Because spammers offload the bulk of their
costs onto others, however, they can use even
more computationally expensive means to generate
addresses.
11Address Collection
- A dictionary attack consists of an exhaustive
attempt to gain access to a resource by trying
all possible credentials ? usually, usernames and
passwords. Spammers have applied this principle
to guessing email addresses ? as by taking common
names and generating likely email addresses for
them at each of thousands of domain
names.Spammers sometimes use various means to
confirm addresses as deliverable. For instance,
including a Web bug in a spam message written in
HTML may cause the recipient's mail client to
transmit the recipient's address, or any other
unique key, to the spammer's Web site.
12Terminology
- To better understand the concepts in this
presentation let us consider the following
terminology. - Mail User Agent (MUA). This refers to the program
used by the client to - send and receive e-mail from. It is usually
referred to as the "mail client." - An example of this is Pine or Eudora.
- Mail Transfer Agent (MTA). This refers to the
program used running on the - server to store and forward e-mail messages. It
is usually referred to as the - "mail server program." An example of this is
sendmail or the Microsoft - Exchange server.
13The Mail Queue
14- In a normal configuration, sendmail sits in the
background waiting for new messages. When a new
connection arrives, a child process is invoked to
handle the connection, while the parent process
goes back to listening for new connections. - When a message is received, the sendmail child
process puts it into the mail queue (usually
stored in /var/spool/mqueue). If it is
immediately deliverable, it is delivered and
removed from the queue. If it is not immediately
deliverable, it will be left in the queue and the
process will terminate. - Messages left in the queue will stay there until
the next time the queue is processed. The parent
sendmail will usually fork a child process to
attempt to deliver anything left in the queue at
regular intervals.
15Structure of E-mail Message
- Email messages are compose of two parts
- 1. Headers (lines of the form "field value"
which contain information about the
message, such as "To", "From", "Date", and
"Message- ID") - 2. Body (the text of the message)
16Example
- From johndoe_at_students.uiuc.edu Mon Jul 5
234619 1999 - Received (from johndoe_at_localhost)
- by students.uiuc.edu (8.9.3/8.9.3) id
LAA05394 - Mon, 5 Jul 1999 234618 -0500
- Received from staff.uiuc.edu (staff.uiuc.edu
128.174.5.59) - by students.uiuc.edu (8.9.3/8.9.3) id
XAA24214 - Mon, 5 Jul 1999 234625 -0500
- Date Mon, 5 Jul 1999 234618 -0500
- From John Doe ltjohndoe_at_students.uiuc.edugt
- To John Smith ltjsmith_at_staff.uiuc.edugt
- Message-Id lt199907052346.LAA05394_at_students.uiuc.e
dugt - Subject This is a subject header.
- This is the message body. It is seperated from
the headers by a blank - line.
- The message body can span multiple lines.
17- Here is an example SMTP transaction
- 1. Client connects to server's SMTP port (25).
- 2. Server 220 staff.uiuc.edu ESMTP Sendmail
8.10.0/8.10.0 ready Mon, 13 Mar 2000 145408
-0600 - 3. Client helo students.uiuc.edu
- 4. Server 250 staff.uiuc.edu Hello
root_at_students.uiuc.edu 128.174.5.62, pleased to
meet you - 5. Client mail from johndoe_at_students.uiuc.edu
- 6. Server 250 2.1.0 johndoe_at_students.uiuc.edu.
.. Sender ok - 7. Client rcpt to jsmith_at_staff.uiuc.edu
- 8. Server 250 2.1.5 jsmith_at_staff.uiuc.edu...
Recipient ok - 9. Client data
- 10. Server 354 Enter mail, end with "." on a
line by itself - 11. Client
- Received (from johndoe_at_localhost)
- by students.uiuc.edu (8.9.3/8.9.3) id
LAA05394 - Mon, 5 Jul 1999 234618 -0500
- Date Mon, 5 Jul 1999 234618 -0500
- From John Doe ltjohndoe_at_students.uiuc.edugt
18Delivering Spam messages
- Early on, spammers discovered that if they sent
large quantities of spam directly from their ISP
accounts, recipients would complain and ISPs
would shut their accounts down. Thus, one of the
basic techniques of sending spam has become to
send it from someone else's computer and network
connection. By doing this, spammers protect
themselves in several ways they hide their
tracks, get others' systems to do most of the
work of delivering messages, and direct the
efforts of investigators towards the other
systems rather than the spammers themselves.
19Mail filters
- A mail filter is a piece of software which takes
an input of an email message. For its output, it
might pass the message through unchanged for
delivery to the user's mailbox, it might redirect
the message for delivery elsewhere, or it might
even throw the message away. Some mail filters
are able to edit messages during processing.
20Introduction
- Application of Text Categorization
- The Spam classification is defined as a binary
problem Email is Spam OR is not Spam. - Automatic text categorization assigns emails to
one of the above categories, using different
methods - One of this methods is the Centroid-based
classification
SPAM
NOT SPAM
21Background
- Text Classification classify documents into
categories - Spam
- un-spam
- Classification process
- preprocess message
- Remove tag
- Stop-word removal
- Word stemming
- Training --- build the classification model
- Testing --- evaluate the model
22Methodologies
- Bayes-Naives
- Centroid-Based
- Content-based
23Bayesianism
- Is the philosophical tenet that the mathematical
theory of probability applies to the degree of
plausibility of a statement. This also applies to
the degree of believability contained within the
rational agents of a truth statement.
Additionally, when a statement is used with
Bayes' theorem, it then becomes a Bayesian
inference.
24Baye's Rule
- If A and B are two separate but possibly
dependent random events, then - Probability of A and B occurring together
Pr(A,B) - The conditional probability of A, given that B
occurs Pr(AB) - The conditional probability of BB, given that AA
occurs Pr(BA)
25- From elementary rules of probability
-
- Pr(A,B)
Pr(AB)Pr(B) Pr(BA)Pr(A) -
- Dividing the right-hand pair of expressions by
Pr(B) gives Bayes' rule - PrAB
PrBAPrA -
----------------- -
PrB -
26- In problems of probabilistic inference, we are
often trying to estimate the most probable
underlying model for a random process, based on
some observed data or evidence. If AA represents
a given set of model parameters, and BB
represents the set of observed data values, then
the terms in equation are given the following
terminology - PrA is the prior probability of the model A (in
the absence of any evidence) - PrB is the probability of the evidence B
- PrBA is the likelihood that the evidence B was
produced, given that the model was A - PrAB is the posterior probability of the model
being A, given that the evidence is B.
27- Mathematically, Bayes' rule states
- likelihood prior
- posterior ------------------------------
- marginal likelihood
28Representing E-mail for statistical Algorithms
- All statistical algorithms for spam filtering
begin with a vector representation of individual
e-mail messages. - The length of the term vector is the number of
distinct words in all the e-mail messages in the
training data. The entry for a particular word in
the term vector for a particular e-mail message
is usually he number of occurences of the word in
the e-mail message.
29Training data comprising four labeled e-mail
messages
- Table below presents toy training data comprising
four e-mail messages. These data contain ten
distinct words the, quick, brown, fox, rabbit,
ran, and, run, at, and rest. - Message Spam
- 1 The quick brown fox no
- 2 The quick rabbit ran and ran
yes - 3
rabbit run run run no - 4
rabbit at rest yes
30Term Vectors corresponding to training data
- and at brown fox quick rabbit ran rest run the
- 1 1 0 0 0 1 1 1 0 0 0 0 1
- 2 1 2 0 0 2 0 0 1 0 1 0 1
- 3 0 0 0 0 3 1 1 1 0 0 1 0
- 4 2 0 3 2 0 0 0 0 1 0 1 1
31- If the training data comprise thousands of e-mail
messages, the number of distinct words often
exceeds 10,000. Two simple strategies to reduce
the size of the term vector somewhat are to
remove stop words (words like and, of, the,
etc.) and to reduce words to their root form, a
process known as stemming (so, for example, ran
and run reduce to run). Table 3 shows the
reduced term vectors along with the spam label.
32Term vectors after stemming and stop word
removal, spam label coded as 0no,1yes
- X1 X2 X3 X4 X5 X6 Y
- brown fox quick rabbit rest run Spam
- 1 1 1 1 0 2
1 0 - 2 0 1 1
0 3 0
1 - 3 0 0 1
0 0 1
0 - 4 0 0 0
1 1 2
1
33Navie Bayes for Spam
- Let X (X1,. .., Xd) denote the term vector for
a random e-mail message, where d is the number of
distinct words in the training data, after
stemming and stopword removal. Let Y denote the
corresponding spam label. The Naive Bayes model
seeks to build a model for - Pr(Y 1X1 x1,. .., Xd xd).
- From Bayes theorem, we have
- Pr(Y 1X1 x1,. .., Xd xd) Pr(Y 1)
Pr(X1x1,. .., Xd xdY 1) -
------------------------------------------------ -
Pr(X1 x1,. .., Xd xd)
34Centroid-based method
- The documents are represented using a
Vector-space model. - Each document is represented as a Term Frequency
vector (TF)
t2
d1
d4
d3
d2
t1
35Centroid-based method
- A refinement of this model is the inverse
document frequency (IDF) - This is to limit the discrimination power of
frequent terms and stop words, and to emphasize
words that appear in specific documents. - IDF is log(N/dfi)
- The size of the document is normalized
36Centroid-based method
- The distance between two vectors is defined using
the cosine function - Finallly, one Centroid Vector C is defined for
each category (spam/not spam) as
37Centroid-based method
- We can measure the similarity between one
document and the Centroid of the category with
the following function
38Steps Centroid-based Method
- TRAINNING
- Determine the document vectors using TD/IDF.
t2
d7
d8
d5
d6
d3
d2
d1
d4
t1
39Steps Centroid-based Method
- TRAINNING
- Calculate the centroid for the categories SPAM
and NOT SPAM
t2
d7
CSPAM
d8
d5
d6
d3
CNOT SPAM
d2
d1
d4
t1
40Steps Centroid-based Method
- CLASSIFICATION
- Given a new document dn, calculate the document
vector representation (like in the training stage)
t2
dn
t1
41Steps Centroid-based Method
- CLASSIFICATION
- Measure the distance between the vector dn and
the Centroids of the Categories SPAM / NOT SPAM
t2
CSPAM
dn
CNOT SPAM
t1
42Steps Centroid-based Method
- CLASSIFICATION (cont.)
- Measure the distance between the vector dn and
the Centroids of the Categories SPAM / NOT SPAM
t2
CSPAM
dn
CNOT SPAM
t1
43Steps Centroid-based Method
- FINAL RESULT
- Obtain the maximum similarity between the
document and the Centroids of SPAM and NOT SPAM -
- for i1,2 where 1SPAM and 2NOT SPAM
44Analysis of Results
- The standard methodology for measuring
performance of text classification methods are
the Precision and Recall
n. of correctly predicted positives N of
predicted positive examples
P
n. of correctly predicted positives N of all
positive examples
R
45Analysis of Results
- None Precision or Recall can give a good measure
by themselves. To have an idea of the
performance, we have to combine them.
R
better
2PR PR
F
P
46Some results
- Compared agains kNN and Naïve Based, the Centroid
method performs better
47Content Based Approach
- Spam can be detected
- before reading the message --- non-content
based - Based on special protocol 3 voip protocol
- Based on address book1 build an email network
- Based on IP address 4
- ..
- After process the content of the email ---
content based
48Content-based Approach
- Non-content based approach
- remove spam message if contain virus, worms
before read. - leaves some messages un-labeled
- Content based method
- widely used method
- may need lots pre-labeled message
- label message based its content
- Zdziarski5 said that it's possible to stop
spam, and that content-based filters are the way
to do it - Focus on content based method
49Method of content-based
- Bayesian based method 6
- Centroid-based method7
- Machine learning method 8
- Latent Semantic indexing LSI
- Contextual Network Graphs (CNG)
- Rule based method9
- ripper rule a list of predefined rules that can
be changed by hand - Memory based method10
- saving cost
50Measurement
- Accuracy the percentage of correct classified
correct/(correct un-correct) - False positive if a message is a spam, but
misclassify to un-spam. - Goal
- Improve accuracy
- Prevent false positive
Correct Un-correct
Spam
No spam
False positive
51measurement
52Rule-based method
- A list of predefined rules that can be changed by
hand - ripper rule
- Each rule/test associated with a score
- If an email fails a rule, its score increased
- After apply all rules, if the score is above a
certain threshold, it is classified as spam
53Rule-based method
- Advantage
- able to employ diverse and specific rule to
check spam - Check size of the email
- Number of pictures it contains
- no training messages are needed
- Disadvantage
- rules have to be entered and maintained by hand
--- cant be automatically
54Latent Semantic indexing
- Keyword
- important word for text classification
- High frequent word in a message
- Can used as an indicator for the message
- Why LSI?
- Polysemy word can be used in more than one
category - ex Play
- Synonymous if two words have identical meaning
- Based on nearest neighbors based algorithm
55Latent semantic indexing
- consider semantic links between words
- Search keyword over the semantic space
- Two words have the same meaning are treated as
one word - eliminate synonymous
- Consider the overlap between different message,
this overlap may indicate - polysemy or stop-word
- two messages in same category
56Latent semantic indexing
- Step1 build a term-document matrix X from the
input documents
Doc1 computer science department Doc2 computer
science and engineering science Doc3 engineering
school
Computer science department and
engineering school Doc1 1 1
1 0 0
0 Doc2 1 2
0 1 1
0 Doc3 0 0 0
0 1 1
57Latent semantic indexing
- Step2 Singular value Decomposition (SVD) is
performed on matrix X - To extract a set of linearly independent FACTORS
that describe the matrix - Generalize the terms have the same meaning
- Three new matrices TSD are produced to reduce
the vocabularys size
58Latent Semantic indexing
- Two document can be compared by finding the
distance between two document vector, stored in
matrix X1 - Text classification is done by finding the
nearest neighbors assign to category with max
document
59Nearest neighbors method classify the test
message to be UN-SPAM
Spam Un-spam Test
60Latent Semantic Indexing
- Advantage
- Entire training set can be learned at same time
- No intermediate model need to be build
- Good for the training set is predefined
- Disadvantage
- When new document added, matrix X changed, and
TSD need to be re-calculated - Time consuming
- Real classifier need the ability to change
training set
61Contextual network Graphs
- A weighted, bipartite, undirected graph of term
and document nodes
t1 W11w21 1 d1 w11w12w13 1
t1
w21
w11
t2
w22
w12
d1
d2
w13
w23
t3
At any time, for each node, the sum of the weigh
is 1
62Contextual network graphs
- When new document d is added, energizing the
weights at node d, and may need re-balance the
weights at the connected node - The document is classified to the one with
maximum of energy (weight) average for each class
63Comparison Bayesian, LSI,CNG, centroid, rule-based
64Result
65Result and conclusion
- LSI CNG super Bayesian approach 5 accuracy,
and reduce false positive and negatives up to 71 - LSI CNG shows better performance even with
small document set
66Comparison content based and non-content based
- Non-content based
- dis-adv
- depends on special factor like email address, IP
address, special protocol, - leaves some un-classified
- Adv detect spam before reading message with
high accuracy
67- Content based
- Disadvantage
- need some training message
- not 100 correct classified due to the spammer
also know the anti-spam tech. - Advantage
- leaves no message unclassified
68Improvement for spam
- Combine both method
- 1 proposes an email network based algorithm,
which with 100 accuracy, but leaves 47
unclassified, if combine with content based
method, can improve the performance. - Build up multi-layers11
- 11 Chris Miller, A layered Approach to
enterprise antispam
69Data set for spam
- Non-content based
- Email network
- One authors email corpus, formed by 5,486
messages - IP address -- none
- Content based
70Data set for spam
- LSI CNG
- Corpus of varying size (250 4000)
- Spam and un-spam emails in equal amount
- Bayesian based
- Corpus of 1789 email
- 211 spam, 1578 non-spam
- Cetroid based
- Totally 200 email message
- 90 spam, 110 non-spam
71Most recently used Benchmarks
- Reuters
- About 7700 training and 3000 test documents,
30000 terms,135 categories, 21MB. - each category has about 57 instances
- collection of newswire stories
- 20NG
- About 18800 total documents, 94000 terms, 20
topics, 25MB. - Each category has about 1000 instances
- WebKB
- About 8300 documents, 7 categories, 26 MB.
- Each category has about 1200 instances
- 4 university website data
- Above three are well-known in recently IR with
small in size and used to test the performance
and CPU scalability
72Benchmarks
- OHSUMED
- 348566 document, 230000 terms and 308511 topics,
400 MB. - Each category has about 1 instance
- Abstract from medical journals
- Dmoz
- 482 topics, 300 training document for each topic,
271MB - Each category has less than 1 instance
- taken from Dmoz(http//dmoz.org/) topic tree
- Large dataset, used to test the memory
scalability of a model
73Some facts
- Spam is a growing problem, and the research on
this topic has become more relevant the last
years - Spam grows because it works.
- Many commercial products try to fight spam. Most
of them rely on the exposed techniques, or
combination of them - Spam damages economy, more than hackers or viruses
74Some facts
- Damages attributed to Spam are calculated around
10.4 billion in 2003, 58 112 billion in 2004,
and is projected to cross 200 billion worldwide
in 2005. - 1.6 trillion unsolicited messages were sent in
2004.
75Conclusions
- Spam is a problem that causes a great impact of
global business - We presented three methods for Spam
classification. - The benchmarks on this three methods suggest that
combination of the methods perform better than
the methods alone
76Conclusions
- Spam classifiers can be Content Based, and Non
Content Based - Content Based Rules, Naïve Bayes, Centroid
- Non content work without reading the content of
the mail
77Conclusions
- Researchers have found ways to increase the
accuracy of all the methods, using heuristics and
combining them - Spammers also learn how to avoid spam filters
- No single method is perfect in all situations
78Sources
- Slide 1, image ttp//www.ecommerce-guide.com
- Slide 1, image ttp//www.email-firewall.jp/produc
ts/das.html
79References
- Anti-spam Filtering A centroid-based
Classification Approach, Nuanwan Soonthornphisaj,
Kanokwan Chaikulseriwat, Piyan Tang-On, 2002 - Centroid-Based Document Classification Analysis
Experimental Results, Eui-Hong (Sam) and George
Karypis, 2000 - Multi-dimensional Text classification, Thanaruk
Theeramunkog, 2002 - Improving centroid-based text classification
using term-distribution-based weighting system
and clustering, Thanaruk Theeramunkog and
Verayuth Lertnattee - Combining Homogeneous Classifiers for
Centroid-based text classifications, Verayuth
Lertnattee and Thanaruk Theeramunkog
80References
- 1 P Oscar Boykin and Vwani Roychowdhury,
Personal Email Networks An Effective Anti-Spam
Tool, IEEE COMPUTER, volume 38, 2004 - 2 Andras A. Benczur and Karoly Csalogany and
Tamas Sarlos and Mate Uher, SpamRank - Fully
Automatic Link Spam Detection, citeseer.ist.psu.ed
u/benczur05spamrank.html - 3. R. Dantu, P. Kolan, Detecting Spam in VoIP
Networks, Proceedings of USENIX, SRUTI (Steps
for Reducing Unwanted Traffic on the Internet)
workshop, July 05(accepted) - 4. IP addresses in email clients
ttp//www.ceas.cc/papers-2004/162.pdf - 5 Plan for Spam ttp//ww.paulgraham.com/spam.htm
l
81References
- 6 M. Sahami, S. Dumais, D. Heckerman, and E.
Horvitz. 1998, A Bayesian Approach to Filtering
Junk E-Mail, Learning for Text Categorization
Papers from the AAAI Workshop, pages 5562,
Madison Wisconsin. AAAI Technical Report WS-98-05 - 7 N. Soonthornphisaj, K. Chaikulseriwat, P
Tang-On, Anti-Spam Filtering A Centroid Based
Classification Approach, IEEE proceedings ICSP
02 - 8 Spam Filtering Using Contextual Networking
Graphs www.cs.tcd.ie/courses/csll/dkellehe0304.pdf
- 9 W.W. Cohen, Learning Rules that Classify
e-mail, In Proceedings of the AAAI Spring
Symposium on Machine Learning in Information
Access, 1996 - 10 G. Sakkis, I. Androutsopoulos, G. Paliouras,
V. Karkaletsis, C.D. Spyropoulos, P.
Stamatopoulos, A memory based approach to
anti-spam filtering for mailing lists,
Information Retrieval 2003