PPT – SPAM PowerPoint presentation | free to download

About This Presentation

Title:

SPAM

Description:

... e-mail addresses. ... A popular method uses e-mail addresses which their owners have published ... Web Spiders to find email addresses on web pages. ... – PowerPoint PPT presentation

Number of Views:170

Avg rating:3.0/5.0

Slides: 82

Provided by: cse54

Learn more at: https://computerscience.engineering.unt.edu

Category:

more less

Transcript and Presenter's Notes

Title: SPAM

1
SPAM Christian Loza Srikanth Palla Liqin Zhang
2
Overview

Introduction
Background
Measurement
Methods
Compare different methods
Conclusions

3
Introduction

If you use email, it's likely that you've
recently been visited by a piece of Spam- an
unsolicited, unwanted messag, sent to you with
out your permission.Sending spam violates the
Acceptable Use Policy (AUP)of almost all ISP's
and can lead to the termination of the sender's
account.As the recipient directly bears the cost
of delivery, storage, and processing, one could
regard spam as the electronic equivalent of
"postage-due" junk mail.

4
Introduction

Spammers frequently engage in deliberate fraud to
send out their messages. Spammers often use false
names, addresses, phone numbers, and other
contact information to set up "disposable"
accounts at various Internet service providers.
They also often use falsified or stolen credit
card numbers to pay for these accounts. This
allows them to move quickly from one account to
the next as the host ISPs discover and shut down
each one.

5
Introduction

In recent years, the spam has show no signals of
stopping growth
This is mainly because it does work
The advantage is that is a cheap way to increase
the customer base.

Spammers frequently engage in deliberate fraud to
send out their messages. Spammers often use false
names, addresses, phone numbers, and other
contact information to set up "disposable"
accounts at various Internet service providers.
They also often use falsified or stolen credit
card numbers to pay for these accounts. This
allows them to move quickly from one account to
the next as the host ISPs discover and shut down
each one.

Spammers frequently go to great lengths to
conceal the origin of their messages. They do
this by spoofing e-mail addresses . The spammer
hacks the email protocol SMTP so that a message
appears to originate from another email address.
Some ISPs and domains require the use of SMTP
AUTHallowing positive identification of the
specific account from which an e-mail originates.

One cannot completely spoof an e-mail address
chain, since the receiving mailserver records the
actual connection from the last mailserver's IP
address however, spammers can forge the rest of
the ostensible history of the mailservers the
e-mail has ostensibly traversed. Spammers
frequently seek out and make use of vulnerable
third-party systems such as open mail relays and
open proxy servers.

9
Address Collection

Spammers may harvest e-mail addresses from a
number of sources. A popular method uses e-mail
addresses which their owners have published for
other purposes. Usenet posts, especially those in
archives such as Google groups, frequently yield
addresses. Simply searching the Web for pages
with addresses ? such as corporate staff
directories ? can yield thousands of addresses,
most of them deliverable.

10
Address Collection

Spammers have also subscribed to discussion
mailing lists for the purpose of gathering the
addresses of posters. The DNS and WHOIS systems
require the publication of technical contact
information for all Internet domains spammers
have illegally crawled these resources for email
addresses. Many spammers utilize programs called
Web Spiders to find email addresses on web
pages.Because spammers offload the bulk of their
costs onto others, however, they can use even
more computationally expensive means to generate
addresses.

11
Address Collection

A dictionary attack consists of an exhaustive
attempt to gain access to a resource by trying
all possible credentials ? usually, usernames and
passwords. Spammers have applied this principle
to guessing email addresses ? as by taking common
names and generating likely email addresses for
them at each of thousands of domain
names.Spammers sometimes use various means to
confirm addresses as deliverable. For instance,
including a Web bug in a spam message written in
HTML may cause the recipient's mail client to
transmit the recipient's address, or any other
unique key, to the spammer's Web site.

12
Terminology

To better understand the concepts in this
presentation let us consider the following
terminology.
Mail User Agent (MUA). This refers to the program
used by the client to
send and receive e-mail from. It is usually
referred to as the "mail client."
An example of this is Pine or Eudora.
Mail Transfer Agent (MTA). This refers to the
program used running on the
server to store and forward e-mail messages. It
is usually referred to as the
"mail server program." An example of this is
sendmail or the Microsoft
Exchange server.

13
The Mail Queue
14

In a normal configuration, sendmail sits in the
background waiting for new messages. When a new
connection arrives, a child process is invoked to
handle the connection, while the parent process
goes back to listening for new connections.
When a message is received, the sendmail child
process puts it into the mail queue (usually
stored in /var/spool/mqueue). If it is
immediately deliverable, it is delivered and
removed from the queue. If it is not immediately
deliverable, it will be left in the queue and the
process will terminate.
Messages left in the queue will stay there until
the next time the queue is processed. The parent
sendmail will usually fork a child process to
attempt to deliver anything left in the queue at
regular intervals.

15
Structure of E-mail Message

Email messages are compose of two parts
1. Headers (lines of the form "field value"
which contain information about the
message, such as "To", "From", "Date", and
"Message- ID")
2. Body (the text of the message)

16
Example

From johndoe_at_students.uiuc.edu Mon Jul 5
234619 1999
Received (from johndoe_at_localhost)
by students.uiuc.edu (8.9.3/8.9.3) id
LAA05394
Mon, 5 Jul 1999 234618 -0500
Received from staff.uiuc.edu (staff.uiuc.edu
128.174.5.59)
by students.uiuc.edu (8.9.3/8.9.3) id
XAA24214
Mon, 5 Jul 1999 234625 -0500
Date Mon, 5 Jul 1999 234618 -0500
From John Doe ltjohndoe_at_students.uiuc.edugt
To John Smith ltjsmith_at_staff.uiuc.edugt
Message-Id lt199907052346.LAA05394_at_students.uiuc.e
dugt
Subject This is a subject header.
This is the message body. It is seperated from
the headers by a blank
line.
The message body can span multiple lines.

Here is an example SMTP transaction
1. Client connects to server's SMTP port (25).
2. Server 220 staff.uiuc.edu ESMTP Sendmail
8.10.0/8.10.0 ready Mon, 13 Mar 2000 145408
-0600
3. Client helo students.uiuc.edu
4. Server 250 staff.uiuc.edu Hello
root_at_students.uiuc.edu 128.174.5.62, pleased to
meet you
5. Client mail from johndoe_at_students.uiuc.edu
6. Server 250 2.1.0 johndoe_at_students.uiuc.edu.
.. Sender ok
7. Client rcpt to jsmith_at_staff.uiuc.edu
8. Server 250 2.1.5 jsmith_at_staff.uiuc.edu...
Recipient ok
9. Client data
10. Server 354 Enter mail, end with "." on a
line by itself
11. Client
Received (from johndoe_at_localhost)
by students.uiuc.edu (8.9.3/8.9.3) id
LAA05394
Mon, 5 Jul 1999 234618 -0500
Date Mon, 5 Jul 1999 234618 -0500
From John Doe ltjohndoe_at_students.uiuc.edugt

18
Delivering Spam messages

Early on, spammers discovered that if they sent
large quantities of spam directly from their ISP
accounts, recipients would complain and ISPs
would shut their accounts down. Thus, one of the
basic techniques of sending spam has become to
send it from someone else's computer and network
connection. By doing this, spammers protect
themselves in several ways they hide their
tracks, get others' systems to do most of the
work of delivering messages, and direct the
efforts of investigators towards the other
systems rather than the spammers themselves.

19
Mail filters

A mail filter is a piece of software which takes
an input of an email message. For its output, it
might pass the message through unchanged for
delivery to the user's mailbox, it might redirect
the message for delivery elsewhere, or it might
even throw the message away. Some mail filters
are able to edit messages during processing.

20
Introduction

Application of Text Categorization
The Spam classification is defined as a binary
problem Email is Spam OR is not Spam.
Automatic text categorization assigns emails to
one of the above categories, using different
methods
One of this methods is the Centroid-based
classification

SPAM
NOT SPAM
21
Background

Text Classification classify documents into
categories
Spam
un-spam
Classification process
preprocess message
Remove tag
Stop-word removal
Word stemming
Training --- build the classification model
Testing --- evaluate the model

22
Methodologies

Bayes-Naives
Centroid-Based
Content-based

23
Bayesianism

Is the philosophical tenet that the mathematical
theory of probability applies to the degree of
plausibility of a statement. This also applies to
the degree of believability contained within the
rational agents of a truth statement.
Additionally, when a statement is used with
Bayes' theorem, it then becomes a Bayesian
inference.

24
Baye's Rule

If A and B are two separate but possibly
dependent random events, then
Probability of A and B occurring together
Pr(A,B)
The conditional probability of A, given that B
occurs Pr(AB)
The conditional probability of BB, given that AA
occurs Pr(BA)

From elementary rules of probability
Pr(A,B)
Pr(AB)Pr(B) Pr(BA)Pr(A)
Dividing the right-hand pair of expressions by
Pr(B) gives Bayes' rule
PrAB
PrBAPrA
-----------------
PrB

In problems of probabilistic inference, we are
often trying to estimate the most probable
underlying model for a random process, based on
some observed data or evidence. If AA represents
a given set of model parameters, and BB
represents the set of observed data values, then
the terms in equation are given the following
terminology
PrA is the prior probability of the model A (in
the absence of any evidence)
PrB is the probability of the evidence B
PrBA is the likelihood that the evidence B was
produced, given that the model was A
PrAB is the posterior probability of the model
being A, given that the evidence is B.

Mathematically, Bayes' rule states
likelihood prior
posterior ------------------------------
marginal likelihood

28
Representing E-mail for statistical Algorithms

All statistical algorithms for spam filtering
begin with a vector representation of individual
e-mail messages.
The length of the term vector is the number of
distinct words in all the e-mail messages in the
training data. The entry for a particular word in
the term vector for a particular e-mail message
is usually he number of occurences of the word in
the e-mail message.

29
Training data comprising four labeled e-mail
messages

Table below presents toy training data comprising
four e-mail messages. These data contain ten
distinct words the, quick, brown, fox, rabbit,
ran, and, run, at, and rest.
Message Spam
1 The quick brown fox no
2 The quick rabbit ran and ran
yes
3
rabbit run run run no
4
rabbit at rest yes

30
Term Vectors corresponding to training data

and at brown fox quick rabbit ran rest run the
1 1 0 0 0 1 1 1 0 0 0 0 1
2 1 2 0 0 2 0 0 1 0 1 0 1
3 0 0 0 0 3 1 1 1 0 0 1 0
4 2 0 3 2 0 0 0 0 1 0 1 1

If the training data comprise thousands of e-mail
messages, the number of distinct words often
exceeds 10,000. Two simple strategies to reduce
the size of the term vector somewhat are to
remove stop words (words like and, of, the,
etc.) and to reduce words to their root form, a
process known as stemming (so, for example, ran
and run reduce to run). Table 3 shows the
reduced term vectors along with the spam label.

32
Term vectors after stemming and stop word
removal, spam label coded as 0no,1yes

X1 X2 X3 X4 X5 X6 Y
brown fox quick rabbit rest run Spam
1 1 1 1 0 2
1 0
2 0 1 1
0 3 0
1
3 0 0 1
0 0 1
0
4 0 0 0
1 1 2
1

33
Navie Bayes for Spam

Let X (X1,. .., Xd) denote the term vector for
a random e-mail message, where d is the number of
distinct words in the training data, after
stemming and stopword removal. Let Y denote the
corresponding spam label. The Naive Bayes model
seeks to build a model for
Pr(Y 1X1 x1,. .., Xd xd).
From Bayes theorem, we have
Pr(Y 1X1 x1,. .., Xd xd) Pr(Y 1)
Pr(X1x1,. .., Xd xdY 1)
------------------------------------------------
Pr(X1 x1,. .., Xd xd)

34
Centroid-based method

The documents are represented using a
Vector-space model.
Each document is represented as a Term Frequency
vector (TF)

t2
d1
d4
d3
d2
t1
35
Centroid-based method

A refinement of this model is the inverse
document frequency (IDF)
This is to limit the discrimination power of
frequent terms and stop words, and to emphasize
words that appear in specific documents.
IDF is log(N/dfi)
The size of the document is normalized

36
Centroid-based method

The distance between two vectors is defined using
the cosine function
Finallly, one Centroid Vector C is defined for
each category (spam/not spam) as

37
Centroid-based method

We can measure the similarity between one
document and the Centroid of the category with
the following function

38
Steps Centroid-based Method

TRAINNING
Determine the document vectors using TD/IDF.

t2
d7
d8
d5
d6
d3
d2
d1
d4
t1
39
Steps Centroid-based Method

TRAINNING
Calculate the centroid for the categories SPAM
and NOT SPAM

t2
d7
CSPAM
d8
d5
d6
d3
CNOT SPAM
d2
d1
d4
t1
40
Steps Centroid-based Method

CLASSIFICATION
Given a new document dn, calculate the document
vector representation (like in the training stage)

t2
dn
t1
41
Steps Centroid-based Method

CLASSIFICATION
Measure the distance between the vector dn and
the Centroids of the Categories SPAM / NOT SPAM

t2
CSPAM
dn
CNOT SPAM
t1
42
Steps Centroid-based Method

CLASSIFICATION (cont.)
Measure the distance between the vector dn and
the Centroids of the Categories SPAM / NOT SPAM

t2
CSPAM
dn
CNOT SPAM
t1
43
Steps Centroid-based Method

FINAL RESULT
Obtain the maximum similarity between the
document and the Centroids of SPAM and NOT SPAM
for i1,2 where 1SPAM and 2NOT SPAM

44
Analysis of Results

The standard methodology for measuring
performance of text classification methods are
the Precision and Recall

n. of correctly predicted positives N of
predicted positive examples
P
n. of correctly predicted positives N of all
positive examples
R
45
Analysis of Results

None Precision or Recall can give a good measure
by themselves. To have an idea of the
performance, we have to combine them.

R
better
2PR PR
F
P
46
Some results

Compared agains kNN and Naïve Based, the Centroid
method performs better

47
Content Based Approach

Spam can be detected
before reading the message --- non-content
based
Based on special protocol 3 voip protocol
Based on address book1 build an email network
Based on IP address 4
..
After process the content of the email ---
content based

48
Content-based Approach

Non-content based approach
remove spam message if contain virus, worms
before read.
leaves some messages un-labeled
Content based method
widely used method
may need lots pre-labeled message
label message based its content
Zdziarski5 said that it's possible to stop
spam, and that content-based filters are the way
to do it
Focus on content based method

49
Method of content-based

Bayesian based method 6
Centroid-based method7
Machine learning method 8
Latent Semantic indexing LSI
Contextual Network Graphs (CNG)
Rule based method9
ripper rule a list of predefined rules that can
be changed by hand
Memory based method10
saving cost

50
Measurement

Accuracy the percentage of correct classified
correct/(correct un-correct)
False positive if a message is a spam, but
misclassify to un-spam.
Goal
Improve accuracy
Prevent false positive

Correct Un-correct
Spam
No spam
False positive
51
measurement
52
Rule-based method

A list of predefined rules that can be changed by
hand
ripper rule
Each rule/test associated with a score
If an email fails a rule, its score increased
After apply all rules, if the score is above a
certain threshold, it is classified as spam

53
Rule-based method

Advantage
able to employ diverse and specific rule to
check spam
Check size of the email
Number of pictures it contains
no training messages are needed
Disadvantage
rules have to be entered and maintained by hand
--- cant be automatically

54
Latent Semantic indexing

Keyword
important word for text classification
High frequent word in a message
Can used as an indicator for the message
Why LSI?
Polysemy word can be used in more than one
category
ex Play
Synonymous if two words have identical meaning
Based on nearest neighbors based algorithm

55
Latent semantic indexing

consider semantic links between words
Search keyword over the semantic space
Two words have the same meaning are treated as
one word
eliminate synonymous
Consider the overlap between different message,
this overlap may indicate
polysemy or stop-word
two messages in same category

56
Latent semantic indexing

Step1 build a term-document matrix X from the
input documents

Doc1 computer science department Doc2 computer
science and engineering science Doc3 engineering
school
Computer science department and
engineering school Doc1 1 1
1 0 0
0 Doc2 1 2
0 1 1
0 Doc3 0 0 0
0 1 1
57
Latent semantic indexing

Step2 Singular value Decomposition (SVD) is
performed on matrix X
To extract a set of linearly independent FACTORS
that describe the matrix
Generalize the terms have the same meaning
Three new matrices TSD are produced to reduce
the vocabularys size

58
Latent Semantic indexing

Two document can be compared by finding the
distance between two document vector, stored in
matrix X1
Text classification is done by finding the
nearest neighbors assign to category with max
document

59
Nearest neighbors method classify the test
message to be UN-SPAM
Spam Un-spam Test
60
Latent Semantic Indexing

Advantage
Entire training set can be learned at same time
No intermediate model need to be build
Good for the training set is predefined
Disadvantage
When new document added, matrix X changed, and
TSD need to be re-calculated
Time consuming
Real classifier need the ability to change
training set

61
Contextual network Graphs

A weighted, bipartite, undirected graph of term
and document nodes

t1 W11w21 1 d1 w11w12w13 1
t1
w21
w11
t2
w22
w12
d1
d2
w13
w23
t3
At any time, for each node, the sum of the weigh
is 1
62
Contextual network graphs

When new document d is added, energizing the
weights at node d, and may need re-balance the
weights at the connected node
The document is classified to the one with
maximum of energy (weight) average for each class

63
Comparison Bayesian, LSI,CNG, centroid, rule-based
64
Result
65
Result and conclusion

LSI CNG super Bayesian approach 5 accuracy,
and reduce false positive and negatives up to 71
LSI CNG shows better performance even with
small document set

66
Comparison content based and non-content based

Non-content based
dis-adv
depends on special factor like email address, IP
address, special protocol,
leaves some un-classified
Adv detect spam before reading message with
high accuracy

Content based
Disadvantage
need some training message
not 100 correct classified due to the spammer
also know the anti-spam tech.
Advantage
leaves no message unclassified

68
Improvement for spam

Combine both method
1 proposes an email network based algorithm,
which with 100 accuracy, but leaves 47
unclassified, if combine with content based
method, can improve the performance.
Build up multi-layers11
11 Chris Miller, A layered Approach to
enterprise antispam

69
Data set for spam

Non-content based
Email network
One authors email corpus, formed by 5,486
messages
IP address -- none
Content based

70
Data set for spam

LSI CNG
Corpus of varying size (250 4000)
Spam and un-spam emails in equal amount
Bayesian based
Corpus of 1789 email
211 spam, 1578 non-spam
Cetroid based
Totally 200 email message
90 spam, 110 non-spam

71
Most recently used Benchmarks

Reuters
About 7700 training and 3000 test documents,
30000 terms,135 categories, 21MB.
each category has about 57 instances
collection of newswire stories
20NG
About 18800 total documents, 94000 terms, 20
topics, 25MB.
Each category has about 1000 instances
WebKB
About 8300 documents, 7 categories, 26 MB.
Each category has about 1200 instances
4 university website data
Above three are well-known in recently IR with
small in size and used to test the performance
and CPU scalability

72
Benchmarks

OHSUMED
348566 document, 230000 terms and 308511 topics,
400 MB.
Each category has about 1 instance
Abstract from medical journals
Dmoz
482 topics, 300 training document for each topic,
271MB
Each category has less than 1 instance
taken from Dmoz(http//dmoz.org/) topic tree
Large dataset, used to test the memory
scalability of a model

73
Some facts

Spam is a growing problem, and the research on
this topic has become more relevant the last
years
Spam grows because it works.
Many commercial products try to fight spam. Most
of them rely on the exposed techniques, or
combination of them
Spam damages economy, more than hackers or viruses

74
Some facts

Damages attributed to Spam are calculated around
10.4 billion in 2003, 58 112 billion in 2004,
and is projected to cross 200 billion worldwide
in 2005.
1.6 trillion unsolicited messages were sent in
2004.

75
Conclusions

Spam is a problem that causes a great impact of
global business
We presented three methods for Spam
classification.
The benchmarks on this three methods suggest that
combination of the methods perform better than
the methods alone

76
Conclusions

Spam classifiers can be Content Based, and Non
Content Based
Content Based Rules, Naïve Bayes, Centroid
Non content work without reading the content of
the mail

77
Conclusions

Researchers have found ways to increase the
accuracy of all the methods, using heuristics and
combining them
Spammers also learn how to avoid spam filters
No single method is perfect in all situations

78
Sources

Slide 1, image ttp//www.ecommerce-guide.com
Slide 1, image ttp//www.email-firewall.jp/produc
ts/das.html

79
References

Anti-spam Filtering A centroid-based
Classification Approach, Nuanwan Soonthornphisaj,
Kanokwan Chaikulseriwat, Piyan Tang-On, 2002
Centroid-Based Document Classification Analysis
Experimental Results, Eui-Hong (Sam) and George
Karypis, 2000
Multi-dimensional Text classification, Thanaruk
Theeramunkog, 2002
Improving centroid-based text classification
using term-distribution-based weighting system
and clustering, Thanaruk Theeramunkog and
Verayuth Lertnattee
Combining Homogeneous Classifiers for
Centroid-based text classifications, Verayuth
Lertnattee and Thanaruk Theeramunkog

80
References

1 P Oscar Boykin and Vwani Roychowdhury,
Personal Email Networks An Effective Anti-Spam
Tool, IEEE COMPUTER, volume 38, 2004
2 Andras A. Benczur and Karoly Csalogany and
Tamas Sarlos and Mate Uher, SpamRank - Fully
Automatic Link Spam Detection, citeseer.ist.psu.ed
u/benczur05spamrank.html
3. R. Dantu, P. Kolan, Detecting Spam in VoIP
Networks, Proceedings of USENIX, SRUTI (Steps
for Reducing Unwanted Traffic on the Internet)
workshop, July 05(accepted)
4. IP addresses in email clients
ttp//www.ceas.cc/papers-2004/162.pdf
5 Plan for Spam ttp//ww.paulgraham.com/spam.htm
l

81
References

6 M. Sahami, S. Dumais, D. Heckerman, and E.
Horvitz. 1998, A Bayesian Approach to Filtering
Junk E-Mail, Learning for Text Categorization
Papers from the AAAI Workshop, pages 5562,
Madison Wisconsin. AAAI Technical Report WS-98-05
7 N. Soonthornphisaj, K. Chaikulseriwat, P
Tang-On, Anti-Spam Filtering A Centroid Based
Classification Approach, IEEE proceedings ICSP
02
8 Spam Filtering Using Contextual Networking
Graphs www.cs.tcd.ie/courses/csll/dkellehe0304.pdf
9 W.W. Cohen, Learning Rules that Classify
e-mail, In Proceedings of the AAAI Spring
Symposium on Machine Learning in Information
Access, 1996
10 G. Sakkis, I. Androutsopoulos, G. Paliouras,
V. Karkaletsis, C.D. Spyropoulos, P.
Stamatopoulos, A memory based approach to
anti-spam filtering for mailing lists,
Information Retrieval 2003

Write a Comment

User Comments (0)

About PowerShow.com

SPAM - PowerPoint PPT Presentation

SPAM

... e-mail addresses. ... A popular method uses e-mail addresses which their owners have published ... Web Spiders to find email addresses on web pages. ... – PowerPoint PPT presentation