SPAM - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

SPAM

Description:

... e-mail addresses. ... A popular method uses e-mail addresses which their owners have published ... Web Spiders to find email addresses on web pages. ... – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 82
Provided by: cse54
Category:

less

Transcript and Presenter's Notes

Title: SPAM


1
SPAM Christian Loza Srikanth Palla Liqin Zhang
2
Overview
  • Introduction
  • Background
  • Measurement
  • Methods
  • Compare different methods
  • Conclusions

3
Introduction
  • If you use email, it's likely that you've
    recently been visited by a piece of Spam- an
    unsolicited, unwanted messag, sent to you with
    out your permission.Sending spam violates the
    Acceptable Use Policy (AUP)of almost all ISP's
    and can lead to the termination of the sender's
    account.As the recipient directly bears the cost
    of delivery, storage, and processing, one could
    regard spam as the electronic equivalent of
    "postage-due" junk mail.

4
Introduction
  • Spammers frequently engage in deliberate fraud to
    send out their messages. Spammers often use false
    names, addresses, phone numbers, and other
    contact information to set up "disposable"
    accounts at various Internet service providers.
    They also often use falsified or stolen credit
    card numbers to pay for these accounts. This
    allows them to move quickly from one account to
    the next as the host ISPs discover and shut down
    each one.

5
Introduction
  • In recent years, the spam has show no signals of
    stopping growth
  • This is mainly because it does work
  • The advantage is that is a cheap way to increase
    the customer base.

6
  • Spammers frequently engage in deliberate fraud to
    send out their messages. Spammers often use false
    names, addresses, phone numbers, and other
    contact information to set up "disposable"
    accounts at various Internet service providers.
    They also often use falsified or stolen credit
    card numbers to pay for these accounts. This
    allows them to move quickly from one account to
    the next as the host ISPs discover and shut down
    each one.

7
  • Spammers frequently go to great lengths to
    conceal the origin of their messages. They do
    this by spoofing e-mail addresses . The spammer
    hacks the email protocol SMTP so that a message
    appears to originate from another email address.
    Some ISPs and domains require the use of SMTP
    AUTHallowing positive identification of the
    specific account from which an e-mail originates.

8
  • One cannot completely spoof an e-mail address
    chain, since the receiving mailserver records the
    actual connection from the last mailserver's IP
    address however, spammers can forge the rest of
    the ostensible history of the mailservers the
    e-mail has ostensibly traversed. Spammers
    frequently seek out and make use of vulnerable
    third-party systems such as open mail relays and
    open proxy servers.

9
Address Collection
  • Spammers may harvest e-mail addresses from a
    number of sources. A popular method uses e-mail
    addresses which their owners have published for
    other purposes. Usenet posts, especially those in
    archives such as Google groups, frequently yield
    addresses. Simply searching the Web for pages
    with addresses ? such as corporate staff
    directories ? can yield thousands of addresses,
    most of them deliverable.

10
Address Collection
  • Spammers have also subscribed to discussion
    mailing lists for the purpose of gathering the
    addresses of posters. The DNS and WHOIS systems
    require the publication of technical contact
    information for all Internet domains spammers
    have illegally crawled these resources for email
    addresses. Many spammers utilize programs called
    Web Spiders to find email addresses on web
    pages.Because spammers offload the bulk of their
    costs onto others, however, they can use even
    more computationally expensive means to generate
    addresses.

11
Address Collection
  • A dictionary attack consists of an exhaustive
    attempt to gain access to a resource by trying
    all possible credentials ? usually, usernames and
    passwords. Spammers have applied this principle
    to guessing email addresses ? as by taking common
    names and generating likely email addresses for
    them at each of thousands of domain
    names.Spammers sometimes use various means to
    confirm addresses as deliverable. For instance,
    including a Web bug in a spam message written in
    HTML may cause the recipient's mail client to
    transmit the recipient's address, or any other
    unique key, to the spammer's Web site.

12
Terminology
  • To better understand the concepts in this
    presentation let us consider the following
    terminology.
  • Mail User Agent (MUA). This refers to the program
    used by the client to
  • send and receive e-mail from. It is usually
    referred to as the "mail client."
  • An example of this is Pine or Eudora.
  • Mail Transfer Agent (MTA). This refers to the
    program used running on the
  • server to store and forward e-mail messages. It
    is usually referred to as the
  • "mail server program." An example of this is
    sendmail or the Microsoft
  • Exchange server.

13
The Mail Queue
14
  • In a normal configuration, sendmail sits in the
    background waiting for new messages. When a new
    connection arrives, a child process is invoked to
    handle the connection, while the parent process
    goes back to listening for new connections.
  • When a message is received, the sendmail child
    process puts it into the mail queue (usually
    stored in /var/spool/mqueue). If it is
    immediately deliverable, it is delivered and
    removed from the queue. If it is not immediately
    deliverable, it will be left in the queue and the
    process will terminate.
  • Messages left in the queue will stay there until
    the next time the queue is processed. The parent
    sendmail will usually fork a child process to
    attempt to deliver anything left in the queue at
    regular intervals.

15
Structure of E-mail Message
  • Email messages are compose of two parts
  • 1. Headers (lines of the form "field value"
    which contain information about the
    message, such as "To", "From", "Date", and
    "Message- ID")
  • 2. Body (the text of the message)

16
Example
  • From johndoe_at_students.uiuc.edu Mon Jul 5
    234619 1999
  • Received (from johndoe_at_localhost)
  • by students.uiuc.edu (8.9.3/8.9.3) id
    LAA05394
  • Mon, 5 Jul 1999 234618 -0500
  • Received from staff.uiuc.edu (staff.uiuc.edu
    128.174.5.59)
  • by students.uiuc.edu (8.9.3/8.9.3) id
    XAA24214
  • Mon, 5 Jul 1999 234625 -0500
  • Date Mon, 5 Jul 1999 234618 -0500
  • From John Doe ltjohndoe_at_students.uiuc.edugt
  • To John Smith ltjsmith_at_staff.uiuc.edugt
  • Message-Id lt199907052346.LAA05394_at_students.uiuc.e
    dugt
  • Subject This is a subject header.
  • This is the message body. It is seperated from
    the headers by a blank
  • line.
  • The message body can span multiple lines.

17
  • Here is an example SMTP transaction
  • 1. Client connects to server's SMTP port (25).
  • 2. Server 220 staff.uiuc.edu ESMTP Sendmail
    8.10.0/8.10.0 ready Mon, 13 Mar 2000 145408
    -0600
  • 3. Client helo students.uiuc.edu
  • 4. Server 250 staff.uiuc.edu Hello
    root_at_students.uiuc.edu 128.174.5.62, pleased to
    meet you
  • 5. Client mail from johndoe_at_students.uiuc.edu
  • 6. Server 250 2.1.0 johndoe_at_students.uiuc.edu.
    .. Sender ok
  • 7. Client rcpt to jsmith_at_staff.uiuc.edu
  • 8. Server 250 2.1.5 jsmith_at_staff.uiuc.edu...
    Recipient ok
  • 9. Client data
  • 10. Server 354 Enter mail, end with "." on a
    line by itself
  • 11. Client
  • Received (from johndoe_at_localhost)
  • by students.uiuc.edu (8.9.3/8.9.3) id
    LAA05394
  • Mon, 5 Jul 1999 234618 -0500
  • Date Mon, 5 Jul 1999 234618 -0500
  • From John Doe ltjohndoe_at_students.uiuc.edugt

18
Delivering Spam messages
  • Early on, spammers discovered that if they sent
    large quantities of spam directly from their ISP
    accounts, recipients would complain and ISPs
    would shut their accounts down. Thus, one of the
    basic techniques of sending spam has become to
    send it from someone else's computer and network
    connection. By doing this, spammers protect
    themselves in several ways they hide their
    tracks, get others' systems to do most of the
    work of delivering messages, and direct the
    efforts of investigators towards the other
    systems rather than the spammers themselves.

19
Mail filters
  • A mail filter is a piece of software which takes
    an input of an email message. For its output, it
    might pass the message through unchanged for
    delivery to the user's mailbox, it might redirect
    the message for delivery elsewhere, or it might
    even throw the message away. Some mail filters
    are able to edit messages during processing.

20
Introduction
  • Application of Text Categorization
  • The Spam classification is defined as a binary
    problem Email is Spam OR is not Spam.
  • Automatic text categorization assigns emails to
    one of the above categories, using different
    methods
  • One of this methods is the Centroid-based
    classification

SPAM
NOT SPAM
21
Background
  • Text Classification classify documents into
    categories
  • Spam
  • un-spam
  • Classification process
  • preprocess message
  • Remove tag
  • Stop-word removal
  • Word stemming
  • Training --- build the classification model
  • Testing --- evaluate the model

22
Methodologies
  • Bayes-Naives
  • Centroid-Based
  • Content-based

23
Bayesianism
  • Is the philosophical tenet that the mathematical
    theory of probability applies to the degree of
    plausibility of a statement. This also applies to
    the degree of believability contained within the
    rational agents of a truth statement.
    Additionally, when a statement is used with
    Bayes' theorem, it then becomes a Bayesian
    inference.

24
Baye's Rule
  • If A and B are two separate but possibly
    dependent random events, then
  • Probability of A and B occurring together
    Pr(A,B)
  • The conditional probability of A, given that B
    occurs Pr(AB)
  • The conditional probability of BB, given that AA
    occurs Pr(BA)

25
  • From elementary rules of probability
  • Pr(A,B)
    Pr(AB)Pr(B) Pr(BA)Pr(A)
  • Dividing the right-hand pair of expressions by
    Pr(B) gives Bayes' rule
  • PrAB
    PrBAPrA

  • -----------------

  • PrB


26
  • In problems of probabilistic inference, we are
    often trying to estimate the most probable
    underlying model for a random process, based on
    some observed data or evidence. If AA represents
    a given set of model parameters, and BB
    represents the set of observed data values, then
    the terms in equation are given the following
    terminology
  • PrA is the prior probability of the model A (in
    the absence of any evidence)
  • PrB is the probability of the evidence B
  • PrBA is the likelihood that the evidence B was
    produced, given that the model was A
  • PrAB is the posterior probability of the model
    being A, given that the evidence is B.

27
  • Mathematically, Bayes' rule states
  • likelihood prior
  • posterior ------------------------------
  • marginal likelihood

28
Representing E-mail for statistical Algorithms
  • All statistical algorithms for spam filtering
    begin with a vector representation of individual
    e-mail messages.
  • The length of the term vector is the number of
    distinct words in all the e-mail messages in the
    training data. The entry for a particular word in
    the term vector for a particular e-mail message
    is usually he number of occurences of the word in
    the e-mail message.

29
Training data comprising four labeled e-mail
messages
  • Table below presents toy training data comprising
    four e-mail messages. These data contain ten
    distinct words the, quick, brown, fox, rabbit,
    ran, and, run, at, and rest.
  • Message Spam
  • 1 The quick brown fox no
  • 2 The quick rabbit ran and ran
    yes
  • 3
    rabbit run run run no
  • 4
    rabbit at rest yes

30
Term Vectors corresponding to training data
  • and at brown fox quick rabbit ran rest run the
  • 1 1 0 0 0 1 1 1 0 0 0 0 1
  • 2 1 2 0 0 2 0 0 1 0 1 0 1
  • 3 0 0 0 0 3 1 1 1 0 0 1 0
  • 4 2 0 3 2 0 0 0 0 1 0 1 1

31
  • If the training data comprise thousands of e-mail
    messages, the number of distinct words often
    exceeds 10,000. Two simple strategies to reduce
    the size of the term vector somewhat are to
    remove stop words (words like and, of, the,
    etc.) and to reduce words to their root form, a
    process known as stemming (so, for example, ran
    and run reduce to run). Table 3 shows the
    reduced term vectors along with the spam label.

32
Term vectors after stemming and stop word
removal, spam label coded as 0no,1yes
  • X1 X2 X3 X4 X5 X6 Y
  • brown fox quick rabbit rest run Spam
  • 1 1 1 1 0 2
    1 0
  • 2 0 1 1
    0 3 0
    1
  • 3 0 0 1
    0 0 1
    0
  • 4 0 0 0
    1 1 2
    1

33
Navie Bayes for Spam
  • Let X (X1,. .., Xd) denote the term vector for
    a random e-mail message, where d is the number of
    distinct words in the training data, after
    stemming and stopword removal. Let Y denote the
    corresponding spam label. The Naive Bayes model
    seeks to build a model for
  • Pr(Y 1X1 x1,. .., Xd xd).
  • From Bayes theorem, we have
  • Pr(Y 1X1 x1,. .., Xd xd) Pr(Y 1)
    Pr(X1x1,. .., Xd xdY 1)

  • ------------------------------------------------

  • Pr(X1 x1,. .., Xd xd)

34
Centroid-based method
  • The documents are represented using a
    Vector-space model.
  • Each document is represented as a Term Frequency
    vector (TF)

t2
d1
d4
d3
d2
t1
35
Centroid-based method
  • A refinement of this model is the inverse
    document frequency (IDF)
  • This is to limit the discrimination power of
    frequent terms and stop words, and to emphasize
    words that appear in specific documents.
  • IDF is log(N/dfi)
  • The size of the document is normalized

36
Centroid-based method
  • The distance between two vectors is defined using
    the cosine function
  • Finallly, one Centroid Vector C is defined for
    each category (spam/not spam) as

37
Centroid-based method
  • We can measure the similarity between one
    document and the Centroid of the category with
    the following function

38
Steps Centroid-based Method
  • TRAINNING
  • Determine the document vectors using TD/IDF.

t2
d7
d8
d5
d6
d3
d2
d1
d4
t1
39
Steps Centroid-based Method
  • TRAINNING
  • Calculate the centroid for the categories SPAM
    and NOT SPAM

t2
d7
CSPAM
d8
d5
d6
d3
CNOT SPAM
d2
d1
d4
t1
40
Steps Centroid-based Method
  • CLASSIFICATION
  • Given a new document dn, calculate the document
    vector representation (like in the training stage)

t2
dn
t1
41
Steps Centroid-based Method
  • CLASSIFICATION
  • Measure the distance between the vector dn and
    the Centroids of the Categories SPAM / NOT SPAM

t2
CSPAM
dn
CNOT SPAM
t1
42
Steps Centroid-based Method
  • CLASSIFICATION (cont.)
  • Measure the distance between the vector dn and
    the Centroids of the Categories SPAM / NOT SPAM

t2
CSPAM
dn
CNOT SPAM
t1
43
Steps Centroid-based Method
  • FINAL RESULT
  • Obtain the maximum similarity between the
    document and the Centroids of SPAM and NOT SPAM
  • for i1,2 where 1SPAM and 2NOT SPAM

44
Analysis of Results
  • The standard methodology for measuring
    performance of text classification methods are
    the Precision and Recall

n. of correctly predicted positives N of
predicted positive examples
P
n. of correctly predicted positives N of all
positive examples
R
45
Analysis of Results
  • None Precision or Recall can give a good measure
    by themselves. To have an idea of the
    performance, we have to combine them.

R
better
2PR PR
F
P
46
Some results
  • Compared agains kNN and Naïve Based, the Centroid
    method performs better

47
Content Based Approach
  • Spam can be detected
  • before reading the message --- non-content
    based
  • Based on special protocol 3 voip protocol
  • Based on address book1 build an email network
  • Based on IP address 4
  • ..
  • After process the content of the email ---
    content based

48
Content-based Approach
  • Non-content based approach
  • remove spam message if contain virus, worms
    before read.
  • leaves some messages un-labeled
  • Content based method
  • widely used method
  • may need lots pre-labeled message
  • label message based its content
  • Zdziarski5 said that it's possible to stop
    spam, and that content-based filters are the way
    to do it
  • Focus on content based method

49
Method of content-based
  • Bayesian based method 6
  • Centroid-based method7
  • Machine learning method 8
  • Latent Semantic indexing LSI
  • Contextual Network Graphs (CNG)
  • Rule based method9
  • ripper rule a list of predefined rules that can
    be changed by hand
  • Memory based method10
  • saving cost

50
Measurement
  • Accuracy the percentage of correct classified
    correct/(correct un-correct)
  • False positive if a message is a spam, but
    misclassify to un-spam.
  • Goal
  • Improve accuracy
  • Prevent false positive

Correct Un-correct
Spam
No spam
False positive
51
measurement
52
Rule-based method
  • A list of predefined rules that can be changed by
    hand
  • ripper rule
  • Each rule/test associated with a score
  • If an email fails a rule, its score increased
  • After apply all rules, if the score is above a
    certain threshold, it is classified as spam

53
Rule-based method
  • Advantage
  • able to employ diverse and specific rule to
    check spam
  • Check size of the email
  • Number of pictures it contains
  • no training messages are needed
  • Disadvantage
  • rules have to be entered and maintained by hand
    --- cant be automatically

54
Latent Semantic indexing
  • Keyword
  • important word for text classification
  • High frequent word in a message
  • Can used as an indicator for the message
  • Why LSI?
  • Polysemy word can be used in more than one
    category
  • ex Play
  • Synonymous if two words have identical meaning
  • Based on nearest neighbors based algorithm

55
Latent semantic indexing
  • consider semantic links between words
  • Search keyword over the semantic space
  • Two words have the same meaning are treated as
    one word
  • eliminate synonymous
  • Consider the overlap between different message,
    this overlap may indicate
  • polysemy or stop-word
  • two messages in same category

56
Latent semantic indexing
  • Step1 build a term-document matrix X from the
    input documents

Doc1 computer science department Doc2 computer
science and engineering science Doc3 engineering
school
Computer science department and
engineering school Doc1 1 1
1 0 0
0 Doc2 1 2
0 1 1
0 Doc3 0 0 0
0 1 1
57
Latent semantic indexing
  • Step2 Singular value Decomposition (SVD) is
    performed on matrix X
  • To extract a set of linearly independent FACTORS
    that describe the matrix
  • Generalize the terms have the same meaning
  • Three new matrices TSD are produced to reduce
    the vocabularys size

58
Latent Semantic indexing
  • Two document can be compared by finding the
    distance between two document vector, stored in
    matrix X1
  • Text classification is done by finding the
    nearest neighbors assign to category with max
    document

59
Nearest neighbors method classify the test
message to be UN-SPAM
Spam Un-spam Test
60
Latent Semantic Indexing
  • Advantage
  • Entire training set can be learned at same time
  • No intermediate model need to be build
  • Good for the training set is predefined
  • Disadvantage
  • When new document added, matrix X changed, and
    TSD need to be re-calculated
  • Time consuming
  • Real classifier need the ability to change
    training set

61
Contextual network Graphs
  • A weighted, bipartite, undirected graph of term
    and document nodes

t1 W11w21 1 d1 w11w12w13 1
t1
w21
w11
t2
w22
w12
d1
d2
w13
w23
t3
At any time, for each node, the sum of the weigh
is 1
62
Contextual network graphs
  • When new document d is added, energizing the
    weights at node d, and may need re-balance the
    weights at the connected node
  • The document is classified to the one with
    maximum of energy (weight) average for each class

63
Comparison Bayesian, LSI,CNG, centroid, rule-based
64
Result
65
Result and conclusion
  • LSI CNG super Bayesian approach 5 accuracy,
    and reduce false positive and negatives up to 71
  • LSI CNG shows better performance even with
    small document set

66
Comparison content based and non-content based
  • Non-content based
  • dis-adv
  • depends on special factor like email address, IP
    address, special protocol,
  • leaves some un-classified
  • Adv detect spam before reading message with
    high accuracy

67
  • Content based
  • Disadvantage
  • need some training message
  • not 100 correct classified due to the spammer
    also know the anti-spam tech.
  • Advantage
  • leaves no message unclassified

68
Improvement for spam
  • Combine both method
  • 1 proposes an email network based algorithm,
    which with 100 accuracy, but leaves 47
    unclassified, if combine with content based
    method, can improve the performance.
  • Build up multi-layers11
  • 11 Chris Miller, A layered Approach to
    enterprise antispam

69
Data set for spam
  • Non-content based
  • Email network
  • One authors email corpus, formed by 5,486
    messages
  • IP address -- none
  • Content based

70
Data set for spam
  • LSI CNG
  • Corpus of varying size (250 4000)
  • Spam and un-spam emails in equal amount
  • Bayesian based
  • Corpus of 1789 email
  • 211 spam, 1578 non-spam
  • Cetroid based
  • Totally 200 email message
  • 90 spam, 110 non-spam

71
Most recently used Benchmarks
  • Reuters
  • About 7700 training and 3000 test documents,
    30000 terms,135 categories, 21MB.
  • each category has about 57 instances
  • collection of newswire stories
  • 20NG
  • About 18800 total documents, 94000 terms, 20
    topics, 25MB.
  • Each category has about 1000 instances
  • WebKB
  • About 8300 documents, 7 categories, 26 MB.
  • Each category has about 1200 instances
  • 4 university website data
  • Above three are well-known in recently IR with
    small in size and used to test the performance
    and CPU scalability

72
Benchmarks
  • OHSUMED
  • 348566 document, 230000 terms and 308511 topics,
    400 MB.
  • Each category has about 1 instance
  • Abstract from medical journals
  • Dmoz
  • 482 topics, 300 training document for each topic,
    271MB
  • Each category has less than 1 instance
  • taken from Dmoz(http//dmoz.org/) topic tree
  • Large dataset, used to test the memory
    scalability of a model

73
Some facts
  • Spam is a growing problem, and the research on
    this topic has become more relevant the last
    years
  • Spam grows because it works.
  • Many commercial products try to fight spam. Most
    of them rely on the exposed techniques, or
    combination of them
  • Spam damages economy, more than hackers or viruses

74
Some facts
  • Damages attributed to Spam are calculated around
    10.4 billion in 2003, 58 112 billion in 2004,
    and is projected to cross 200 billion worldwide
    in 2005.
  • 1.6 trillion unsolicited messages were sent in
    2004.

75
Conclusions
  • Spam is a problem that causes a great impact of
    global business
  • We presented three methods for Spam
    classification.
  • The benchmarks on this three methods suggest that
    combination of the methods perform better than
    the methods alone

76
Conclusions
  • Spam classifiers can be Content Based, and Non
    Content Based
  • Content Based Rules, Naïve Bayes, Centroid
  • Non content work without reading the content of
    the mail

77
Conclusions
  • Researchers have found ways to increase the
    accuracy of all the methods, using heuristics and
    combining them
  • Spammers also learn how to avoid spam filters
  • No single method is perfect in all situations

78
Sources
  • Slide 1, image ttp//www.ecommerce-guide.com
  • Slide 1, image ttp//www.email-firewall.jp/produc
    ts/das.html

79
References
  • Anti-spam Filtering A centroid-based
    Classification Approach, Nuanwan Soonthornphisaj,
    Kanokwan Chaikulseriwat, Piyan Tang-On, 2002
  • Centroid-Based Document Classification Analysis
    Experimental Results, Eui-Hong (Sam) and George
    Karypis, 2000
  • Multi-dimensional Text classification, Thanaruk
    Theeramunkog, 2002
  • Improving centroid-based text classification
    using term-distribution-based weighting system
    and clustering, Thanaruk Theeramunkog and
    Verayuth Lertnattee
  • Combining Homogeneous Classifiers for
    Centroid-based text classifications, Verayuth
    Lertnattee and Thanaruk Theeramunkog

80
References
  • 1 P Oscar Boykin and Vwani Roychowdhury,
    Personal Email Networks An Effective Anti-Spam
    Tool, IEEE COMPUTER, volume 38, 2004
  • 2 Andras A. Benczur and Karoly Csalogany and
    Tamas Sarlos and Mate Uher, SpamRank - Fully
    Automatic Link Spam Detection, citeseer.ist.psu.ed
    u/benczur05spamrank.html
  • 3. R. Dantu, P. Kolan, Detecting Spam in VoIP
    Networks, Proceedings of USENIX, SRUTI (Steps
    for Reducing Unwanted Traffic on the Internet)
    workshop, July 05(accepted)
  • 4. IP addresses in email clients
    ttp//www.ceas.cc/papers-2004/162.pdf
  • 5 Plan for Spam ttp//ww.paulgraham.com/spam.htm
    l

81
References
  • 6 M. Sahami, S. Dumais, D. Heckerman, and E.
    Horvitz. 1998, A Bayesian Approach to Filtering
    Junk E-Mail, Learning for Text Categorization
    Papers from the AAAI Workshop, pages 5562,
    Madison Wisconsin. AAAI Technical Report WS-98-05
  • 7 N. Soonthornphisaj, K. Chaikulseriwat, P
    Tang-On, Anti-Spam Filtering A Centroid Based
    Classification Approach, IEEE proceedings ICSP
    02
  • 8 Spam Filtering Using Contextual Networking
    Graphs www.cs.tcd.ie/courses/csll/dkellehe0304.pdf
  • 9 W.W. Cohen, Learning Rules that Classify
    e-mail, In Proceedings of the AAAI Spring
    Symposium on Machine Learning in Information
    Access, 1996
  • 10 G. Sakkis, I. Androutsopoulos, G. Paliouras,
    V. Karkaletsis, C.D. Spyropoulos, P.
    Stamatopoulos, A memory based approach to
    anti-spam filtering for mailing lists,
    Information Retrieval 2003
Write a Comment
User Comments (0)
About PowerShow.com