Comment Spam Identification - PowerPoint PPT Presentation

About This Presentation

Title:

Comment Spam Identification

Description:

'Completely Automated Public Turing test to tell Computers and ... betting. 0.000082076. 0.999918. casino 'Clean' words. 0.999572. 0.000427782. pimps. 0.998603 ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 48

Provided by: eri5155

Learn more at: http://www.gersteinlab.org

Category:

more less

Transcript and Presenter's Notes

Title: Comment Spam Identification

1
Comment Spam Identification

Eric Cheng Eric Steinlauf

2
What is comment spam?
3
(No Transcript)
4
(No Transcript)
5
Total spam 1,226,026,178 Total ham 62,723,306
95 are spam!
Source http//akismet.com/stats/ Retrieved
4/22/2007
6
Countermeasures
7
Blacklisting

5yx.org
9kx.com
aakl.com
aaql.com
aazl.com
abcwaynet.com
abgv.com
abjg.com
ablazeglass.com
abseilextreme.net
actionbenevole.com
acvt.com
adbx.com
adhouseaz.com
advantechmicro.com
aeur.com
aeza.com
agentcom.com
ailh.org

globalplasticscrap.com
gowest-veritas.com
greenlightgo.org
hadjimitsis.com
healthcarefx.com
herctrade.com
hobbyhighway.com
hominginc.com
hongkongdivas.com
hpspyacademy.com
hzlr.com
idlemindsonline.com
internetmarketingserve.com
jesh.org
jfcp.com
jfss.com
jittersjapan.com
jkjf.com
jkmrw.com

rockymountainair.org rstechresources.com samsung-i
nteger.com sandiegonhs.org screwpile.org scvend.or
g sell-in-china.com sensationalwraps.com sevierdes
ign.com starbikeshop.com struthersinc.com swarange
et.com thecorporategroup.net thehawleyco.com thehu
mancrystal.com thinkaids.org thisandthatgiftshop.n
et thomsungroup.com ti0.org timeby.net tradewindsw
f.com tradingb2c.com turkeycogroup.net vassagospal
ace.com vyoung.net web-toggery.com webedgewars.com
webshoponsalead.com webtoggery.com willman-paris.
com worldwidegoans.com
8
Captchas

"Completely Automated Public Turing test to tell
Computers and Humans Apart"

9
Other ad-hoc/weak methods

Authentication / registration
Comment throttling
Disallowing links in comments
Moderation

10
Our Approach Naïve Bayes

Statistical
Adaptive
Automatic
Scalable and extensible
Works well for spam e-mail

11
Naïve Bayes
12
P(AB) P(B)
P(BA) P(A)
P(AB)
13
P(AB) P(B)
P(BA) P(A)
14
P(AB) P(BA) P(A) / P(B)
15
P(spamcomment) P(commentspam) P(spam) /
P(comment)
16
P(spamcomment) P(commentspam) P(spam) /
P(comment)
17
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
(naïve assumption)
18
Corpus
Incoming Comment
Texas casino
Online Texas holdem
Texas gambling site
P(Texasspam) 1 (1 2/5)3 0.784
P(w1spam) 1 (1 x/y)n
Probability of w1 occurring given a spam comment
where x is the number of times w1 appears in
all spam messages, y is the total number of
words in all spam messages, and n is the
length of the given comment
19
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
20
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
21
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
??????
22
P(hamcomment) P(w1ham) P(w2ham)
P(wnham) P(ham) /
P(comment)
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
??????
23
P(hamcomment) ? P(w1ham) P(w2ham)
P(wnham) P(ham)
P(spamcomment) ? P(w1spam)
P(w2spam) P(wnspam)
P(spam)
Probability of w1 occurring given a spam comment
Probability of something being spam
24
P(hamcomment) ? P(w1ham) P(w2ham)
P(wnham) P(ham))
log(
)
log(
P(spamcomment) ? P(w1spam)
P(w2spam) P(wnspam)
P(spam))
log(
)
log(
25
log(P(hamcomment)) ? log(P(w1ham))
log(P(w2ham))
log(P(wnham)) log(P(ham))
log(P(spamcomment)) ? log(P(w1spam))
log(P(w2spam))
log(P(wnspam)) log(P(spam))
26
Fact
P(spamcomment) 1 P(hamcomment)
Abuse of notation
P(s) P(spamcomment) P(h) P(hamcomment)
27
P(s) 1 P(h)
m log(P(s)) log(P(h))
log(P(s)/P(h))
em elog(P(s)/P(h))
P(s)/P(h)
em P(h) P(s)
28
P(s) 1 P(h)
em P(h) P(s)
em P(h) 1 P(h)
(em 1) P(h) 1
P(h) 1/(em1) P(s) 1 P(h)
m log(P(s)) log(P(h))
29
P(h) 1/(em1) P(s) 1 P(h)
m log(P(s)) log(P(h))
30
P(hamcomment) 1/(em1) P(spamcomment) 1
P(hamcomment)
m log(P(spamcomment)) log(P(hamcomment))
31
In practice, just compare
log(P(hamcomment)) log(P(spamcomment))
32
Implementation
33
Corpus

A collection of 50 blog pages with 1024 comments
Manually tagged as spam/non-spam
67 are spam
Provided by the Informatics Institute at
University of Amsterdam

Blocking Blog Spam with Language Model
Disagreement, G. Mishne, D. Carmel, and R.
Lempel. In AIRWeb '05 - First International
Workshop on Adversarial Information Retrieval on
the Web, at the 14th International World Wide Web
Conference (WWW2005), 2005.
34
Most popular spam words
casino 0.999918 0.000082076
betting 0.999879 0.000120513
texas 0.999813 0.000187148
biz 0.999776 0.000223708
holdem 0.999738 0.000262111
poker 0.999551 0.000448675
pills 0.999527 0.000473407
pokerabc 0.999506 0.000493821
teen 0.999455 0.000544715
online 0.999455 0.000544715
bowl 0.999437 0.000562555
gambling 0.999437 0.000562555
sonneries 0.999353 0.000647359
blackjack 0.999346 0.000653516
pharmacy 0.999254 0.000745723
35
Clean words
edu 0.00287339 0.997127
projects 0.00270528 0.997295
week 0.00270528 0.997295
etc 0.00270528 0.997295
went 0.00270528 0.997295
inbox 0.00270528 0.997295
bit 0.00270528 0.997295
someone 0.00255576 0.997444
bike 0.00230136 0.997699
already 0.00230136 0.997699
selling 0.00219225 0.997808
making 0.00209302 0.997907
squad 0.00184278 0.998157
left 0.00177216 0.998228
important 0.0013973 0.998603
pimps 0.000427782 0.999572
36
Implementation

Corpus parsing and processing
Naïve Bayes algorithm
Randomly select 70 for training, 30 for testing
Stand-alone web service
Written entirely in Python

37
Its showtime!
38
Configurations

Separator used to tokenize comment
Inclusion of words from header
Classify based only on most significant words
Double count non-spam comments
Include article body as non-spam example
Boosting

39
Minimum Error Configuration

Separator a-zltgt
Header Both
Significant words All
Double count No
Include body No
Boosting No

40
Varying Configuration Parameters
41
Varying Configuration Parameters
42
Boosting

Naïve Bayes is applied repeatedly to the data.
Produces Weighted Majority Model

bayesModels empty list weights vector(1) for
i in 1 to M model naiveBayes(examples,
weights) error computeError(model,
examples) weights adjustWeights(examples,
weights, error) bayesModelsi model,
error if error0 break
43
Boosting
44
Future work(or what we did not do)
45
Data Processing

Follow links in comment and include words in
target web page
More sophisticated tokenization and URL handling
(handling 100,000...)
Word stemming

46
Features

Ability to incorporate incoming comments into
corpus
Ability to mark comment as spam/non-spam
Assign more weight on page content
Adjust probability table based on page content,
providing content-sensitive filtering

47
Comments?
No spam, please.

Write a Comment

User Comments (0)