Title: Comment Spam Identification
1Comment Spam Identification
- Eric Cheng Eric Steinlauf
2What is comment spam?
3(No Transcript)
4(No Transcript)
5Total spam 1,226,026,178 Total ham 62,723,306
95 are spam!
Source http//akismet.com/stats/ Retrieved
4/22/2007
6Countermeasures
7Blacklisting
- 5yx.org
- 9kx.com
- aakl.com
- aaql.com
- aazl.com
- abcwaynet.com
- abgv.com
- abjg.com
- ablazeglass.com
- abseilextreme.net
- actionbenevole.com
- acvt.com
- adbx.com
- adhouseaz.com
- advantechmicro.com
- aeur.com
- aeza.com
- agentcom.com
- ailh.org
- globalplasticscrap.com
- gowest-veritas.com
- greenlightgo.org
- hadjimitsis.com
- healthcarefx.com
- herctrade.com
- hobbyhighway.com
- hominginc.com
- hongkongdivas.com
- hpspyacademy.com
- hzlr.com
- idlemindsonline.com
- internetmarketingserve.com
- jesh.org
- jfcp.com
- jfss.com
- jittersjapan.com
- jkjf.com
- jkmrw.com
rockymountainair.org rstechresources.com samsung-i
nteger.com sandiegonhs.org screwpile.org scvend.or
g sell-in-china.com sensationalwraps.com sevierdes
ign.com starbikeshop.com struthersinc.com swarange
et.com thecorporategroup.net thehawleyco.com thehu
mancrystal.com thinkaids.org thisandthatgiftshop.n
et thomsungroup.com ti0.org timeby.net tradewindsw
f.com tradingb2c.com turkeycogroup.net vassagospal
ace.com vyoung.net web-toggery.com webedgewars.com
webshoponsalead.com webtoggery.com willman-paris.
com worldwidegoans.com
8Captchas
- "Completely Automated Public Turing test to tell
Computers and Humans Apart"
9Other ad-hoc/weak methods
- Authentication / registration
- Comment throttling
- Disallowing links in comments
- Moderation
10Our Approach Naïve Bayes
- Statistical
- Adaptive
- Automatic
- Scalable and extensible
- Works well for spam e-mail
11Naïve Bayes
12P(AB) P(B)
P(BA) P(A)
P(AB)
13P(AB) P(B)
P(BA) P(A)
14P(AB) P(BA) P(A) / P(B)
15P(spamcomment) P(commentspam) P(spam) /
P(comment)
16P(spamcomment) P(commentspam) P(spam) /
P(comment)
17P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
(naïve assumption)
18Corpus
Incoming Comment
Texas casino
Online Texas holdem
Texas gambling site
P(Texasspam) 1 (1 2/5)3 0.784
P(w1spam) 1 (1 x/y)n
Probability of w1 occurring given a spam comment
where x is the number of times w1 appears in
all spam messages, y is the total number of
words in all spam messages, and n is the
length of the given comment
19P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
20P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
21P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
??????
22P(hamcomment) P(w1ham) P(w2ham)
P(wnham) P(ham) /
P(comment)
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
??????
23P(hamcomment) ? P(w1ham) P(w2ham)
P(wnham) P(ham)
P(spamcomment) ? P(w1spam)
P(w2spam) P(wnspam)
P(spam)
Probability of w1 occurring given a spam comment
Probability of something being spam
24P(hamcomment) ? P(w1ham) P(w2ham)
P(wnham) P(ham))
log(
)
log(
P(spamcomment) ? P(w1spam)
P(w2spam) P(wnspam)
P(spam))
log(
)
log(
25log(P(hamcomment)) ? log(P(w1ham))
log(P(w2ham))
log(P(wnham)) log(P(ham))
log(P(spamcomment)) ? log(P(w1spam))
log(P(w2spam))
log(P(wnspam)) log(P(spam))
26Fact
P(spamcomment) 1 P(hamcomment)
Abuse of notation
P(s) P(spamcomment) P(h) P(hamcomment)
27P(s) 1 P(h)
m log(P(s)) log(P(h))
log(P(s)/P(h))
em elog(P(s)/P(h))
P(s)/P(h)
em P(h) P(s)
28P(s) 1 P(h)
em P(h) P(s)
em P(h) 1 P(h)
(em 1) P(h) 1
P(h) 1/(em1) P(s) 1 P(h)
m log(P(s)) log(P(h))
29P(h) 1/(em1) P(s) 1 P(h)
m log(P(s)) log(P(h))
30P(hamcomment) 1/(em1) P(spamcomment) 1
P(hamcomment)
m log(P(spamcomment)) log(P(hamcomment))
31In practice, just compare
log(P(hamcomment)) log(P(spamcomment))
32Implementation
33Corpus
- A collection of 50 blog pages with 1024 comments
- Manually tagged as spam/non-spam
- 67 are spam
- Provided by the Informatics Institute at
University of Amsterdam
Blocking Blog Spam with Language Model
Disagreement, G. Mishne, D. Carmel, and R.
Lempel. In AIRWeb '05 - First International
Workshop on Adversarial Information Retrieval on
the Web, at the 14th International World Wide Web
Conference (WWW2005), 2005.
34Most popular spam words
casino 0.999918 0.000082076
betting 0.999879 0.000120513
texas 0.999813 0.000187148
biz 0.999776 0.000223708
holdem 0.999738 0.000262111
poker 0.999551 0.000448675
pills 0.999527 0.000473407
pokerabc 0.999506 0.000493821
teen 0.999455 0.000544715
online 0.999455 0.000544715
bowl 0.999437 0.000562555
gambling 0.999437 0.000562555
sonneries 0.999353 0.000647359
blackjack 0.999346 0.000653516
pharmacy 0.999254 0.000745723
35Clean words
edu 0.00287339 0.997127
projects 0.00270528 0.997295
week 0.00270528 0.997295
etc 0.00270528 0.997295
went 0.00270528 0.997295
inbox 0.00270528 0.997295
bit 0.00270528 0.997295
someone 0.00255576 0.997444
bike 0.00230136 0.997699
already 0.00230136 0.997699
selling 0.00219225 0.997808
making 0.00209302 0.997907
squad 0.00184278 0.998157
left 0.00177216 0.998228
important 0.0013973 0.998603
pimps 0.000427782 0.999572
36Implementation
- Corpus parsing and processing
- Naïve Bayes algorithm
- Randomly select 70 for training, 30 for testing
- Stand-alone web service
- Written entirely in Python
37Its showtime!
38Configurations
- Separator used to tokenize comment
- Inclusion of words from header
- Classify based only on most significant words
- Double count non-spam comments
- Include article body as non-spam example
- Boosting
39Minimum Error Configuration
- Separator a-zltgt
- Header Both
- Significant words All
- Double count No
- Include body No
- Boosting No
40Varying Configuration Parameters
41Varying Configuration Parameters
42Boosting
- Naïve Bayes is applied repeatedly to the data.
- Produces Weighted Majority Model
bayesModels empty list weights vector(1) for
i in 1 to M model naiveBayes(examples,
weights) error computeError(model,
examples) weights adjustWeights(examples,
weights, error) bayesModelsi model,
error if error0 break
43Boosting
44Future work(or what we did not do)
45Data Processing
- Follow links in comment and include words in
target web page - More sophisticated tokenization and URL handling
(handling 100,000...) - Word stemming
46Features
- Ability to incorporate incoming comments into
corpus - Ability to mark comment as spam/non-spam
- Assign more weight on page content
- Adjust probability table based on page content,
providing content-sensitive filtering
47Comments?
No spam, please.