Title: A Quantitative Study of Forum Spamming Using Context-Based Analysis
1A Quantitative Study of Forum Spamming Using
Context-Based Analysis
Yuan Niu Hao Chen Francis Hsu
UC Davis, Microsoft Research
2A Look at the Web
3Why do we care about spam?
- Users want to
- Look at quality pages on the web
- Interact without the trouble of moderation
- Surf safely
- Search engines want to
- Provide good search results
- Profit from ads
- We want to investigate the landscape of the
problem - Popular battleground web forums
4Why Web Forums?
- Open communities wiki, forums, blogs
- Increasingly easy to contribute
5Why Web Forums?
6How Spammers Operate
Search Engine
Comment Spam
Search Results
Doorway Pages (Splogs)
Spammer Domain
7How to deal with the problem?
- Content based approach
- Constrained by content retrieved
- May be deceived by tricks like cloaking and
redirection - We propose context-based analysis
8Context-based Analysis
- Consisting of
- Redirection
- Cloaking analysis
- See dynamic content not served to crawlers
- Use the Strider URL Tracer
- Flag large number of doorway pages to spam
domains - Based on intuition that
- Publishing links is necessary to increase
popularity - We must see the destination URL eventually
9Doorways Redirections
Google search Coach handbag
10Redirection Analysis
- Fed URLs to Strider URL Tracer, which records all
pages visited - Ranked top 3rd Party Domains by redirections
- Seed known spammer domain
- Identified doorway pages based on association
with spammer domains - Manually investigated unknown domains to expand
the blacklist
11Cloaking Analysis
- Diff-based check
- Run URL twice once with anti-cloaking, once
without - Crawler-browser cloaking (User-agent,
scripting-on/off) - Click-through cloaking (Referer)
12Crawler-Browser Cloaking
13Crawler-Browser Cloaking
14Click-Through Cloaking
15Three Perspectives
Search Engine
Comment Spam
Search Results
Doorway Pages (Splogs)
Spammer Domain
16Search User
17Search User
- Chose 9 popular forum software written in
Perl/PHP, hosted/unhosted - WWWBoard, Hypernews, Ikonboard, Ezboard,
Bravenet, Invision Board, Phpbb, Phorum, and
VBulletin - Compiled popular tags and common spam terms list
of 190 keywords - Myspace, jewelry, casino, shopping, baseball
- Searched for all ltkeyword, forum-softwaregt pairs
in Google MSN
18Search User
- Search terms returned spammed forums in top 20
results from both Google and MSN - Only exception is palm-texas-holdem-game
- Top 5 most spammed forums
Forum Pages Keywords
http//fs.fed.us/...mm/get/mmforumA.html 175 102
http//www.comm.fsu.edu/interactive/forum/ 134 82
http//www.usra.edu/phorum 119 94
http//classicauthors.net/messageboard/list.php?f1 117 97
http//samba.eecs.umich.edu/phorum/list.php?2 105 79
19Honeyblogs
- Spammers
- Create their own doorway pages, and
- Promote the doorways by posting to other peoples
pages - Honeyblogs lure the spammer in
- No moderation, default accept all policy
- Pinged blog aggregators with every post
- Abandoned within three months
20Honeyblogs
- 41,100 comments collected over 339 days
- 19,297 comments received in the last month
- Ilium 930/1432
- Litlog 3734/5714
- Spammer activity got me kicked off my hosting
server
21Honeyblog Activity
22Honeyblog Activity
3142
23Webhost Perspective
Blog Host Examined URLs Spam URLs URLs Using Cloaking
Blogspot 13,389 1,091 (8.1) 652
Blogspoint 4,714 3,535 (75) 131
Blogstudio 369 198 (54) 0
Blogsharing 99 82 (83) 0
- Above Numbers are lower bounds
- Consider only pages using cloaking redirection
24Webhost Perspective
- Blogspot 1,091 splogs
- Most popular
- Randomly sampled 1 of profile pages created in
July and extracted all blog links 13,389 - 60 of splogs used cloaking
- 24 of splogs redirected to filldirect.com
25Webhost Perspective
- Blogspoint 3535 splogs
- 2166 redirected to finance-web-search.com
- 917 redirected to casino-web-search.com
- Blogstudio 198 splogs
- 130 redirected to finance-web-search.com
- 54 redirected to casino-web-search.com
- Blogsharing 82 splogs
- Plumber related link spamming in splogs
26Also of note
- Malicious URLs
- Previous work by MSR (Strider HoneyMonkey)1
discovered sites that actively exploit browser
vulnerabilities - We tested 8 known malicious URLs for presence on
the web - Found 5 spammed in forums, 2 in link farms, 1 in
referrer logs - Universal redirectors
- Redirects user to any URL (sometimes destination
is obfuscated) - www.rit.edu/ksa/cgi-bin/splinks/click.cgi?num2u
rlyour url here - http//tinyurl.com/3c7twl
- http//www.canadianpharmacyltd.com/group.php?id59
aid860 - Could be used to serve malicious URLs,
particularly those on .edu and .gov sites
1Yi-Min Wang, et al. Automated Web Patrol with
Strider HoneyMonkeys Finding Web Sites That
Exploit Browser Vulnerabilities. NDSS, 2006.
27Related Work (Part 1)
- Diff-based cloaking
- Wu Davison Diff-based cloaking combined with
content based analysis - Our approach detects click-through cloaking
- Content based approaches
- Fetterly, Manasse and Najork URL properties,
clustering pages of similar content - Mishne, Carmel, Lempel Compared statistical
models of comments target pages against post
content - Kolari, Finin and Joshi Meta tag text, anchor
text, URLs - Our approach is complimentary to content-based
approaches
28Related Work (Part 2)
- Measurements of Trust
- Metaxas et al Defined trust neighborhoods
- Benczur et al SpamRank Identify outliers by
looking at PageRank of the site and its
supporters - Similarly, our approach propagates distrust by
following redirections - Plugins to aid moderating forums/blogs
- Akismet
- Bad Behavior, Spam Karma
- Our approach does not require cooperation from
forum owners
29Conclusions
- Context-based approach successfully detects
advanced cloaking redirection based spam - Spammers are pervasive
- 189 of 190 search terms returned spammed forums
in the top 20 search results from both Google and
MSN - Same spammer redirecting to two domains on
blogspoint and blogstudio
30Future work
- There is hope!
- Economic solution
- Identifies middlemen in online advertising
- Read our WWW07 paper1
- http//wwwcsif.cs.ucdavis.edu/niu
- http//research.microsoft.com/csm/strider/
1Yi-Min Wang et al. Spam Double-Funnel
Connecting Web Spammers with Advertisers. WWW
2007.