A Quantitative Study of Forum Spamming Using Context-Based Analysis PowerPoint PPT Presentation

presentation player overlay
1 / 30
About This Presentation
Transcript and Presenter's Notes

Title: A Quantitative Study of Forum Spamming Using Context-Based Analysis


1
A Quantitative Study of Forum Spamming Using
Context-Based Analysis
Yuan Niu Hao Chen Francis Hsu
  • Yi-Min Wang
  • Ming Ma

UC Davis, Microsoft Research
2
A Look at the Web
3
Why do we care about spam?
  • Users want to
  • Look at quality pages on the web
  • Interact without the trouble of moderation
  • Surf safely
  • Search engines want to
  • Provide good search results
  • Profit from ads
  • We want to investigate the landscape of the
    problem
  • Popular battleground web forums

4
Why Web Forums?
  • Open communities wiki, forums, blogs
  • Increasingly easy to contribute

5
Why Web Forums?
6
How Spammers Operate
Search Engine
Comment Spam
Search Results
Doorway Pages (Splogs)
Spammer Domain
7
How to deal with the problem?
  • Content based approach
  • Constrained by content retrieved
  • May be deceived by tricks like cloaking and
    redirection
  • We propose context-based analysis

8
Context-based Analysis
  • Consisting of
  • Redirection
  • Cloaking analysis
  • See dynamic content not served to crawlers
  • Use the Strider URL Tracer
  • Flag large number of doorway pages to spam
    domains
  • Based on intuition that
  • Publishing links is necessary to increase
    popularity
  • We must see the destination URL eventually

9
Doorways Redirections
Google search Coach handbag
10
Redirection Analysis
  • Fed URLs to Strider URL Tracer, which records all
    pages visited
  • Ranked top 3rd Party Domains by redirections
  • Seed known spammer domain
  • Identified doorway pages based on association
    with spammer domains
  • Manually investigated unknown domains to expand
    the blacklist

11
Cloaking Analysis
  • Diff-based check
  • Run URL twice once with anti-cloaking, once
    without
  • Crawler-browser cloaking (User-agent,
    scripting-on/off)
  • Click-through cloaking (Referer)

12
Crawler-Browser Cloaking
13
Crawler-Browser Cloaking
14
Click-Through Cloaking
15
Three Perspectives
Search Engine
Comment Spam
Search Results
Doorway Pages (Splogs)
Spammer Domain
16
Search User
17
Search User
  • Chose 9 popular forum software written in
    Perl/PHP, hosted/unhosted
  • WWWBoard, Hypernews, Ikonboard, Ezboard,
    Bravenet, Invision Board, Phpbb, Phorum, and
    VBulletin
  • Compiled popular tags and common spam terms list
    of 190 keywords
  • Myspace, jewelry, casino, shopping, baseball
  • Searched for all ltkeyword, forum-softwaregt pairs
    in Google MSN

18
Search User
  • Search terms returned spammed forums in top 20
    results from both Google and MSN
  • Only exception is palm-texas-holdem-game
  • Top 5 most spammed forums

Forum Pages Keywords
http//fs.fed.us/...mm/get/mmforumA.html 175 102
http//www.comm.fsu.edu/interactive/forum/ 134 82
http//www.usra.edu/phorum 119 94
http//classicauthors.net/messageboard/list.php?f1 117 97
http//samba.eecs.umich.edu/phorum/list.php?2 105 79
19
Honeyblogs
  • Spammers
  • Create their own doorway pages, and
  • Promote the doorways by posting to other peoples
    pages
  • Honeyblogs lure the spammer in
  • No moderation, default accept all policy
  • Pinged blog aggregators with every post
  • Abandoned within three months

20
Honeyblogs
  • 41,100 comments collected over 339 days
  • 19,297 comments received in the last month
  • Ilium 930/1432
  • Litlog 3734/5714
  • Spammer activity got me kicked off my hosting
    server

21
Honeyblog Activity
22
Honeyblog Activity
3142
23
Webhost Perspective
  • Focus on splog doorways

Blog Host Examined URLs Spam URLs URLs Using Cloaking
Blogspot 13,389 1,091 (8.1) 652
Blogspoint 4,714 3,535 (75) 131
Blogstudio 369 198 (54) 0
Blogsharing 99 82 (83) 0
  • Above Numbers are lower bounds
  • Consider only pages using cloaking redirection

24
Webhost Perspective
  • Blogspot 1,091 splogs
  • Most popular
  • Randomly sampled 1 of profile pages created in
    July and extracted all blog links 13,389
  • 60 of splogs used cloaking
  • 24 of splogs redirected to filldirect.com

25
Webhost Perspective
  • Blogspoint 3535 splogs
  • 2166 redirected to finance-web-search.com
  • 917 redirected to casino-web-search.com
  • Blogstudio 198 splogs
  • 130 redirected to finance-web-search.com
  • 54 redirected to casino-web-search.com
  • Blogsharing 82 splogs
  • Plumber related link spamming in splogs

26
Also of note
  • Malicious URLs
  • Previous work by MSR (Strider HoneyMonkey)1
    discovered sites that actively exploit browser
    vulnerabilities
  • We tested 8 known malicious URLs for presence on
    the web
  • Found 5 spammed in forums, 2 in link farms, 1 in
    referrer logs
  • Universal redirectors
  • Redirects user to any URL (sometimes destination
    is obfuscated)
  • www.rit.edu/ksa/cgi-bin/splinks/click.cgi?num2u
    rlyour url here
  • http//tinyurl.com/3c7twl
  • http//www.canadianpharmacyltd.com/group.php?id59
    aid860
  • Could be used to serve malicious URLs,
    particularly those on .edu and .gov sites

1Yi-Min Wang, et al. Automated Web Patrol with
Strider HoneyMonkeys Finding Web Sites That
Exploit Browser Vulnerabilities. NDSS, 2006.
27
Related Work (Part 1)
  • Diff-based cloaking
  • Wu Davison Diff-based cloaking combined with
    content based analysis
  • Our approach detects click-through cloaking
  • Content based approaches
  • Fetterly, Manasse and Najork URL properties,
    clustering pages of similar content
  • Mishne, Carmel, Lempel Compared statistical
    models of comments target pages against post
    content
  • Kolari, Finin and Joshi Meta tag text, anchor
    text, URLs
  • Our approach is complimentary to content-based
    approaches

28
Related Work (Part 2)
  • Measurements of Trust
  • Metaxas et al Defined trust neighborhoods
  • Benczur et al SpamRank Identify outliers by
    looking at PageRank of the site and its
    supporters
  • Similarly, our approach propagates distrust by
    following redirections
  • Plugins to aid moderating forums/blogs
  • Akismet
  • Bad Behavior, Spam Karma
  • Our approach does not require cooperation from
    forum owners

29
Conclusions
  • Context-based approach successfully detects
    advanced cloaking redirection based spam
  • Spammers are pervasive
  • 189 of 190 search terms returned spammed forums
    in the top 20 search results from both Google and
    MSN
  • Same spammer redirecting to two domains on
    blogspoint and blogstudio

30
Future work
  • There is hope!
  • Economic solution
  • Identifies middlemen in online advertising
  • Read our WWW07 paper1
  • http//wwwcsif.cs.ucdavis.edu/niu
  • http//research.microsoft.com/csm/strider/

1Yi-Min Wang et al. Spam Double-Funnel
Connecting Web Spammers with Advertisers. WWW
2007.
Write a Comment
User Comments (0)
About PowerShow.com