Title: Blog Track Open Task: Spam Blog Detection
1 Blog Track Open Task Spam Blog Detection
NIST Blog Pre-Track, 14 Nov 2006
Pranam Kolari, Akshay Java, Tim Finin, Anupam
Joshi, Justin Martineau University of Maryland,
Baltimore County
James Mayfield Johns Hopkins University Applied
Physics Laboratory
http//ebiquity.umbc.edu/paper/html/id/318/
2Blogosphere Reputation at Stake!
3Spam in the Blogosphere
- Types comment spam, ping spam, spam blogs
- Akismet 87 of all comments are spam
- 75 of update pings are spam (ebiquity 2005)
- 20 of indexed blogs by popular blog search
engines is spam (Umbria 2006, ebiquity 2005) - Spam blogs, sometimes referred to by the
neologism splogs, are weblog sites which the
author uses only for promoting affiliated
websites 1 - Spings, or ping spam, are pings that are sent
from spam blogs 1
1Wikipedia
4Advertisements in Profitable Contexts
Auto-generated and/or Plagiarized Content
Link Farms to promote affiliates
5Why a problem?
- Blogosphere increasingly important segment of
Web 12 hours from post to Google index - Splog content provides no additional value
- Splog content is often plagiarized
- Splogs demote value of authentic content
- Splogs steal advertising (referral) revenue from
authentic content producers - Splogs stress the blogosphere infrastructure
- Splogs can skew Blog Analytics, as was observed
in TREC Blog Track 2006
6Nature of Splogs in TREC 2006
- Around 83K identifiable blog home-pages in the
collection, with 3.2M permalinks - 81K blogs could be processed
- We use splog detection models developed on blog
home-pages 87 accuracy - We identified 13,542 splogs
- Blacklisted 543K permalinks from these splogs
- 16 of the entire collection
- 17 splog posts injected into TREC dataset1
1The TREC Blog06 Collection Creating and
Analyzing a Blog Test Collection C. Macdonald,
I. Ounis
7Impact of Splogs in TREC Queries
Cholesterol
Hybrid Cars
American Idol
8Higher in Spam Prone Contexts
Card
Interest
Mortgage
Spam query terms based on analysis by McDonald et
al 2006 ..
9Splog Detection Task Proposal
- Motivation
- Detecting and eliminating spam is an essential
requirement for any blog analysis - Splog detection has characteristics that set it
appart from e-mail and web spam detection - Constraint
- Simulate how blog search systems operate
- Task Statement
- Is an input permalink (post) spam?
10Relation to E-mail Spam Detection
- TREC has an E-mail Spam Classification Task
- Similar in
- Fast online spam detection
- Different in
- Nature of spamming links, RSS feeds, web graph,
metadata - Users targeted indirectly through search engines,
e.g. N1ST not relevant for NIST query
11Relation to Web Spam Detection
- TREC does not have a web spam track
- Similar in
- Spamming web link structure
- Different in
- Coverage Blog Analytics Engines dont look
beyond blogosphere - Speed of detection is important, 150K posts/hour
- Presence of structured text through RSS feeds
presents new opportunities, and challenges
12Difficulty
- We have been experimenting with multiple
approaches starting mid 2005 - Data http//ebiquity.umbc.edu/resource/html/id/21
2
13Difficulty
- Evolving spamming techniques and splog creation
genres - Most basic technique spam techniques
- Generate content by stuffing key dictionary words
- Generate link to affiliates, through link dumps
on blogrolls, linkrolls or after post content - Evolving spam techniques
- Scrape contextually similar content to generate
posts - RSS hijacking
- Aggregation software, e.g. Planet X
- Intersperse links randomly
- Make link placement meaningful
- Add spam comments and then ping. Repeat.
14(No Transcript)
15Task Details - Dataset Creation
- Similar to TREC Blog 2006, a collection of feeds,
blog home-pages and permalinks - View dataset D as two sets Dbase , Dtest
- Dbase to span (n-x) days, and Dtest to span the
rest of x days for x1 - D could collected as a combination of
- D as collected in 2006
- Sample a subset of pings from a ping server over
the period that D is collected
16Task Details - Assessment
- Assessors classify spam post into one or more
classes based on the kind of spam this post, or
the blog hosting it features - Non-blog
- Keyword-stuffed
- Post-stitching
- Post-plagiarism
- Post-weaving
- Blog/link-roll spam
- Each assessment typically takes 1-2 minutes
- Detailed assessment will enable participants to
identify classes they handle well and where they
can improve
17- Non-Blog ping at weblogs.com
- No RSS Feeds
- No Dated Entry, no comments
- Possibly plagiarized content
18- Keyword Stuffed Blog
- coupon codes, casino
19- Post Stitching
- Excerpts scraped from other sources
20- Post Weaving
- Spam Links contextually placed in post
21- Link-roll spam
- With fully plagiarized text
22Evaluation
- Dbase distributed first, Dtest subsequently with
50 independent sets of permalinks - Dbase, Dtest division will mimic how blog search
engines operate - Build models to detect splogs using individual
posts, feeds or blog homepages of what is seen - Detect spam in an incoming stream of new blog
postings - Teams will be judged by how well they detect
spamminess for new posts
23Input/Output
- ltsetgt
- ltnumgt...lt/numgt
- lttestgt
- ltpermalinkgt
- lturlgt...lt/urlgt
- lthomepagegt...lt/homepagegt
- ltfeedgt...lt/feedgt
- ltwhengt... lt/whengt
- lt/permalinkgt
- ltpermalinkgt
- ...
- lt/permalinkgt
- ...
- lt/testgt
- lt/setgt
Each permalink to be judged by participants
Individual set of test input. 1 or y such sets
can be used, with each set biased to a specific
splog genre, blog Publishing host or TLD
Output format
set Q0 docno rank prob runtag
24Summary
- Spam Blogs present a major challenge to the
quality of blog mining/analytics - Splog Detection is different from spam in other
communication platforms - Development of TREC Task will help furthering
state of the art - Task requirements can be easily aligned with
existing task of opinion identification