Blog Track Open Task: Spam Blog Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Blog Track Open Task: Spam Blog Detection

Description:

American Idol. Cholesterol. Hybrid Cars. Higher in Spam Prone Contexts ... Detecting and eliminating spam is an essential requirement for any blog analysis ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 23
Provided by: ebiqui
Category:
Tags: american | blog | detection | idol | open | spam | task | track

less

Transcript and Presenter's Notes

Title: Blog Track Open Task: Spam Blog Detection


1
Blog Track Open Task Spam Blog Detection
NIST Blog Pre-Track, 14 Nov 2006
  • Tim Finin

Pranam Kolari, Akshay Java, Tim Finin, Anupam
Joshi, Justin Martineau University of Maryland,
Baltimore County
James Mayfield Johns Hopkins University Applied
Physics Laboratory
http//ebiquity.umbc.edu/paper/html/id/318/
2
Blogosphere Reputation at Stake!
3
Spam in the Blogosphere
  • Types comment spam, ping spam, spam blogs
  • Akismet 87 of all comments are spam
  • 75 of update pings are spam (ebiquity 2005)
  • 20 of indexed blogs by popular blog search
    engines is spam (Umbria 2006, ebiquity 2005)
  • Spam blogs, sometimes referred to by the
    neologism splogs, are weblog sites which the
    author uses only for promoting affiliated
    websites 1
  • Spings, or ping spam, are pings that are sent
    from spam blogs 1

1Wikipedia
4
Advertisements in Profitable Contexts
Auto-generated and/or Plagiarized Content
Link Farms to promote affiliates
5
Why a problem?
  • Blogosphere increasingly important segment of
    Web 12 hours from post to Google index
  • Splog content provides no additional value
  • Splog content is often plagiarized
  • Splogs demote value of authentic content
  • Splogs steal advertising (referral) revenue from
    authentic content producers
  • Splogs stress the blogosphere infrastructure
  • Splogs can skew Blog Analytics, as was observed
    in TREC Blog Track 2006

6
Nature of Splogs in TREC 2006
  • Around 83K identifiable blog home-pages in the
    collection, with 3.2M permalinks
  • 81K blogs could be processed
  • We use splog detection models developed on blog
    home-pages 87 accuracy
  • We identified 13,542 splogs
  • Blacklisted 543K permalinks from these splogs
  • 16 of the entire collection
  • 17 splog posts injected into TREC dataset1

1The TREC Blog06 Collection Creating and
Analyzing a Blog Test Collection C. Macdonald,
I. Ounis
7
Impact of Splogs in TREC Queries
Cholesterol
Hybrid Cars
American Idol
8
Higher in Spam Prone Contexts
Card
Interest
Mortgage
Spam query terms based on analysis by McDonald et
al 2006 ..
9
Splog Detection Task Proposal
  • Motivation
  • Detecting and eliminating spam is an essential
    requirement for any blog analysis
  • Splog detection has characteristics that set it
    appart from e-mail and web spam detection
  • Constraint
  • Simulate how blog search systems operate
  • Task Statement
  • Is an input permalink (post) spam?

10
Relation to E-mail Spam Detection
  • TREC has an E-mail Spam Classification Task
  • Similar in
  • Fast online spam detection
  • Different in
  • Nature of spamming links, RSS feeds, web graph,
    metadata
  • Users targeted indirectly through search engines,
    e.g. N1ST not relevant for NIST query

11
Relation to Web Spam Detection
  • TREC does not have a web spam track
  • Similar in
  • Spamming web link structure
  • Different in
  • Coverage Blog Analytics Engines dont look
    beyond blogosphere
  • Speed of detection is important, 150K posts/hour
  • Presence of structured text through RSS feeds
    presents new opportunities, and challenges

12
Difficulty
  • We have been experimenting with multiple
    approaches starting mid 2005
  • Data http//ebiquity.umbc.edu/resource/html/id/21
    2

13
Difficulty
  • Evolving spamming techniques and splog creation
    genres
  • Most basic technique spam techniques
  • Generate content by stuffing key dictionary words
  • Generate link to affiliates, through link dumps
    on blogrolls, linkrolls or after post content
  • Evolving spam techniques
  • Scrape contextually similar content to generate
    posts
  • RSS hijacking
  • Aggregation software, e.g. Planet X
  • Intersperse links randomly
  • Make link placement meaningful
  • Add spam comments and then ping. Repeat.

14
(No Transcript)
15
Task Details - Dataset Creation
  • Similar to TREC Blog 2006, a collection of feeds,
    blog home-pages and permalinks
  • View dataset D as two sets Dbase , Dtest
  • Dbase to span (n-x) days, and Dtest to span the
    rest of x days for x1
  • D could collected as a combination of
  • D as collected in 2006
  • Sample a subset of pings from a ping server over
    the period that D is collected

16
Task Details - Assessment
  • Assessors classify spam post into one or more
    classes based on the kind of spam this post, or
    the blog hosting it features
  • Non-blog
  • Keyword-stuffed
  • Post-stitching
  • Post-plagiarism
  • Post-weaving
  • Blog/link-roll spam
  • Each assessment typically takes 1-2 minutes
  • Detailed assessment will enable participants to
    identify classes they handle well and where they
    can improve

17
  • Non-Blog ping at weblogs.com
  • No RSS Feeds
  • No Dated Entry, no comments
  • Possibly plagiarized content

18
  • Keyword Stuffed Blog
  • coupon codes, casino

19
  • Post Stitching
  • Excerpts scraped from other sources

20
  • Post Weaving
  • Spam Links contextually placed in post

21
  • Link-roll spam
  • With fully plagiarized text

22
Evaluation
  • Dbase distributed first, Dtest subsequently with
    50 independent sets of permalinks
  • Dbase, Dtest division will mimic how blog search
    engines operate
  • Build models to detect splogs using individual
    posts, feeds or blog homepages of what is seen
  • Detect spam in an incoming stream of new blog
    postings
  • Teams will be judged by how well they detect
    spamminess for new posts

23
Input/Output
  • ltsetgt
  • ltnumgt...lt/numgt
  • lttestgt
  • ltpermalinkgt
  • lturlgt...lt/urlgt
  • lthomepagegt...lt/homepagegt
  • ltfeedgt...lt/feedgt
  • ltwhengt... lt/whengt
  • lt/permalinkgt
  • ltpermalinkgt
  • ...
  • lt/permalinkgt
  • ...
  • lt/testgt
  • lt/setgt

Each permalink to be judged by participants
Individual set of test input. 1 or y such sets
can be used, with each set biased to a specific
splog genre, blog Publishing host or TLD
Output format
set Q0 docno rank prob runtag
24
Summary
  • Spam Blogs present a major challenge to the
    quality of blog mining/analytics
  • Splog Detection is different from spam in other
    communication platforms
  • Development of TREC Task will help furthering
    state of the art
  • Task requirements can be easily aligned with
    existing task of opinion identification
Write a Comment
User Comments (0)
About PowerShow.com