Blog Track Open Task: Spam Blog Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Blog Track Open Task: Spam Blog Detection

Description:

Results tally in % to TREC dataset ( Macdonald et al 2006) Impact of Splogs in TREC Queries. American Idol. Cholesterol. Hybrid Cars. Higher in Spam Prone Contexts ... – PowerPoint PPT presentation

Number of Views:384
Avg rating:3.0/5.0
Slides: 20
Provided by: ebiqui
Category:

less

Transcript and Presenter's Notes

Title: Blog Track Open Task: Spam Blog Detection


1
Blog Track Open Task Spam Blog Detection
NIST Blog Pre-Track, 14 Nov 2006
  • Tim Finin

Pranam Kolari, Akshay Java, Tim Finin, Anupam
Joshi, Justin Martineau University of Maryland,
Baltimore County
James Mayfield Johns Hopkins University Applied
Physics Laboratory
http//ebiquity.umbc.edu/paper/html/id/318/
2
Blogosphere Reputation at Stake!
3
Spam in the Blogosphere
  • Types comment spam, ping spam, spam blogs
  • Akismet 87 of all comments are spam
  • 75 of update pings are spam (ebiquity 2005)
  • 20 of indexed blogs by popular blog search
    engines is spam (Umbria 2006, ebiquity 2005)
  • Spam blogs, sometimes referred to by the
    neologism splogs, are weblog sites which the
    author uses only for promoting affiliated
    websites
  • Spings, or ping spam, are pings that are sent
    from spam blogs

1Wikipedia
4
Advertisements in Profitable Contexts
Auto-generated and/or Plagiarized Content
Link Farms to promote affiliates
5
Why a problem?
  • Splog content provides no additional value
  • Splog content is often plagiarized
  • Splogs demote value of authentic content
  • Splogs steal advertising (referral) revenue from
    authentic content producers
  • Splogs stress the blogosphere infrastructure
  • Splogs can skew Blog Analytics, as was observed
    in TREC Blog Track 2006

6
Nature of Splogs in TREC 2006
  • Around 83K identifiable blog home-pages in the
    collection, with 3.2M permalinks
  • 81K blogs could be processed
  • We use splog detection models developed on blog
    home-pages 87 accurate
  • We identified 13,542 splogs
  • Blacklisted 543K permalinks from these splogs
  • This accounts to 16 of the entire collection
  • Results tally in to TREC dataset ( Macdonald et
    al 2006)

7
Impact of Splogs in TREC Queries
Cholesterol
Hybrid Cars
American Idol
8
Higher in Spam Prone Contexts
Card
Interest
Mortgage
Spam query terms based on analysis by McDonald et
al 2006 ..
9
Splog Detection Task Proposal
  • Motivation
  • Detecting and eliminating spam forms a key
    competency requirement of any blog analysis
  • Splog detection has characteristics that sets it
    different from e-mail and web spam detection
  • Constraint
  • Simulate how a blog search system operates
  • Task Statement
  • Is an input permalink (post) spam?

10
Relation to E-mail Spam Detection
  • TREC has an E-mail Spam Classification Task
  • Similar in
  • Fast online spam detection
  • Different in
  • Nature of spamming links, RSS feeds
  • Users targeted indirectly through search engines
    n1st not relevant for nist query

11
Relation to Web Spam Detection
  • TREC does not have a web spam track
  • Similar in
  • Spamming web link structure
  • Different in
  • Coverage of Blog Analytics Engines, not beyond
    blogosphere
  • Speed of detection, crucial
  • Presence of structured text through RSS feeds
    presents new opportunities, and challenges

12
Difficulty
  • We have been experimented with multiple
    approaches starting mid 2005
  • Dataset available at
  • http//ebiquity.umbc.edu/blogger/

13
Difficulty
  • Evolving spamming techniques, and splog creation
    genres
  • Most basic technique
  • Generate content by stuffing random dictionary
    words
  • Generate link to affiliates, through link dumps
    on blogrolls, linkrolls or after post content
  • Evolving techniques
  • Scrape contextually similar content to generate
    posts
  • Intersperse links randomly
  • Make link placement meaningful

14
(No Transcript)
15
Task Details - Dataset Creation
  • Similar to TREC Blog 2006, a collection of feeds,
    blog home-pages and permalinks
  • View dataset D as two sets Dbase , Dtest
  • Dbase to span (n-x) days, and Dtest to span the
    rest of x days. x1 or lesser
  • D could collected as a combination of
  • D as collected in 2006
  • Sample a subset of pings from a ping server over
    the period that D is collected

16
Task Details - Assessment
  • Assessors classify spam post by the kind of spam
    this post, or the blog hosting it features
  • Non-blog
  • Keyword-stuffed
  • Post-stitching
  • Post-plagiarism
  • Post-weaving
  • Blog/link-roll
  • Each assessment typically takes 1-2 mins
  • Detailed assessment will enable participants to
    find which classes they do well, and where they
    can improve

17
Evaluation
  • Dbase distributed first, Dtest subsequently with
    50 independent sets of permalinks
  • Dbase, Dtest division will mimic how blog search
    engines operate
  • Build models to detect splogs using individual
    posts, feeds or blog homepages of what is seen
  • Detect spam in an incoming stream of new blog
    postings
  • Teams will be judged by how well they detect
    spamminess for new posts

18
Input/Output
  • ltsetgt
  • ltnumgt...lt/numgt
  • lttestgt
  • ltpermalinkgt
  • lturlgt...lt/urlgt
  • lthomepagegt...lt/homepagegt
  • ltfeedgt...lt/feedgt
  • ltwhengt... lt/whengt
  • lt/permalinkgt
  • ltpermalinkgt
  • ...
  • lt/permalinkgt
  • ...
  • lt/testgt
  • lt/setgt

Each permalink to be judged by participants
Individual set of test input. 50 such sets can be
used, similar to how topics used in the opinion
Identification task
Output format
set Q0 docno rank prob runtag
19
Summary
  • Spam Blogs present a major challenge to the
    quality of blog mining/analytics
  • Splog Detection is different from spam in other
    communication platforms
  • Development of TREC Task will help furthering
    state of the art
  • Task requirements can be easily aligned with
    existing task of opinion identification
Write a Comment
User Comments (0)
About PowerShow.com