BlogVox: Separating Blog Wheat from Blog Chaff - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

BlogVox: Separating Blog Wheat from Blog Chaff

Description:

Presence of spam blogs or splogs and extraneous content waters down the ... models developed on blog home-pages; 87% accuracy ... of 150 blog posts ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 30
Provided by: researc3
Category:

less

Transcript and Presenter's Notes

Title: BlogVox: Separating Blog Wheat from Blog Chaff


1
BlogVox Separating Blog Wheat from Blog Chaff
  • Akshay Java, Pranam Kolari, Tim Finin, Aupam
    Joshi, Justin Martineau (UMBC)
  • James Mayfield (JHU/APL)

2
Motivation Cleaning the Harvest
  • BlogVox A Blog analytics engine developed for
    the TREC 2006 Blog Track.
  • Presence of spam blogs or splogs and extraneous
    content waters down the quality of the index.
  • Narrowing down on the content of the post is
    essential in lack of clearly demarcated opinion
    sentences (like in eopinions, IMDB, Amazon etc)
  • Noisy and unstructured text on the Blogosphere
    can skew blog analytics/ business intelligence
    tools (as observed in TREC 2006).

3
BlogVox Opinion Extraction System
BlogVox
  • TREC 06 Finding opinionated posts, either
    positive or negative, about a query
  • 2006 TREC Blog corpus
  • 80K blogs
  • 300K posts
  • 50 test queries
  • BlogVox opinion extraction system
  • Document and sentence level scorers
  • Combined scores using an SVM meta-learner
  • Data cleaning splogs and post identification
  • BlogVox challenges
  • Data cleaning and splog removal
  • Slangs
  • Semantic orientation of words
  • Contradictions, sarcasms, ungrammatical text

4
Separating Blog Wheat from Blog Chaff
  • Data cleaning for
  • Splog removal
  • Post content identification

5
Spam in the Blogosphere
  • Types comment spam, ping spam, splogs
  • Akismet 87 of all comments are spam
  • 75 of update pings are spam (ebiquity 2005)
  • 56 of blogs are spam (ebiquity 2005)
  • 20 of indexed blogs by popular blog search
    engines is spam (Umbria 2006, ebiquity 2005)
  • Spam blogs (splogs) are weblogs used to promoting
    affiliated websites or host ads
  • Spings, or ping spam, are pings that are sent
    from spam blogs

6
Motivation host ads
7
Motivation index affiliates, promote pageRank
8
Data Cleaning Splogs
Host Ads
Index affiliates, Promote pageRank
Plagiarized content
  • Splog detection using SVM
  • 700 blogs, 700 splogs used for training
  • Model based on blog homepage and local blog
    features

Splog Detection Performance
9
Nature of Splogs in TREC 2006
  • Around 83K identifiable blog home-pages in the
    collection, with 3.2M permalinks
  • 81K blogs could be processed
  • We use splog detection models developed on blog
    home-pages 87 accuracy
  • We identified 13,542 splogs
  • Blacklisted 543K permalinks from these splogs
  • 16 of the entire collection
  • 17 splog posts injected into TREC dataset1

1The TREC Blog06 Collection Creating and
Analyzing a Blog Test Collection C. Macdonald,
I. Ounis
1The TREC Blog06 Collection Creating and
Analyzing a Blog Test Collection C. Macdonald,
I. Ounis
10
Impact of Splogs in TREC Queries
Cholesterol
Hybrid Cars
American Idol
11
Higher in Spam Prone Contexts
Card
Interest
Mortgage
Spam query terms based on analysis by McDonald et
al 2006 ..
12
Separating Blog Wheat from Blog Chaff
  • Data cleaning for
  • Splog removal
  • Post content identification

13
Data Cleaning Content Identification
Navigation
Ads
Post content
Recent Posts
14
Data cleaning Baseline heuristic
Navigational Links
Post Content
Sidebar
Ads
  • Eliminate link a if there exist a link b
  • Within ? distance
  • No Title tags between the links
  • Avg length of text bearing nodes less than a
    threshold
  • b is the nearest link to a

An example DOM tree
15
Data cleaning SVM cleaner
  • Random collection of 150 blog posts
  • Human evaluation of 400 links tagged as content
    or extraneous links
  • We trained SVM using linear kernel in this
    analysis

DOM Features
Tag Features
Position Features
Word Features
Evaluation
16
Data Cleaning Effect of sidebar content
17
Related Work
  • Web Spam Detection
  • Coverage Blog Analytics Engines dont look
    beyond Blogosphere
  • Speed of detection is important, 150K posts/hour
  • RSS feeds presents new opportunities, and
    challenges
  • Email spam Detection
  • Nature of spamming links, RSS feeds, web graph,
    metadata
  • Users targeted indirectly through search engines,
    e.g. N1ST not relevant for NIST query
  • Template Detection
  • Repeated structural components detected via
    sampling
  • Customization, use of javascripts and AJAX is
    increasing
  • Simple heuristics using DOM traversal work well
    in general cases
  • Sentiment Analysis
  • Open domain opinion extraction is complex
  • Opinions are part of a narrative
  • Subject for which the opinion is being expressed
    is not easy to detect

18
Conclusions
  • Noisy content on the Blogosphere present a major
    challenge to the quality of blog analytics tools.
  • Combination of heuristics and ML can be used to
    effectively clean the data.
  • Ongoing Work
  • DOM subtree elimination
  • Identifying the subject of the opinion
  • Slangs
  • More training examples!

19
Thank you!
http//ebiquity.umbc.edu/
20
Backup Slides
21
Opinions in Social Media
Readers Perspective Starbucks Sandwiches are
bad!
  • I went to school early so I would have time to
    grab some lunch. Which ended up consisting of a
    crappy sandwich from starbucks and a chai latte.
    Lacey came into Starbucks while I was there so we
    chatted for a little bit and she thought that I
    might be in her class. After I finished eating I
    headed to school and checked the board..1

Narrative
Expressed Opinions
Opinions can influence buying decisions of
customers
1 http//annamay13x.livejournal.com/7061.html
22
  • Keyword Stuffed Blog
  • coupon codes, casino

23
  • Post Stitching
  • Excerpts scraped from other sources

24
  • Post Weaving
  • Spam Links contextually placed in post

25
  • Link-roll spam
  • With fully plagiarized text

26
Difficulty
  • We have been experimenting with multiple
    approaches starting mid 2005
  • Data http//ebiquity.umbc.edu/resource/html/id/21
    2

27
Difficulty
  • Evolving spamming techniques and splog creation
    genres
  • Most basic technique spam techniques
  • Generate content by stuffing key dictionary words
  • Generate link to affiliates, through link dumps
    on blogrolls, linkrolls or after post content
  • Evolving spam techniques
  • Scrape contextually similar content to generate
    posts
  • RSS hijacking
  • Aggregation software, e.g. Planet X
  • Intersperse links randomly
  • Make link placement meaningful
  • Add spam comments and then ping. Repeat.

28
TREC Submissions (Topic Relevance)
29
TREC Submissions (Opinion Extraction)
Write a Comment
User Comments (0)
About PowerShow.com