Title: BlogVox: Separating Blog Wheat from Blog Chaff
1BlogVox Separating Blog Wheat from Blog Chaff
- Akshay Java, Pranam Kolari, Tim Finin, Aupam
Joshi, Justin Martineau (UMBC) - James Mayfield (JHU/APL)
2Motivation Cleaning the Harvest
- BlogVox A Blog analytics engine developed for
the TREC 2006 Blog Track. - Presence of spam blogs or splogs and extraneous
content waters down the quality of the index. - Narrowing down on the content of the post is
essential in lack of clearly demarcated opinion
sentences (like in eopinions, IMDB, Amazon etc) - Noisy and unstructured text on the Blogosphere
can skew blog analytics/ business intelligence
tools (as observed in TREC 2006).
3BlogVox Opinion Extraction System
BlogVox
- TREC 06 Finding opinionated posts, either
positive or negative, about a query - 2006 TREC Blog corpus
- 80K blogs
- 300K posts
- 50 test queries
- BlogVox opinion extraction system
- Document and sentence level scorers
- Combined scores using an SVM meta-learner
- Data cleaning splogs and post identification
- BlogVox challenges
- Data cleaning and splog removal
- Slangs
- Semantic orientation of words
- Contradictions, sarcasms, ungrammatical text
4Separating Blog Wheat from Blog Chaff
- Data cleaning for
- Splog removal
- Post content identification
5Spam in the Blogosphere
- Types comment spam, ping spam, splogs
- Akismet 87 of all comments are spam
- 75 of update pings are spam (ebiquity 2005)
- 56 of blogs are spam (ebiquity 2005)
- 20 of indexed blogs by popular blog search
engines is spam (Umbria 2006, ebiquity 2005) - Spam blogs (splogs) are weblogs used to promoting
affiliated websites or host ads - Spings, or ping spam, are pings that are sent
from spam blogs
6Motivation host ads
7Motivation index affiliates, promote pageRank
8Data Cleaning Splogs
Host Ads
Index affiliates, Promote pageRank
Plagiarized content
- Splog detection using SVM
- 700 blogs, 700 splogs used for training
- Model based on blog homepage and local blog
features
Splog Detection Performance
9Nature of Splogs in TREC 2006
- Around 83K identifiable blog home-pages in the
collection, with 3.2M permalinks - 81K blogs could be processed
- We use splog detection models developed on blog
home-pages 87 accuracy - We identified 13,542 splogs
- Blacklisted 543K permalinks from these splogs
- 16 of the entire collection
- 17 splog posts injected into TREC dataset1
1The TREC Blog06 Collection Creating and
Analyzing a Blog Test Collection C. Macdonald,
I. Ounis
1The TREC Blog06 Collection Creating and
Analyzing a Blog Test Collection C. Macdonald,
I. Ounis
10Impact of Splogs in TREC Queries
Cholesterol
Hybrid Cars
American Idol
11Higher in Spam Prone Contexts
Card
Interest
Mortgage
Spam query terms based on analysis by McDonald et
al 2006 ..
12Separating Blog Wheat from Blog Chaff
- Data cleaning for
- Splog removal
- Post content identification
13Data Cleaning Content Identification
Navigation
Ads
Post content
Recent Posts
14Data cleaning Baseline heuristic
Navigational Links
Post Content
Sidebar
Ads
- Eliminate link a if there exist a link b
- Within ? distance
- No Title tags between the links
- Avg length of text bearing nodes less than a
threshold - b is the nearest link to a
An example DOM tree
15Data cleaning SVM cleaner
- Random collection of 150 blog posts
- Human evaluation of 400 links tagged as content
or extraneous links - We trained SVM using linear kernel in this
analysis
DOM Features
Tag Features
Position Features
Word Features
Evaluation
16Data Cleaning Effect of sidebar content
17Related Work
- Web Spam Detection
- Coverage Blog Analytics Engines dont look
beyond Blogosphere - Speed of detection is important, 150K posts/hour
- RSS feeds presents new opportunities, and
challenges - Email spam Detection
- Nature of spamming links, RSS feeds, web graph,
metadata - Users targeted indirectly through search engines,
e.g. N1ST not relevant for NIST query
- Template Detection
- Repeated structural components detected via
sampling - Customization, use of javascripts and AJAX is
increasing - Simple heuristics using DOM traversal work well
in general cases - Sentiment Analysis
- Open domain opinion extraction is complex
- Opinions are part of a narrative
- Subject for which the opinion is being expressed
is not easy to detect
18Conclusions
- Noisy content on the Blogosphere present a major
challenge to the quality of blog analytics tools. - Combination of heuristics and ML can be used to
effectively clean the data. - Ongoing Work
- DOM subtree elimination
- Identifying the subject of the opinion
- Slangs
- More training examples!
19Thank you!
http//ebiquity.umbc.edu/
20Backup Slides
21Opinions in Social Media
Readers Perspective Starbucks Sandwiches are
bad!
- I went to school early so I would have time to
grab some lunch. Which ended up consisting of a
crappy sandwich from starbucks and a chai latte.
Lacey came into Starbucks while I was there so we
chatted for a little bit and she thought that I
might be in her class. After I finished eating I
headed to school and checked the board..1
Narrative
Expressed Opinions
Opinions can influence buying decisions of
customers
1 http//annamay13x.livejournal.com/7061.html
22- Keyword Stuffed Blog
- coupon codes, casino
23- Post Stitching
- Excerpts scraped from other sources
24- Post Weaving
- Spam Links contextually placed in post
25- Link-roll spam
- With fully plagiarized text
26Difficulty
- We have been experimenting with multiple
approaches starting mid 2005 - Data http//ebiquity.umbc.edu/resource/html/id/21
2
27Difficulty
- Evolving spamming techniques and splog creation
genres - Most basic technique spam techniques
- Generate content by stuffing key dictionary words
- Generate link to affiliates, through link dumps
on blogrolls, linkrolls or after post content - Evolving spam techniques
- Scrape contextually similar content to generate
posts - RSS hijacking
- Aggregation software, e.g. Planet X
- Intersperse links randomly
- Make link placement meaningful
- Add spam comments and then ping. Repeat.
28TREC Submissions (Topic Relevance)
29TREC Submissions (Opinion Extraction)