Title: The predictive power of online chatter
1The predictive power of online chatter
- D. Gruhl, R. Guha, R. Kumar, J. Novak, A. Tomkins
Presented by R. Paiu
2Motivation Methodology
- Goal
- Demonstrate that online blog postings can be used
to predict spikes in sales ranks - Methodology
- Blog postings ?? Amazon books sales
- Create hand-crafted queries and use them for
predictions - Use queries to discover blog posts discussing a
given product - Plot sales ranks and of blog posts as two time
series - Implement and test automated query generation
- Develop automated prediction algorithms
3Data Sources
- IBM WebFountain Project
- Contains hundreds of thousands of postings
- Data sources include 300,000 blogs and 200,000
articles - Approximately 200,000 new blog postings stored
daily - Amazon.com Sales Rank Data
- Collected rank data for 120 days (July October
2004) for all books in top 300 - In total, 2,430 books were included in the data
set
4Correlation of Blog Posts and Sales Rank
- Steps
- Detecting spikes in sales ranks
- Locating blog mentions about corresponding book
- Plotting and correlating time series
5Detecting Spikes in Sales Ranks (step 1)
- Minimal rank m must occur more than 2 weeks from
the start and the end of the considered period
(120 days) - One week before and after m occurs, the rank
value must be greater than max(m50, 1.5m) - 50 books (out of 2430) contain a spike
6Locating Blog Mentions about Books (Step 2)
- Retrieve blog mentions about the 50 books
- Hand-crafted queries produced by users
- Iteratively re-fined until results seem to match
the desired book
7Plotting and Correlating Time Series (Step 3)
- Plot the sales rank of the product and the number
of corresponding postings vs. time - y-axis is scaled so that values are 01
- Best lag k where cross-correlation is maximum
- Best lag
- Leading, if negative
- Trailing, if positive
- Results
- Sales rank spikes potentially
- correlated with blog activity
- Spikes in sales rank may occur
- despite insignificant blog activity
- Causes include discount pricing,
- bulk buying of books, etc.
- Highly correlated spikes in sales ranks and
- blog mentions were obtained for 10 books
8Queries Automatic Generation
- Generation algorithm
- Based on the names of the authors of the books
- 1900 US Census used for estimating the names
ambiguity - Experimental results
- 182 automated queries
- generated
- 45 of queries have
- cross-correlation gt 0.5
- Cross-correlations gt 0.5
- correspond to strong
- correlations between blog
- mentions and sales rank
- 35 of those queries have
- best lag lt 2 weeks
- Blog posts are more likely
- leading rank spikes than
- trailing them
9Sales Rank Prediction
- Given the time series representing sales data up
to a point t, does the addition of blog mention
data for the same period improve the prediction
of what will happen to the sales rank? - Outcomes
- MOTION VOLATILITY Predicting whether
tomorrows sales rank for a particular book will
be higher or lower appears to be hard - SPIKES Analysis of blog data up to a point t
allows to effectively predict when there will be
a future spike in the sales rank, without
recourse to information from the future and even
without recourse to history of sales ranks.
10Predicting Motion Volatility
- Analysis of various predictor algorithms
- Volatility prediction predict if tomorrows
sales rank would differ from todays by more than
a certain threshold value.
11Predicting Spikes 1
- Original data is processed as follows (thruthed
data) - The point of the minimal sales rank is located, m
- A threshold is set,
- Spikes region is taken to be the maximum
interval containing the point of minimal sales
rank and no point of sales rank greater than - Steps
- A product and a time t are fixed
- The predictor is given as input the number of
blog postings for this product, for all days up
to and including t - The predictor outputs a bit indicating whether it
believes a spike in sales rank will occur in the
near future - Results of the predictor are evaluated against
the truthed data
12Predicting Spikes 2
- Goals
- Find spikes that appear to be biggest ever
- Find spikes that significantly exceed historical
averages - Find spikes that rise relatively quickly
13Predicting Spikes 3
- Evaluation
- Four categories
- Leading a spikes occurs after t, but within 2
weeks - Trailing a spike occurred within the past 2
weeks - Inside a spike is currently occurring
- Incorrect a spike does not occur within 2 weeks
of t - 39 predictions made
- 2/3 of them (26) were leading or trailing
- No indication in the paper regarding accuracy of
predictions!
14Conclusions Future Work
- Conclusions
- Rank motion and volatility prediction is
difficult - The volume of blog postings related to a product
can be used to predict sales rank spikes - Automatic query generation for detecting blog
post mentions of products produces good results - Future work
- Develop better understanding of when blogs are
useful for prediction - Create new tool features and new techniques for
automated query generation and prediction - Expand the area of prediction to other product
sales ranks, electoral voting, and public opinion
on important decisions - Study and analyze proposed Hidden Markov Model
relating discussion involvement of the blogger to
sales prediction lag