The predictive power of online chatter

1 / 14
About This Presentation
Title:

The predictive power of online chatter

Description:

Query: Lance Armstrong OR Tour de France. Queries Automatic Generation. Generation algorithm: ... from the future and even without recourse to history of sales ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:5.0/5.0
Slides: 15
Provided by: raluc8

less

Transcript and Presenter's Notes

Title: The predictive power of online chatter


1
The predictive power of online chatter
  • D. Gruhl, R. Guha, R. Kumar, J. Novak, A. Tomkins

Presented by R. Paiu
2
Motivation Methodology
  • Goal
  • Demonstrate that online blog postings can be used
    to predict spikes in sales ranks
  • Methodology
  • Blog postings ?? Amazon books sales
  • Create hand-crafted queries and use them for
    predictions
  • Use queries to discover blog posts discussing a
    given product
  • Plot sales ranks and of blog posts as two time
    series
  • Implement and test automated query generation
  • Develop automated prediction algorithms

3
Data Sources
  • IBM WebFountain Project
  • Contains hundreds of thousands of postings
  • Data sources include 300,000 blogs and 200,000
    articles
  • Approximately 200,000 new blog postings stored
    daily
  • Amazon.com Sales Rank Data
  • Collected rank data for 120 days (July October
    2004) for all books in top 300
  • In total, 2,430 books were included in the data
    set

4
Correlation of Blog Posts and Sales Rank
  • Steps
  • Detecting spikes in sales ranks
  • Locating blog mentions about corresponding book
  • Plotting and correlating time series

5
Detecting Spikes in Sales Ranks (step 1)
  • Minimal rank m must occur more than 2 weeks from
    the start and the end of the considered period
    (120 days)
  • One week before and after m occurs, the rank
    value must be greater than max(m50, 1.5m)
  • 50 books (out of 2430) contain a spike

6
Locating Blog Mentions about Books (Step 2)
  • Retrieve blog mentions about the 50 books
  • Hand-crafted queries produced by users
  • Iteratively re-fined until results seem to match
    the desired book

7
Plotting and Correlating Time Series (Step 3)
  • Plot the sales rank of the product and the number
    of corresponding postings vs. time
  • y-axis is scaled so that values are 01
  • Best lag k where cross-correlation is maximum
  • Best lag
  • Leading, if negative
  • Trailing, if positive
  • Results
  • Sales rank spikes potentially
  • correlated with blog activity
  • Spikes in sales rank may occur
  • despite insignificant blog activity
  • Causes include discount pricing,
  • bulk buying of books, etc.
  • Highly correlated spikes in sales ranks and
  • blog mentions were obtained for 10 books

8
Queries Automatic Generation
  • Generation algorithm
  • Based on the names of the authors of the books
  • 1900 US Census used for estimating the names
    ambiguity
  • Experimental results
  • 182 automated queries
  • generated
  • 45 of queries have
  • cross-correlation gt 0.5
  • Cross-correlations gt 0.5
  • correspond to strong
  • correlations between blog
  • mentions and sales rank
  • 35 of those queries have
  • best lag lt 2 weeks
  • Blog posts are more likely
  • leading rank spikes than
  • trailing them

9
Sales Rank Prediction
  • Given the time series representing sales data up
    to a point t, does the addition of blog mention
    data for the same period improve the prediction
    of what will happen to the sales rank?
  • Outcomes
  • MOTION VOLATILITY Predicting whether
    tomorrows sales rank for a particular book will
    be higher or lower appears to be hard
  • SPIKES Analysis of blog data up to a point t
    allows to effectively predict when there will be
    a future spike in the sales rank, without
    recourse to information from the future and even
    without recourse to history of sales ranks.

10
Predicting Motion Volatility
  • Analysis of various predictor algorithms
  • Volatility prediction predict if tomorrows
    sales rank would differ from todays by more than
    a certain threshold value.

11
Predicting Spikes 1
  • Original data is processed as follows (thruthed
    data)
  • The point of the minimal sales rank is located, m
  • A threshold is set,
  • Spikes region is taken to be the maximum
    interval containing the point of minimal sales
    rank and no point of sales rank greater than
  • Steps
  • A product and a time t are fixed
  • The predictor is given as input the number of
    blog postings for this product, for all days up
    to and including t
  • The predictor outputs a bit indicating whether it
    believes a spike in sales rank will occur in the
    near future
  • Results of the predictor are evaluated against
    the truthed data

12
Predicting Spikes 2
  • Goals
  • Find spikes that appear to be biggest ever
  • Find spikes that significantly exceed historical
    averages
  • Find spikes that rise relatively quickly

13
Predicting Spikes 3
  • Evaluation
  • Four categories
  • Leading a spikes occurs after t, but within 2
    weeks
  • Trailing a spike occurred within the past 2
    weeks
  • Inside a spike is currently occurring
  • Incorrect a spike does not occur within 2 weeks
    of t
  • 39 predictions made
  • 2/3 of them (26) were leading or trailing
  • No indication in the paper regarding accuracy of
    predictions!

14
Conclusions Future Work
  • Conclusions
  • Rank motion and volatility prediction is
    difficult
  • The volume of blog postings related to a product
    can be used to predict sales rank spikes
  • Automatic query generation for detecting blog
    post mentions of products produces good results
  • Future work
  • Develop better understanding of when blogs are
    useful for prediction
  • Create new tool features and new techniques for
    automated query generation and prediction
  • Expand the area of prediction to other product
    sales ranks, electoral voting, and public opinion
    on important decisions
  • Study and analyze proposed Hidden Markov Model
    relating discussion involvement of the blogger to
    sales prediction lag
Write a Comment
User Comments (0)