Detecting Buzz from Time-Sequenced Document Streams - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Detecting Buzz from Time-Sequenced Document Streams

Description:

Our approach for buzz detection is based on the notion of 'burst ... Winnow classifier. 29K blog postings with political contents in the first quarter of 2004. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 14
Provided by: hor59
Category:

less

Transcript and Presenter's Notes

Title: Detecting Buzz from Time-Sequenced Document Streams


1
Detecting Buzz from Time-Sequenced Document
Streams
  • Jeonghee Yi
  • IBM Almaden Research Center
  • EEE05

2
Introduction
  • Goal detecting emerging and changing interests
    that appear in document streams (e.g. email, news
    articles, and blogs).
  • We detecting buzzwords in the documents over
    time.
  • Buzzwords are terms that occur with strong
    momentum for a relatively short period of time.
  • Our approach for buzz detection is based on the
    notion of burst of activities proposed by
    Kleinberg.

3
Related Works
  • J. Kleinberg. Bursty and hierarchical structure
    in streams
  • burst event model
  • Yahoo! Buzz Index
  • counts the percentage of the total number of
    people searching for a specific query term
  • not take into account the duration of the buzz
  • R. Kumar, J. Novak, P. Raghavan, and A. Tomkins.
    On the bursty evolution of blogspace
  • evolution of community in blogspace
  • D. Gruhl, R. Guha, D. Liben-Nowell, and A.
    Tomkins. Information diffusion through blogspace
  • information and topic propagation using blogspace
    as an example domain

4
Kleinbergs Burst Analysis
  • Modeling the stream using an infinite-state
    automaton
  • Burst appear ? state transition
  • There is a cost associated with any state
    transition.
  • Given an event stream the method finds a low cost
    state sequence that is likely to generate that
    stream.

5
Two-State Automaton
  • Given a set of n1 messages D (d0, d1, , dn),
    message inter-arrival gaps X (x1, x2, , xn).
  • Two states, q0 (low) and q1 (high).
  • For each state
  • Exponential density function f(x) ae-ax, a gt 0.
  • The probability that the gap exceeds x is equal
    to e-ax.
  • a is the rate of message arrivals.
  • The expected value of the gap in this model is
    e-1.
  • The automaton changes states with probability p ?
    (0, 1), remaining in its current state with
    probability 1-p.

6
Finding Optimal State Sequences
  • Q (qi1, qi2, , qin), where qi1, qi2, , qin ?
    q0, q1
  • Each Q induces a density function fQ(x1, x2, ,
    xn) ?t 1..n fit(xt)
  • b number of state transitions in QPrQ
  • PrQX
  • Maxmizing PrQX? minimizing lnPrQX
  • Finding Q that minimizes the cost function

favor small b
favor Q that conform X well
7
Burst Analysis Examples
8
Buzzword Detection
  • Not all bursty events by the Kleinbergs model
    can be considered as buzz.
  • Take into account the relative duration and the
    mass of the bursts
  • Ii(w) high-interval of w, Ii-(w)
    low-interval of w
  • mass of Ii(w), , doc containing w
    arrived during Ii(w).
  • arrival rate of Ii(w),
  • momentum of Ii(w),
  • span ratio of Ii(w),
  • concentration of Ii(w),
  • w qualifies to be a buzzword for time period
    Ii(w) if

9
Buzzword Word Detection Algorithm
  • 1. For each term w in the stream, compute the
    optimal state sequence Q(w) by Kleinbergs model.
  • 2. For each high interval of the state sequence
    Q(w), Ii(w), compute the degree of concentration
  • 3. w is a buzzword if the following holds

10
Dataset
  • 1M blog pages crawled by WebFountain Crawlers.
  • Data preprocessing
  • Template Removal
  • Page Segmentation
  • Detagging HTML Tags
  • Topic Page Detection
  • Winnow classifier
  • 29K blog postings with political contents in the
    first quarter of 2004.

11
Empirical Results (1/2)
12
Empirical Results (2/2)
  • Ralph Nader
  • Feb. 20 Ralph Nader would enter the 2004
    presidential race
  • Feb. 22 formal announcement
  • John Kerry
  • Many bursts got low score of span ratio, but they
    associated with his victory of Iowa Caucus, and
    his victory on Super Tuesday.
  • Our algorithm was able to detect buzz on the
    individual events.
  • John Kerry can be considered as a buzzword of
    much longer span.

buzz
buzz
13
Conclusion
  • We presented a formal model of buzz and proposed
    an algorithm to detect them from a text document
    stream.
  • We proposed a formal method of computing the
    degree of concentration.
  • Our algorithm is experimentally verified on blog
    postings and the results show that the method is
    highly promising in detecting buzz.
Write a Comment
User Comments (0)
About PowerShow.com