Mining Document Streams - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Mining Document Streams

Description:

... contain burst of intensity j 2; these relationship can be shown by a rooted tree ... it fails to separate few higher intensity bursts from a low intensity burst. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 23
Provided by: KRe90
Category:

less

Transcript and Presenter's Notes

Title: Mining Document Streams


1
Mining Document Streams
  • Sudeep Biswas
  • Guided by
  • Dr. Sunita Sarawagi

2
Mathematical Models
  • What is a model ?
  • Why models ?
  • Classification of models
  • Deterministic
  • Round-robin scheduling for a given job queue in
    OS.
  • Probabilistic
  • Tossing a coin to generate binary sequences.

3
Mining Document Streams
  • Extraction of meaningful structures from document
    streams using mathematical models.
  • - email
  • - news article
  • - conference papers

4
Intuitive Idea
  • Appearance of a topic in a document stream is
    signaled by a burst of activity, with certain
    features rising sharply in frequency as the topic
    emerges and then fades away.

5
Objective for Modeling
  • To develop a formal approach for modeling bursts
  • The model should robustly and efficiently
    identify the bursts
  • The model should provide organizational framework
    for analyzing the underline content

6
Expected outcomes from our Model
  • Discovery of some structures out of the stream
    such that it can be efficiently organized.
  • These structures often have a natural meaning in
    terms of its contents

7
Well-known approaches to organize streams
  • Document streams can be classified and stored
    according to topics and sub topics
  • Ex. A large email steam containing a topic say
    Grant Proposals containing say Announcement of
    new funding program, Planning of
    Proposals,Communications with co-authors etc.This
    email stream can be divided according to some
    sub-topics based on message content say certain
    people,programs,funding agencies etc.

8
Our Approach
  • Exploring organizing structures based explicitly
    on the role of time in email and other document
    streams
  • Many of the large topics represented in document
    streams are naturally punctuated by bursts, with
    the flow of relevant items intensifying in
    certain key periods

9
Simplest Model
  • Arrival of email is considered to be a Poisson
    process
  • Inter arrival time hence follows an exponential
    distribution with f(x)?e- ?x and expected value
    of gap is 1/ ?.
  • The simplest model should be extended in such a
    way that periods of lower rates can be
    interleaved with periods of higher rates by
    adding some memory

10
A two-state Automaton Model
p
A
1-p
1-p
fo(x)?oe-?ox
So
S1
f1(x)?1e-?1x
p
?1gt ?o
11
Computations
  • x(x1,x2,,xn) xigt0
  • q(qi1, qi2,, qin) in?(0,1)
  • fq(x1,x2,,xn )?nt1fit(xt)
  • b denotes the number of state transitions in the
    sequence q i.e., number of indices it such that
    qit not equal to qit1
  • Our goal is to find a q such that Prqx is
    maximized.

12
Contd
Prq
13
Contd
One expects the optimum to track the global
structure of burst in the gap sequences while
holding to a single state of non-uniformity
14
Infinite State Automaton Model
Cost associated with state transition from qi to
qj(j-i)?ln n , ?gt0, when jgti 0,when
jlti Generally, value of ?1
15
Computations
Theorem The number of states of the automaton
though infinite , it can be shown that a state
sequence which is optimal for a finite state
automaton with number of states k is also optimal
for the infinite state automaton. The finite
state automaton with number of states k is
denoted by
16
Algorithm to compute optimal state sequence with
number of states k
Cj(t)minimum cost of state sequence for the
input x1,x2,,xt that must end with state qj
Input sequence x1,x2,,xn Step-1 C0(0)0 and
Cj(0)? for jgt0 Step-2 Compute Cj(t) -ln
fj(xt)minp(Cp(t-1)?(p,j)) for all
qj (j 0,1,2..,k) and all t1 to
n Step-3 find qj such that Cj(n) is
minimum Step-4 Backtrack to find the complete
sequence of states
17
Definition of Burst
18
Results of the Model
It follows from the model that burst exhibit a
natural nested structure A burst of intensity j
may contain one or more sub-interval that are
burst of intensity j1 these in turn may contain
burst of intensity j2 these relationship can be
shown by a rooted tree as shown in the example
below
19
Example
Computation by
20
Choice of Parameters
  • The choice of s controls the resolution, if s is
    too small more number of state transitions are
    required to capture a high intensity burst.If s
    is too large , it fails to separate few higher
    intensity bursts from a low intensity burst.
  • The choice of ? controls the ease with which the
    automaton can change states.? prevents the model
    to recognize small spurious burst.Too high value
    overlooks important bursts.

21
Conclusions
  • Mathematical models are often useful to model
    real time systems.
  • Sequential data mining models can help as to get
    meaningful structures,inherent to the document
    stream.
  • Our burst model gives as hierarchical structures
    and hence suggests how a large folder of email
    might naturally be divided into hierarchical set
    of subfolders around certain key events,based
    only on the rate of message arrivals.

22
References
  • Jon Kleinberg , Bursty and Hierarchical
    Structures in Streams,8th ACM SIGKDD
    International Conferences on Knowledge Discovery
    and Data Mining ,July 2002.
  • Irwin Miller,John E. Freund, Probability and
    Statistics for Engineers,PHI,1977.
Write a Comment
User Comments (0)
About PowerShow.com