Title: Mining Document Streams
1Mining Document Streams
- Sudeep Biswas
- Guided by
- Dr. Sunita Sarawagi
2Mathematical Models
- What is a model ?
- Why models ?
- Classification of models
- Deterministic
- Round-robin scheduling for a given job queue in
OS. - Probabilistic
- Tossing a coin to generate binary sequences.
3Mining Document Streams
- Extraction of meaningful structures from document
streams using mathematical models. - - email
- - news article
- - conference papers
4Intuitive Idea
- Appearance of a topic in a document stream is
signaled by a burst of activity, with certain
features rising sharply in frequency as the topic
emerges and then fades away.
5Objective for Modeling
- To develop a formal approach for modeling bursts
- The model should robustly and efficiently
identify the bursts - The model should provide organizational framework
for analyzing the underline content
6Expected outcomes from our Model
- Discovery of some structures out of the stream
such that it can be efficiently organized. - These structures often have a natural meaning in
terms of its contents
7Well-known approaches to organize streams
- Document streams can be classified and stored
according to topics and sub topics - Ex. A large email steam containing a topic say
Grant Proposals containing say Announcement of
new funding program, Planning of
Proposals,Communications with co-authors etc.This
email stream can be divided according to some
sub-topics based on message content say certain
people,programs,funding agencies etc.
8Our Approach
- Exploring organizing structures based explicitly
on the role of time in email and other document
streams - Many of the large topics represented in document
streams are naturally punctuated by bursts, with
the flow of relevant items intensifying in
certain key periods
9Simplest Model
- Arrival of email is considered to be a Poisson
process - Inter arrival time hence follows an exponential
distribution with f(x)?e- ?x and expected value
of gap is 1/ ?. - The simplest model should be extended in such a
way that periods of lower rates can be
interleaved with periods of higher rates by
adding some memory
10A two-state Automaton Model
?1gt ?o
- x(x1,x2,,xn) xigt0
- q(qi1, qi2,, qin) in?(0,1)
- fq(x1,x2,,xn )?nt1fit(xt)
- b denotes the number of state transitions in the
sequence q i.e., number of indices it such that
qit not equal to qit1 - Our goal is to find a q such that Prqx is
One expects the optimum to track the global
structure of burst in the gap sequences while
holding to a single state of non-uniformity
14Infinite State Automaton Model
Cost associated with state transition from qi to
qj(j-i)?ln n , ?gt0, when jgti 0,when
jlti Generally, value of ?1
Theorem The number of states of the automaton
though infinite , it can be shown that a state
sequence which is optimal for a finite state
automaton with number of states k is also optimal
for the infinite state automaton. The finite
state automaton with number of states k is
denoted by
16Algorithm to compute optimal state sequence with
number of states k
Cj(t)minimum cost of state sequence for the
input x1,x2,,xt that must end with state qj
Input sequence x1,x2,,xn Step-1 C0(0)0 and
Cj(0)? for jgt0 Step-2 Compute Cj(t) -ln
fj(xt)minp(Cp(t-1)?(p,j)) for all
qj (j 0,1,2..,k) and all t1 to
n Step-3 find qj such that Cj(n) is
minimum Step-4 Backtrack to find the complete
sequence of states
17Definition of Burst
18Results of the Model
It follows from the model that burst exhibit a
natural nested structure A burst of intensity j
may contain one or more sub-interval that are
burst of intensity j1 these in turn may contain
burst of intensity j2 these relationship can be
shown by a rooted tree as shown in the example
Computation by
20Choice of Parameters
- The choice of s controls the resolution, if s is
too small more number of state transitions are
required to capture a high intensity burst.If s
is too large , it fails to separate few higher
intensity bursts from a low intensity burst. - The choice of ? controls the ease with which the
automaton can change states.? prevents the model
to recognize small spurious burst.Too high value
overlooks important bursts.
- Mathematical models are often useful to model
real time systems. - Sequential data mining models can help as to get
meaningful structures,inherent to the document
stream. - Our burst model gives as hierarchical structures
and hence suggests how a large folder of email
might naturally be divided into hierarchical set
of subfolders around certain key events,based
only on the rate of message arrivals.
- Jon Kleinberg , Bursty and Hierarchical
Structures in Streams,8th ACM SIGKDD
International Conferences on Knowledge Discovery
and Data Mining ,July 2002. - Irwin Miller,John E. Freund, Probability and
Statistics for Engineers,PHI,1977.