Data Association for Topic Intensity Tracking - PowerPoint PPT Presentation

About This Presentation
Title:

Data Association for Topic Intensity Tracking

Description:

Title: Slide 1 Author: School of Computer Science Last modified by: Andreas Krause Created Date: 11/28/2005 1:16:20 AM Document presentation format – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 34
Provided by: SchoolofC71
Category:

less

Transcript and Presenter's Notes

Title: Data Association for Topic Intensity Tracking


1
Data Association for Topic Intensity Tracking
  • Andreas Krause
  • Jure Leskovec
  • Carlos Guestrin
  • School of Computer Science,
  • Carnegie Mellon University

2
Document classification
  • Emails from two topics Conference and Hiking

Will you go toICML too?
Lets go hikingon Friday!
P(C words) .1
P(C words) .9
? Conference
? Hiking
3
A more difficult example
  • Emails from two topics Conference and Hiking
  • What if we had temporal information?
  • How about modeling emails as HMM?

200 pm
203 pm
Lets have dinnerafter the talk.
Should we go onFriday?
P(C words) .5
P(C words) .7
? Conference
Assumes equal time steps,smooth topic
changes. Valid assumptions?
4
Typical email traffic (Enron data)
Topic 1
Topic 2
Bursts
No emails
  • Email traffic very bursty
  • Cannot model with uniform time steps! ?
  • Topic intensities change over time, separately
    per topic
  • Bursts tell us, how intensely a topic is pursued
  • ? Bursts are potentially very interesting!

5
Identifying both topics and bursts
  • Given
  • A stream of documents (emails)
  • d1, d2, d3,
  • and corresponding document inter-arrival times
    (time between consecutive documents)
  • ?1, ?2, ?3, ...
  • Simultaneously
  • Classify (or cluster) documents into K topics
  • Predict the topic intensities predict time
    between consecutive documents from the same topic

6
Data association problem
Conference
Hiking
  • If we know the email topics, we can identify
    bursts
  • If we dont know the topics, we cant identify
    bursts!
  • Two-step solution First classify documents, then
    identify bursts Kleinberg 03 Can fail
    badly! ?
  • This paper Simultaneously identify topics and
    bursts! ?

time
7
The Task
  • Have to solve a data association problem
  • We observe
  • Message Deltas time between the arrivals of
    consecutive documents
  • We want to estimate
  • Topic Deltas time between messages of the same
    topic
  • We can then compute the topic intensity L E
    1/?
  • Therefore, need to associate each document with a
    topic

Need topics to identify intensity
Chicken and Eggproblem
Need intensity toclassify (better)
8
How to reason about topic deltas?
  • Associate with each email time vectors ? per topic

Email 1,Conference At 200 pm
Email 2,Hiking At 230 pm
Email 3, Conference At 415 pm
Expected arrival times ?per topic
C 200 pm
H 230 pm
Topic ? 2h 15min (consecutive msg. of same
topic)
Message ? min ?2 min ?1( 30min)
Topic C argmin ?2( Hiking)
Conference Hiking
9
Generating message arrival times
  • Want generative model for the time vectors ?

L(H)1
L(H)2
L(H)3
Incremented by exponential distribution, paramete
r Exp(L(C))
C 200 pm
C 415pm
C 415 pm
H 730 pm
H 230 pm
H 230 pm
Does not change, as topic not active.
10
Generative Model (conceptual)
Intensity forConference
Intensity forHiking
Problem Need to reason about entire history of
timesteps ?t!(Domain of ?t grows linearly with
time.)Makes inference intractable, even for few
topics!
ETA per topic
Message ?
Topic
Document
11
Do we really need ETA vectors?
  • We know Message ?t min ?t min ?t-1.
  • Since Topic ? follow exponential distribution,
    memorylessness implies
  • P(?t1(C) gt 4pm ?t (C) 2pm, its now 3pm)
    P(?t1(C) gt 4pm ?t (C) 3pm, its now 3pm)
  • Hence ?t distributed as min Exp(Lt(C)),Exp(Lt(H))
  • Closed form ?t Exp(Lt(C) Lt(H) )
  • Similarly, Ct argmin Exp(Lt(C)),Exp(Lt(H))
  • Closed form Ct Bernoulli( Lt(C) / (Lt(C)
    Lt(H) ) )

Can discard ETA vectors ?! Quite general
modeling trick!
12
Generative Model (conceptual)
Implicit Data Association (IDA) Model
Dt
  • Turns model (essentially) into Factorial HMM
  • Many efficient inference techniques available! ?

13
Exponential distribution appropriate?
  • Previous work on document streams (E.g.,
    Kleinberg 03)
  • Frequently used to model transition times
  • When adding hidden variables, can model arbitrary
    transition distributions (Nodelman et al)

14
Experimental setup
  • Inference Procedures
  • Full (conceptual) model
  • Particle filter
  • Simplified Model
  • Particle filter
  • Fully factorized mean field
  • Extract inference
  • Comparison to the two-step approach (first
    classify, then identify bursts)

15
Results (Synthetic data)
  • Periodic message arrivals (uninformative ?) with
    noisy class assignments ABBBABABABBB

Naïve Bayesmisclassifies based on features
30
Topic ?
25
20
Topic delta
15
10
5
0
0
20
40
60
80
Message number
16
Results (Synthetic data)
  • Periodic message arrivals (uninformative ?) with
    noisy class assignments ABBBABABABBB

Naïve Bayesmisclassifies based on features
30
Topic ?
Part. Filt.(Full model)
25
20
Topic delta
15
0
0
20
40
60
80
Message number
17
Results (Synthetic data)
  • Periodic message arrivals (uninformative ?) with
    noisy class assignments ABBBABABABBB

Naïve Bayesmisclassifies based on features
30
Topic ?
Part. Filt.(Full model)
Exactinference
25
20
Topic delta
15
0
0
20
40
60
80
Message number
18
Results (Synthetic data)
  • Periodic message arrivals (uninformative ?) with
    noisy class assignments ABBBABABABBB

Naïve Bayesmisclassifies based on features
Implicit Data Association gets both topics and
intensity right, inspite severe (30) label
noise. Memorylessness trick identifies true
intensity. Separate topic and burst
identification fails badly.
30
Topic ?
Part. Filt.(Full model)
Exactinference
25
20
Topic delta
15
10
Weighted automaton(first classify, then bursts)
5
0
0
20
40
60
80
Message number
19
Inference comparison (Synthethic data)
  • Two topics, with different frequency pattern

Topic ?for topic 1
More bursty
Message ?(both combined)
20
Inference comparison (Synthethic data)
  • Two topics, with different frequency pattern

Exactinference
Topic ?for topic 1
More bursty
Message ?(both combined)
21
Inference comparison (Synthethic data)
  • Two topics, with different frequency pattern

Particlefilter
Exactinference
Topic ?for topic 1
More bursty
Message ?(both combined)
22
Inference comparison (Synthethic data)
  • Two topics, with different frequency pattern

Mean-field
Particlefilter
Exactinference
Topic ?for topic 1
Implicit Data Association identifies true
frequency parameters (does not get distracted by
observed ?) In addition to exact inference (for
few topics),several approximate inference
techniques perform well.
More bursty
Message ?(both combined)
23
Experiments on real document streams
  • ENRON Email corpus
  • 517,431 emails from 151 employees
  • Selected 554 messages from tech-memos and
    universities folders of Kaminski
  • Stream between December 1999 and May 2001
  • Reuters news archive
  • Contains 810,000 news articles
  • Selected 2,303 documents from four topics
    wholesale prices, environment issues, fashion and
    obituaries

24
Intensity identification for Enron data
Topic ?
More bursty
25
Intensity identification for Enron data
Topic ?
WAM
More bursty
26
Intensity identification for Enron data
Topic ?
WAM
IDA-IT
More bursty
27
Intensity identification for Enron data
Topic ?
WAM
IDA-IT
More bursty
Implicit Data Association identifies bursts
which are missed by two-step approach
28
Reuters news archive
  • Again, simultaneous topic and burst
    identification outperforms separate approach

29
What about classification?
  • Temporal modeling effectively changes class prior
    over time.
  • Impact on classification accuracy?

30
Classification performance
IDAModel
NaïveBayes
Lower is better
  • Modeling intensity leads to improved
    classification accuracy ?

31
Generalizations
  • Learning paradigms
  • Not just supervised setting, but also
  • Unsupervised- / semisupervised learning
  • Active learning (select most informative labels)
  • See paper for details.
  • Other document representations /
    classifiers(Just need P(Dt Ct))
  • Other applications
  • Fault detection
  • Activity recognition

32
Tracking topic drift over time
Topic param.(Mean for LSI representation)
?t tracks topic means (Kalman Filter)
Dt
Document (LSI)
33
Conclusion
  • General model for data association in data
    streams
  • Exponential order statistics enable implicit data
    association and tractable exact inference
  • A principled model for changing class priors
    over time
  • Can be used in supervised, unsupervised and
    (semisupervised) active learning setting
  • Synergetic effect between intensity estimation
    and classification on several real-world data
    sets
Write a Comment
User Comments (0)
About PowerShow.com