Automatic%20Blog%20Monitoring%20and%20Summarization - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic%20Blog%20Monitoring%20and%20Summarization

Description:

Algorithms to monitor frequently changing data sources ... blogspot.com. Blog. 839,663. spaces.msn.com. Category. Count. Domain. Collection ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 43
Provided by: sia78
Learn more at: http://oak.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Automatic%20Blog%20Monitoring%20and%20Summarization


1
Automatic Blog Monitoring and Summarization
  • Ka Cheung Richard Sia
  • PhD Prospectus

2
With/without organized access
3
Inaccessible?
By AskJeeves
4
Introduction
  • Organized access to blogs
  • Full coverage
  • Reflect changes quickly
  • Filtered and organized presentation
  • Intended Contributions
  • Efficient techniques to harvest blogs
  • Algorithms to monitor frequently changing data
    sources
  • Algorithms to reconstruct implicit networks and
    compose topic summaries

5
Modules
  • Monitoring
  • Collection (future work)
  • Topic detection and tracking (future work)
  • Conclusion

6
Monitoring
  • Preliminary results

7
Framework
  • A central server monitors data source changes and
    provides succinct summaries to users

8
Overview
  • New challenges
  • Content change more rapidly with recurring
    pattern
  • More time-sensitive requirements
  • Modeling of posting update
  • Definition of delay
  • Strategies for allocation and scheduling

9
Characteristics
  • Homogeneous Poisson model?(t) ? at any t
  • Periodic inhomogeneous Poisson model ?(t)
    ?(t-nT), n1,2,

10
Definition of metrics
  • Delay of a data sourcesum of elapsed time for
    every post
  • Delay experienced by the aggregator

11
Definition of metrics
  • tj retrieval time?(t) posting rate
  • Expected delay
  • Homogeneous Poisson model
  • Inhomogeneous Poisson model

12
Problem formulation
  • Minimization of expected delay experienced by the
    aggregator under constraint of limited resources.

Schedule tjs such that is minimized.
13
Approach
  • Resource allocation
  • How often to contact data sources?
  • O1 is more active than O2, how much more often
    should we contact O1 than O2?
  • Retrieval scheduling
  • When to contact a data source?
  • 3 retrievals are allocated for O1, when should
    these 3 retrievals be located?

14
Resource allocation
  • Consider n data source O1, , On
  • ?i posting rate of Oi
  • wi weight of Oi
  • N total number of retrievals per day
  • mi number of retrievals per day allocated to Oi
  • Optimal allocation

15
Retrieval scheduling
  • m retrieval(s) per day are allocated to a data
    source O, how should we schedule these m
    retrievals?
  • m1
  • mgt1

16
Single retrieval per period
  • ?(t) 1, t 0,1, ?(t)0, t 1,2
  • Periodicity T2
  • t 0.5, expected delay 0.75
  • t 1, expected delay 0.5
  • t 2, expected delay 1.5

17
Single retrieval per period
  • For a data source with posting rate ?(t) and
    period T, the expected delay when retrieved at
    time t is given by

18
Multiple retrievals per period
  • m retrievals per period are allocated, when
    scheduled at time t1, , tm, the expected delay
    is given by

19
Example
  • 6 retrievals for ?(t)22sin(2pt)

20
Experiment
  • Data 10k RSS feeds over Oct Dec 2004

21
Performance
  • CGM03 optimize for age
  • Ours both resource allocation and retrieval
    scheduling

22
Size of estimation window
  • Resource constraint 4 retrievals per day per
    feeds on average
  • 2 weeks is an appropriate choice

23
Predictability of posting rate
  • 90 of the RSS feeds post consistently

24
Summaries and extensions
  • Resource allocation is more aggressive
  • Retrieval scheduling optimizes within individual
    data source
  • Include user access pattern
  • Variable retrieval cost

25
Collection
  • Future work

26
Collection
Domain Count Category
spaces.msn.com 839,663 Blog
blogspot.com 362,957 Blog
wretch.cc 116,161 Blog
search-net101.com 89,750 Spam/ads
abalty.com 86,329 Spam/ads
search-now854.com 80,109 Spam/ads
bigebiz.org 79,059 Spam/ads
  • Blog hosting website
  • Central repository5.3M URLs from
    weblogs.comlimited and contaminated
  • CrawlingRetrieve maximum number of blog while
    reducing number of irrelevant pages downloaded

27
Collection
  • Blogs are inter-connected (blogrolls)
  • Selectively following links, discovering hubs for
    blogs

blog
blog
1 Chakrabarti et.al. Focused Crawling A New
Approach to Topic-specific Web Resource
Discovery, The International WWW conference 1999
28
Relinquishment of blogs
  • Detection of abandoned blog to save resource

2 D.R. Cox Regression models and life-tables
(with discussion) Journal of the Royal
Statistical Society, B(34), 1972 3 Gina Venolia
A Matter of Life or Death Modeling Blog
MortalityTechnical report, Microsoft Research
29
Topic detection and tracking
  • Future work

30
Overview
  • Characteristics
  • Document stream
  • Traces of information propagation among blogs
  • Challenges
  • Modeling growth and death of a topic
  • Ranking of blog articles
  • Malicious content

31
Influence network in blogs
  • Information are diffused among blogs
  • Indicator of popularity
  • Social relationship among bloggers

32
Influence network in blogs
  • Four major patterns of propagation
  • Reconstruction of implicit network
  • Ranking (source authority)
  • Advertising campaign

33
Data characteristics
  • 97 - 98 daily content are new

34
Data characteristics
  • Same content last for 8 days

35
Topics
  • Topics with different lifespan
  • Bursty
  • Mid-range
  • Sustaining
  • Evolving of topic

4 J. Kleinberg, Bursty and Hierarchical
Structure in Streamsin SIGKDD 2002 5 J.
Kleinberg, Temploral Dynamics of On-Line
Information StreamsData Stream Management
Processing High-Speed Data Stream, Springer 2005
36
Document similarity
  • Sparse and diverse
  • 400 articles clustered into 21 clusters out of
    10,000 daily articles (by DBSCAN)

37
Framework
  • Document stream approach
  • Filtering
  • Aggregation

38
Problems
  • Selecting a representative subset of documents
    from a topic cluster
  • Coverage
  • Distinctiveness among subset
  • Ranking of documents
  • Time
  • Source authority

39
Conclusion
  1. Efficient collection of blogs and modeling the
    relinquishment
  2. Monitoring and retrieval scheduling of rapidly
    changing data sources
  3. Composing topic summary
  4. Reconstruction of an implicit influence network
  5. Representative document selection problem

40
End
  • Questions?

41
More examples
42
Major posting patterns
  • K means clustering
Write a Comment
User Comments (0)
About PowerShow.com