Title: Automatic%20Blog%20Monitoring%20and%20Summarization
1Automatic Blog Monitoring and Summarization
- Ka Cheung Richard Sia
- PhD Prospectus
2With/without organized access
3Inaccessible?
By AskJeeves
4Introduction
- Organized access to blogs
- Full coverage
- Reflect changes quickly
- Filtered and organized presentation
- Intended Contributions
- Efficient techniques to harvest blogs
- Algorithms to monitor frequently changing data
sources - Algorithms to reconstruct implicit networks and
compose topic summaries
5Modules
- Monitoring
- Collection (future work)
- Topic detection and tracking (future work)
- Conclusion
6Monitoring
7Framework
- A central server monitors data source changes and
provides succinct summaries to users
8Overview
- New challenges
- Content change more rapidly with recurring
pattern - More time-sensitive requirements
- Modeling of posting update
- Definition of delay
- Strategies for allocation and scheduling
9Characteristics
- Homogeneous Poisson model?(t) ? at any t
- Periodic inhomogeneous Poisson model ?(t)
?(t-nT), n1,2,
10Definition of metrics
- Delay of a data sourcesum of elapsed time for
every post - Delay experienced by the aggregator
11Definition of metrics
- tj retrieval time?(t) posting rate
- Expected delay
- Homogeneous Poisson model
- Inhomogeneous Poisson model
12Problem formulation
- Minimization of expected delay experienced by the
aggregator under constraint of limited resources.
Schedule tjs such that is minimized.
13Approach
- Resource allocation
- How often to contact data sources?
- O1 is more active than O2, how much more often
should we contact O1 than O2? - Retrieval scheduling
- When to contact a data source?
- 3 retrievals are allocated for O1, when should
these 3 retrievals be located?
14Resource allocation
- Consider n data source O1, , On
- ?i posting rate of Oi
- wi weight of Oi
- N total number of retrievals per day
- mi number of retrievals per day allocated to Oi
- Optimal allocation
15Retrieval scheduling
- m retrieval(s) per day are allocated to a data
source O, how should we schedule these m
retrievals? - m1
- mgt1
16Single retrieval per period
- ?(t) 1, t 0,1, ?(t)0, t 1,2
- Periodicity T2
- t 0.5, expected delay 0.75
- t 1, expected delay 0.5
- t 2, expected delay 1.5
17Single retrieval per period
- For a data source with posting rate ?(t) and
period T, the expected delay when retrieved at
time t is given by
18Multiple retrievals per period
- m retrievals per period are allocated, when
scheduled at time t1, , tm, the expected delay
is given by
19Example
- 6 retrievals for ?(t)22sin(2pt)
20Experiment
- Data 10k RSS feeds over Oct Dec 2004
21Performance
- CGM03 optimize for age
- Ours both resource allocation and retrieval
scheduling
22Size of estimation window
- Resource constraint 4 retrievals per day per
feeds on average - 2 weeks is an appropriate choice
23Predictability of posting rate
- 90 of the RSS feeds post consistently
24Summaries and extensions
- Resource allocation is more aggressive
- Retrieval scheduling optimizes within individual
data source - Include user access pattern
- Variable retrieval cost
25Collection
26Collection
Domain Count Category
spaces.msn.com 839,663 Blog
blogspot.com 362,957 Blog
wretch.cc 116,161 Blog
search-net101.com 89,750 Spam/ads
abalty.com 86,329 Spam/ads
search-now854.com 80,109 Spam/ads
bigebiz.org 79,059 Spam/ads
- Blog hosting website
- Central repository5.3M URLs from
weblogs.comlimited and contaminated - CrawlingRetrieve maximum number of blog while
reducing number of irrelevant pages downloaded
27Collection
- Blogs are inter-connected (blogrolls)
- Selectively following links, discovering hubs for
blogs
blog
blog
1 Chakrabarti et.al. Focused Crawling A New
Approach to Topic-specific Web Resource
Discovery, The International WWW conference 1999
28Relinquishment of blogs
- Detection of abandoned blog to save resource
2 D.R. Cox Regression models and life-tables
(with discussion) Journal of the Royal
Statistical Society, B(34), 1972 3 Gina Venolia
A Matter of Life or Death Modeling Blog
MortalityTechnical report, Microsoft Research
29Topic detection and tracking
30Overview
- Characteristics
- Document stream
- Traces of information propagation among blogs
- Challenges
- Modeling growth and death of a topic
- Ranking of blog articles
- Malicious content
31Influence network in blogs
- Information are diffused among blogs
- Indicator of popularity
- Social relationship among bloggers
32Influence network in blogs
- Four major patterns of propagation
- Reconstruction of implicit network
- Ranking (source authority)
- Advertising campaign
33Data characteristics
- 97 - 98 daily content are new
34Data characteristics
- Same content last for 8 days
35Topics
- Topics with different lifespan
- Bursty
- Mid-range
- Sustaining
- Evolving of topic
4 J. Kleinberg, Bursty and Hierarchical
Structure in Streamsin SIGKDD 2002 5 J.
Kleinberg, Temploral Dynamics of On-Line
Information StreamsData Stream Management
Processing High-Speed Data Stream, Springer 2005
36Document similarity
- Sparse and diverse
- 400 articles clustered into 21 clusters out of
10,000 daily articles (by DBSCAN)
37Framework
- Document stream approach
- Filtering
- Aggregation
38Problems
- Selecting a representative subset of documents
from a topic cluster - Coverage
- Distinctiveness among subset
- Ranking of documents
- Time
- Source authority
39Conclusion
- Efficient collection of blogs and modeling the
relinquishment - Monitoring and retrieval scheduling of rapidly
changing data sources - Composing topic summary
- Reconstruction of an implicit influence network
- Representative document selection problem
40End
41More examples
42Major posting patterns