Synchronizing%20a%20Database%20To%20Improve%20Freshness%20Junghoo%20Cho%20Hector%20Garcia-Molina%20Stanford%20University - PowerPoint PPT Presentation

About This Presentation
Title:

Synchronizing%20a%20Database%20To%20Improve%20Freshness%20Junghoo%20Cho%20Hector%20Garcia-Molina%20Stanford%20University

Description:

Challenge: How to maintain pages 'fresh?' How does the web ... Comparing Policies. Based on Statistics from experiment. and revisit frequency of every month ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Synchronizing%20a%20Database%20To%20Improve%20Freshness%20Junghoo%20Cho%20Hector%20Garcia-Molina%20Stanford%20University


1
Synchronizing a DatabaseTo Improve
FreshnessJunghoo ChoHector Garcia-MolinaStanfo
rd University
2
Problem
  • Application
  • Web search engines/crawlers
  • Data warehouse
  • . . .

3
Challenge How to maintain pages fresh?
  • How does the web change over time?
  • Web evolution experiment
  • What does fresh page/database mean?
  • Change metrics
  • How can we increase freshness?
  • Crawl policy

4
Web Evolution Experiment
  • How often does a web page change?
  • How do we model web changes?
  • What is the lifespan of a page?
  • How long does it take for 50 of the web change?

5
Experimental Setup
  • February 17 to June 24, 1999
  • 270 sites visited (with permission)
  • identified 400 sites with highest page rank
  • contacted administrators
  • 720,000 pages collected
  • 3,000 pages from each site daily
  • start at root, visit breadth first (get new old
    pages)
  • ran only 9pm - 6am, 10 seconds between site
    requests

6
How Often Does a Page Change?
  • Example 50 visits to page, 5 changes ?
    average change interval 50/5 10 days

7
Average Change Interval
fraction of pages
8
Modeling Web Evolution
  • Poisson process with rate ?
  • T is time to next event
  • fT(t) ?e-?t (t gt 0)

9
Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
10
Change Metrics
  • Freshness
  • Freshness of page ei at time t is F( ei
    t ) 1 if ei is up-to-date at time t
    0 otherwise

11
Change Metrics
  • Age
  • Age of page ei at time t is A( ei t )
    0 if ei is up-to-date at time t
    t - (modification ei time)
    otherwise

12
Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
13
Refresh Order
  • Fixed order
  • Example Explicit list of URLs to visit
  • Random Order
  • Example Start from seed URLs follow links
  • Purely Random
  • Example Refresh pages on demand, as requested by
    user

database
web
ei
ei
...
...
14
Freshness vs. Order
r ? / f average change frequency / average
revisit frequency
15
Trick Question
  • Two page database
  • e1 changes daily
  • e2 changes once a week
  • Can visit pages once a week
  • How should we visit pages?
  • e1 e1 e1 e1 e1 e1 ...
  • e2 e2 e2 e2 e2 e2 ...
  • e1 e2 e1 e2 e1 e2 ... uniform
  • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 ...
    proportional
  • ?

e1
e1
e2
e2
web
database
16
Proportional Often Not Good!
  • Visit fast changing e1 ? get 1/2 day of freshness
  • Visit slow changing e2 ? get 1/2 week of
    freshness
  • Visiting e2 is a better deal!

17
Selecting Optimal Refresh Frequency
  • Analysis is complex
  • Shape of curve is the same in all cases
  • Holds for any distribution g( ? )

18
Optimal Refresh Frequency for Age
  • Analysis is also complex
  • Shape of curve is the same in all cases
  • Holds for any distribution g( ? )

19
Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
20
Summary
  • Maintaining the collection fresh
  • Web evolution experiment
  • Change metrics
  • Optimal policy
  • Intuitive policy does not always perform well
  • Should be careful in deciding revisit policy

21
Future work
  • Weighted freshness model
  • Non-Poisson process model
  • Change frequency estimation
Write a Comment
User Comments (0)
About PowerShow.com