WebBase and Stanford Digital Library Project - PowerPoint PPT Presentation

About This Presentation
Title:

WebBase and Stanford Digital Library Project

Description:

WebBase and Stanford Digital Library Project Junghoo Cho Stanford University – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 24
Provided by: Hecto76
Learn more at: http://oak.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: WebBase and Stanford Digital Library Project


1
WebBase and Stanford Digital Library Project
  • Junghoo Cho
  • Stanford University

2
Technologies for Digital Libraries
Physical Barriers
  • Mobile Access

Economic Weaknesses
  • IP Infrastructure

Information Loss
  • Archival Repository

Information Overload
  • Value Filtering

Service Heterogeneity
  • Interoperability

2
3
WebBase Architecture
Client
Client
Webbase API
WWW
Retrieval Indexes
Feature Repository
Repository
Multicast Engine
Client
Client
Client
Client
3
4
What is a Crawler?
initial urls
init
to visit urls
get next url
web
get page
visited urls
extract urls
web pages
5
Crawling Issues (1)
  • Load at visited web sites
  • Load at crawlers
  • Scope of the crawl

6
Crawling Issues (2)
  • Maintaining pages fresh
  • How does the web change over time?
  • What does fresh page/database mean?
  • How can we increase freshness?

7
Outline
  • Web evolution experiments
  • Freshness metrics
  • Crawling policy

8
Web Evolution Experiment
  • How often does a web page change?
  • What is the lifespan of a page?
  • How long does it take for 50 of the web to
    change?

9
Experimental Setup
  • February 17 to June 24, 1999
  • 270 sites visited (with permission)
  • identified 400 sites with highest page rank
  • contacted administrators
  • 720,000 pages collected
  • 3,000 pages from each site daily
  • start at root, visit breadth first (get new old
    pages)
  • ran only 9pm - 6am, 10 seconds between site
    requests

10
Page Change
  • Example 50 visits to page, 5 changes
  • ? average change interval 50/5 10 days

11
Average Change Interval
fraction of pages
12
Change Interval - By Domain
fraction of pages
13
Modeling Web Evolution
  • Poisson process with rate ?
  • T is time to next event
  • fT(t) ?e-?t (t gt 0)

14
Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
15
Change Metrics
  • Freshness
  • Freshness of page ei at time t is F( ei
    t ) 1 if ei is up-to-date at time t
    0 otherwise

16
Change Metrics
  • Age
  • Age of page ei at time t is A( ei t )
    0 if ei is up-to-date at time t
    t - (modification ei time)
    otherwise

17
Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
18
Trick Question
  • Two page database
  • e1 changes daily
  • e2 changes once a week
  • Can visit pages once a week
  • How should we visit pages?
  • e1 e1 e1 e1 e1 e1 ...
  • e2 e2 e2 e2 e2 e2 ...
  • e1 e2 e1 e2 e1 e2 ... uniform
  • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 ...
    proportional
  • ?

e1
e1
e2
e2
web
database
19
Proportional Often Not Good!
  • Visit fast changing e1 ? get 1/2 day of freshness
  • Visit slow changing e2 ? get 1/2 week of
    freshness
  • Visiting e2 is a better deal!

20
Selecting Optimal Refresh Frequency
  • Analysis is complex
  • Shape of curve is the same in all cases
  • Holds for any distribution g( ? )

21
Optimal Refresh Frequency for Age
  • Analysis is also complex
  • Shape of curve is the same in all cases
  • Holds for any distribution g( ? )

22
Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
23
Summary
  • Maintaining the collection fresh
  • Web evolution experiment
  • Change metrics
  • Optimal policy
  • Intuitive policy does not always perform well
  • Should be careful in deciding revisit policy
Write a Comment
User Comments (0)
About PowerShow.com