Crawling the Web - PowerPoint PPT Presentation

About This Presentation
Title:

Crawling the Web

Description:

New Web Base Crawler. 20,000 lines in C/C . 130M pages ... Application to a Web crawler. Visit pages once every week for 5 weeks. Estimate change frequency ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 41
Provided by: Jungh1
Learn more at: http://oak.cs.ucla.edu
Category:
Tags: crawling | web

less

Transcript and Presenter's Notes

Title: Crawling the Web


1
Crawling the Web
  • Discovery and Maintenance of
  • Large-Scale Web Data

Junghoo Cho Stanford University
2
What is a Crawler?
initial urls
init
to visit urls
get next url
get page
visited urls
web
extract urls
web pages
3
Applications
  • Internet Search Engines
  • Google, AltaVista
  • Comparison Shopping Services
  • My Simon, BizRate
  • Data mining
  • Stanford Web Base, IBM Web Fountain

4
Why at University?
  • Not much scientific study
  • Little had been known
  • Freshness problems
  • Can we cope with growth?
  • Can focus on fundamental issues

5
Crawling at Stanford
  • Web Base Project
  • BackRub Crawler, PageRank
  • Google
  • New Web Base Crawler
  • 20,000 lines in C/C
  • 130M pages collected

6
Crawling Issues (1)
  • Load at visited web sites
  • Space out requests to a site
  • Limit number of requests to a site per day
  • Limit depth of crawl

7
Crawling Issues (2)
  • Load at crawler
  • Parallelize

initial urls
init
init
to visit urls
get next url
get next url
get page
get page
extract urls
extract urls
visited urls
web pages
8
Crawling Issues (3)
  • Scope of crawl
  • Not enough space for all pages
  • Not enough time to visit all pages

9
Crawling Issues (4)
  • Replication
  • Pages mirrored at multiple locations

10
Crawling Issues (5)
  • Incremental crawling
  • How do we avoid crawling from scratch?
  • How do we keep pages fresh?

11
Summary of My Research
  • Load on sites PAWS00
  • Parallel crawler Tech Report 01
  • Page selection WWW7
  • Replicated page detection SIGMOD00
  • Page freshness SIGMOD00
  • Crawler architecture VLDB00

12
Outline of This Talk
  • How can we maintain pages fresh?
  • How does the Web change?
  • What do we mean by fresh pages?
  • How should we refresh pages?

13
Web Evolution Experiment
  • How often does a Web page change?
  • How long does a page stay on the Web?
  • How long does it take for 50 of the Web to
    change?
  • How do we model Web changes?

14
Experimental Setup
  • February 17 to June 24, 1999
  • 270 sites visited (with permission)
  • identified 400 sites with highest PageRank
  • contacted administrators
  • 720,000 pages collected
  • 3,000 pages from each site daily
  • start at root, visit breadth first (get new old
    pages)
  • ran only 9pm - 6am, 10 seconds between site
    requests

15
Average Change Interval
fraction of pages
¾
¾
average change interval
16
Change Interval By Domain
fraction of pages
¾
¾
average change interval
17
Modeling Web Evolution
  • Poisson process with rate ?
  • T is time to next event
  • fT (t) ? e-? t (t gt 0)

18
Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
19
Change Metrics
  • Freshness
  • Freshness of element ei at time t is F (
    ei t ) 1 if ei is up-to-date at time t
    0 otherwise

20
Change Metrics
  • Age
  • Age of element ei at time t is A( ei t
    ) 0 if ei is up-to-date at time t
    t - (modification ei time)
    otherwise

21
Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
22
Trick Question
  • Two page database
  • e1 changes daily
  • e2 changes once a week
  • Can visit one page per week
  • How should we visit pages?
  • e1 e2 e1 e2 e1 e2 e1 e2... uniform
  • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 proportional
  • e1 e1 e1 e1 e1 e1 ...
  • e2 e2 e2 e2 e2 e2 ...
  • ?

e1
e1
e2
e2
web
database
23
Proportional Often Not Good!
  • Visit fast changing e1
  • ? get 1/2 day of freshness
  • Visit slow changing e2
  • ? get 1/2 week of freshness
  • Visiting e2 is a better deal!

24
Optimal Refresh Frequency
  • Problem
  • Given and f ,
  • find
  • that maximize

25
Solution
  • Compute
  • Lagrange multiplier method
  • All

26
Optimal Refresh Frequency
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

27
Optimal Refresh for Age
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

28
Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
29
Not Every Page is Equal!
? Some pages are more important
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
30
Weighted Freshness
f
w 2
w 1
l
31
Change Frequency Estimation
  • How to estimate change frequency?
  • Naïve Estimator X/T
  • X number of detected changes
  • T monitoring period
  • 2 changes in 10 days 0.2 times/day
  • Incomplete change history

32
Improved Estimator
  • Based on the Poisson model
  • X number of detected changes
  • N number of accesses
  • f access frequency
  • 3 changes in 10 days 0.36 times/day
  • ? Accounts for missed changes

33
Improved Estimator
  • Bias
  • Efficiency
  • Consistency

34
Improvement Significant?
  • Application to a Web crawler
  • Visit pages once every week for 5 weeks
  • Estimate change frequency
  • Adjust revisit frequency based on the estimate
  • Uniform do not adjust
  • Naïve based on the naïve estimator
  • Ours based on our improved estimator

35
Improvement from Our Estimator
Detected changes Ratio to uniform
Uniform 2,147,589 100
Naïve 4,145,582 193
Ours 4,892,116 228
(9,200,000 visits in total)
36
Other Estimators
  • Irregular access interval
  • Last-modified date
  • Categorization

37
Summary
  • Web evolution experiment
  • Change metric
  • Refresh policy
  • Frequency estimator

38
Contribution
  • Freshness SIGMOD00
  • Page selection WWW7
  • Replicated page detection SIGMOD00
  • Load on sites PAWS00
  • Parallel crawler Tech Report 01
  • Crawler architecture VLDB00

39
Whats Next?
  • New search paradigm
  • What is the middle name of Thomas Edison?
  • Thomas a-z Edison
  • Continuous data stream
  • Web logs, Network traffic engineering
  • How should the data model change?

40
The End
  • Thank you for your attention
  • For more information visit
  • http//www-db.stanford.edu/cho/
Write a Comment
User Comments (0)
About PowerShow.com