Keeping Up With The Changing Web - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Keeping Up With The Changing Web

Description:

Re-crawling is essential to maintaining a fresh document collection ... Determine how often the web must be crawled ... Crawls continually. Updates changed documents ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 40

Provided by: cseLe

Category:

more less

Transcript and Presenter's Notes

Title: Keeping Up With The Changing Web

1
Keeping Up With The Changing Web

Brian E. Brewington
George Cybenko

2
Re-Crawling is Essential

Usually, satisfying search results
Reference documents that currently exist
Are based on the current contents of those
documents
The web is constantly changing
Re-crawling is essential to maintaining a fresh
document collection
How often should a new crawl be performed?

3
Figuring Out How Often to Crawl

Develop a mathematical model of how often web
pages change
Decide how current you expect the document
collection to be
Determine how often the web must be crawled
Using the change model and freshness expectation
as inputs

4
Finding a Model for the Distribution of Change
Rates

Monitor the web
Change any alteration of a web page
Lifetime time between successive changes
Age time since the last change

5
Age to Model Change Rates
Cumulative Distribution of Ages

To approximate the rate of change of documents
The web is young
1/5th of documents is lt 12 days old

6
Age to Model Change Rates

Difficulties
Web server doesnt tell you when a page was first
created
Cant know the age of a page from a single sample
If a page is young
It could be highly dynamic
Or, it could be newly created
Need a growth model to estimate which
Used lifetime distributions instead

7
Lifetime to Model Change Rates

Use lifetime to approximate the rate of change of
documents
Difficulties
For quickly changing pages
May not observe every change
For slowly changing pages
May not observe any change
Need to correct for both effects

8
Lifetime to Model Change Rates

Assumptions
Pages change according to independent Poisson
processes
Each described by a rate ?
A Weibull distribution of mean lifetimes for the
Poisson processes
The observation period for a page is independent
of its change rate
Overlook the bias against pages that change
infrequently

1
9
Lifetime to Model Change Rates

Using the assumptions
Find the parameters that best fit the model to
the observed lifetime distribution
Result
A distribution of mean web page lifetimes
i.e. a model of the rate of web pages changes

10
The Meaning of Up-To-Date

It is unreasonable to expect a search engine to
be absolutely current
i.e. completely current all of the time
A practical/reasonable expectation
Relaxes time
Relaxes certainty
(?,?) currency

11
(?,?) Current

With probability at least ?, a randomly chosen
page in the collection is ?-current

?-current the page has not changed between the
last time it was checked and ? time units ago

(0.8,3-week) current ?80 likelihood that a
document in the collection has not changed from
when it was checked until 3 weeks ago

2
12
Deciding How often to Crawl

Assumption rebuild the entire document
collection periodically
Every every T time units (on average)
Define
Relative re-indexing rate ?T
? is the rate of the Poisson process
Grace period percentage ?? ?/T

13
Deciding How often to Crawl

Y axis ? (probability of being ß-current)
X axis ?T (relative re-indexing rate)
Lines ??/T (grace period percentage)

Index less often
Index more often
3
14
Deciding How often to Crawl

The surface ? as a function of ??/T (grace
period percentage) and T (re-indexing period)
The plane ? 0.95

Intersect the surface with the plane
Obtain a curve relating the indexing period to
?-currency for a given ?

4
15
Deciding How often to Crawl
5
16
What to Do With the Results

Decide how often to crawl in order to keep the
index current to a specified level
Estimate bandwidth requirements

pages in index
kilobytes
x
days between crawls
page
17
Future Work

Consider
Varying importance of pages
Varying importance of changes to pages
Varying the re-indexing period by page
The next paper tries to address these issues
The mathematical model in this paper fits into
the techniques described in the next paper

18
(No Transcript)
19
The Evolution of the Web andImplications for an
Incremental Crawler

Junghoo Cho
Hector Garcia-Molina

20
Types Of Crawlers

Assume that an initial document collection has
been built

Periodic Crawler
Crawls from time to time
Creates a new document collection each time

Incremental Crawler
Crawls continually
Updates changed documents
Replaces old or less important documents with
new or more important documents

21
Goals of Incremental Crawling

Improves freshness of the document collection
Freshness fraction of up-to-date documents
Avoid delaying updates until the next crawl cycle
Visit frequently changed pages more often
Discover new/removed documents sooner
Efficiency
Try to revisit documents only as often as they
change
Distribute bandwidth usage over a longer time span

6
22
Developing an Incremental Crawler

Study how the web changes over time
Understand the system demands of an incremental
crawler
How they differ from those of a periodic crawler
Make key design choices for an incremental crawler

23
Learning How the Web Changes

Monitor part of the web for a time period
270 sites 132 .coms, 78 .edus, 11 .nets, 19
.orgs, 28 .govs, and 2 .mils
4 month monitoring period
Re-crawled each site daily
BFS crawl from the root page of each site until
3000 documents are found
Page window pages in the 3000 documents
Pages can enter and leave the window
Simulates new page creation / page removal

24
How Often Web Pages Change

Average change interval of each page
days in window / times it changed
Ignored error due to undetected page changes
Over all domain types
20 of pages changed on every visit
30 of pages never changed
By domain type
Pages in .com domains change more frequently than
pages in other domains (e.g. .gov)

25
The Lifespan of Web Pages

Lifespan length of time that a page exists
Approximated by the number of days that the page
was within the window
Adjusted for error due to lifespan extending
beyond the monitoring period
Over all domain types
Some pages are short lived
Many pages are long lived
By domain type
Pages in .com domains have a shorter lifespan
than pages in other domains (e.g. .gov)

26
The Time for 50 of the Web to Change

Considers the changes of the web as a whole
50 days for 50 of the web to change
.com domains, 11 days for 50 of the site to
change

27
Mathematical Model of Document Changes

Motivations
To compare crawling policies
By comparing how up-to-date their document
collections are
A mathematical model can predict how many pages
have changed since the crawler last crawled them
To decide which pages to re-visit during
incremental crawling
Mathematical model
Poisson process

28
Using the Observations

Critical observations
The change frequency of a web document varies
The lifetime of a web document varies
Portions of the web evolve faster than others
Changes can be modeled as a Poisson process
To keep up with the changes, incremental crawling
might be a better strategy than periodic crawling

29
Batch Mode vs. Steady

Batch-mode Crawler
Periodically update entire collection

Steady Crawler
Continually update portions of collection

crawl at same average speed ?same average
freshness over time A steady crawler can achieve
this with a lower peak speed
Graphs assumes that pages are immediately
available to users
30
Shadowing vs. In-Place

Shadowed updates
Build a new collection separately
Instantaneously replaces active collection
Simpler to implement
Acceptable for batch mode crawlers

In-place updates
Update the active collection directly
Active collection is more fresh
Nearly essential for steady crawlers

31
Shadowing vs. In-Place

The crawlers collection is always as fresh as it
can be
For the active collection
Freshness is lost by shadowing
The lost freshness is more significant for a
steady crawler

Batch Mode Crawler
Steady Crawler
Dashed line in-place, Solid line shadowing
32
Page Update Frequency

Fixed frequency
Visit all pages equally often
A periodic crawler will most likely visit every
page in every crawl (fixed frequency)

Variable frequency
Revisit actively changed pages more often
A steady crawler may elect to vary the frequency

33
Variable Frequency Visitation

How often should the crawler visit pages?
Intuition visit more often if it changes more
often
But, there is a point of diminishing returns
Optimal behavior

7
34
Comparison of Features

Incremental approach
Steady Crawler
In-place updates
Variable visitation frequency
Potentially higher freshness
Lower demands on network and web sites

Periodic approach
Batch mode crawler
Shadowed updates
Fixed visitation frequency
Easier to implement

35
Incremental Crawler Implementation

Loop forever
Either Replace a page in the collection with a
new page
Or Update a page in the collection
Decisions must be made at run-time
Replace a page in the collection?
If so, which page should be replaced which new
page?
Update a page in the collection?
If so, which one?

36
Refinement decisions

Discard a page with the lowest importance
Add a page with higher importance
Improves the collection quality by replacing
less important pages with more important
pages
Judge importance using PageRank, Hub and
Authority,

37
Update decisions

Choose a page and update it
Based on a model of the change rates of the pages
The one that will improve the freshness of the
collection the most (variable frequency model)
Update that page
Improves the collections freshness by updating
pages in the collection

38
Suggested Crawler Architecture