Title: Keeping Up With The Changing Web
1Keeping Up With The Changing Web
- Brian E. Brewington
- George Cybenko
2Re-Crawling is Essential
- Usually, satisfying search results
- Reference documents that currently exist
- Are based on the current contents of those
documents - The web is constantly changing
- Re-crawling is essential to maintaining a fresh
document collection - How often should a new crawl be performed?
3Figuring Out How Often to Crawl
- Develop a mathematical model of how often web
pages change - Decide how current you expect the document
collection to be - Determine how often the web must be crawled
- Using the change model and freshness expectation
as inputs
4Finding a Model for the Distribution of Change
Rates
- Monitor the web
- Change any alteration of a web page
- Lifetime time between successive changes
- Age time since the last change
5Age to Model Change Rates
Cumulative Distribution of Ages
- To approximate the rate of change of documents
- The web is young
- 1/5th of documents is lt 12 days old
6Age to Model Change Rates
- Difficulties
- Web server doesnt tell you when a page was first
created - Cant know the age of a page from a single sample
- If a page is young
- It could be highly dynamic
- Or, it could be newly created
- Need a growth model to estimate which
- Used lifetime distributions instead
7Lifetime to Model Change Rates
- Use lifetime to approximate the rate of change of
documents - Difficulties
- For quickly changing pages
- May not observe every change
- For slowly changing pages
- May not observe any change
- Need to correct for both effects
8Lifetime to Model Change Rates
- Assumptions
- Pages change according to independent Poisson
processes - Each described by a rate ?
- A Weibull distribution of mean lifetimes for the
Poisson processes - The observation period for a page is independent
of its change rate - Overlook the bias against pages that change
infrequently
1
9Lifetime to Model Change Rates
- Using the assumptions
- Find the parameters that best fit the model to
the observed lifetime distribution - Result
- A distribution of mean web page lifetimes
- i.e. a model of the rate of web pages changes
10The Meaning of Up-To-Date
- It is unreasonable to expect a search engine to
be absolutely current - i.e. completely current all of the time
- A practical/reasonable expectation
- Relaxes time
- Relaxes certainty
- (?,?) currency
11(?,?) Current
- With probability at least ?, a randomly chosen
page in the collection is ?-current
- ?-current the page has not changed between the
last time it was checked and ? time units ago
- (0.8,3-week) current ?80 likelihood that a
document in the collection has not changed from
when it was checked until 3 weeks ago
2
12Deciding How often to Crawl
- Assumption rebuild the entire document
collection periodically - Every every T time units (on average)
- Define
- Relative re-indexing rate ?T
- ? is the rate of the Poisson process
- Grace period percentage ?? ?/T
13Deciding How often to Crawl
- Y axis ? (probability of being ß-current)
- X axis ?T (relative re-indexing rate)
- Lines ??/T (grace period percentage)
Index less often
Index more often
3
14Deciding How often to Crawl
- The surface ? as a function of ??/T (grace
period percentage) and T (re-indexing period) - The plane ? 0.95
- Intersect the surface with the plane
- Obtain a curve relating the indexing period to
?-currency for a given ?
4
15Deciding How often to Crawl
5
16What to Do With the Results
- Decide how often to crawl in order to keep the
index current to a specified level - Estimate bandwidth requirements
pages in index
kilobytes
x
days between crawls
page
17Future Work
- Consider
- Varying importance of pages
- Varying importance of changes to pages
- Varying the re-indexing period by page
- The next paper tries to address these issues
- The mathematical model in this paper fits into
the techniques described in the next paper
18(No Transcript)
19The Evolution of the Web andImplications for an
Incremental Crawler
- Junghoo Cho
- Hector Garcia-Molina
20Types Of Crawlers
- Assume that an initial document collection has
been built
- Periodic Crawler
- Crawls from time to time
- Creates a new document collection each time
- Incremental Crawler
- Crawls continually
- Updates changed documents
- Replaces old or less important documents with
new or more important documents
21Goals of Incremental Crawling
- Improves freshness of the document collection
- Freshness fraction of up-to-date documents
- Avoid delaying updates until the next crawl cycle
- Visit frequently changed pages more often
- Discover new/removed documents sooner
- Efficiency
- Try to revisit documents only as often as they
change - Distribute bandwidth usage over a longer time span
6
22Developing an Incremental Crawler
- Study how the web changes over time
- Understand the system demands of an incremental
crawler - How they differ from those of a periodic crawler
- Make key design choices for an incremental crawler
23Learning How the Web Changes
- Monitor part of the web for a time period
- 270 sites 132 .coms, 78 .edus, 11 .nets, 19
.orgs, 28 .govs, and 2 .mils - 4 month monitoring period
- Re-crawled each site daily
- BFS crawl from the root page of each site until
3000 documents are found - Page window pages in the 3000 documents
- Pages can enter and leave the window
- Simulates new page creation / page removal
24How Often Web Pages Change
- Average change interval of each page
- days in window / times it changed
- Ignored error due to undetected page changes
- Over all domain types
- 20 of pages changed on every visit
- 30 of pages never changed
- By domain type
- Pages in .com domains change more frequently than
pages in other domains (e.g. .gov)
25The Lifespan of Web Pages
- Lifespan length of time that a page exists
- Approximated by the number of days that the page
was within the window - Adjusted for error due to lifespan extending
beyond the monitoring period - Over all domain types
- Some pages are short lived
- Many pages are long lived
- By domain type
- Pages in .com domains have a shorter lifespan
than pages in other domains (e.g. .gov)
26The Time for 50 of the Web to Change
- Considers the changes of the web as a whole
- 50 days for 50 of the web to change
- .com domains, 11 days for 50 of the site to
change
27Mathematical Model of Document Changes
- Motivations
- To compare crawling policies
- By comparing how up-to-date their document
collections are - A mathematical model can predict how many pages
have changed since the crawler last crawled them - To decide which pages to re-visit during
incremental crawling - Mathematical model
- Poisson process
28Using the Observations
- Critical observations
- The change frequency of a web document varies
- The lifetime of a web document varies
- Portions of the web evolve faster than others
- Changes can be modeled as a Poisson process
- To keep up with the changes, incremental crawling
might be a better strategy than periodic crawling
29Batch Mode vs. Steady
- Batch-mode Crawler
- Periodically update entire collection
- Steady Crawler
- Continually update portions of collection
crawl at same average speed ?same average
freshness over time A steady crawler can achieve
this with a lower peak speed
Graphs assumes that pages are immediately
available to users
30Shadowing vs. In-Place
- Shadowed updates
- Build a new collection separately
- Instantaneously replaces active collection
- Simpler to implement
- Acceptable for batch mode crawlers
- In-place updates
- Update the active collection directly
- Active collection is more fresh
- Nearly essential for steady crawlers
31Shadowing vs. In-Place
- The crawlers collection is always as fresh as it
can be - For the active collection
- Freshness is lost by shadowing
- The lost freshness is more significant for a
steady crawler
Batch Mode Crawler
Steady Crawler
Dashed line in-place, Solid line shadowing
32Page Update Frequency
- Fixed frequency
- Visit all pages equally often
- A periodic crawler will most likely visit every
page in every crawl (fixed frequency)
- Variable frequency
- Revisit actively changed pages more often
- A steady crawler may elect to vary the frequency
33Variable Frequency Visitation
- How often should the crawler visit pages?
- Intuition visit more often if it changes more
often - But, there is a point of diminishing returns
- Optimal behavior
7
34Comparison of Features
- Incremental approach
- Steady Crawler
- In-place updates
- Variable visitation frequency
- Potentially higher freshness
- Lower demands on network and web sites
- Periodic approach
- Batch mode crawler
- Shadowed updates
- Fixed visitation frequency
- Easier to implement
35Incremental Crawler Implementation
- Loop forever
- Either Replace a page in the collection with a
new page - Or Update a page in the collection
- Decisions must be made at run-time
- Replace a page in the collection?
- If so, which page should be replaced which new
page? - Update a page in the collection?
- If so, which one?
36Refinement decisions
- Discard a page with the lowest importance
- Add a page with higher importance
- Improves the collection quality by replacing
less important pages with more important
pages - Judge importance using PageRank, Hub and
Authority,
37Update decisions
- Choose a page and update it
- Based on a model of the change rates of the pages
- The one that will improve the freshness of the
collection the most (variable frequency model) - Update that page
- Improves the collections freshness by updating
pages in the collection
38Suggested Crawler Architecture
- Omitted due to time constraints
39Ideas to Discuss
- A hybrid approach
- Batch mode crawler with steady maintenance
- Higher complexity and bandwidth expense
- Further improves freshness
- Collection growth/shrinkage
- Assumed that the collection has a fixed size
- Must decide how/when to grow/shrink collection