Keeping Up With The Changing Web - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Keeping Up With The Changing Web

Description:

Re-crawling is essential to maintaining a fresh document collection ... Determine how often the web must be crawled ... Crawls continually. Updates changed documents ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 40
Provided by: cseLe
Category:

less

Transcript and Presenter's Notes

Title: Keeping Up With The Changing Web


1
Keeping Up With The Changing Web
  • Brian E. Brewington
  • George Cybenko

2
Re-Crawling is Essential
  • Usually, satisfying search results
  • Reference documents that currently exist
  • Are based on the current contents of those
    documents
  • The web is constantly changing
  • Re-crawling is essential to maintaining a fresh
    document collection
  • How often should a new crawl be performed?

3
Figuring Out How Often to Crawl
  • Develop a mathematical model of how often web
    pages change
  • Decide how current you expect the document
    collection to be
  • Determine how often the web must be crawled
  • Using the change model and freshness expectation
    as inputs

4
Finding a Model for the Distribution of Change
Rates
  • Monitor the web
  • Change any alteration of a web page
  • Lifetime time between successive changes
  • Age time since the last change

5
Age to Model Change Rates
Cumulative Distribution of Ages
  • To approximate the rate of change of documents
  • The web is young
  • 1/5th of documents is lt 12 days old

6
Age to Model Change Rates
  • Difficulties
  • Web server doesnt tell you when a page was first
    created
  • Cant know the age of a page from a single sample
  • If a page is young
  • It could be highly dynamic
  • Or, it could be newly created
  • Need a growth model to estimate which
  • Used lifetime distributions instead

7
Lifetime to Model Change Rates
  • Use lifetime to approximate the rate of change of
    documents
  • Difficulties
  • For quickly changing pages
  • May not observe every change
  • For slowly changing pages
  • May not observe any change
  • Need to correct for both effects

8
Lifetime to Model Change Rates
  • Assumptions
  • Pages change according to independent Poisson
    processes
  • Each described by a rate ?
  • A Weibull distribution of mean lifetimes for the
    Poisson processes
  • The observation period for a page is independent
    of its change rate
  • Overlook the bias against pages that change
    infrequently

1
9
Lifetime to Model Change Rates
  • Using the assumptions
  • Find the parameters that best fit the model to
    the observed lifetime distribution
  • Result
  • A distribution of mean web page lifetimes
  • i.e. a model of the rate of web pages changes

10
The Meaning of Up-To-Date
  • It is unreasonable to expect a search engine to
    be absolutely current
  • i.e. completely current all of the time
  • A practical/reasonable expectation
  • Relaxes time
  • Relaxes certainty
  • (?,?) currency

11
(?,?) Current
  • With probability at least ?, a randomly chosen
    page in the collection is ?-current
  • ?-current the page has not changed between the
    last time it was checked and ? time units ago
  • (0.8,3-week) current ?80 likelihood that a
    document in the collection has not changed from
    when it was checked until 3 weeks ago

2
12
Deciding How often to Crawl
  • Assumption rebuild the entire document
    collection periodically
  • Every every T time units (on average)
  • Define
  • Relative re-indexing rate ?T
  • ? is the rate of the Poisson process
  • Grace period percentage ?? ?/T

13
Deciding How often to Crawl
  • Y axis ? (probability of being ß-current)
  • X axis ?T (relative re-indexing rate)
  • Lines ??/T (grace period percentage)

Index less often
Index more often
3
14
Deciding How often to Crawl
  • The surface ? as a function of ??/T (grace
    period percentage) and T (re-indexing period)
  • The plane ? 0.95
  • Intersect the surface with the plane
  • Obtain a curve relating the indexing period to
    ?-currency for a given ?

4
15
Deciding How often to Crawl
5
16
What to Do With the Results
  • Decide how often to crawl in order to keep the
    index current to a specified level
  • Estimate bandwidth requirements

pages in index
kilobytes
x
days between crawls
page
17
Future Work
  • Consider
  • Varying importance of pages
  • Varying importance of changes to pages
  • Varying the re-indexing period by page
  • The next paper tries to address these issues
  • The mathematical model in this paper fits into
    the techniques described in the next paper

18
(No Transcript)
19
The Evolution of the Web andImplications for an
Incremental Crawler
  • Junghoo Cho
  • Hector Garcia-Molina

20
Types Of Crawlers
  • Assume that an initial document collection has
    been built
  • Periodic Crawler
  • Crawls from time to time
  • Creates a new document collection each time
  • Incremental Crawler
  • Crawls continually
  • Updates changed documents
  • Replaces old or less important documents with
    new or more important documents

21
Goals of Incremental Crawling
  • Improves freshness of the document collection
  • Freshness fraction of up-to-date documents
  • Avoid delaying updates until the next crawl cycle
  • Visit frequently changed pages more often
  • Discover new/removed documents sooner
  • Efficiency
  • Try to revisit documents only as often as they
    change
  • Distribute bandwidth usage over a longer time span

6
22
Developing an Incremental Crawler
  • Study how the web changes over time
  • Understand the system demands of an incremental
    crawler
  • How they differ from those of a periodic crawler
  • Make key design choices for an incremental crawler

23
Learning How the Web Changes
  • Monitor part of the web for a time period
  • 270 sites 132 .coms, 78 .edus, 11 .nets, 19
    .orgs, 28 .govs, and 2 .mils
  • 4 month monitoring period
  • Re-crawled each site daily
  • BFS crawl from the root page of each site until
    3000 documents are found
  • Page window pages in the 3000 documents
  • Pages can enter and leave the window
  • Simulates new page creation / page removal

24
How Often Web Pages Change
  • Average change interval of each page
  • days in window / times it changed
  • Ignored error due to undetected page changes
  • Over all domain types
  • 20 of pages changed on every visit
  • 30 of pages never changed
  • By domain type
  • Pages in .com domains change more frequently than
    pages in other domains (e.g. .gov)

25
The Lifespan of Web Pages
  • Lifespan length of time that a page exists
  • Approximated by the number of days that the page
    was within the window
  • Adjusted for error due to lifespan extending
    beyond the monitoring period
  • Over all domain types
  • Some pages are short lived
  • Many pages are long lived
  • By domain type
  • Pages in .com domains have a shorter lifespan
    than pages in other domains (e.g. .gov)

26
The Time for 50 of the Web to Change
  • Considers the changes of the web as a whole
  • 50 days for 50 of the web to change
  • .com domains, 11 days for 50 of the site to
    change

27
Mathematical Model of Document Changes
  • Motivations
  • To compare crawling policies
  • By comparing how up-to-date their document
    collections are
  • A mathematical model can predict how many pages
    have changed since the crawler last crawled them
  • To decide which pages to re-visit during
    incremental crawling
  • Mathematical model
  • Poisson process

28
Using the Observations
  • Critical observations
  • The change frequency of a web document varies
  • The lifetime of a web document varies
  • Portions of the web evolve faster than others
  • Changes can be modeled as a Poisson process
  • To keep up with the changes, incremental crawling
    might be a better strategy than periodic crawling

29
Batch Mode vs. Steady
  • Batch-mode Crawler
  • Periodically update entire collection
  • Steady Crawler
  • Continually update portions of collection

crawl at same average speed ?same average
freshness over time A steady crawler can achieve
this with a lower peak speed
Graphs assumes that pages are immediately
available to users
30
Shadowing vs. In-Place
  • Shadowed updates
  • Build a new collection separately
  • Instantaneously replaces active collection
  • Simpler to implement
  • Acceptable for batch mode crawlers
  • In-place updates
  • Update the active collection directly
  • Active collection is more fresh
  • Nearly essential for steady crawlers

31
Shadowing vs. In-Place
  • The crawlers collection is always as fresh as it
    can be
  • For the active collection
  • Freshness is lost by shadowing
  • The lost freshness is more significant for a
    steady crawler

Batch Mode Crawler
Steady Crawler
Dashed line in-place, Solid line shadowing
32
Page Update Frequency
  • Fixed frequency
  • Visit all pages equally often
  • A periodic crawler will most likely visit every
    page in every crawl (fixed frequency)
  • Variable frequency
  • Revisit actively changed pages more often
  • A steady crawler may elect to vary the frequency

33
Variable Frequency Visitation
  • How often should the crawler visit pages?
  • Intuition visit more often if it changes more
    often
  • But, there is a point of diminishing returns
  • Optimal behavior

7
34
Comparison of Features
  • Incremental approach
  • Steady Crawler
  • In-place updates
  • Variable visitation frequency
  • Potentially higher freshness
  • Lower demands on network and web sites
  • Periodic approach
  • Batch mode crawler
  • Shadowed updates
  • Fixed visitation frequency
  • Easier to implement

35
Incremental Crawler Implementation
  • Loop forever
  • Either Replace a page in the collection with a
    new page
  • Or Update a page in the collection
  • Decisions must be made at run-time
  • Replace a page in the collection?
  • If so, which page should be replaced which new
    page?
  • Update a page in the collection?
  • If so, which one?

36
Refinement decisions
  • Discard a page with the lowest importance
  • Add a page with higher importance
  • Improves the collection quality by replacing
    less important pages with more important
    pages
  • Judge importance using PageRank, Hub and
    Authority,

37
Update decisions
  • Choose a page and update it
  • Based on a model of the change rates of the pages
  • The one that will improve the freshness of the
    collection the most (variable frequency model)
  • Update that page
  • Improves the collections freshness by updating
    pages in the collection

38
Suggested Crawler Architecture
  • Omitted due to time constraints

39
Ideas to Discuss
  • A hybrid approach
  • Batch mode crawler with steady maintenance
  • Higher complexity and bandwidth expense
  • Further improves freshness
  • Collection growth/shrinkage
  • Assumed that the collection has a fixed size
  • Must decide how/when to grow/shrink collection
Write a Comment
User Comments (0)
About PowerShow.com