Internet Search Engine freshness by Web Server help - PowerPoint PPT Presentation

About This Presentation
Title:

Internet Search Engine freshness by Web Server help

Description:

Page rank impact. Pages which are popular will have higher page ranks: ... FreshFlow algorithm is a solution that improve the data updates of the search ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 27
Provided by: Ale8298
Category:

less

Transcript and Presenter's Notes

Title: Internet Search Engine freshness by Web Server help


1
Internet Search Engine freshness by Web Server
help
  • Presented by
  • Barilari Alessandro

2
Introduction
  • Search engines are an important source of
    information and keeping them up-to-date will
    result in more accurate answers to search
    queries.
  • Search engines create their databases by probing
    web servers on a per-URL basis with a little help
    from the web servers.

3
Main Problem
  • There are no standard for facilitating the push
    of updates from servers to search engines
  • It takes up to six months for a few page to be
    indexed by popular web search engines
  • The data which is indexed by the search engines
    is often stale.

4
Solution
  • Web server help to facilitate search engine
    freshness results in a favorable situation for
    web sites, search engines and users.

5
and its problems
  • The number of updates per second is very large.
  • Must balance between
  • The number of interactions between web sites and
    search engines, and
  • The freshness of the search engines.

6
Page rank impact
  • Pages which are popular will have higher page
    ranks
  • Use popularity in addition to age and freshness
    to compute the mismatch between a web site and a
    search engine

7
Summary
  • Definitions and Cost Model
  • Algorithm
  • Analysis
  • Pratical issues

8
Some definitions
  • Update an update u to a file f is a modification
    to f that has been flushed to the disk
  • Propagation of an update an update is said to be
    propagate when the web site has informed the
    search engine about the update. A SE may or may
    not retrieve that update
  • Meta-update propagation At any time t, let U(t)
    be the set of unpropagated updates. The web site
    informs the search engine about all the updates
    U(t)

9
Some definitions (2)
  • Weight of a file given a content file, its
    weight ?f (non-negative) denotes the importance
    of the file the weights are chosen such that
  • Last_modification_time(u,t) the last time before
    t when the file f(u) was updated.

10
The Cost Model
  • Components
  • Communication cost
  • Opportunity cost represents the stalenes of the
    search engine data as compared to the data on the
    web server.
  • CPU cost is ignored

11
Opportunity cost (OC)
  • Given an unpropagated update u to a content file
    f the opportunity cost for update u at time t
    is
  • OC(u,t)?f(u)x(t - last_modification_time(u,t))
  • Definition for meta-update propagation

12
Communication cost (CC)
  • sizef(u)(t) the size of file f(u) at time t

13
Potential Communication cost (PCC)
  • Represents the communication cost which would
    need to be incurred in case update u were to be
    propagated after time t

14
The Cost Function
  • Given that an update u is unpropagated at time t,
    the cost function for that update at time t is
    given by

15
Summary
  • Definition and Cost Model
  • Algorithm
  • Analysis
  • Pratical issues

16
FreshFlow Algorithm
  • When OC_tot equals PCC_tot at any time t, the web
    server can inform the search engine about all the
    unpropagated updates.

17
Summary
  • Definition and Cost Model
  • Algorithm
  • Analysis
  • Pratical issues

18
Analysis
  • The cost of the FreshFlow algorithm (called FF)
    is compared with the cost of an optimal off-line
    algorithm (called ADV)

19
Analysis (2)
  • Lemma (1) OC(u,t) is monotonically
    non-decreasing
  • Lemma (2) suppose an update u to a file f, and
    suppose FF transmits but ADV does not. Then
    OCADV(u,t)OCFF(u,t).
  • Lemma (3) if the update is transmitted by the
    adversary (ADV), then CCADV(u,t) CCFF(u,t).

20
Theorem
  • FF is 2-competitive
  • CostFF(u,t) 2 x CostADV(u,t)

21
Summary
  • Definition and Cost Model
  • Algorithm
  • Analysis
  • Pratical issues

22
Pratical issues
  • There are multiple search engines
  • Synchronization effect pushing the updates would
    put pressure on the last-hop link to the web
    server
  • Search engine load some search engines might
    deny the receipt of updates.

23
The middleman approach
  • Each web server contacts only one middleman for
    sending its updates
  • Could be a group of middlemen.

24
Benefits
  • The middleman can solve some additional issues
  • Verifying trustworthiness of web servers
  • Restricting the rate at which updates get
    transmitted to search engines

25
Limitations
  • The algorithm has not been used in practice
  • The search engines need the cooperation of the
    web servers to keep track of updates to their
    URLs. Whether web servers will incorporate such a
    service remains to be seen.

26
Conclusions
  • The FreshFlow algorithm is a solution that
    improve the data updates of the search engines,
    mantaining high level efficiency and performance
  • The authors are planning to implement the
    algorithm in a real system (and have a future
    pubblication!)
Write a Comment
User Comments (0)
About PowerShow.com