Adaptive Online Page Importance, Experiments and Applications - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Adaptive Online Page Importance, Experiments and Applications

Description:

INRIA-Xyleme crawlers. Run on a cluster of Linux PCs - 8 PCs at some point ... Each crawler is in charge of 100 million pages and crawls about 4 million pages per day ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 34
Provided by: wwwroc
Category:

less

Transcript and Presenter's Notes

Title: Adaptive Online Page Importance, Experiments and Applications


1
Adaptive On-line Page Importance, Experiments and
Applications
  • Serge Abiteboul (INRIA Xyleme)
  • with
  • Grégory Cobéna (INRIA, now Xyleme)
  • and Mihai Preda (Xyleme)

2
Motivations
  • Page importance
  • Notion introduced by Kleinberg
  • Popularized on the Web by Google
  • Applications of page importance
  • Rank results of search engine
  • Guide frequency of visits of pages
  • For maintaining the index of a search engine
  • For monitoring the Web
  • For archiving the Web
  • For building a topic specific warehouse of Web
    data

3
Organization
  • What is page importance?
  • Intuition
  • Formal model
  • The Adaptive OPIC algorithm
  • An on-line algorithm
  • Adaptive algorithm
  • Experiments
  • Crawling in Xyleme
  • Applications

4
What is page importance?
5
Intuition
  • All the pages on the Web do not have the same
    importance
  • Le Louvre homepage is more important than Mme
    Michus homepage
  • Page importance information is valuable
  • To rank query results (e.g. Google) for display
  • Less and less so?
  • To decide which pages should be crawled first or
    which pages should be refreshed next

6
Model
  • The Web as a graph/matrix
  • We view the Web as a directed graph G
  • Web pages are vertices, HTML links are edges
  • Let n be the number of pages
  • G is represented as a non-negative link-matrix
    L1..n,1..n
  • There are many ways to encode the Web as a matrix
  • Kleinberg sets Li,j1 if there is a link from
    i to j
  • BrinPage set Li,j1/outi if there is a link
    from i to j where outi is out-degree
  • Both Li,j0 if there is no edge from i to j
  • The importance is represented as a vector x1..n

7
Importance (modulo details -)
  • Importance is defined as the unique fixpoint of
    the equation
  • xLx
  • Page importance can be computed inductively
  • xk1Lxk
  • If normalized, this corresponds to the limit of
    the probability to be in a page in a random walk
    on the Web
  • Start randomly on some page Follow randomly some
    link of the page
  • Keep walking
  • This corresponds to an intuitive notion of
    importance

8
Some of the details in brief
  • For each nonnegative matrix L
  • There always exists such a fixpoint but it may
    not be unique
  • Iterating over k will diverge or converge to zero
    in most cases
  • A normalization after each step is necessary
  • Theorem
  • If the graph is strongly connected, there exist a
    unique fixpoint (up to normalization)
  • If the graph is a-periodic, the iteration
    converges

9
Strong connectivity disjoint components
B
A
  • The relative importance of AB, as compared to
    CD depends on the initial value of x
  • One solution for (AB) and one for (CD) gives many
    solutions for the system

C
D
10
Strong Connectivity sinks
B
A
  • In the random walk model, the importance of A and
    B is zero.
  • Only C and D accumulate some importance

C
D
11
A-periodic
A
  • The fixpoint oscillates between several values

B
C
12
Situation for the Web
  • The Web is not strongly connected
  • Consider the bow-tie model of the Web graph
  • Google adds a small edges for any pair i,j
  • We add small edges to from some virtual page
  • Intuition Consider the possibility of users to
    navigate on the Web without using links (e.g.
    bookmarks, URLs)
  • The Web is reasonably a-periodic

13
Adaptive OPICAdaptive Online Page Importance
Computation
14
Online Computation Motivations
  • Off-line algorithm
  • Crawls the Web and builds a Link-matrix
  • Stores the link matrix and update it very
    expensive
  • Starts an off-line computation on a frozen link
    matrix
  • On-line Page Importance Computation
  • Does not require storing the link matrix
  • Works continuously together with crawling
  • Works independently of any crawling strategy
  • Provides early an estimate of page importance to
    guide crawling
  • Keeps improving and updating the estimate

15
Static Graphs OPIC
  • We assign to each page a small amount of cash
  • When a page is read, its cash is distributed
    between its children
  • The total of cash in all pages does not change
  • The page importance for a given page is computed
    using the history of cash of that page

16
Example
  • Small Web of 3 pages
  • Alice has all the cash to start
  • Importance independent of the
  • original position

Alice
Georges
Bob
ABAGB
17
What happened?
  • Cash-Game History
  • Alice received 600 (200400)
  • Bob received 600 (200100300)
  • Georges received 300 (200100)
  • Solution
  • I(Alice) 40
  • I(Bob) 40
  • I(Georges) 20
  • It is the fixpoint

I(page) History(page)/ Sum(Histories)
18
Cash-History
  • Alice, Bob, Georges (history)
  • 0, 0, 0
  • Read ltAlicegt
  • 0.33, 0, 0
  • Read Bob
  • 0.33, 0.50, 0
  • Read Georges
  • 0.33, 0.50, 0.50
  • Read Bob
  • 0.33, 1.0, 0.50
  • Read Alice
  • 1.33, 1.0, 0.50
  • Alice,Bob,Georges (cash)
  • 0.33, 0.33, 0.33 (t0)
  • Read ltAlicegt
  • 0, 0.50, 0.50
  • Read Bob
  • 0.5, 0, 0.5
  • Read Georges
  • 0.5, 0.5, 0
  • Read Bob
  • 1, 0, 0
  • Read Alice
  • 0, 0.5, 0.5

19
Computing Page Importance
  • Cti is the cash of page i at some time t
  • Hti is the history (sum of previous cash) of
    page i
  • Total of cash is constant
  • For each page i, Hi goes to infinity
  • For each page, at each step,
  • HtjCtj C0j sum(i ancestor of j,
    Li,jHti/out(i))
  • Thm The limit of Htj/sum(Htj)is the
    importance of page i

20
The Web is a changing graphThe Adaptive
Algorithm
  • The Web changes continuously, so does the
    importance of pages
  • Our adaptive algorithm works by considering only
    the recent part of the cash history for each page
  • The time window corresponding to the recent
    history may be defined as
  • A fixed number of measures for each page
  • A fixed period of time for each page
  • A single value that interpolates the history for
    a specific period of time
  • Note that the definition of page importance
    considers a fixed number of nodes
  • For instance, the page importance of previously
    existing pages decreases automatically when new
    pages are added.

21
Experiments
22
Crawling Strategies
  • Our algorithm works with any crawling strategy if
    each page is visited infinitely often.
  • It does not impose any constraints for the order
    of pages to visit.
  • Simple crawling strategies are
  • Random all pages have equal probability to be
    chosen
  • Greedy choose the page with largest amount of
    cash
  • Cycle systematic strategy that cycles around the
    set of pages
  • Convergence is faster with Greedy since pages
    have more cash on average to distribute

23
Experiment settings for synthetic data
  • Synthetic models of the Web
  • More flexibility in studying variants of the
    algorithm
  • We build a graph which simulates the Web
  • We compute the reference page importance on this
    graph using the offline algorithm until the
    fixpoint is reached
  • We simulate Web crawling and computing page
    importance online
  • We compare our estimate of page importance with
    the reference

24
Experiments on synthetic data
  • Impact of the page selection strategy greedy is
    best

25
Experiments on synthetic data
  • Convergence on important pages greedy brilliant
    on important pages

26
Experiments on synthetic data
  • Impact of the size of the window difficult to
    fix depends on change rate

27
Experiments on synthetic data
  • Impact of the window policy Interpolated history
    is best

28
Xyleme Crawlers
29
Implementation
  • INRIA-Xyleme crawlers
  • Run on a cluster of Linux PCs - 8 PCs at some
    point
  • Code is in C, communications use Corba
  • Each crawler is in charge of 100 million pages
    and crawls about 4 million pages per day
  • A difficulty is to assign a unique integer to
    each page and to provide an efficient translation
    from integer to URL
  • Continuously read pages on the Web (HTML, XML)
  • Uses HTMLXML links to discover new pages
  • Monitor the Web archive XML pages

30
Implementation - continued
  • We implemented a distributed version of Adaptive
    OPIC
  • Crawling strategy
  • The crawling strategy in Xyleme was defined to
    optimize the knowledge of the Web
  • Intuition refresh frequency proportional to
    importance
  • Turned out to be on average very close to Greedy

31
Overview of Crawler
WWW
Pages are grouped by domain-name and crawled by
robots
Crawler
Robot
Robot
Robot
Robot
Robot
The scheduler decides which pages will be read
next, depending on their importance, change rate,
client interests, etc.
New pages are discovered using links found in
HTML pages. Management of metadata on known pages
Scheduler
32
Some numbers
  • Fetcher
  • Up to 100 robots running simultaneously on a
    single PC
  • Average of 50 pages/seconds on an (old)PC (4
    millions/day)
  • Limiting factor is the number of random disk
    access
  • Performance and Politeness
  • Pages are grouped by domain to minimize the cost
    of DNS (Domain Name Server) resolution (the next
    10 million pages to be read).
  • To avoid rapid firing, we maintain a large number
    of accessible sites in memory (1 million
    domains).
  • Knowledge about visited pages 100 million pages
    in main memory
  • For each page, the exact disk location of the
    info structure (4 bytes) a counter that we use
    for page rank and for the crawling strategy
  • One disk access per page that is read

33
Experiments on Web data
  • Experiments were conducted using the crawlers of
    Xyleme
  • 8 PCs with 1.5Gb of memory each
  • Crawling strategy is close to Greedy (with focus
    on XML)
  • History is managed using the interpolation policy
  • Experiments lasted for several months
  • We discovered close to one billion URLs
  • After several trials, we set the window size to 3
    months
  • Importance of pages seems correct
  • Same as Google validated by success
  • Experiments with BnF librarians over a few
    thousand Web pages as good as an average
    librarian
  • Also gives estimates for pages that were never
    read 60 so guidelines to discover the Web

34
Issues
  • Page importance is OK but not sufficient
  • Content level
  • Classification of pages by topic
  • Refined notion of authority (per domain)
  • Logical site vs. page or physical site
  • What is a logical site?

35
Open problems on Adaptive OPIC
  • More refined analysis of the convergence speed
  • More refined analysis of the adaptive algorithm
  • Selection of the best window
  • More refined study of importance in changing
    graph
  • Discovery of new portions of the Web
  • Algorithms for personalized importance
  • Other graph properties computable in a similar
    way
  • Hub
  • Find other applications

36
Merci
Write a Comment
User Comments (0)
About PowerShow.com