Title: Adaptive Online Page Importance, Experiments and Applications
1Adaptive On-line Page Importance, Experiments and
Applications
- Serge Abiteboul (INRIA Xyleme)
- with
- Grégory Cobéna (INRIA, now Xyleme)
- and Mihai Preda (Xyleme)
2Motivations
- Page importance
- Notion introduced by Kleinberg
- Popularized on the Web by Google
- Applications of page importance
- Rank results of search engine
- Guide frequency of visits of pages
- For maintaining the index of a search engine
- For monitoring the Web
- For archiving the Web
- For building a topic specific warehouse of Web
data
3Organization
- What is page importance?
- Intuition
- Formal model
- The Adaptive OPIC algorithm
- An on-line algorithm
- Adaptive algorithm
- Experiments
- Crawling in Xyleme
- Applications
4What is page importance?
5Intuition
- All the pages on the Web do not have the same
importance - Le Louvre homepage is more important than Mme
Michus homepage - Page importance information is valuable
- To rank query results (e.g. Google) for display
- Less and less so?
- To decide which pages should be crawled first or
which pages should be refreshed next
6Model
- The Web as a graph/matrix
- We view the Web as a directed graph G
- Web pages are vertices, HTML links are edges
- Let n be the number of pages
- G is represented as a non-negative link-matrix
L1..n,1..n - There are many ways to encode the Web as a matrix
- Kleinberg sets Li,j1 if there is a link from
i to j - BrinPage set Li,j1/outi if there is a link
from i to j where outi is out-degree - Both Li,j0 if there is no edge from i to j
- The importance is represented as a vector x1..n
7Importance (modulo details -)
- Importance is defined as the unique fixpoint of
the equation - xLx
- Page importance can be computed inductively
- xk1Lxk
- If normalized, this corresponds to the limit of
the probability to be in a page in a random walk
on the Web - Start randomly on some page Follow randomly some
link of the page - Keep walking
- This corresponds to an intuitive notion of
importance
8Some of the details in brief
- For each nonnegative matrix L
- There always exists such a fixpoint but it may
not be unique - Iterating over k will diverge or converge to zero
in most cases - A normalization after each step is necessary
- Theorem
- If the graph is strongly connected, there exist a
unique fixpoint (up to normalization) - If the graph is a-periodic, the iteration
converges
9Strong connectivity disjoint components
B
A
- The relative importance of AB, as compared to
CD depends on the initial value of x - One solution for (AB) and one for (CD) gives many
solutions for the system
C
D
10Strong Connectivity sinks
B
A
- In the random walk model, the importance of A and
B is zero. - Only C and D accumulate some importance
C
D
11A-periodic
A
- The fixpoint oscillates between several values
B
C
12Situation for the Web
- The Web is not strongly connected
- Consider the bow-tie model of the Web graph
- Google adds a small edges for any pair i,j
- We add small edges to from some virtual page
- Intuition Consider the possibility of users to
navigate on the Web without using links (e.g.
bookmarks, URLs) - The Web is reasonably a-periodic
13Adaptive OPICAdaptive Online Page Importance
Computation
14Online Computation Motivations
- Off-line algorithm
- Crawls the Web and builds a Link-matrix
- Stores the link matrix and update it very
expensive - Starts an off-line computation on a frozen link
matrix - On-line Page Importance Computation
- Does not require storing the link matrix
- Works continuously together with crawling
- Works independently of any crawling strategy
- Provides early an estimate of page importance to
guide crawling - Keeps improving and updating the estimate
15Static Graphs OPIC
- We assign to each page a small amount of cash
- When a page is read, its cash is distributed
between its children - The total of cash in all pages does not change
- The page importance for a given page is computed
using the history of cash of that page
16Example
- Small Web of 3 pages
- Alice has all the cash to start
- Importance independent of the
- original position
Alice
Georges
Bob
ABAGB
17What happened?
- Cash-Game History
- Alice received 600 (200400)
- Bob received 600 (200100300)
- Georges received 300 (200100)
- Solution
- I(Alice) 40
- I(Bob) 40
- I(Georges) 20
- It is the fixpoint
I(page) History(page)/ Sum(Histories)
18Cash-History
- Alice, Bob, Georges (history)
- 0, 0, 0
- Read ltAlicegt
- 0.33, 0, 0
- Read Bob
- 0.33, 0.50, 0
- Read Georges
- 0.33, 0.50, 0.50
- Read Bob
- 0.33, 1.0, 0.50
- Read Alice
- 1.33, 1.0, 0.50
- Alice,Bob,Georges (cash)
- 0.33, 0.33, 0.33 (t0)
- Read ltAlicegt
- 0, 0.50, 0.50
- Read Bob
- 0.5, 0, 0.5
- Read Georges
- 0.5, 0.5, 0
- Read Bob
- 1, 0, 0
- Read Alice
- 0, 0.5, 0.5
19Computing Page Importance
- Cti is the cash of page i at some time t
- Hti is the history (sum of previous cash) of
page i - Total of cash is constant
- For each page i, Hi goes to infinity
- For each page, at each step,
- HtjCtj C0j sum(i ancestor of j,
Li,jHti/out(i)) - Thm The limit of Htj/sum(Htj)is the
importance of page i
20The Web is a changing graphThe Adaptive
Algorithm
- The Web changes continuously, so does the
importance of pages - Our adaptive algorithm works by considering only
the recent part of the cash history for each page - The time window corresponding to the recent
history may be defined as - A fixed number of measures for each page
- A fixed period of time for each page
- A single value that interpolates the history for
a specific period of time - Note that the definition of page importance
considers a fixed number of nodes - For instance, the page importance of previously
existing pages decreases automatically when new
pages are added.
21Experiments
22Crawling Strategies
- Our algorithm works with any crawling strategy if
each page is visited infinitely often. - It does not impose any constraints for the order
of pages to visit. - Simple crawling strategies are
- Random all pages have equal probability to be
chosen - Greedy choose the page with largest amount of
cash - Cycle systematic strategy that cycles around the
set of pages - Convergence is faster with Greedy since pages
have more cash on average to distribute
23Experiment settings for synthetic data
- Synthetic models of the Web
- More flexibility in studying variants of the
algorithm - We build a graph which simulates the Web
- We compute the reference page importance on this
graph using the offline algorithm until the
fixpoint is reached - We simulate Web crawling and computing page
importance online - We compare our estimate of page importance with
the reference
24Experiments on synthetic data
- Impact of the page selection strategy greedy is
best
25Experiments on synthetic data
- Convergence on important pages greedy brilliant
on important pages
26Experiments on synthetic data
- Impact of the size of the window difficult to
fix depends on change rate
27Experiments on synthetic data
- Impact of the window policy Interpolated history
is best
28Xyleme Crawlers
29Implementation
- INRIA-Xyleme crawlers
- Run on a cluster of Linux PCs - 8 PCs at some
point - Code is in C, communications use Corba
- Each crawler is in charge of 100 million pages
and crawls about 4 million pages per day - A difficulty is to assign a unique integer to
each page and to provide an efficient translation
from integer to URL - Continuously read pages on the Web (HTML, XML)
- Uses HTMLXML links to discover new pages
- Monitor the Web archive XML pages
30Implementation - continued
- We implemented a distributed version of Adaptive
OPIC - Crawling strategy
- The crawling strategy in Xyleme was defined to
optimize the knowledge of the Web - Intuition refresh frequency proportional to
importance - Turned out to be on average very close to Greedy
31Overview of Crawler
WWW
Pages are grouped by domain-name and crawled by
robots
Crawler
Robot
Robot
Robot
Robot
Robot
The scheduler decides which pages will be read
next, depending on their importance, change rate,
client interests, etc.
New pages are discovered using links found in
HTML pages. Management of metadata on known pages
Scheduler
32Some numbers
- Fetcher
- Up to 100 robots running simultaneously on a
single PC - Average of 50 pages/seconds on an (old)PC (4
millions/day) - Limiting factor is the number of random disk
access - Performance and Politeness
- Pages are grouped by domain to minimize the cost
of DNS (Domain Name Server) resolution (the next
10 million pages to be read). - To avoid rapid firing, we maintain a large number
of accessible sites in memory (1 million
domains). - Knowledge about visited pages 100 million pages
in main memory - For each page, the exact disk location of the
info structure (4 bytes) a counter that we use
for page rank and for the crawling strategy - One disk access per page that is read
33Experiments on Web data
- Experiments were conducted using the crawlers of
Xyleme - 8 PCs with 1.5Gb of memory each
- Crawling strategy is close to Greedy (with focus
on XML) - History is managed using the interpolation policy
- Experiments lasted for several months
- We discovered close to one billion URLs
- After several trials, we set the window size to 3
months - Importance of pages seems correct
- Same as Google validated by success
- Experiments with BnF librarians over a few
thousand Web pages as good as an average
librarian - Also gives estimates for pages that were never
read 60 so guidelines to discover the Web
34Issues
- Page importance is OK but not sufficient
- Content level
- Classification of pages by topic
- Refined notion of authority (per domain)
- Logical site vs. page or physical site
- What is a logical site?
35Open problems on Adaptive OPIC
- More refined analysis of the convergence speed
- More refined analysis of the adaptive algorithm
- Selection of the best window
- More refined study of importance in changing
graph - Discovery of new portions of the Web
- Algorithms for personalized importance
- Other graph properties computable in a similar
way - Hub
- Find other applications
36Merci