Title: Overview%20of%20Web%20Ranking%20Algorithms:%20HITS%20and%20PageRank
1Overview of Web Ranking Algorithms HITS and
PageRank
- April 6, 2006
- Presented by Bill Eberle
2Overview
- Problem
- Web as a Graph
- HITS
- PageRank
- Comparison
3Problem
- Specific queries (scarcity problem).
- Broad-topic queries (abundance problem).
- Goal to find the smallest set of authoritative
sources.
4Web as a Graph
- Web pages as nodes of a graph.
- Links as directed edges.
my page
www.uta.edu
my page
www.uta.edu
www.uta.edu
www.google.com
www.google.com
www.google.com
5Link Structure of the Web
- Forward links (out-edges).
- Backward links (in-edges).
- Approximation of importance/quality a page may
be of high quality if it is referred to by many
other pages, and by pages of high quality.
6HITS
- HITS (Hyperlinked-Induced Topic Search)
- Authoritative Sources in a Hyperlinked
Environment, Jon Kleinberg, Cornell University.
1998.
7Authorities and Hubs
- Authority is a page which has relevant
information about the topic. - Hub is a page which has collection of links to
pages about that topic.
a1
a2
h
a3
a4
8Authorities and Hubs (cont.)
- Good hubs are the ones that point to good
authorities. - Good authorities are the ones that are pointed to
by - good hubs.
h1
a1
a2
h2
a3
h3
a4
h4
a5
h5
a6
9Finding Authorities and Hubs
- First, construct a focused sub-graph of the www.
- Second, compute Hubs and Authorities from the
sub-graph.
10Construction of Sub-graph
Rootset Pages
Expanded set Pages
Search Engine
Crawler
Topic
Forward link pages
Rootset
11Root Set and Base Set
- Use query term to collect a root set of pages
from text-based search engine (AltaVista).
12Root Set and Base Set (cont.)
- Expand root set into base set by including (up to
a designated size cut-off) - All pages linked to by pages in root set
- All pages that link to a page in root set
13Hubs Authorities Calculation
- Iterative algorithm on Base Set authority
weights a(p), and hub weights h(p). - Set authority weights a(p) 1, and hub weights
h(p) 1 for all p. - Repeat following two operations(and then
re-normalize a and h to have unit norm)
v1
v1
h(v1)
a(v1)
p
v2
p
v2
h(v2)
a(v2)
v3
h(v3)
v3
a(v3)
14Example
0.45, 0.45
0.45, 0.45
Hub 0.45, Authority 0.45
0.45, 0.45
15Example (cont.)
0.45, 0.9
1.35, 0.9
Hub 0.9, Authority 0.45
0.45, 0.9
16Algorithmic Outcome
- Applying iterative multiplication (power
iteration) will lead to calculating eigenvector
of any non-degenerate initial vector. - Hubs and authorities as outcome of process.
- Principal eigenvector contains highest hub and
authorities.
17Results
- Although HITS is only link-based (it completely
disregards page content) results are quite good
in many tested queries. - When the authors tested the query search
engines - The algorithm returned Yahoo!, Excite, Magellan,
Lycos, AltaVista - However, none of these pages described themselves
as a search engine (at the time of the
experiment)
18Issues
- From narrow topic, HITS tends to end in more
general one. - Specific of hub pages - many links can cause
algorithm drift. They can point to authorities in
different topics. - Pages from single domain / website can dominate
result, if they point to one page - not
necessarily a good authority.
19Possible Enhancements
- Use weighted sums for link calculation.
- Take advantage of anchor text - text
surrounding link itself. - Break hubs into smaller pieces. Analyze each
piece separately, instead of whole hub page as
one. - Disregard or minimize influence of links inside
one domain. - IBM expanded HITS into Clever not seen as viable
real-time search engine.
20PageRank
- The PageRank Citation Ranking Bringing Order to
the Web, Lawrence Page and Sergey Brin, Stanford
University. 1998.
21Basic Idea
- Back-links coming from important pages convey
more importance to a page. For example, if a web
page has a link off the yahoo home page, it may
be just one link but it is a very important one. - A page has high rank if the sum of the ranks of
its back-links is high. This covers both the case
when a page has many back-links and when a page
has a few highly ranked back-links.
22Definition
- My pages rank is equal to the sum of all the
pages pointing to me.
23Simplified PageRank Example
- Rank(u) Rank of page u , where c is a
normalization constant (c lt 1 to cover for pages
with no outgoing links).
24Expanded Definition
- R(u) page rank of page u
- c factor used for normalization (lt1)
- Bu set of pages pointing to u
- Nv outbound links of v
- R(v) page rank of site v that points to u
- E(u) distribution of web pages that a random
surfer periodically jumps (set to 0.15)
25Problem 1 - Rank Sink
- Page cycles pointed by some incoming link.
- Loop will accumulate rank but never distribute it.
26Problem 2 - Dangling Links
- In general, many Web pages do not have either
back links or forward links. - Dangling links do not affect the ranking of any
other page directly, so they are removed until
all the PageRanks are calculated.
27Random Surfer Model
- PageRank corresponds to the probability
distribution of a random walk on the web graphs.
28Solution Escape Term
- Escape term E(u) can be thought of as the random
surfer gets bored periodically and jumps to a
different page not staying in the loop forever. - We term this E to be a vector over all the web
pages that accounts for each pages escape
probability (user defined parameter).
29PageRank Computation
- initialize vector
over web pages Loop
- new ranks sum of normalized backlink
ranks
- compute normalizing
factor
- add escape
term
- control
parameter While
- stop when converged
30Matrices
- A is designated to be a matrix, u and v
correspond to the columns of this matrix. - Given that A is a matrix, and R be a vector over
all the Web pages, the dominant eigenvector is
the one associated with the maximal eigenvalue.
31Example
AT
32Example (cont.)
R c A R M R c eigenvalue R eigenvector of
A
A
A x ? x A - ?I x 0
R
Normalized
33Implementation
- 1. URL -gt id
- 2. Store each hyperlink in a database.
- 3. Sort link structure by Parent id.
- 4. Remove dangling links.
- 5. Calculate the PR giving each page an initial
value. - 6. Iterate until convergence.
- 7. Add the dangling links.
34Example
Which of these three has the highest page rank?
Page A
Page B
Page C
35Example (cont.)
Page A
Page B
Page C
36 Re-write the system of equations as a Matrix-
Vector product.
Example (cont.)
The PageRank vector is simply an eigenvector
(scalarvector matrixvector) of the
coefficient matrix.
37Example (cont.)
PageRank 0.4
PageRank 0.2
Page A
Page B
Page C
PageRank 0.4
38Example (cont.)
with d 0.5 Pr(A) PR(B)
PR(C)
0 1 2 3 . . . . 11 12
39Convergence
- PageRank computation is O(log(V)).
40Other Applications
- Help user decide if a site is trustworthy.
- Estimate web traffic.
- Spam detection and prevention.
- Predict citation counts.
41Issues
- Users are not random walkers.
- Starting point distribution (actual usage data as
starting vector). - Bias towards main pages.
- Linkage spam.
- No query specific rank.
42PageRank vs. HITS
- HITS
- (CLEVER)
- performed on the set of retrieved web pages for
each query - computes authorities and hubs
- easy to compute, but real-time execution is hard
- PageRank
- (Google)
- computed for all web pages stored in the database
prior to the query - computes authorities only
- Trivial and fast to compute
43References
- Authoritative Sources in a Hyperlinked
Environment, Jon Kleinberg, Cornell University.
1998. - The PageRank Citation Ranking Bringing Order to
the Web, Lawrence Page and Sergey Brin, Stanford
University. 1998.