Overview%20of%20Web%20Ranking%20Algorithms:%20HITS%20and%20PageRank - PowerPoint PPT Presentation

About This Presentation
Title:

Overview%20of%20Web%20Ranking%20Algorithms:%20HITS%20and%20PageRank

Description:

While - stop when converged. Matrices ... Spam detection and prevention. Predict citation counts. Issues. Users are not random walkers. ... – PowerPoint PPT presentation

Number of Views:397
Avg rating:3.0/5.0
Slides: 44
Provided by: bille1
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Overview%20of%20Web%20Ranking%20Algorithms:%20HITS%20and%20PageRank


1
Overview of Web Ranking Algorithms HITS and
PageRank
  • April 6, 2006
  • Presented by Bill Eberle

2
Overview
  • Problem
  • Web as a Graph
  • HITS
  • PageRank
  • Comparison

3
Problem
  • Specific queries (scarcity problem).
  • Broad-topic queries (abundance problem).
  • Goal to find the smallest set of authoritative
    sources.

4
Web as a Graph
  • Web pages as nodes of a graph.
  • Links as directed edges.

my page
www.uta.edu
my page
www.uta.edu
www.uta.edu
www.google.com
www.google.com
www.google.com
5
Link Structure of the Web
  • Forward links (out-edges).
  • Backward links (in-edges).
  • Approximation of importance/quality a page may
    be of high quality if it is referred to by many
    other pages, and by pages of high quality.

6
HITS
  • HITS (Hyperlinked-Induced Topic Search)
  • Authoritative Sources in a Hyperlinked
    Environment, Jon Kleinberg, Cornell University.
    1998.

7
Authorities and Hubs
  • Authority is a page which has relevant
    information about the topic.
  • Hub is a page which has collection of links to
    pages about that topic.

a1
a2
h
a3
a4
8
Authorities and Hubs (cont.)
  • Good hubs are the ones that point to good
    authorities.
  • Good authorities are the ones that are pointed to
    by
  • good hubs.

h1
a1
a2
h2
a3
h3
a4
h4
a5
h5
a6
9
Finding Authorities and Hubs
  • First, construct a focused sub-graph of the www.
  • Second, compute Hubs and Authorities from the
    sub-graph.

10
Construction of Sub-graph
Rootset Pages
Expanded set Pages
Search Engine
Crawler
Topic
Forward link pages
Rootset
11
Root Set and Base Set
  • Use query term to collect a root set of pages
    from text-based search engine (AltaVista).

12
Root Set and Base Set (cont.)
  • Expand root set into base set by including (up to
    a designated size cut-off)
  • All pages linked to by pages in root set
  • All pages that link to a page in root set

13
Hubs Authorities Calculation
  • Iterative algorithm on Base Set authority
    weights a(p), and hub weights h(p).
  • Set authority weights a(p) 1, and hub weights
    h(p) 1 for all p.
  • Repeat following two operations(and then
    re-normalize a and h to have unit norm)

v1
v1
h(v1)
a(v1)
p
v2
p
v2
h(v2)
a(v2)
v3
h(v3)
v3
a(v3)
14
Example
0.45, 0.45
0.45, 0.45
Hub 0.45, Authority 0.45
0.45, 0.45
15
Example (cont.)
0.45, 0.9
1.35, 0.9
Hub 0.9, Authority 0.45
0.45, 0.9
16
Algorithmic Outcome
  • Applying iterative multiplication (power
    iteration) will lead to calculating eigenvector
    of any non-degenerate initial vector.
  • Hubs and authorities as outcome of process.
  • Principal eigenvector contains highest hub and
    authorities.

17
Results
  • Although HITS is only link-based (it completely
    disregards page content) results are quite good
    in many tested queries.
  • When the authors tested the query search
    engines
  • The algorithm returned Yahoo!, Excite, Magellan,
    Lycos, AltaVista
  • However, none of these pages described themselves
    as a search engine (at the time of the
    experiment)

18
Issues
  • From narrow topic, HITS tends to end in more
    general one.
  • Specific of hub pages - many links can cause
    algorithm drift. They can point to authorities in
    different topics.
  • Pages from single domain / website can dominate
    result, if they point to one page - not
    necessarily a good authority.

19
Possible Enhancements
  • Use weighted sums for link calculation.
  • Take advantage of anchor text - text
    surrounding link itself.
  • Break hubs into smaller pieces. Analyze each
    piece separately, instead of whole hub page as
    one.
  • Disregard or minimize influence of links inside
    one domain.
  • IBM expanded HITS into Clever not seen as viable
    real-time search engine.

20
PageRank
  • The PageRank Citation Ranking Bringing Order to
    the Web, Lawrence Page and Sergey Brin, Stanford
    University. 1998.

21
Basic Idea
  • Back-links coming from important pages convey
    more importance to a page. For example, if a web
    page has a link off the yahoo home page, it may
    be just one link but it is a very important one.
  • A page has high rank if the sum of the ranks of
    its back-links is high. This covers both the case
    when a page has many back-links and when a page
    has a few highly ranked back-links.

22
Definition
  • My pages rank is equal to the sum of all the
    pages pointing to me.

23
Simplified PageRank Example
  • Rank(u) Rank of page u , where c is a
    normalization constant (c lt 1 to cover for pages
    with no outgoing links).

24
Expanded Definition
  • R(u) page rank of page u
  • c factor used for normalization (lt1)
  • Bu set of pages pointing to u
  • Nv outbound links of v
  • R(v) page rank of site v that points to u
  • E(u) distribution of web pages that a random
    surfer periodically jumps (set to 0.15)

25
Problem 1 - Rank Sink
  • Page cycles pointed by some incoming link.
  • Loop will accumulate rank but never distribute it.

26
Problem 2 - Dangling Links
  • In general, many Web pages do not have either
    back links or forward links.
  • Dangling links do not affect the ranking of any
    other page directly, so they are removed until
    all the PageRanks are calculated.

27
Random Surfer Model
  • PageRank corresponds to the probability
    distribution of a random walk on the web graphs.

28
Solution Escape Term
  • Escape term E(u) can be thought of as the random
    surfer gets bored periodically and jumps to a
    different page not staying in the loop forever.
  • We term this E to be a vector over all the web
    pages that accounts for each pages escape
    probability (user defined parameter).

29
PageRank Computation

- initialize vector
over web pages Loop
- new ranks sum of normalized backlink
ranks
- compute normalizing
factor
- add escape
term
- control
parameter While
- stop when converged
30
Matrices
  • A is designated to be a matrix, u and v
    correspond to the columns of this matrix.
  • Given that A is a matrix, and R be a vector over
    all the Web pages, the dominant eigenvector is
    the one associated with the maximal eigenvalue.

31
Example
AT
32
Example (cont.)
R c A R M R c eigenvalue R eigenvector of
A
A
A x ? x A - ?I x 0
R
Normalized
33
Implementation
  • 1. URL -gt id
  • 2. Store each hyperlink in a database.
  • 3. Sort link structure by Parent id.
  • 4. Remove dangling links.
  • 5. Calculate the PR giving each page an initial
    value.
  • 6. Iterate until convergence.
  • 7. Add the dangling links.

34
Example
Which of these three has the highest page rank?
Page A
Page B
Page C
35
Example (cont.)
Page A
Page B
Page C
36
Re-write the system of equations as a Matrix-
Vector product.
Example (cont.)
The PageRank vector is simply an eigenvector
(scalarvector matrixvector) of the
coefficient matrix.
37
Example (cont.)
PageRank 0.4
PageRank 0.2
Page A
Page B
Page C
PageRank 0.4
38
Example (cont.)
with d 0.5 Pr(A) PR(B)
PR(C)
0 1 2 3 . . . . 11 12
39
Convergence
  • PageRank computation is O(log(V)).

40
Other Applications
  • Help user decide if a site is trustworthy.
  • Estimate web traffic.
  • Spam detection and prevention.
  • Predict citation counts.

41
Issues
  • Users are not random walkers.
  • Starting point distribution (actual usage data as
    starting vector).
  • Bias towards main pages.
  • Linkage spam.
  • No query specific rank.

42
PageRank vs. HITS
  • HITS
  • (CLEVER)
  • performed on the set of retrieved web pages for
    each query
  • computes authorities and hubs
  • easy to compute, but real-time execution is hard
  • PageRank
  • (Google)
  • computed for all web pages stored in the database
    prior to the query
  • computes authorities only
  • Trivial and fast to compute

43
References
  • Authoritative Sources in a Hyperlinked
    Environment, Jon Kleinberg, Cornell University.
    1998.
  • The PageRank Citation Ranking Bringing Order to
    the Web, Lawrence Page and Sergey Brin, Stanford
    University. 1998.
Write a Comment
User Comments (0)
About PowerShow.com