Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg - PowerPoint PPT Presentation

About This Presentation
Title:

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Description:

For each page p S maintain: Authority score : ap (vector a) Hub score : hp (vector h) ... all p S: (update hub scores) For all p S: ap= ap/c c: For all p S: hp ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 28
Provided by: PKRE
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg


1
Authoritative Sources in a Hyperlinked
environmentJon M. Kleinberg
  • Presented By
  • Lekhendro

2
Outline
  • Introduction
  • Constructing focused Subgraph
  • Computing Hubs and Authorities
  • Conclusion

3
Introduction
  • How to improve quality of search on WWW ?
  • Quality of search requires human evaluation due
    to the subjectivity inherent in notions such as
    relevance.
  • The quality of search results and storage are
    orthogonal.

4
Queries and Authoritative Sources
  • Types of queries
  • Specific queries E.g. Does Netscape support the
    JDK 1.1 code-signing API?
  • Broad-topic queries E.g. Find information about
    the Java programming language.
  • Handling specific queries is difficult.
  • Scarcity problem- There are few pages containing
    those information and it is difficult to
    determine the identity of those pages.
  • For broad topic queries, there are sometimes
    thousands of relevant pages.
  • Abundance problem The number of pages that could
    reasonably be returned as relevant is far too
    large for a human user to digest.
  • One needs a way to filter a small set of the
    authoritative or definitive pages from a huge
    collection of relevant pages.

5
Limitations of text based analysis
  • Text-based ranking function
  • E.g. For the harvard, www.harvard.edu is proper
    authoritative page but there may be lots of other
    web pages containing harvard more often.
  • Most popular Pages are not sufficiently
    selfdescriptive.
  • Usually the term search engine doesnt appear
    on search engine home web pages of Yahoo,
    AltaVista, Excite etc.
  • Honda or Toyota home pages hardly contain the
    term automobile manufacturer.

6
Analysis of link structure
  • Hyperlinks encode a latent human judgment which
    can be used to formulate a notion of authority.
  • Creation of a link represents a concrete
    indication of the following type of judgment
  • The creator of page p, by including a link to
    page q, has in some measure conferred authority
    on q.
  • Opportunity for the user to find potential
    authorities purely through the pages that point
    to them.
  • In this paper a link-based model for the
    conferral of authority has been proposed.
  • It has been shown that the proposed method
    consistently identifies relevant authoritative
    web pages for broad search topics.
  • However, there are pitfalls of above concept.
  • Most links are created for navigational purposes.
  • Difficult to balance between appropriate
    relevance and popularity

7
Authorities and Hubs
  • Authorities are pages that are recognized as
    providing significant, trustworthy, and useful
    information on a topic.
  • Hubs are index pages that provide lots of useful
    links to relevant content pages (topic
    authorities).
  • In-degree - Number of pointers to a page and is
    one simple measure of authority.
  • Out-degree - Number of pointers from a page to
    other pages.

8
Overview
  • Discover authoritative WWW sources globally.
  • Determine hubs and authorities on a particular
    topic through analysis of a relevant sub-graph of
    the web.
  • Given Keyword Query, assign a hub and an
    authoritative value to each page.
  • Pages with high authority are results of query

9
Hubs Authorities
  • Mutually reinforcing relationship
  • Hubs point to lots of authorities.
  • Authorities are pointed to by lots of hubs
  • Good hub page that points to many good
    authorities.
  • Good authority page pointed to by many good
    hubs.

10
Constructing a focused subgraph of WWW
  • Terms
  • A collection of hyperlinked pages can be viewed
    as a directed graph G(V,E) nodes correspond to
    pages, and a directed edge (p,q) e E indicates
    the presence of a link from p to q.
  • Given a query string ?, determine the sub-graph
    G? of WWW.
  • The graph may include all the pages containing
    the query string.
  • This approach has the following drawbacks.
  • The set may contain millions of pages
  • Best authorities may not belong to this set.
  • Focus is on S? pages with the following
    properties.
  • S? is very small
  • S? is rich in relevant pages.
  • S? contains most of the strongest authorities.

11
Hubs and Authorities
  • Together they tend to form a bipartite graph

Authorities
Hubs
12
Root Set and Base Set
  • Collect a root set, R? (top ranked) of pages
    based on the query using text-based search engine
    (AltaVista).
  • R? satisfies 1 and 2 but may not satisfy 3.
  • R? contains the string (query) hence it is subset
    of Q? set containing all the pages containing the
    query.
  • A strong authority of query topic although it may
    not be in root set, quite likely to be pointed to
    by at least one page in root set.
  • The number of authorities can be increased by
    expanding root set along the links that enter and
    leave it.

13
Root Set and Base Set (Contd)
  • Expand root set into base set by including (up to
    a designated size cut-off)
  • all pages linked to by pages in root set
  • all pages that link to a page in root set
  • Typical base set contains roughly 1000-5000 pages

Base Set
Root Set
14
Subgraph construction algorithm
15
Heuristic
  • Two types of links.
  • Transverse if it is between pages with
    different domain names.
  • Intrinsic if it is between pages with the same
    domain name.
  • Delete all intrinsic links
  • Most of them are for navigation purposes
  • Less informative or information repetition
  • Or keep upto m(4 to 8) pages of same domain

16
Authorities and In-Degree
  • Even within the base set S for a given query, the
    nodes with highest in-degree are not necessarily
    authorities (may just be generally popular pages
    like Yahoo or Amazon).
  • True authority pages are pointed to by a number
    of hubs (i.e. pages that point to lots of
    authorities).

17
Iterative Algorithm
  • For each page p ? S maintain
  • Authority score ap (vector a)
  • Hub score hp (vector h)
  • Initialize all ap hp 1
  • Maintain normalized scores

18
Computing Hubs and authorities
v1
v1
h(v1)
a(v1)
p
v2
p
v2
h(v2)
a(v2)
v3
h(v3)
v3
a(v3)
19
Hubs and authorities computation (contd)
  • Authorities are pointed to by lots of good hubs
  • Hubs point to lots of good authorities

20
Iterative Algorithm
  • Initialize for all p ? S ap hp 1
  • For i 1 to k
  • For all p ? S (update auth. scores)
  • For all p ? S (update hub scores)
  • For all p ? S ap ap/c c
  • For all p ? S hp hp/c c

(normalize a)
(normalize h)
21
Example Mini Web
     


A
M
H

-
i
i
1
X
T

H
M
A

-
i
i
1
Y
Z
22
Example
 

Iteration 0 1 2 3
X
Y
Z
23
Convergence
  • Algorithm converges to a fix-point if iterated
    indefinitely.
  • Define A to be the adjacency matrix for the
    subgraph defined by S.
  • Aij 1 for i ? S, j ? S iff i?j
  • Authority vector, a, converges to the principal
    eigenvector of ATA
  • Hub vector, h, converges to the principal
    eigenvector of AAT
  • In practice, 20 iterations produces fairly stable
    results.

24
Results
  • Authorities for query Java
  • java.sun.com
  • comp.lang.java FAQ
  • Authorities for query search engine
  • Yahoo.com
  • Excite.com
  • Lycos.com
  • Altavista.com
  • Authorities for query Gates
  • Microsoft.com
  • roadahead.com

25
Conclusions
  • A technique for locating high-quality information
    related to broad search topic based on link
    analysis.
  • Performed on the set of retrieved web pages for
    each query
  • Computes authorities and hubs
  • No indexing is needed. Only interface to
    different search engines is needed.
  • IBM expanded HITS into CLEVER but not seen as
    viable search engine. (computation of real-time
    execution is hard).

26
Basic knowledge of Matrix
  • M symmetric nn matrix ? vector ? a number
  • If for some vector ?, M ? ? ?, we say,
  • The set of all such ? is a subspace of Rn
    Eigenspace associated with ?
  • These ?1(M), ?2(M), are eigenvalues, while
    ?1(M), ?2(M), are eigenvectors
  • ?i(M) belongs to the subspace of ?i(M)
  • If we assume ?1(M) gt ?2(M), we refer to ?1(M)
    as the principal eigenvector, and all other ?i(M)
    as non-principal eigenvector.

27
Convergence Proof of Iterate Procedure
  • Theorem1. The sequences x1, x2, x3, and y1, y2,
    y3, converge to x and y respectively.
  • Proof G(V,E) Vp1, p2, , pn
  • A is the adjacency matrix of graph G Aij 1 if
    (pi, pj) is an edge of G.
  • I O operations can be written as
  • x ? ATy y ? Ax K loops,
  • So, x (1)? AT Ax (0) x(0) AT z
  • x ? x (k)? (AT A)k-1 AT z
  • y ? y (k)? (AAT)k z
  • if ? is a vector not orthogonal to the principle
    eigenvector ?1(M), the unit vector in the
    direction of Mk? converges to ?1(M) as k
    increases without bound

28
Convergence Proof of Iterate Procedure(cont.)
  • A is called an orthogonal matrix if AAT AT A
    E.
  • Theorem2 x is the principal eigenvector of
    ATA, and y is the principal eigenvector of AAT.
  • Experiment finds that k20 is sufficient for the
    convergence of vectors.

29
Reference
  • http//crystal.uta.edu/gdas/Courses/websitepages/
    spring06DBIR.htm
  • http//www.iiit.net/pkreddy
Write a Comment
User Comments (0)
About PowerShow.com