Title: Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
1Authoritative Sources in a Hyperlinked
environmentJon M. Kleinberg
2Outline
- Introduction
- Constructing focused Subgraph
- Computing Hubs and Authorities
- Conclusion
3Introduction
- How to improve quality of search on WWW ?
- Quality of search requires human evaluation due
to the subjectivity inherent in notions such as
relevance. - The quality of search results and storage are
orthogonal.
4Queries and Authoritative Sources
- Types of queries
- Specific queries E.g. Does Netscape support the
JDK 1.1 code-signing API? - Broad-topic queries E.g. Find information about
the Java programming language. - Handling specific queries is difficult.
- Scarcity problem- There are few pages containing
those information and it is difficult to
determine the identity of those pages. - For broad topic queries, there are sometimes
thousands of relevant pages. - Abundance problem The number of pages that could
reasonably be returned as relevant is far too
large for a human user to digest. - One needs a way to filter a small set of the
authoritative or definitive pages from a huge
collection of relevant pages.
5Limitations of text based analysis
- Text-based ranking function
- E.g. For the harvard, www.harvard.edu is proper
authoritative page but there may be lots of other
web pages containing harvard more often. - Most popular Pages are not sufficiently
selfdescriptive. - Usually the term search engine doesnt appear
on search engine home web pages of Yahoo,
AltaVista, Excite etc. - Honda or Toyota home pages hardly contain the
term automobile manufacturer.
6Analysis of link structure
- Hyperlinks encode a latent human judgment which
can be used to formulate a notion of authority. - Creation of a link represents a concrete
indication of the following type of judgment - The creator of page p, by including a link to
page q, has in some measure conferred authority
on q. - Opportunity for the user to find potential
authorities purely through the pages that point
to them. - In this paper a link-based model for the
conferral of authority has been proposed. - It has been shown that the proposed method
consistently identifies relevant authoritative
web pages for broad search topics. - However, there are pitfalls of above concept.
- Most links are created for navigational purposes.
- Difficult to balance between appropriate
relevance and popularity
7Authorities and Hubs
- Authorities are pages that are recognized as
providing significant, trustworthy, and useful
information on a topic. - Hubs are index pages that provide lots of useful
links to relevant content pages (topic
authorities). - In-degree - Number of pointers to a page and is
one simple measure of authority. - Out-degree - Number of pointers from a page to
other pages.
8Overview
- Discover authoritative WWW sources globally.
- Determine hubs and authorities on a particular
topic through analysis of a relevant sub-graph of
the web. - Given Keyword Query, assign a hub and an
authoritative value to each page. - Pages with high authority are results of query
9Hubs Authorities
- Mutually reinforcing relationship
- Hubs point to lots of authorities.
- Authorities are pointed to by lots of hubs
- Good hub page that points to many good
authorities. - Good authority page pointed to by many good
hubs.
10Constructing a focused subgraph of WWW
- Terms
- A collection of hyperlinked pages can be viewed
as a directed graph G(V,E) nodes correspond to
pages, and a directed edge (p,q) e E indicates
the presence of a link from p to q. - Given a query string ?, determine the sub-graph
G? of WWW. - The graph may include all the pages containing
the query string. - This approach has the following drawbacks.
- The set may contain millions of pages
- Best authorities may not belong to this set.
- Focus is on S? pages with the following
properties. - S? is very small
- S? is rich in relevant pages.
- S? contains most of the strongest authorities.
11Hubs and Authorities
- Together they tend to form a bipartite graph
Authorities
Hubs
12Root Set and Base Set
- Collect a root set, R? (top ranked) of pages
based on the query using text-based search engine
(AltaVista). - R? satisfies 1 and 2 but may not satisfy 3.
- R? contains the string (query) hence it is subset
of Q? set containing all the pages containing the
query. - A strong authority of query topic although it may
not be in root set, quite likely to be pointed to
by at least one page in root set. - The number of authorities can be increased by
expanding root set along the links that enter and
leave it.
13Root Set and Base Set (Contd)
- Expand root set into base set by including (up to
a designated size cut-off) - all pages linked to by pages in root set
- all pages that link to a page in root set
- Typical base set contains roughly 1000-5000 pages
Base Set
Root Set
14Subgraph construction algorithm
15Heuristic
- Two types of links.
- Transverse if it is between pages with
different domain names. - Intrinsic if it is between pages with the same
domain name. - Delete all intrinsic links
- Most of them are for navigation purposes
- Less informative or information repetition
- Or keep upto m(4 to 8) pages of same domain
16Authorities and In-Degree
- Even within the base set S for a given query, the
nodes with highest in-degree are not necessarily
authorities (may just be generally popular pages
like Yahoo or Amazon). - True authority pages are pointed to by a number
of hubs (i.e. pages that point to lots of
authorities).
17Iterative Algorithm
- For each page p ? S maintain
- Authority score ap (vector a)
- Hub score hp (vector h)
- Initialize all ap hp 1
- Maintain normalized scores
18Computing Hubs and authorities
v1
v1
h(v1)
a(v1)
p
v2
p
v2
h(v2)
a(v2)
v3
h(v3)
v3
a(v3)
19Hubs and authorities computation (contd)
- Authorities are pointed to by lots of good hubs
- Hubs point to lots of good authorities
20Iterative Algorithm
- Initialize for all p ? S ap hp 1
- For i 1 to k
- For all p ? S (update auth. scores)
-
- For all p ? S (update hub scores)
-
- For all p ? S ap ap/c c
- For all p ? S hp hp/c c
(normalize a)
(normalize h)
21Example Mini Web
A
M
H
-
i
i
1
X
T
H
M
A
-
i
i
1
Y
Z
22Example
Iteration 0 1 2 3
X
Y
Z
23Convergence
- Algorithm converges to a fix-point if iterated
indefinitely. - Define A to be the adjacency matrix for the
subgraph defined by S. - Aij 1 for i ? S, j ? S iff i?j
- Authority vector, a, converges to the principal
eigenvector of ATA - Hub vector, h, converges to the principal
eigenvector of AAT - In practice, 20 iterations produces fairly stable
results. -
24Results
- Authorities for query Java
- java.sun.com
- comp.lang.java FAQ
- Authorities for query search engine
- Yahoo.com
- Excite.com
- Lycos.com
- Altavista.com
- Authorities for query Gates
- Microsoft.com
- roadahead.com
25Conclusions
- A technique for locating high-quality information
related to broad search topic based on link
analysis. - Performed on the set of retrieved web pages for
each query - Computes authorities and hubs
- No indexing is needed. Only interface to
different search engines is needed. - IBM expanded HITS into CLEVER but not seen as
viable search engine. (computation of real-time
execution is hard).
26Basic knowledge of Matrix
- M symmetric nn matrix ? vector ? a number
- If for some vector ?, M ? ? ?, we say,
- The set of all such ? is a subspace of Rn
Eigenspace associated with ? - These ?1(M), ?2(M), are eigenvalues, while
?1(M), ?2(M), are eigenvectors - ?i(M) belongs to the subspace of ?i(M)
- If we assume ?1(M) gt ?2(M), we refer to ?1(M)
as the principal eigenvector, and all other ?i(M)
as non-principal eigenvector.
27Convergence Proof of Iterate Procedure
- Theorem1. The sequences x1, x2, x3, and y1, y2,
y3, converge to x and y respectively. - Proof G(V,E) Vp1, p2, , pn
- A is the adjacency matrix of graph G Aij 1 if
(pi, pj) is an edge of G. - I O operations can be written as
- x ? ATy y ? Ax K loops,
- So, x (1)? AT Ax (0) x(0) AT z
- x ? x (k)? (AT A)k-1 AT z
- y ? y (k)? (AAT)k z
- if ? is a vector not orthogonal to the principle
eigenvector ?1(M), the unit vector in the
direction of Mk? converges to ?1(M) as k
increases without bound
28Convergence Proof of Iterate Procedure(cont.)
- A is called an orthogonal matrix if AAT AT A
E. - Theorem2 x is the principal eigenvector of
ATA, and y is the principal eigenvector of AAT. - Experiment finds that k20 is sufficient for the
convergence of vectors.
29Reference
- http//crystal.uta.edu/gdas/Courses/websitepages/
spring06DBIR.htm - http//www.iiit.net/pkreddy