Title: Authoritative Sources in a Hyperlinked Environment
1Authoritative Sources in a Hyperlinked Environment
- Hui Han
- CSE dept, PSU
- 10/15/01
2Main Idea
- Return more authoritative information, using the
algorithm based on link structure analysis, esp.
the relationship between a set of relevant
authoritative pages and the set of hub pages
that join them together in the link structure
3The types of queries
- Specific queries scarcity problem
- Does Netscape support the JDK 1.1 code-signing
API - Broad-topic queries abundance problem
- Find information about the Java programming
language. - Solution find a small set of the most
authoritative relevant pages. - Similar-page queries
- Find pages similar to java.sun.com
4Linked Based Analysis
- Limitations of text based analysis
- Text-based ranking function
- Eg. Could www.harvard.edu be recognized as one of
the most authoritative pages, since many other
webpages contain harvard more often. - Pages are not sufficiently self descriptive
- Usually the term search engine doesnt appear
on search engine web pages
5Limitations of Basic Link-based Analysis
Conferred Authority
P
Q
- Basic model Of all pages containing the query
string, return those with the greatest number of
in-links - Large number of links are created primarily for
navigational purposes - Difficulty in finding a balance between relevance
and popularity - Popular sites like www.yahoo.com would be
considered as highly authoritative on any query
strings it contained.
6Hub-Authority method
- A link-based model for the conferral of
authority, using the method that consistently
identifies relevant, authoritative WWW pages for
broad search topics. - Difference from Clustering
- Distinguishing pages related to different
meanings or senses of a query term
7Step1 Constructing a Focused Subgraph of the WWW
- Properties of a ideal collection S? of pages
- S? is relatively small
- S? is rich in relevant pages
- S? contains most(or many) of the strongest
authorities - Construction of S?
- Collect the t highest-ranked pages for the query
? from a text-based search engine (Altavista,
hotbot) as root set R? - Expand R? to base set S?
- a strong authority is quite likely to be pointed
to by at least one page in R? - Keep only transverse links
8Step2 Compute Hubs and Authorities
- Authoritative pages should
- have large in-degree
- have a considerable overlap in the sets of pages
that point to them, since they share a common
topic - Hub pages pull together(link to) authorities
on a common topic
9An Iterative Algorithm
- G? the subgraph induced on the pages in S?
- x(p) authority weight of Page P
- y(p) hub weight of Page P
- Normalize ?p?S? (x(p))2 1
- ?p?S? (y(p))2 1
- I operation
- x(p) ?q (q,p) ? Ey(p)
- O operation
- y(p) ?q (q,p) ? Ex(p)
- Hub and Authority exhibit
- mutually reinforcing relationship
10An Iterative Algorithm(cont.)
- x(p) is a vector x with authority weight for
each page in G? - y(p) is a vector y with hub weight for each
page in G? - Filter out the top c authorities and top c hubs
11Basic knowledge of Matrix
- M symmetric nn matrix ? vector ? a number
- If for some vector ?, M ? ? ?, we say,
- The set of all such ? is a subspace of Rn
Eigenspace associated with ? - These ?1(M), ?2(M), are eigenvalues, while
?1(M), ?2(M), are eigenvectors - ?i(M) belongs to the subspace of ?i(M)
- If we assume ?1(M) ?2(M), we refer to ?1(M)
as the principal eigenvector, and all other ?i(M)
as non-principal eigenvector.
12Convergence Proof of Iterate Procedure
- Theorem1. The sequences x1, x2, x3, and y1, y2,
y3, converge to x and y respectively. - Proof G(V,E) Vp1, p2, , pn
- A is the adjacency matrix of graph G Aij 1 if
(pi, pj) is an edge of G. - I O operations can be written as
- x ? ATy y ? Ax K loops,
- So, x (1)? AT Ax (0) x(0) AT z
- x ? x (k)? (AT A)k-1 AT z
- y ? y (k)? (AAT)k z
- if ? is a vector not orthogonal to the principle
eigenvector ?1(M), the unit vector in the
direction of Mk? converges to ?1(M) as k
increases without bound
13Convergence Proof of Iterate Procedure(cont.)
- A is called an orthogonal matrix if AAT AT A
E. - Theorem2 x is the principal eigenvector of
ATA, and y is the principal eigenvector of AAT. - Experiment finds that k20 is sufficient for the
convergence of vectors.
14Experimental results
- Most of these authorities do not contain any
occurrences of the initial query string , except
http//www.roadhead.com
15Similar-Page Queries
- The strongest authorities in the local region of
the link structure near p, can server as a
broad-topic summary of the pages related to p. - A small difference in initial request to the
search engine Find t pages pointing to p, to
get R?
16Multiple sets of Hubs and Authorities
- Why?
- The query string ? may have several very
different meanings. Eg.java - The string may arise as a term in the context of
multiple technical communities. Eg. randomized
algorithms - The string may refer to a highly polarized issue,
involving groups that are not likely to link to
one another. Eg. abortion - Idea
- The NON-principle eigenvectors of ATA and AAT
provide us with a natural way to extract
additional densely linked collections of hubs and
authorities from the base set S?.
17Multiple sets of Hubs and AuthoritiesExperimental
result1
- For the query jaguar, the strongest
collections of authoritative sources concerned
the Atari Jaguar product, the NFL football team
from Jacksonville, and the automobile.
18Multiple sets of Hubs and AuthoritiesExperimental
result2
- For the query randomized algorithms, none of
the strongest collections of hubs and authorities
are precisely on the query topic. They include
home pages of theoretical computer scientists,
compendia of mathesmatical software and pages on
wavelets.
19Multiple sets of Hubs and AuthoritiesExperimental
result3
- For the query abortion, there is a natural
separation.
20Conclusion
- A technique for locating high-quality information
related to a broad search topic on the www, based
on a structural analysis of the link topology
surrounding authoritative pages on the topic. - Related work.
- Standing, influence in social networks,
scientific citations, etc. - Hypertext and WWW rankings