Title: Authoritative Sources in a Hyperlinked Environment
1Authoritative Sources in a Hyperlinked
Environment
Author Jon M. Kleinberg
- Presented by Gang Fang
- Slides made from 1 as well as copied and edited
from 2 and 3 - for Fall 2008 Course CSci 8363
- 1 HITS paper
- 2 Dr. Bill Eberles slides, UTA
- 3 Dr. Tie-Yan Liusslides, MSRA
2Outline
- Motivation (citation)
- The HITS Algorithm
- Extensions of HITS
- Comparison with PageRank
- Limits of Link Analysis
3Ranking for the Search Results
- Drawback of pure content-based ranking
- Specific queries (scarcity).
- Does Netscape support the JDK 1.1 code-signing
API? - Broad-topic queries (abundance).
- Find information about the Java programming
language
4Ranking for the Search Results
- Drawback of pure content-based ranking
- Specific queries (scarcity).
- Does Netscape support the JDK 1.1 code-signing
API? - Broad-topic queries (abundance).
- Find information about the Java programming
language - Relevance-based search engines may return
millions of pages for a certain query - It is definitely not possible for the user to
preview all these results - An appropriate ranking will be very helpful.
- Ranking on relevance
- Ranking on importance
5Traditional IR Ranking
- Text-based ranking function
- www.harvard.edu can hardly be recognized as one
of the most authoritative pages for the query
harvard, since many other web pages contain
harvard more often. - The number of pages with the same relevance is
still too large for the users to preview.
6Traditional IR Ranking
- Text-based ranking function
- www.harvard.edu can hardly be recognized as one
of the most authoritative pages for the query
harvard, since many other web pages contain
harvard more often. - The number of pages with the same relevance is
still too large for the users to preview. - Pages are not sufficiently self-descriptive
- Usually the term search engine doesn't appear
on the web pages of search engines.
7Whats More for Web Search
- In order to solve these problems
- We must leverage other information on the Web
- We must distinguish those pages with the same
amount of relevance - Link Analysis
- The web is not just a collection of pure-text
documents - the hyperlinks are also very important!
- A link from page A to page B may indicate
- A is related to B, or
- A is recommending, citing, voting for or
endorsing B - Links effect the ranking of web pages and thus
have commercial value.
8Web as a Graph
- Web pages as nodes of a graph.
- Links as directed edges.
my page
www.uta.edu
my page
www.uta.edu
www.uta.edu
www.google.com
www.google.com
www.google.com
Copied and edited from Bill Eberles slides
9Whats More for Web Search
- In order to solve these problems
- We must leverage other information on the Web
- We must distinguish those pages with the same
amount of relevance - Link Analysis
- The web is not just a collection of pure-text
documents - the hyperlinks are also very important!
- A link from page A to page B may indicate
- A is related to B, or
- A is recommending, citing, voting for or
endorsing B
10Famous Link Analysis Methods
- HITS (Hyperlinked-Induced Topic Search)
- Authoritative Sources in a Hyperlinked
Environment, Jon Kleinberg, - Cornell University. 1998.
- PageRank
- The PageRank Citation Ranking Bringing Order to
the Web, Lawrence Page and Sergey Brin, Stanford
University. 1998.
11In the year 1996, also at cornell!
Image retrieval community was also shifting from
CH to CCV
CH Color histogram ? CCV Color Coherent Vector
Pass and Zabih 1996
12Motivation of Link Analysis (HITS)
- Motivation
- First search for a number of relevant pages
- The find the smallest set of authoritative
sources via ranking - Make use of the rich linkage structure!
- Forward links (out-edges).
- Backward links (in-edges).
Copied and edited from Bill Eberles slides
13Motivation of Link Analysis (HITS)
- Motivation
- First search for a number of relevant pages
- Then find the smallest set of authoritative
sources via ranking - Make use of the rich linkage structure!
- Forward links (out-edges).
- Backward links (in-edges).
Copied and edited from Bill Eberles slides
14Authorities and Hubs
- Authority is a page which has relevant
information about the topic.
a1
a2
h
a3
a4
Copied and edited from Bill Eberles slides
15Authorities and Hubs
- Authority is a page which has relevant
information about the topic. - Hub is a page which has collection of links to
pages about that topic.
a1
a2
h
a3
a4
Copied and edited from Bill Eberles slides
16Authorities and Hubs (cont.)
- Good hubs are the ones that point to good
authorities.
h1
a1
a2
h2
a3
h3
a4
h4
a5
h5
a6
Copied and edited from Bill Eberles slides
17Authorities and Hubs (cont.)
- Good hubs are the ones that point to good
authorities. - Good authorities are the ones that are pointed to
by - good hubs.
h1
a1
a2
h2
a3
h3
a4
h4
a5
h5
a6
Copied and edited from Bill Eberles slides
18HITS Two Steps
- First, construct a focused sub-graph of the www.
- Second, compute Hubs and Authorities from the
sub-graph.
Copied and edited from Bill Eberles slides
19Construction of Sub-graph
Rootset Pages
Expanded set Pages
Search Engine
Crawler
Topic
Forward link pages
R ? S ? G
All the children and a fixed number of parents
Rootset
Copied and edited from Bill Eberles slides
20Hubs Authorities Calculation
- Iterative algorithm on Base Set authority
weights a(p), and hub weights h(p). - Set authority weights a(p) 1, and hub weights
h(p) 1 for all p. - Repeat following two operations(and then
re-normalize a and h to have unit norm)
v1
v1
h(v1)
a(v1)
p
v2
p
v2
h(v2)
a(v2)
v3
h(v3)
v3
a(v3)
Copied and edited from Bill Eberles slides
21Example
0.45, 0.45
0.45, 0.45
Hub 0.45, Authority 0.45
0.45, 0.45
Copied and edited from Bill Eberles slides
22Example (cont.)
0.45, 0.9
1.35, 0.9
Hub 0.9, Authority 0.45
0.45, 0.9
Copied and edited from Bill Eberles slides
23Iterative Updata of Authority and Hubness
- Recursive dependency
-
- I step a(v) ? S h(w)
- O step h(v) ? S a(w)
w ? pav
w ? chv
- Normalization after each iteration
24Convergence of Authority and Hubness - Assumptions
Copied from HITS paper
25Convergence of Authority and Hubness - Assumptions
Copied from HITS paper
26Convergence of Authority and Hubness - Assumptions
Copied from HITS paper
27Convergence of Authority and Hubness - Convergence
Copied from HITS paper
28Convergence of Authority and Hubness - Convergence
Copied from HITS paper
29Convergence of Authority and Hubness - Convergence
Copied from HITS paper
30Convergence of Authority and Hubness - Convergence
Copied from HITS paper
31Convergence of Authority and Hubness - Convergence
Copied from HITS paper
32Convergence of Authority and Hubness - Convergence
Copied from HITS paper
33Convergence of Authority and Hubness - Convergence
Copied from HITS paper
34Convergence of Authority and Hubness - Convergence
Copied from HITS paper
35Convergence of Authority and Hubness (Cont)
- Theorem 3.2 shows that one can use any
eigenvector algorithm to compute the fixed point
x and y - However HITS have stuck to the above exposition
in terms of the Iterate procedure for two
reasons. - It emphasizes the underlying motivation for our
approach in terms of the reinforcing I and O
operations - Second, one does not have to run the above
process of iterated IO operations to
convergence one can compute weights xltpgt and
yltpgt by starting from any initial vectors x0
and y0, and performing a xed bounded number of I
and O operations.
36Iterative Updata of Authority and Hubness
- Recursive dependency
-
- I step a(v) ? S h(w)
- O step h(v) ? S a(w)
w ? pav
w ? chv
- Normalization after each iteration
37HITS Example Results
Copied and edited from Tie-yan Lius slides
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
38Extensions of HITS
- Extensions of HITS
- Similar query pages
- Multiple Sets of Hubs and Authorities
39Similar query pages
- using link structure to infer a notion of
similarity" among pages - Suppose we have found a page p that is of
interest perhaps it is an authoritative page
on a topic of interest and - We wish to ask the following type of question
What do users of the www consider to be related
to p, when they create pages and hyperlinks? - If p is highly referenced page, we have a version
of the Abundance Problem - The surrounding link structure will implicitly
represent an enormous number of independent
opinions about the relation of p to other pages.
40Similar query pages
- Using link structure to infer a notion of
similarity" among pages - Suppose we have found a page p that is of
interest perhaps it is an authoritative page
on a topic of interest and - We wish to ask the following type of question
What do users of the www consider to be related
to p, when they create pages and hyperlinks? - If p is highly referenced page, we have a version
of the Abundance Problem - The surrounding link structure will implicitly
represent an enormous number of independent
opinions about the relation of p to other pages.
41Similar query pages
- Using the notion of hubs and authorities, we can
provide an approach to the issue of page
similarity, asking - In the local region of the link structure near
p, what are the strongest authorities? Such
authorities can potentially serve as a
broad-topic summary of the pages related to p. - Original HITS algorithm can be adapted to this
situation with essentially no modification - Previously, we initiated our search with a query
string our request from the search engine was
Find t pages containing the string ." - We now begin with a page p and pose the following
request to the search engine Find t pages
pointing to p."
42Similar query pages
- Using the notion of hubs and authorities, we can
provide an approach to the issue of page
similarity, asking - In the local region of the link structure near
p, what are the strongest authorities? Such
authorities can potentially serve as a
broad-topic summary of the pages related to p. - Original HITS algorithm can be adapted to this
situation with essentially no modification - Previously, we initiated our search with a query
string our request from the search engine was
Find t pages containing the string ."
43Similar query pages
- Using the notion of hubs and authorities, we can
provide an approach to the issue of page
similarity, asking - In the local region of the link structure near
p, what are the strongest authorities? Such
authorities can potentially serve as a
broad-topic summary of the pages related to p. - Original HITS algorithm can be adapted to this
situation with essentially no modification - Previously, we initiated our search with a query
string our request from the search engine was
Find t pages containing the string ." - We now begin with a page p and pose the following
request to the search engine Find t pages
pointing to p."
44Similar query pages
- Supercially, the set of issues in working with a
subgraph Gp are somewhat different from those
involved in working with a subgraph defined by a
query string. - However, we find that most of the basic
conclusions we drew in the previous two sections
continue to apply.
45Similar query pages
- Supercially, the set of issues in working with a
subgraph Gp are somewhat different from those
involved in working with a subgraph defined by a
query string. - However, we find that most of the basic
conclusions we drew in the previous two sections
continue to apply. - Ranking pages of Gp by their in-degrees is still
not satisfactory
46Similar query pages
- To compare,
- Note the difficulties inherent in compiling such
lists through text-based methods - entirely of images, with very little text and
the text that they do contain has very little
overlap. - HITS, on the other hand, is determining, via the
presence of links, what the creators of www pages
tend to classify" together with the given pages
www.honda.com
47Similar query pages
- To compare,
- Note the difficulties inherent in compiling such
lists through text-based methods - entirely of images, with very little text and
the text that they do contain has very little
overlap.
48Similar query pages
- To compare,
- Note the difficulties inherent in compiling such
lists through text-based methods - entirely of images, with very little text and
the text that they do contain has very little
overlap. - HITS, on the other hand, is determining, via the
presence of links, what the creators of www pages
tend to classify" together with the given pages
www.honda.com
49Multiple Sets of Hubs and Authorities
- Original HITS finds the most densely linked
collection of hubs and authorities in the
subgraph Gs defined by a query string s. - There are a number of settings, however, in which
one may be interested in finding several densely
linked collections of hubs and authorities among
the same set S of pages. - Each such collection could potentially be
relevant to the query topic, - but they could be well-separated from one another
in the graph G for a variety of reasons. For
example, - The query string may have several very different
meanings. E.g. "jaguar - The string may arise as a term in the context of
multiple technical communities. E.g. "randomized
algorithms". - The string may refer to a highly polarized issue,
involving groups that are not likely to link to
one another. E.g. "abortion"
50Multiple Sets of Hubs and Authorities
- Original HITS finds the most densely linked
collection of hubs and authorities in the
subgraph Gs defined by a query string s. - There are a number of settings, however, in which
one may be interested in finding several densely
linked collections of hubs and authorities among
the same set S of pages. - Each such collection could potentially be
relevant to the query topic, - but they could be well-separated from one another
in the graph G for a variety of reasons. For
example,
51Multiple Sets of Hubs and Authorities
- Original HITS finds the most densely linked
collection of hubs and authorities in the
subgraph Gs defined by a query string s. - There are a number of settings, however, in which
one may be interested in finding several densely
linked collections of hubs and authorities among
the same set S of pages. - Each such collection could potentially be
relevant to the query topic, - but they could be well-separated from one another
in the graph G for a variety of reasons. For
example, - The query string may have several very different
meanings. E.g. "jaguar - The string may arise as a term in the context of
multiple technical communities. E.g. "randomized
algorithms". - The string may refer to a highly polarized issue,
involving groups that are not likely to link to
one another. E.g. "abortion"
52Multiple Sets of Hubs and Authorities
- In each of the above three examples, the relevant
documents can be naturally grouped into several
clusters.
53Multiple Sets of Hubs and Authorities
- In each of the above three examples, the relevant
documents can be naturally grouped into several
clusters. - The issue in the setting of broad-topic queries,
however, is not simply how to achieve a
dissection into reasonable clusters one must
also deal with this in the presence of the
Abundance Problem. - Each cluster, in the context of the full www, is
enormous. - .
54Multiple Sets of Hubs and Authorities
- In each of the above three examples, the relevant
documents can be naturally grouped into several
clusters. - The issue in the setting of broad-topic queries,
however, is not simply how to achieve a
dissection into reasonable clusters one must
also deal with this in the presence of the
Abundance Problem. - Each cluster, in the context of the full www, is
enormous. - So, we require a way to distill a small set of
hubs and authorities out of each one. We can thus
view such collections of hubs and authorities as
implicitly providing broad-topic summaries of a
collection of large clusters that we never
explicitly represent.
55Multiple Sets of Hubs and Authorities
- In the original HITS, the authorities and hubs we
computed to the principal eigenvectors of the
matrices ATA and AAT, where A is the adjacency
matrix of G.
56Multiple Sets of Hubs and Authorities
- In the original HITS, the hubs and authorities we
computed to the principal eigenvectors of the
matrices ATA and AAT, where A is the adjacency
matrix of G. - The non-principal eigenvectors of ATA and AAT
provide us with a natural way to extract
additional densely linked collections of hubs and
authorities from the base set S.
57Algorithmic Outcome
- Applying iterative multiplication (power
iteration) will lead to calculating eigenvector
of any non-degenerate initial vector. - Hubs and authorities as outcome of process.
- Principal eigenvector contains highest hub and
authorities.
Copied and edited from HITS paper
58Results
- Although HITS is only link-based (it completely
disregards page content) results are quite good
in many tested queries. - When the authors tested the query search
engines - The algorithm returned Yahoo!, Excite, Magellan,
Lycos, AltaVista - However, none of these pages described themselves
as a search engine (at the time of the
experiment)
Copied and edited from HITS paper
59Issues
- From narrow topic, HITS tends to end in more
general one. - Specific of hub pages - many links can cause
algorithm drift. They can point to authorities in
different topics. - Pages from single domain / website can dominate
result, if they point to one page - not
necessarilly a good authority.
Copied and edited from HITS paper
60Issues
- From narrow topic, HITS tends to end in more
general one. - Specific of hub pages - many links can cause
algorithm drift. They can point to authorities in
different topics. - Pages from single domain / website can dominate
result, if they point to one page - not
necessarilly a good authority.
Copied and edited from HITS paper
61HITS Example Results
Copied and edited from Tie-yan Lius slides
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
62Issues
- From narrow topic, HITS tends to end in more
general one. - Specific of hub pages - many links can cause
algorithm drift. They can point to authorities in
different topics. - Pages from single domain / website can dominate
result, if they point to one page - not
necessarilly a good authority.
Copied and edited from HITS paper
63Possible Enhancements
- Use weighted sums for link calculation.
- Take advantage of anchor text - text
surrounding link itself. - Break hubs into smaller pieces. Analyze each
piece separately, instead of whole hub page as
one.
Copied and edited from HITS paper
64Possible Enhancements
- Use weighted sums for link calculation.
- Take advantage of anchor text - text
surrounding link itself. - Break hubs into smaller pieces. Analyze each
piece separately, instead of whole hub page as
one. - Disregard or minimize influence of links inside
one domain.
Copied and edited from HITS paper
65Possible Enhancements
- Use weighted sums for link calculation.
- Take advantage of anchor text - text
surrounding link itself. - Break hubs into smaller pieces. Analyze each
piece separately, instead of whole hub page as
one. - Disregard or minimize influence of links inside
one domain. - IBM expanded HITS into Clever not seen as viable
real-time search engine.
Copied and edited from HITS paper
66Issues of PageRank
Essential difference
- Users are not random walkers.
- Starting point distribution (actual usage data as
starting vector). - Bias towards main pages.
- Linkage spam.
- No query specific rank.
67PageRank vs. HITS
- HITS
- (CLEVER)
- performed on the set of retrieved web pages for
each query - computes authorities and hubs
- easy to compute, but real-time execution is hard
- PageRank
- (Google)
- computed for all web pages stored in the database
prior to the query - computes authorities only
- Trivial and fast to compute
Copied and edited from Tie-yan Lius slides
68Case Study on PageRank vs. HITS
http//www.matalon.org/search-algorithms/
69Case Study on PageRank vs. HITS
http//www.matalon.org/search-algorithms/
70Limits of Link Analysis
- Pay-for-place
- Search engine bias organizations pay search
engines and page rank - Advertisements organizations pay high ranking
pages for advertising space - With a primary effect of increased visibility to
end users and a secondary effect of increased
respectability due to relevance to high ranking
page
Copied and edited from Tie-yan Lius slides
71Limits of Link Analysis
- Stability
- Adding even a small number of nodes/edges to the
graph has a significant impact - Topic drift
- A top authority may be a hub of pages on a
different topic resulting in increased rank of
the authority page - Content evolution
- Adding/removing links/content can affect the
intuitive authority rank of a page requiring
recalculation of page ranks
Copied and edited from Tie-yan Lius slides
72 73PageRank v.s. HITS - Stability
Copied and edited from Tie-yan Lius slides
- Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results
dont change significantly? - General Strategy for evaluating stability
- 1. Start with original adjacency matrix, A
- 2. Perturb the matrix to get A, Select k nodes
in graph to add or delete - 3. Compute distance, d(r(A),r(A)), for some
distance measure d and objective function r that
measures the quality of results of A somehow - 4. Compute amount of perturbation p(?,?) for
some distance function p that measures the amount
of perturbation - 5. Evaluate the conditions, if any, where small
values for p generate large values for d
74Stability of HITS
Copied from Tie-yan Lius slides
75Copied from Tie-yan Lius slides
76Multiple Sets of Hubs and Authorities