Authoritative Sources in a Hyperlinked Environment - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Authoritative Sources in a Hyperlinked Environment

Description:

Does Netscape support the JDK 1.1 code-signing API? Broad-topic queries ... Second, one does not have to run the above process of iterated I=O operations to ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 77
Provided by: bille2
Category:

less

Transcript and Presenter's Notes

Title: Authoritative Sources in a Hyperlinked Environment


1
Authoritative Sources in a Hyperlinked
Environment
Author Jon M. Kleinberg
  • Presented by Gang Fang
  • Slides made from 1 as well as copied and edited
    from 2 and 3
  • for Fall 2008 Course CSci 8363
  • 1 HITS paper
  • 2 Dr. Bill Eberles slides, UTA
  • 3 Dr. Tie-Yan Liusslides, MSRA

2
Outline
  • Motivation (citation)
  • The HITS Algorithm
  • Extensions of HITS
  • Comparison with PageRank
  • Limits of Link Analysis

3
Ranking for the Search Results
  • Drawback of pure content-based ranking
  • Specific queries (scarcity).
  • Does Netscape support the JDK 1.1 code-signing
    API?
  • Broad-topic queries (abundance).
  • Find information about the Java programming
    language

4
Ranking for the Search Results
  • Drawback of pure content-based ranking
  • Specific queries (scarcity).
  • Does Netscape support the JDK 1.1 code-signing
    API?
  • Broad-topic queries (abundance).
  • Find information about the Java programming
    language
  • Relevance-based search engines may return
    millions of pages for a certain query
  • It is definitely not possible for the user to
    preview all these results
  • An appropriate ranking will be very helpful.
  • Ranking on relevance
  • Ranking on importance

5
Traditional IR Ranking
  • Text-based ranking function
  • www.harvard.edu can hardly be recognized as one
    of the most authoritative pages for the query
    harvard, since many other web pages contain
    harvard more often.
  • The number of pages with the same relevance is
    still too large for the users to preview.

6
Traditional IR Ranking
  • Text-based ranking function
  • www.harvard.edu can hardly be recognized as one
    of the most authoritative pages for the query
    harvard, since many other web pages contain
    harvard more often.
  • The number of pages with the same relevance is
    still too large for the users to preview.
  • Pages are not sufficiently self-descriptive
  • Usually the term search engine doesn't appear
    on the web pages of search engines.

7
Whats More for Web Search
  • In order to solve these problems
  • We must leverage other information on the Web
  • We must distinguish those pages with the same
    amount of relevance
  • Link Analysis
  • The web is not just a collection of pure-text
    documents
  • the hyperlinks are also very important!
  • A link from page A to page B may indicate
  • A is related to B, or
  • A is recommending, citing, voting for or
    endorsing B
  • Links effect the ranking of web pages and thus
    have commercial value.

8
Web as a Graph
  • Web pages as nodes of a graph.
  • Links as directed edges.

my page
www.uta.edu
my page
www.uta.edu
www.uta.edu
www.google.com
www.google.com
www.google.com
Copied and edited from Bill Eberles slides
9
Whats More for Web Search
  • In order to solve these problems
  • We must leverage other information on the Web
  • We must distinguish those pages with the same
    amount of relevance
  • Link Analysis
  • The web is not just a collection of pure-text
    documents
  • the hyperlinks are also very important!
  • A link from page A to page B may indicate
  • A is related to B, or
  • A is recommending, citing, voting for or
    endorsing B

10
Famous Link Analysis Methods
  • HITS (Hyperlinked-Induced Topic Search)
  • Authoritative Sources in a Hyperlinked
    Environment, Jon Kleinberg,
  • Cornell University. 1998.
  • PageRank
  • The PageRank Citation Ranking Bringing Order to
    the Web, Lawrence Page and Sergey Brin, Stanford
    University. 1998.

11
In the year 1996, also at cornell!
Image retrieval community was also shifting from
CH to CCV
CH Color histogram ? CCV Color Coherent Vector
Pass and Zabih 1996
12
Motivation of Link Analysis (HITS)
  • Motivation
  • First search for a number of relevant pages
  • The find the smallest set of authoritative
    sources via ranking
  • Make use of the rich linkage structure!
  • Forward links (out-edges).
  • Backward links (in-edges).

Copied and edited from Bill Eberles slides
13
Motivation of Link Analysis (HITS)
  • Motivation
  • First search for a number of relevant pages
  • Then find the smallest set of authoritative
    sources via ranking
  • Make use of the rich linkage structure!
  • Forward links (out-edges).
  • Backward links (in-edges).

Copied and edited from Bill Eberles slides
14
Authorities and Hubs
  • Authority is a page which has relevant
    information about the topic.

a1
a2
h
a3
a4
Copied and edited from Bill Eberles slides
15
Authorities and Hubs
  • Authority is a page which has relevant
    information about the topic.
  • Hub is a page which has collection of links to
    pages about that topic.

a1
a2
h
a3
a4
Copied and edited from Bill Eberles slides
16
Authorities and Hubs (cont.)
  • Good hubs are the ones that point to good
    authorities.

h1
a1
a2
h2
a3
h3
a4
h4
a5
h5
a6
Copied and edited from Bill Eberles slides
17
Authorities and Hubs (cont.)
  • Good hubs are the ones that point to good
    authorities.
  • Good authorities are the ones that are pointed to
    by
  • good hubs.

h1
a1
a2
h2
a3
h3
a4
h4
a5
h5
a6
Copied and edited from Bill Eberles slides
18
HITS Two Steps
  • First, construct a focused sub-graph of the www.
  • Second, compute Hubs and Authorities from the
    sub-graph.

Copied and edited from Bill Eberles slides
19
Construction of Sub-graph
Rootset Pages
Expanded set Pages
Search Engine
Crawler
Topic
Forward link pages
R ? S ? G
All the children and a fixed number of parents
Rootset
Copied and edited from Bill Eberles slides
20
Hubs Authorities Calculation
  • Iterative algorithm on Base Set authority
    weights a(p), and hub weights h(p).
  • Set authority weights a(p) 1, and hub weights
    h(p) 1 for all p.
  • Repeat following two operations(and then
    re-normalize a and h to have unit norm)

v1
v1
h(v1)
a(v1)
p
v2
p
v2
h(v2)
a(v2)
v3
h(v3)
v3
a(v3)
Copied and edited from Bill Eberles slides
21
Example
0.45, 0.45
0.45, 0.45
Hub 0.45, Authority 0.45
0.45, 0.45
Copied and edited from Bill Eberles slides
22
Example (cont.)
0.45, 0.9
1.35, 0.9
Hub 0.9, Authority 0.45
0.45, 0.9
Copied and edited from Bill Eberles slides
23
Iterative Updata of Authority and Hubness
  • Recursive dependency
  • I step a(v) ? S h(w)
  • O step h(v) ? S a(w)

w ? pav
w ? chv
  • Normalization after each iteration

24
Convergence of Authority and Hubness - Assumptions
Copied from HITS paper
25
Convergence of Authority and Hubness - Assumptions
Copied from HITS paper
26
Convergence of Authority and Hubness - Assumptions
Copied from HITS paper
27
Convergence of Authority and Hubness - Convergence
Copied from HITS paper
28
Convergence of Authority and Hubness - Convergence
Copied from HITS paper
29
Convergence of Authority and Hubness - Convergence
Copied from HITS paper
30
Convergence of Authority and Hubness - Convergence
Copied from HITS paper
31
Convergence of Authority and Hubness - Convergence
Copied from HITS paper
32
Convergence of Authority and Hubness - Convergence
Copied from HITS paper
33
Convergence of Authority and Hubness - Convergence
Copied from HITS paper
34
Convergence of Authority and Hubness - Convergence
Copied from HITS paper
35
Convergence of Authority and Hubness (Cont)
  • Theorem 3.2 shows that one can use any
    eigenvector algorithm to compute the fixed point
    x and y
  • However HITS have stuck to the above exposition
    in terms of the Iterate procedure for two
    reasons.
  • It emphasizes the underlying motivation for our
    approach in terms of the reinforcing I and O
    operations
  • Second, one does not have to run the above
    process of iterated IO operations to
    convergence one can compute weights xltpgt and
    yltpgt by starting from any initial vectors x0
    and y0, and performing a xed bounded number of I
    and O operations.

36
Iterative Updata of Authority and Hubness
  • Recursive dependency
  • I step a(v) ? S h(w)
  • O step h(v) ? S a(w)

w ? pav
w ? chv
  • Normalization after each iteration

37
HITS Example Results
Copied and edited from Tie-yan Lius slides
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
38
Extensions of HITS
  • Extensions of HITS
  • Similar query pages
  • Multiple Sets of Hubs and Authorities

39
Similar query pages
  • using link structure to infer a notion of
    similarity" among pages
  • Suppose we have found a page p that is of
    interest perhaps it is an authoritative page
    on a topic of interest and
  • We wish to ask the following type of question
    What do users of the www consider to be related
    to p, when they create pages and hyperlinks?
  • If p is highly referenced page, we have a version
    of the Abundance Problem
  • The surrounding link structure will implicitly
    represent an enormous number of independent
    opinions about the relation of p to other pages.

40
Similar query pages
  • Using link structure to infer a notion of
    similarity" among pages
  • Suppose we have found a page p that is of
    interest perhaps it is an authoritative page
    on a topic of interest and
  • We wish to ask the following type of question
    What do users of the www consider to be related
    to p, when they create pages and hyperlinks?
  • If p is highly referenced page, we have a version
    of the Abundance Problem
  • The surrounding link structure will implicitly
    represent an enormous number of independent
    opinions about the relation of p to other pages.

41
Similar query pages
  • Using the notion of hubs and authorities, we can
    provide an approach to the issue of page
    similarity, asking
  • In the local region of the link structure near
    p, what are the strongest authorities? Such
    authorities can potentially serve as a
    broad-topic summary of the pages related to p.
  • Original HITS algorithm can be adapted to this
    situation with essentially no modification
  • Previously, we initiated our search with a query
    string our request from the search engine was
    Find t pages containing the string ."
  • We now begin with a page p and pose the following
    request to the search engine Find t pages
    pointing to p."

42
Similar query pages
  • Using the notion of hubs and authorities, we can
    provide an approach to the issue of page
    similarity, asking
  • In the local region of the link structure near
    p, what are the strongest authorities? Such
    authorities can potentially serve as a
    broad-topic summary of the pages related to p.
  • Original HITS algorithm can be adapted to this
    situation with essentially no modification
  • Previously, we initiated our search with a query
    string our request from the search engine was
    Find t pages containing the string ."

43
Similar query pages
  • Using the notion of hubs and authorities, we can
    provide an approach to the issue of page
    similarity, asking
  • In the local region of the link structure near
    p, what are the strongest authorities? Such
    authorities can potentially serve as a
    broad-topic summary of the pages related to p.
  • Original HITS algorithm can be adapted to this
    situation with essentially no modification
  • Previously, we initiated our search with a query
    string our request from the search engine was
    Find t pages containing the string ."
  • We now begin with a page p and pose the following
    request to the search engine Find t pages
    pointing to p."

44
Similar query pages
  • Supercially, the set of issues in working with a
    subgraph Gp are somewhat different from those
    involved in working with a subgraph defined by a
    query string.
  • However, we find that most of the basic
    conclusions we drew in the previous two sections
    continue to apply.

45
Similar query pages
  • Supercially, the set of issues in working with a
    subgraph Gp are somewhat different from those
    involved in working with a subgraph defined by a
    query string.
  • However, we find that most of the basic
    conclusions we drew in the previous two sections
    continue to apply.
  • Ranking pages of Gp by their in-degrees is still
    not satisfactory

46
Similar query pages
  • To compare,
  • Note the difficulties inherent in compiling such
    lists through text-based methods
  • entirely of images, with very little text and
    the text that they do contain has very little
    overlap.
  • HITS, on the other hand, is determining, via the
    presence of links, what the creators of www pages
    tend to classify" together with the given pages
    www.honda.com

47
Similar query pages
  • To compare,
  • Note the difficulties inherent in compiling such
    lists through text-based methods
  • entirely of images, with very little text and
    the text that they do contain has very little
    overlap.

48
Similar query pages
  • To compare,
  • Note the difficulties inherent in compiling such
    lists through text-based methods
  • entirely of images, with very little text and
    the text that they do contain has very little
    overlap.
  • HITS, on the other hand, is determining, via the
    presence of links, what the creators of www pages
    tend to classify" together with the given pages
    www.honda.com

49
Multiple Sets of Hubs and Authorities
  • Original HITS finds the most densely linked
    collection of hubs and authorities in the
    subgraph Gs defined by a query string s.
  • There are a number of settings, however, in which
    one may be interested in finding several densely
    linked collections of hubs and authorities among
    the same set S of pages.
  • Each such collection could potentially be
    relevant to the query topic,
  • but they could be well-separated from one another
    in the graph G for a variety of reasons. For
    example,
  • The query string may have several very different
    meanings. E.g. "jaguar
  • The string may arise as a term in the context of
    multiple technical communities. E.g. "randomized
    algorithms".
  • The string may refer to a highly polarized issue,
    involving groups that are not likely to link to
    one another. E.g. "abortion"

50
Multiple Sets of Hubs and Authorities
  • Original HITS finds the most densely linked
    collection of hubs and authorities in the
    subgraph Gs defined by a query string s.
  • There are a number of settings, however, in which
    one may be interested in finding several densely
    linked collections of hubs and authorities among
    the same set S of pages.
  • Each such collection could potentially be
    relevant to the query topic,
  • but they could be well-separated from one another
    in the graph G for a variety of reasons. For
    example,

51
Multiple Sets of Hubs and Authorities
  • Original HITS finds the most densely linked
    collection of hubs and authorities in the
    subgraph Gs defined by a query string s.
  • There are a number of settings, however, in which
    one may be interested in finding several densely
    linked collections of hubs and authorities among
    the same set S of pages.
  • Each such collection could potentially be
    relevant to the query topic,
  • but they could be well-separated from one another
    in the graph G for a variety of reasons. For
    example,
  • The query string may have several very different
    meanings. E.g. "jaguar
  • The string may arise as a term in the context of
    multiple technical communities. E.g. "randomized
    algorithms".
  • The string may refer to a highly polarized issue,
    involving groups that are not likely to link to
    one another. E.g. "abortion"

52
Multiple Sets of Hubs and Authorities
  • In each of the above three examples, the relevant
    documents can be naturally grouped into several
    clusters.

53
Multiple Sets of Hubs and Authorities
  • In each of the above three examples, the relevant
    documents can be naturally grouped into several
    clusters.
  • The issue in the setting of broad-topic queries,
    however, is not simply how to achieve a
    dissection into reasonable clusters one must
    also deal with this in the presence of the
    Abundance Problem.
  • Each cluster, in the context of the full www, is
    enormous.
  • .

54
Multiple Sets of Hubs and Authorities
  • In each of the above three examples, the relevant
    documents can be naturally grouped into several
    clusters.
  • The issue in the setting of broad-topic queries,
    however, is not simply how to achieve a
    dissection into reasonable clusters one must
    also deal with this in the presence of the
    Abundance Problem.
  • Each cluster, in the context of the full www, is
    enormous.
  • So, we require a way to distill a small set of
    hubs and authorities out of each one. We can thus
    view such collections of hubs and authorities as
    implicitly providing broad-topic summaries of a
    collection of large clusters that we never
    explicitly represent.

55
Multiple Sets of Hubs and Authorities
  • In the original HITS, the authorities and hubs we
    computed to the principal eigenvectors of the
    matrices ATA and AAT, where A is the adjacency
    matrix of G.

56
Multiple Sets of Hubs and Authorities
  • In the original HITS, the hubs and authorities we
    computed to the principal eigenvectors of the
    matrices ATA and AAT, where A is the adjacency
    matrix of G.
  • The non-principal eigenvectors of ATA and AAT
    provide us with a natural way to extract
    additional densely linked collections of hubs and
    authorities from the base set S.

57
Algorithmic Outcome
  • Applying iterative multiplication (power
    iteration) will lead to calculating eigenvector
    of any non-degenerate initial vector.
  • Hubs and authorities as outcome of process.
  • Principal eigenvector contains highest hub and
    authorities.

Copied and edited from HITS paper
58
Results
  • Although HITS is only link-based (it completely
    disregards page content) results are quite good
    in many tested queries.
  • When the authors tested the query search
    engines
  • The algorithm returned Yahoo!, Excite, Magellan,
    Lycos, AltaVista
  • However, none of these pages described themselves
    as a search engine (at the time of the
    experiment)

Copied and edited from HITS paper
59
Issues
  • From narrow topic, HITS tends to end in more
    general one.
  • Specific of hub pages - many links can cause
    algorithm drift. They can point to authorities in
    different topics.
  • Pages from single domain / website can dominate
    result, if they point to one page - not
    necessarilly a good authority.

Copied and edited from HITS paper
60
Issues
  • From narrow topic, HITS tends to end in more
    general one.
  • Specific of hub pages - many links can cause
    algorithm drift. They can point to authorities in
    different topics.
  • Pages from single domain / website can dominate
    result, if they point to one page - not
    necessarilly a good authority.

Copied and edited from HITS paper
61
HITS Example Results
Copied and edited from Tie-yan Lius slides
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
62
Issues
  • From narrow topic, HITS tends to end in more
    general one.
  • Specific of hub pages - many links can cause
    algorithm drift. They can point to authorities in
    different topics.
  • Pages from single domain / website can dominate
    result, if they point to one page - not
    necessarilly a good authority.

Copied and edited from HITS paper
63
Possible Enhancements
  • Use weighted sums for link calculation.
  • Take advantage of anchor text - text
    surrounding link itself.
  • Break hubs into smaller pieces. Analyze each
    piece separately, instead of whole hub page as
    one.

Copied and edited from HITS paper
64
Possible Enhancements
  • Use weighted sums for link calculation.
  • Take advantage of anchor text - text
    surrounding link itself.
  • Break hubs into smaller pieces. Analyze each
    piece separately, instead of whole hub page as
    one.
  • Disregard or minimize influence of links inside
    one domain.

Copied and edited from HITS paper
65
Possible Enhancements
  • Use weighted sums for link calculation.
  • Take advantage of anchor text - text
    surrounding link itself.
  • Break hubs into smaller pieces. Analyze each
    piece separately, instead of whole hub page as
    one.
  • Disregard or minimize influence of links inside
    one domain.
  • IBM expanded HITS into Clever not seen as viable
    real-time search engine.

Copied and edited from HITS paper
66
Issues of PageRank
Essential difference
  • Users are not random walkers.
  • Starting point distribution (actual usage data as
    starting vector).
  • Bias towards main pages.
  • Linkage spam.
  • No query specific rank.

67
PageRank vs. HITS
  • HITS
  • (CLEVER)
  • performed on the set of retrieved web pages for
    each query
  • computes authorities and hubs
  • easy to compute, but real-time execution is hard
  • PageRank
  • (Google)
  • computed for all web pages stored in the database
    prior to the query
  • computes authorities only
  • Trivial and fast to compute

Copied and edited from Tie-yan Lius slides
68
Case Study on PageRank vs. HITS
http//www.matalon.org/search-algorithms/
69
Case Study on PageRank vs. HITS
http//www.matalon.org/search-algorithms/
70
Limits of Link Analysis
  • Pay-for-place
  • Search engine bias organizations pay search
    engines and page rank
  • Advertisements organizations pay high ranking
    pages for advertising space
  • With a primary effect of increased visibility to
    end users and a secondary effect of increased
    respectability due to relevance to high ranking
    page

Copied and edited from Tie-yan Lius slides
71
Limits of Link Analysis
  • Stability
  • Adding even a small number of nodes/edges to the
    graph has a significant impact
  • Topic drift
  • A top authority may be a hub of pages on a
    different topic resulting in increased rank of
    the authority page
  • Content evolution
  • Adding/removing links/content can affect the
    intuitive authority rank of a page requiring
    recalculation of page ranks

Copied and edited from Tie-yan Lius slides
72
  • Thank you!

73
PageRank v.s. HITS - Stability
Copied and edited from Tie-yan Lius slides
  • Whether the link analysis algorithms based on
    eigenvectors are stable in the sense that results
    dont change significantly?
  • General Strategy for evaluating stability
  • 1. Start with original adjacency matrix, A
  • 2. Perturb the matrix to get A, Select k nodes
    in graph to add or delete
  • 3. Compute distance, d(r(A),r(A)), for some
    distance measure d and objective function r that
    measures the quality of results of A somehow
  • 4. Compute amount of perturbation p(?,?) for
    some distance function p that measures the amount
    of perturbation
  • 5. Evaluate the conditions, if any, where small
    values for p generate large values for d

74
Stability of HITS
Copied from Tie-yan Lius slides
75
Copied from Tie-yan Lius slides
76
Multiple Sets of Hubs and Authorities
Write a Comment
User Comments (0)
About PowerShow.com