Title: Vikrant Khosla Sridhar Kameswara Nemani
1Authoritative Sources in aHyperlinked
environmentJon M. Kleinberg
- Presented By-
- Vikrant KhoslaSridhar Kameswara Nemani
2Outline
- Search on WWW Problem in general
- Overview of the authoritative approach proposed
by this paper - Constructing a focused Subgraph
- Computing Hubs and Authorities
- Similar page Queries
- Multiple Sets of Hubs and Authorities
- Diffusion and generalization
- Evaluation
- Conclusion
3General Problem
- How to improve quality of search on WWW?
- Quality of search requires human evaluation due
to the subjectivity inherent in notions such as
relevance. - The WWW is a hypertext corpus of enormous
complexity and information. - This paper aims to create link based model that
consistently identifies relevant, authoritative
WWW pages for broad search topics.
4Understand Query Types
- There is more than one type of query and the
handling of each may require different
techniques. - Type of queries
- Specific queries
- E.g. Does Netscape support the JDK 1.1
code-signing API? - Broad-topic queries
- E.g. Find information about the Java
programming language. - Similar page queries
- Example Find pages similar to honda.com
5Difficulty in Handling query
- Specific queries
- Scarcity Problem- There are few pages containing
those information and it is difficult to
determine the identity of those pages. - Broad topic queries
- Abundance problem- The number of pages that could
reasonably be returned as relevant is far too
large for a human user to digest. - Select a small set of the most authoritative or
definitive ones from a huge collection of pages
that are most relevant
6Authoritative Pages
- Given a particular page, how do we tell whether
it is authoritative? - Problem is related to limitations of text based
analysis. - Text based ranking function
- E.g. For the harvard, www.harvard.edu is proper
authoritative page but there may be lots of other
web pages containing harvard more often. - Most popular Pages are not sufficiently self
descriptive - Usually the term search engine doesnt appear
on search engine home web pages of Yahoo,
AltaVista, Excite etc. - Honda or Toyota home pages hardly contain the
term automobile manufacturer.
7Analysis of link structure
- Hyperlinks encode a latent human judgment which
can be used to formulate a notion of authority. - Creation of a link represents a concrete
indication of the following type of judgment - The creator of page p, by including a link to
page q, has in some measure conferred authority
on q. - Opportunity for the user to find potential
authorities purely through the pages that point
to them. - Potential Pitfalls of above concept
- Most links are created for navigational
purposes.(eg main-menu, paid-adds) - Difficult to balance between appropriate
relevance and popularity(eg Yahoo)
8Authorities and Hubs
- Authorities are pages that are recognized as
providing significant, trustworthy, and useful
information on a topic. - Hubs are index pages that provide lots of useful
links to relevant content pages (topic
authorities). - In-degree - Number of pointers to a page and is
one simple measure of authority. - Out-degree - Number of pointers from a page to
other pages.
9Can we operate over entire WWW ?
- Local approaches- deals with intranet and amount
of data is much smaller as compared to WWW as a
whole. - Clustering approach- dissects a heterogeneous
population into subpopulations that in some way
more cohesive, but underlying problem of
filtering vast number of pages is still the same. - Authoritative approach- global nature
- Perform search on text based WWW search engine
- Distil broad topic from these pages via the
discovery of authority.
10Overview of search steps
11Overview
- Search string
- Text Search Engine
- Authoritative Approach
- Constructing focused subgraph
- Computing Hub Authorities
- Better quality search result
12Constructing Subgraph
- The collection V of hyperlinked pages can be
viewed as a directed graph G(V,E)nodes
correspond to pages, and a directed edge (p,q) e
E indicates the presence of a link from p to q. - Construct a focused subgraph (S ? ) of the WWW
with the following properties- - S ? is relatively small (so that computation is
affordable) - S ? is rich in relevant pages (so that its
easier to find good authority) - S ? contains most (or many) of the strongest
authorities
13How to find S?
- Set Q- set of all pages containing query string.
- Root set R ? - t highest ranked pages for the
query ? got from a text-based search engine. It
satisfy property 1 2. - Problems with R ?
- R is a subset of collection Q and Q does not
satisfy property 3. - There are extremely few links between pages in R,
rendering it essentially structureless. - Strong authority for query is quite likely to be
pointed to by at least one page in R ?. - Construct Base set S ? by extend root set R ? by
including - - All pages linked to by pages in R
- All pages that link to a page in R
- at most d
14Subgraph algorithm
15Observation Heuristics
- Heuristic 1 Delete all intrinsic links keep
all transverse links - Intrinsic links
- if the link is between pages with the same
domain name. - Generally these are for navigation purposes.
- Less informative and often contain repetitive
information. - Transverse if it is between pages with different
domain names. - Heuristic 2 Delete pages having collusion or
keep 4 to 8 - Large number of pages from a single domain all
point to a single page p. - Generally used for mass endorsement,
advertisement etc.
16Overview
- Search string
- Text Search Engine
- Authoritative Approach
- Constructing focused subgraph
- Computing Hub Authorities
- Better quality search result
17Computing Hubs Authorities
- Simplest approach would be to order pages by
in-degree - Problem
- Nodes with highest in-degree in base set-
- might not necessarily be authorities lack any
thematic unity. - might simply be universally popular pages like
yahoo, google, etc.
18Computing Hubs Authorities
- Observation
- Good sources of content (authorities)
- Good sources of links (hubs)
- True authority pages are pointed by a number of
good hubs. - Mutually reinforcing relationship
- Hubs point to lots of authorities.
- Authorities are pointed to by lots of hubs
- We will use the iterative algorithm to break this
circularity. - Terms
- Good hub page that points to many good
authorities. - Good authority page pointed to by many good
hubs.
19Overview of Algorithm
20Iterative Algorithm
- An iterative algorithm
- with each page p , we associate
- a non-negative authority weight xltpgt
- a non-negative hub weight yltpgt
- weights of each type are normalized so their
squares sum to 1 - The pages with larger x and y values have
better authorities and hubs respectively.
21Iterative Algorithm
- If p points to many pages with large x-values,
then it should receive a large y-value - If p is pointed to by many pages with large
y-values, then it should receive a large x-value - Inlinks Operation I
-
- Outlinks Operation O
22Algorithm
23Matrices Basics
24Observations
- As one applies Iterate with arbitrary large k ,
the vectors -
- Let G (V , E ), with V p1 , p2 ,, pn ,
and let A denote the adjacency matrix of the
graph G the (i , j )th entry of A is 1 if
(pi , pj ) is an edge of G , and is 0 otherwise. - x is the principal eigenvector of ATA , and
- y is the principal eigenvector of AAT
- The convergence of Iterate is quite rapid (k 20
is sufficient)
25Observations
- Any eigenvector algorithm can be used to compute
the fixed points X and Y - Emphasizes the underlying motivation of the
approach by reinforcing I and O operations - Do not require to iterate I and O to convergence
- Can start from initial vector X0 and Y0 and
computer using a fixed bound of I and O
operations
26Example Mini Web
27Example Mini Web (Cont..)
28Basic Results
29Observations
- Just pure analysis of link structure
- We ignored the text in searching for
authoritative pages. - i.e., text-based search is just an initial set
- Pages legitimately considered as authoritative in
the context of www without access to large-
scale index of the www - i.e., global analysis of the full www link
structure can be replaced by local method over
small focused subgraph - This approach can replace local approaches used
in intranet
30Similar page queries
- Example Find pages similar to honda.com
- Using link structure to infer a notion of
similarity among pages - We have found a page p that is of interest and
its an authoritative page on a topic. - Can this help in finding similar pages?
- What do users of the WWW consider to be related
to p when they create pages and links ?
31Similar page queries
- Previously our request to search engine was
- Find t pages containing the string ?.
- Now our request to search engine is
- Find t pages pointing to p
- Rp root set
- Sp base set
- Gp focused subgraph
- Strongest authorities in the local region of the
link structure near p are the potential
broad-topic summary of pages related to p.
32Results- Similar page queries
33Multiple Sets of Hubs Authorities
- Several densely linked collections of hubs and
authorities within the same set. - Example
- jaguar has several different meanings.
- randomized algorithms arises multiple
technical communities. - abortion - -involves groups that may not be
linked to each other. - Clustering in presence of Abundance problem is
needed. -
34Multiple Sets of Hubs Authorities
- The non-principal Eigenvectors provide us a way
to extract additional densely linked collections
of hubs and authorities. - Non-principal eigenvectors will have both
positive and negative entries. - Often, the highly positive entries will
correspond to a cluster of pages and negative
entries to a different cluster. - Typically the two clusters will not be tightly
intertwined. intertwined.
35Jaguar Example
- Authority principal eigenvector is primarily
about the Atari product. - In the positive end of the 2nd non-principal
eigenvector, the pages are primarily about the
Jacksonville Jaguars. - In the positive end of the 3rd non-principal
eigenvector, the pages are primarily about the
car.
36Randomized Algorithms Example
- The first non-principal eigenvector, positive end
returned home pages of theoretical computer
scientists. - First non-principal eigenvectors, negative end
returns compendia of mathematical software. - In the negative end of the fourth non-principal
eigenvector, the pages are primarily about
wavelets.
37Diffusion and Generealization
- The query may not be sufficiently broad.
- In this case there will not be enough highly
relevant pages in the base set to extract a
sufficiently dense sub-graph of relevant hubs and
authorities. - When this occurs, the collection will often
represent a broader topic, and the results will
reflect a diffused version of the initial query. - Example WWW conferences -gt WWW resource pages.
resource pages.
38- In studies conducted in 1998 over 26 queries and
37 volunteers, Clever reported better authorities
than Yahoo!, which in turn was better than Alta
Vista.
39Conclusion
- Need a way to distill a broad topic, for which
there may be millions of relevant pages - Provides a high quality results in context of
what is available on the www globally - Operate without maintaining an index of the www
or its link structure - It identifies the complex pattern of social
organization on the www.
40References
- http//crystal.uta.edu/gdas/Courses/websitepages/
spring07DBIR.htm - An Intro to Information Retrieval by Manning and
Raghaban - Information Retrieval Data Structures Algorithms
- William B. Frakes - Random Walks in Ranking Query Results in
Semistructured Databases slides by Vagelis
Hristidis
41