Vikrant Khosla Sridhar Kameswara Nemani - PowerPoint PPT Presentation

About This Presentation
Title:

Vikrant Khosla Sridhar Kameswara Nemani

Description:

'jaguar' has several different meanings. ... Jaguar Example. Authority principal eigenvector is primarily about the Atari product. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 42
Provided by: Vik72
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Vikrant Khosla Sridhar Kameswara Nemani


1
Authoritative Sources in aHyperlinked
environmentJon M. Kleinberg
  • Presented By-
  • Vikrant KhoslaSridhar Kameswara Nemani

2
Outline
  • Search on WWW Problem in general
  • Overview of the authoritative approach proposed
    by this paper
  • Constructing a focused Subgraph
  • Computing Hubs and Authorities
  • Similar page Queries
  • Multiple Sets of Hubs and Authorities
  • Diffusion and generalization
  • Evaluation
  • Conclusion

3
General Problem
  • How to improve quality of search on WWW?
  • Quality of search requires human evaluation due
    to the subjectivity inherent in notions such as
    relevance.
  • The WWW is a hypertext corpus of enormous
    complexity and information.
  • This paper aims to create link based model that
    consistently identifies relevant, authoritative
    WWW pages for broad search topics.

4
Understand Query Types
  • There is more than one type of query and the
    handling of each may require different
    techniques.
  • Type of queries
  • Specific queries
  • E.g. Does Netscape support the JDK 1.1
    code-signing API?
  • Broad-topic queries
  • E.g. Find information about the Java
    programming language.
  • Similar page queries
  • Example Find pages similar to honda.com

5
Difficulty in Handling query
  • Specific queries
  • Scarcity Problem- There are few pages containing
    those information and it is difficult to
    determine the identity of those pages.
  • Broad topic queries
  • Abundance problem- The number of pages that could
    reasonably be returned as relevant is far too
    large for a human user to digest.
  • Select a small set of the most authoritative or
    definitive ones from a huge collection of pages
    that are most relevant

6
Authoritative Pages
  • Given a particular page, how do we tell whether
    it is authoritative?
  • Problem is related to limitations of text based
    analysis.
  • Text based ranking function
  • E.g. For the harvard, www.harvard.edu is proper
    authoritative page but there may be lots of other
    web pages containing harvard more often.
  • Most popular Pages are not sufficiently self
    descriptive
  • Usually the term search engine doesnt appear
    on search engine home web pages of Yahoo,
    AltaVista, Excite etc.
  • Honda or Toyota home pages hardly contain the
    term automobile manufacturer.

7
Analysis of link structure
  • Hyperlinks encode a latent human judgment which
    can be used to formulate a notion of authority.
  • Creation of a link represents a concrete
    indication of the following type of judgment
  • The creator of page p, by including a link to
    page q, has in some measure conferred authority
    on q.
  • Opportunity for the user to find potential
    authorities purely through the pages that point
    to them.
  • Potential Pitfalls of above concept
  • Most links are created for navigational
    purposes.(eg main-menu, paid-adds)
  • Difficult to balance between appropriate
    relevance and popularity(eg Yahoo)

8
Authorities and Hubs
  • Authorities are pages that are recognized as
    providing significant, trustworthy, and useful
    information on a topic.
  • Hubs are index pages that provide lots of useful
    links to relevant content pages (topic
    authorities).
  • In-degree - Number of pointers to a page and is
    one simple measure of authority.
  • Out-degree - Number of pointers from a page to
    other pages.

9
Can we operate over entire WWW ?
  • Local approaches- deals with intranet and amount
    of data is much smaller as compared to WWW as a
    whole.
  • Clustering approach- dissects a heterogeneous
    population into subpopulations that in some way
    more cohesive, but underlying problem of
    filtering vast number of pages is still the same.
  • Authoritative approach- global nature
  • Perform search on text based WWW search engine
  • Distil broad topic from these pages via the
    discovery of authority.

10
Overview of search steps
11
Overview
  • Search string
  • Text Search Engine
  • Authoritative Approach
  • Constructing focused subgraph
  • Computing Hub Authorities
  • Better quality search result

12
Constructing Subgraph
  • The collection V of hyperlinked pages can be
    viewed as a directed graph G(V,E)nodes
    correspond to pages, and a directed edge (p,q) e
    E indicates the presence of a link from p to q.
  • Construct a focused subgraph (S ? ) of the WWW
    with the following properties-
  • S ? is relatively small (so that computation is
    affordable)
  • S ? is rich in relevant pages (so that its
    easier to find good authority)
  • S ? contains most (or many) of the strongest
    authorities

13
How to find S?
  • Set Q- set of all pages containing query string.
  • Root set R ? - t highest ranked pages for the
    query ? got from a text-based search engine. It
    satisfy property 1 2.
  • Problems with R ?
  • R is a subset of collection Q and Q does not
    satisfy property 3.
  • There are extremely few links between pages in R,
    rendering it essentially structureless.
  • Strong authority for query is quite likely to be
    pointed to by at least one page in R ?.
  • Construct Base set S ? by extend root set R ? by
    including -
  • All pages linked to by pages in R
  • All pages that link to a page in R
  • at most d

14
Subgraph algorithm

15
Observation Heuristics
  • Heuristic 1 Delete all intrinsic links keep
    all transverse links
  • Intrinsic links
  • if the link is between pages with the same
    domain name.
  • Generally these are for navigation purposes.
  • Less informative and often contain repetitive
    information.
  • Transverse if it is between pages with different
    domain names.
  • Heuristic 2 Delete pages having collusion or
    keep 4 to 8
  • Large number of pages from a single domain all
    point to a single page p.
  • Generally used for mass endorsement,
    advertisement etc.

16
Overview
  • Search string
  • Text Search Engine
  • Authoritative Approach
  • Constructing focused subgraph
  • Computing Hub Authorities
  • Better quality search result

17
Computing Hubs Authorities
  • Simplest approach would be to order pages by
    in-degree
  • Problem
  • Nodes with highest in-degree in base set-
  • might not necessarily be authorities lack any
    thematic unity.
  • might simply be universally popular pages like
    yahoo, google, etc.

18
Computing Hubs Authorities
  • Observation
  • Good sources of content (authorities)
  • Good sources of links (hubs)
  • True authority pages are pointed by a number of
    good hubs.
  • Mutually reinforcing relationship
  • Hubs point to lots of authorities.
  • Authorities are pointed to by lots of hubs
  • We will use the iterative algorithm to break this
    circularity.
  • Terms
  • Good hub page that points to many good
    authorities.
  • Good authority page pointed to by many good
    hubs.

19
Overview of Algorithm
20
Iterative Algorithm
  • An iterative algorithm
  • with each page p , we associate
  • a non-negative authority weight xltpgt
  • a non-negative hub weight yltpgt
  • weights of each type are normalized so their
    squares sum to 1
  • The pages with larger x and y values have
    better authorities and hubs respectively.

21
Iterative Algorithm
  • If p points to many pages with large x-values,
    then it should receive a large y-value
  • If p is pointed to by many pages with large
    y-values, then it should receive a large x-value
  • Inlinks Operation I
  • Outlinks Operation O

22
Algorithm
23
Matrices Basics

24
Observations
  • As one applies Iterate with arbitrary large k ,
    the vectors
  • Let G (V , E ), with V p1 , p2 ,, pn ,
    and let A denote the adjacency matrix of the
    graph G the (i , j )th entry of A is 1 if
    (pi , pj ) is an edge of G , and is 0 otherwise.
  • x is the principal eigenvector of ATA , and
  • y is the principal eigenvector of AAT
  • The convergence of Iterate is quite rapid (k 20
    is sufficient)

25
Observations
  • Any eigenvector algorithm can be used to compute
    the fixed points X and Y
  • Emphasizes the underlying motivation of the
    approach by reinforcing I and O operations
  • Do not require to iterate I and O to convergence
  • Can start from initial vector X0 and Y0 and
    computer using a fixed bound of I and O
    operations

26
Example Mini Web

27
Example Mini Web (Cont..)

28
Basic Results

29
Observations
  • Just pure analysis of link structure
  • We ignored the text in searching for
    authoritative pages.
  • i.e., text-based search is just an initial set
  • Pages legitimately considered as authoritative in
    the context of www without access to large-
    scale index of the www
  • i.e., global analysis of the full www link
    structure can be replaced by local method over
    small focused subgraph
  • This approach can replace local approaches used
    in intranet

30
Similar page queries
  • Example Find pages similar to honda.com
  • Using link structure to infer a notion of
    similarity among pages
  • We have found a page p that is of interest and
    its an authoritative page on a topic.
  • Can this help in finding similar pages?
  • What do users of the WWW consider to be related
    to p when they create pages and links ?

31
Similar page queries
  • Previously our request to search engine was
  • Find t pages containing the string ?.
  • Now our request to search engine is
  • Find t pages pointing to p
  • Rp root set
  • Sp base set
  • Gp focused subgraph
  • Strongest authorities in the local region of the
    link structure near p are the potential
    broad-topic summary of pages related to p.

32
Results- Similar page queries
33
Multiple Sets of Hubs Authorities
  • Several densely linked collections of hubs and
    authorities within the same set.
  • Example
  • jaguar has several different meanings.
  • randomized algorithms arises multiple
    technical communities.
  • abortion - -involves groups that may not be
    linked to each other.
  • Clustering in presence of Abundance problem is
    needed.

34
Multiple Sets of Hubs Authorities
  • The non-principal Eigenvectors provide us a way
    to extract additional densely linked collections
    of hubs and authorities.
  • Non-principal eigenvectors will have both
    positive and negative entries.
  • Often, the highly positive entries will
    correspond to a cluster of pages and negative
    entries to a different cluster.
  • Typically the two clusters will not be tightly
    intertwined. intertwined.

35
Jaguar Example
  • Authority principal eigenvector is primarily
    about the Atari product.
  • In the positive end of the 2nd non-principal
    eigenvector, the pages are primarily about the
    Jacksonville Jaguars.
  • In the positive end of the 3rd non-principal
    eigenvector, the pages are primarily about the
    car.

36
Randomized Algorithms Example
  • The first non-principal eigenvector, positive end
    returned home pages of theoretical computer
    scientists.
  • First non-principal eigenvectors, negative end
    returns compendia of mathematical software.
  • In the negative end of the fourth non-principal
    eigenvector, the pages are primarily about
    wavelets.

37
Diffusion and Generealization
  • The query may not be sufficiently broad.
  • In this case there will not be enough highly
    relevant pages in the base set to extract a
    sufficiently dense sub-graph of relevant hubs and
    authorities.
  • When this occurs, the collection will often
    represent a broader topic, and the results will
    reflect a diffused version of the initial query.
  • Example WWW conferences -gt WWW resource pages.
    resource pages.

38
  • In studies conducted in 1998 over 26 queries and
    37 volunteers, Clever reported better authorities
    than Yahoo!, which in turn was better than Alta
    Vista.

39
Conclusion
  • Need a way to distill a broad topic, for which
    there may be millions of relevant pages
  • Provides a high quality results in context of
    what is available on the www globally
  • Operate without maintaining an index of the www
    or its link structure
  • It identifies the complex pattern of social
    organization on the www.

40
References
  • http//crystal.uta.edu/gdas/Courses/websitepages/
    spring07DBIR.htm
  • An Intro to Information Retrieval by Manning and
    Raghaban
  • Information Retrieval Data Structures Algorithms
    - William B. Frakes
  • Random Walks in Ranking Query Results in
    Semistructured Databases slides by Vagelis
    Hristidis

41
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com