Mining the Webs Link Structure - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Mining the Webs Link Structure

Description:

Sampling to get a sub graph rich in relevant pages ... Inadvertent topic hijacking. Reason: ignore of content after assembling the root set ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 14
Provided by: yuqf
Category:

less

Transcript and Presenter's Notes

Title: Mining the Webs Link Structure


1
Mining the Webs Link Structure
  • Soumen Chakrabarti IIT, Bombay
  • B.Dom, S.Kumar, A. Tomkins
  • P.Raghavan, S.Rajagopalan, IBM Almaden Research
    Center
  • D. Gibson UC Berkeley
  • J. Kleinberg Cornell

Qingfeng YU Xinlei WANG
2
Outline
  • Brief on Hub/Authority and HITS
  • Search difficulties and Improvements
  • Comparison with others
  • Applications
  • Construct taxonomies semi-automatically
  • Categorization
  • Others
  • Summary

3
A more discerning search engine?
  • Hub/Authority concept
  • Hyperlink-Induced Topic Search
  • Web as a directed graph
  • Sampling to get a sub graph rich in relevant
    pages
  • Associate hub/authority weight with each page in
    the base set
  • Define adjacency matrix A, hub/authority vector
    x, y for pages 1,2...n, to do weight propagation

4
Problems with HITS
  • Problems
  • A narrow topic
  • Only returns good resources for a more general
    topic
  • When a hub discusses multiple topics
  • Authority of one topic conferred on another topic
  • Pages from the same site have common topics or
    links
  • Inadvertent topic hijacking
  • Reason ignore of content after assembling the
    root set

5
System Heuristics
  • Solution Combine content
  • Use weighted links
  • Use anchor texts to boost link weight
  • Use low weight to links within the same domain
  • Scale down weights of multiple links from the
    same domain
  • Break a large hub into smaller units

6
Comparison with other search engines
  • Comparison of Clever with AltaVista, Yahoo
  • Yahoo excels in 19 topics, Clever best in 50
    while same in other 31
  • Comparison with Google
  • Google has only one PageRank score for every
    page focused on authorities only
  • Hubs are useful when learning a new topic
  • Both have similar improvements

7
Applications Constructing Taxonomies
  • Yahoos large taxonomy of topics
  • A subject tree
  • Node ? a particular topic
  • Populated by relevant pages
  • Use Clever engine to populate a node
  • describe each node as a query
  • Use the name or label of the node
  • Use the description of other nodes on the path to
    the root
  • Example Business/Real Estate/Regional/United
    States/Oregon

8
Applications Constructing Taxonomies
  • Provide exemplary authority or hub pages
  • Add an exemplary hub to the base set along with
    all pages that it points to.
  • Add an exemplary authority
  • Can we add all pages to the base set that point
    to the authority page?
  • will pull in too many irrelevant pages
  • Heuristic 1 add any page pointing to at least
    two exemplary authorities
  • Heuristic 2 set user-designated stop-sites to
    avoid ambiguity

9
Applications Constructing Taxonomies
  • In conclusion, a topic node is described as
  • query terms exemplary authority and hub pages
    stop sites (optionally)
  • Advantages
  • Resources for each node can be refreshed as often
    as we please
  • Increase in quality comparing to textual queries

10
Applications Categorization
  • Assign web pages to categories
  • Hyperlinks contain high-quality semantic clues to
    a pages topic. However, the link information is
    highly noisy.
  • How does HyperClass solve it?
  • Use Markov random fields (MRF) because pages on
    the related topic tends to be linked more
    frequently
  • Use a relaxation technique to iteratively adjust
    the category labels

11
Other Applications
  • Discovery cybercommunities
  • Citation Analysis

12
Summary
  • HITS difficulties and improvements
  • Clever comparison with others
  • Applications

 
13
Q A
Write a Comment
User Comments (0)
About PowerShow.com