The PageRank Citation Ranking: Bringing Order to the Web - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

The PageRank Citation Ranking: Bringing Order to the Web

Description:

Simple citation counting has been used to speculate on the future winners of the Nobel Prize ... Show PageRank in the page link list. Helpful ... – PowerPoint PPT presentation

Number of Views:234
Avg rating:3.0/5.0
Slides: 38
Provided by: tanselze
Category:

less

Transcript and Presenter's Notes

Title: The PageRank Citation Ranking: Bringing Order to the Web


1
The PageRank Citation Ranking Bringing Order to
the Web
by TANSEL ZENGINLER 2004720371
2
Contents
  • Introduction Motivation
  • Related Works
  • PageRank
  • Implementation
  • Properties
  • Searching
  • Personalized PageRank
  • Applications
  • Conclusion

3
Introduction
  • World Wide Web new challenges for information
    retrieval
  • Large (150 million web pages)
  • Heterogeneous
  • Hypertext (information on top of text of web
    pages)
  • Objective
    Take advantage of link structure of Web to
    produce global importance ranking, PageRank, and
    help search engines

4
Web Pages vs. Academic Publ.
  • Academic Papers
  • Scrupulously reviewed
  • Well-defined units of work, roughly similar in
    quality and number of citations
  • Web Pages
  • Proliferate free of quality control or publishing
    costs
  • Vary on a much wider scale in quality, usage,
    citations and length

5
Related Work
  • Eugene Garfield
  • William Goffman Information flow in a scientific
    community is an epidemic process
  • James E. Pitkow Characterizing World Wide Web
    Ecologies
  • Ron Weiss, Bienvenido Velez Clustering methods

6
Related Work
  • Ellen Sperturs Information that can be obtained
    from link structure
  • Jon Kleinberg Hubs and authorities, based on
    eigenvector
  • Hope N. Tillman What Quality means in net?
  • Result Importance of backlinks, like citations

7
Link Structure
  • Web has 150 million pages, 1.7 billion links
  • Nodes as pages and edges as links (outedges as
    forward links and inedges as backlinks)

8
Link Structure
  • Netscape homepage has 62804 backlinks
  • Highly linked pages are more IMPORTANT
  • Simple citation counting has been used to
    speculate on the future winners of the Nobel
    Prize

9
Propagation of Ranking
  • PageRank a page has high rank if the sum of the
    ranks of its backlinks is high

10
Simplified PageRank
11
Simplified PageRank
12
Simplified PageRank
13
Simplified PageRank
  • A square matrix with rows and columns
    corresponding to web pages

14
Simplified PageRank
Objective Find dominant eigenvector of A
Problem Two web pages, point to each other and
to no other page another page points one of
these two pages, what will happen?? (Rank Sink)
15
Source Rank
In matrix notation Or
16
Random Surfer Model
  • Simplified version of PageRank Random walk on
    the graph of the web
  • PageRank Random surfer who can get bored and
    jumps random page chosen based on the
    distribution in E

17
Computing PageRank
  • S any vector over Web pages
  • d factor increases the rate of convergence

18
Dangling Links
  • Links that point to any page with no outgoing
    links pages that have not been downloaded yet
  • 21 million pages downloaded
  • 51 million pages not downloaded yet
  • Where should their weight be distributed???
  • Remove them from the system until all PageRanks
    are calculated

19
Implementation
  • As part of the Stanford WebBase project
  • Complete crawling and indexing system with a
    repository
  • 24 million web pages, process 50 pages per second
  • 11 links per page on average, process 550 links
    per second

20
Algorithm
  • Store each url as a unique integer, Id
  • Sort by parent Id
  • Remove dangling links
  • Assign initial values for ranks
  • Compute ranks

21
Algorithm
  • Iterate until converge
  • Add dangling links back in
  • Recompute ranks until converge to remove
    danglings
  • The whole process takes about 5 hours!!!

22
Convergence Properties
23
Convergence Properties
  • Scale very well, as scaling factor roughly linear
    in logn
  • The web is an expander-like graph, every subset
    of nodes S has alpha times S neighbors
  • The largest eigenvalue is sufficiently larger
    than the second-largest eigenvalue

24
Searching with PageRank
  • Title-based search engine
  • Full text search engine called Google

25
Title-based Search Engine
  • Repository of 16 million web pages
  • Algorithm
  • Find pages which has title that contain all of
    the query words
  • Sort with PageRank

26
Title-based Search Engine
27
Full text Search Engine
28
Personalized PageRank
  • An important component of PageRank, E
  • E vector for web sites that a random surfer
    periodically jumps to
  • E vector for personalized web sites

29
Manipulation by Commercial Use
  • To get a high PageRank
  • Convince an important page
  • A lot of non-important pages to link
  • Buy advertisement?? Money??
  • Compromise between uniform E, single page E (E
    consists all root level pages of all web
    servers), create a large number of root level
    servers

30
Applications
  • Estimating Web traffic
  • Backlink Predictor
  • User Navigation

31
Estimating Web traffic
  • Use NLANR proxy cache
  • Not easy to compare NLANR data with PageRank
  • Used for finding things that users like to look
    at, but do not mention on their web pages
  • Can be used for initial vector of PageRank

32
Backlink Predictor
  • PageRank as a predictor for future citation
    (better than citation counts themselves)
  • Start with only one page and crawl whole web,
    crawl in the order of rank
  • Citation count cannot be used until crawl whole
    web
  • Use PageRank
  • PageRank avoids the local maxima

33
PageRank Proxy
  • Show PageRank in the page link list
  • Helpful
  • For looking at the results from other search
    engines
  • For deciding on interesting links
  • For looking a page with its importance, search
    for this rank range

34
PageRank Proxy
35
Conclusion
  • Global ranking of all web pages
  • Order search results
  • Use in applications
  • Traffic estimation
  • User Navigation

36
QUESTIONS??
37
Thank you
  • HAPPY NEW YEAR
Write a Comment
User Comments (0)
About PowerShow.com