CS246 - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

CS246

Description:

PageRank of p is the sum of PageRanks of its parents. One equation for every page ... Identified pages are good 'Hub' and 'Authority' on 'bicycling' ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 42
Provided by: junghoo
Category:
Tags: cs246 | hub

less

Transcript and Presenter's Notes

Title: CS246


1
CS246
  • Page Ranking

2
Todays Topic
  • Page Ranking
  • TFIDF (Term frequency inverse document frequency)
    vector and cosine similarity
  • PageRank
  • Hub/Authority

3
Main Problem
  • What page to return for a query Stanford
    University?
  • Any idea?

4
Traditional IR Measure
  • If a page mentions Stanford and University
    many times, the page is relevant
  • TF (Term frequency) number of times that a word
    occurs in a document
  • Page A Stanford - 100, University - 100
  • Page B Stanford - 10, University - 10
  • Are Stanford and University equal?
  • Page A Stanford - 100, University - 10
  • Page B Stanford - 10, University - 100

5
Inverse Document Frequency
  • Rare words are more significant than common words
  • IDF (Inverse document frequency) inverse of the
    number of documents containing the word
  • Stanford is considered more significant than
    University
  • TFIDF
  • Pages with many rare words considered relevant

6
TFIDF Vector
  • Every unique word corresponds to one dimension in
    TFIDF vector
  • Di (TF1IDF1, TF2IDF2, , TFnIDFn)
  • n total number of words in the document corpus
  • TFj 0 if the word is not in the document
  • More precisely,
  • Similarly, we construct TFIDF vector Q for the
    query

7
Cosine Similarity
  • Examples
  • Q Stanford, Stanford ? Di
  • Q Stanford, D1 Stanford, D2
    Stanford, MIT
  • How do we compute cosine similarity efficiently?

8
Inverted Index
  • Q Di 0 if Di has no query words
  • Consider only the documents with query words
  • Inverted Index Word ? Document

9
Problems of TFIDF Vector
  • Works well on small controlled corpus, but not on
    the Web
  • Top result for American Airlines query
    accident report of American Airline flights
  • Do users really care how many times American
    Airlines mentioned?
  • Easy to spam
  • Ranking purely based on page content
  • Authors can manipulate page content to get high
    ranking
  • Any idea?

10
Link-based Ranking
  • People expect to get AA home page for the query
    American Airlines
  • Many pages point to AA home page, but not to
    accident report
  • Use link-count!

11
Simple Link Count
  • Still easy to spam
  • Create many pages and add links to a page
  • How to avoid spam?

12
PageRank
  • A page is important if it is pointed by many
    important pages
  • PR(p) PR(p1)/n1 PR(pk)/nk pi page
    pointing to p, ni number of links in pi
  • PageRank of p is the sum of PageRanks of its
    parents
  • One equation for every page
  • N equations, N unknown variables

13
Example Web of 1842
  • Netscape, Microsoft and Amazon

PR(n) PR(n)/2 PR(a)/2 PR(m)
PR(a)/2 PR(a) PR(n)/2
PR(m)
14
PageRank Matrix Notation
  • Web graph matrix M mij
  • Each page i corresponds to row i and column i of
    the matrix M
  • mij 1/n if page i is one of the n children of
    page j mij 0 otherwise
  • PageRank vector
  • PageRank equation

15
PageRank Iterative Computation
  • Initially every page has a unit of importance
  • At each round, each page shares its importance
    among its children and receives new importance
    from its parents
  • Eventually the importance of each page reaches a
    limit
  • Stochastic matrix

16
Example Web of 1842
Ne
MS
Am
17
PageRank Eigenvector
  • PageRank equation
  • is the principal eigenvector of M

18
PageRank Random Surfer Model
  • The probability of a Web surfer to reach a page
    after many clicks, following random links

Random Click
19
Problems on the Real Web
  • Dead end
  • A page with no links to send importance
  • All importance leak out of the Web
  • Crawler trap
  • A group of one or more pages that have no links
    out of the group
  • Accumulate all the importance of the Web

20
Example Dead End
  • No link from Microsoft

Dead end
Ne
MS
Am
21
Example Dead End
Ne
MS
Am
22
Solution to Dead End
  • Assume a surfer to jumps to a random page at a
    dead end

Ne
MS
Am
23
Example Crawler Trap
  • Only self-link at Microsoft

Crawler trap
Ne
MS
Am
24
Example Crawler Trap
Ne
MS
Am
25
Crawler Trap Damping Factor
  • Tax each page some fraction of its importance
    and distribute it equally
  • Probability to jump to a random page
  • Assuming 20 tax

26
Anti-Spamming at Search Engines
  • Anchor text
  • Consider what others think about your page
  • Give higher weights to anchors from high PageRank
    pages
  • More difficult to spam
  • PageRank
  • To gain importance, you need to convince many
    important people
  • More difficult to spam
  • Consider inter-site links with higher weight

27
Hub and Authority
  • More detailed evaluation of importance
  • A page is useful if
  • It has good contents or
  • It has links to useful pages (good bookmark)
  • Hub/Authority
  • Authority pages with good contents
  • Hub pages pointing to good content pages

28
Hub/Authority Definition
  • Recursive definition similar to PageRank
  • Authority pages are linked to by many hub pages
  • Hub pages link to many authority pages
  • H(p) A(p1) A(pk)A(p) H(p1) H(pm)

29
Hub/Authority Matrix Notation
  • Web graph matrix A aij
  • Each page i corresponds to row i and column i of
    the matrix A
  • aij 1 if page i points to page j aij 0
    otherwise
  • A is not a stochastic matrix
  • AT similar to PageRank matrix M, without
    stochastic restriction

30
Example Web of 1842
  • n, m, a vector

31
Hub/Authority Iterative Computation
  • Hub/Authority vector
  • ? divergence scaling factor
  • ? divergence scaling factor
  • Compute and iteratively with scaling

32
Hub/Authority Eigenvector
  • eigenvector of eigenvector of

33
Example Web of 1842
34
Hub/Authority and Root Set
  • Apply the equations on a small neighbor graph
    (base set)
  • Start with, say, 100 pages on bicycling
  • Add pages pointing to the 100 pages
  • Add pages that the 100 pages are pointing to
  • Identified pages are good Hub and Authority
    on bicycling

35
Hub/Authority and Web Community
  • Hub/Authority is often used to identify Web
    communities
  • Nice notion of Hub and Authority of the
    community
  • Often Hub and Authority are tightly linked to
    each other

36
Any Questions?
37
Questions
  • Can we apply Hub/Authority to the entire Web like
    PageRank?

38
Hub/Authority on the Entire Web?
  • Hub/Authority works well on a topic-specific
    subset, but works poorly for the whole Web
  • Easy to spam
  • Create a page pointing to many authority pages
    (e.g., Yahoo, Google, etc.)? The page becomes
    a good hub page
  • On the page, add a link to your home page

39
Questions
  • Can we apply PageRank to a small base set?

40
PageRank on a Small Subset
  • Maybe No experiments yet. In general, PageRank
    works better for larger dataset
  • We may be able to compute topic-specific
    PageRank
  • Any other way for topic-specific PageRank?

41
Summary
  • TDIDF and cosine similarity
  • PageRank
  • Hub/Authority
Write a Comment
User Comments (0)
About PowerShow.com