CS246 - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

CS246

Description:

PageRank of p is the sum of PageRanks of its parents. One equation for every page ... Identified pages are good 'Hub' and 'Authority' on 'bicycling' ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 42

Provided by: junghoo

Category:

Tags: cs246 | hub

more less

Transcript and Presenter's Notes

Title: CS246

1
CS246

Page Ranking

2
Todays Topic

Page Ranking
TFIDF (Term frequency inverse document frequency)
vector and cosine similarity
PageRank
Hub/Authority

3
Main Problem

What page to return for a query Stanford
University?
Any idea?

4
Traditional IR Measure

If a page mentions Stanford and University
many times, the page is relevant
TF (Term frequency) number of times that a word
occurs in a document
Page A Stanford - 100, University - 100
Page B Stanford - 10, University - 10
Are Stanford and University equal?
Page A Stanford - 100, University - 10
Page B Stanford - 10, University - 100

5
Inverse Document Frequency

Rare words are more significant than common words
IDF (Inverse document frequency) inverse of the
number of documents containing the word
Stanford is considered more significant than
University
TFIDF
Pages with many rare words considered relevant

6
TFIDF Vector

Every unique word corresponds to one dimension in
TFIDF vector
Di (TF1IDF1, TF2IDF2, , TFnIDFn)
n total number of words in the document corpus
TFj 0 if the word is not in the document
More precisely,
Similarly, we construct TFIDF vector Q for the
query

7
Cosine Similarity

Examples
Q Stanford, Stanford ? Di
Q Stanford, D1 Stanford, D2
Stanford, MIT
How do we compute cosine similarity efficiently?

8
Inverted Index

Q Di 0 if Di has no query words
Consider only the documents with query words
Inverted Index Word ? Document

9
Problems of TFIDF Vector

Works well on small controlled corpus, but not on
the Web
Top result for American Airlines query
accident report of American Airline flights
Do users really care how many times American
Airlines mentioned?
Easy to spam
Ranking purely based on page content
Authors can manipulate page content to get high
ranking
Any idea?

10
Link-based Ranking

People expect to get AA home page for the query
American Airlines
Many pages point to AA home page, but not to
accident report
Use link-count!

11
Simple Link Count

Still easy to spam
Create many pages and add links to a page
How to avoid spam?

12
PageRank

A page is important if it is pointed by many
important pages
PR(p) PR(p1)/n1 PR(pk)/nk pi page
pointing to p, ni number of links in pi
PageRank of p is the sum of PageRanks of its
parents
One equation for every page
N equations, N unknown variables

13
Example Web of 1842

Netscape, Microsoft and Amazon

PR(n) PR(n)/2 PR(a)/2 PR(m)
PR(a)/2 PR(a) PR(n)/2
PR(m)
14
PageRank Matrix Notation

Web graph matrix M mij
Each page i corresponds to row i and column i of
the matrix M
mij 1/n if page i is one of the n children of
page j mij 0 otherwise
PageRank vector
PageRank equation

15
PageRank Iterative Computation

Initially every page has a unit of importance
At each round, each page shares its importance
among its children and receives new importance
from its parents
Eventually the importance of each page reaches a
limit
Stochastic matrix

16
Example Web of 1842
Ne
MS
Am
17
PageRank Eigenvector

PageRank equation
is the principal eigenvector of M

18
PageRank Random Surfer Model

The probability of a Web surfer to reach a page
after many clicks, following random links

Random Click
19
Problems on the Real Web

Dead end
A page with no links to send importance
All importance leak out of the Web
Crawler trap
A group of one or more pages that have no links
out of the group
Accumulate all the importance of the Web

20
Example Dead End

No link from Microsoft

Dead end
Ne
MS
Am
21
Example Dead End
Ne
MS
Am
22
Solution to Dead End

Assume a surfer to jumps to a random page at a
dead end

Ne
MS
Am
23
Example Crawler Trap

Only self-link at Microsoft

Crawler trap
Ne
MS
Am
24
Example Crawler Trap
Ne
MS
Am
25
Crawler Trap Damping Factor

Tax each page some fraction of its importance
and distribute it equally
Probability to jump to a random page
Assuming 20 tax

26
Anti-Spamming at Search Engines

Anchor text
Consider what others think about your page
Give higher weights to anchors from high PageRank
pages
More difficult to spam
PageRank
To gain importance, you need to convince many
important people
More difficult to spam
Consider inter-site links with higher weight

27
Hub and Authority

More detailed evaluation of importance
A page is useful if
It has good contents or
It has links to useful pages (good bookmark)
Hub/Authority
Authority pages with good contents
Hub pages pointing to good content pages

28
Hub/Authority Definition

Recursive definition similar to PageRank
Authority pages are linked to by many hub pages
Hub pages link to many authority pages
H(p) A(p1) A(pk)A(p) H(p1) H(pm)

29
Hub/Authority Matrix Notation

Web graph matrix A aij
Each page i corresponds to row i and column i of
the matrix A
aij 1 if page i points to page j aij 0
otherwise
A is not a stochastic matrix
AT similar to PageRank matrix M, without
stochastic restriction

30
Example Web of 1842

n, m, a vector

31
Hub/Authority Iterative Computation

Hub/Authority vector
? divergence scaling factor
? divergence scaling factor
Compute and iteratively with scaling

32
Hub/Authority Eigenvector

eigenvector of eigenvector of

33
Example Web of 1842
34
Hub/Authority and Root Set

Apply the equations on a small neighbor graph
(base set)
Start with, say, 100 pages on bicycling
Add pages pointing to the 100 pages
Add pages that the 100 pages are pointing to
Identified pages are good Hub and Authority
on bicycling

35
Hub/Authority and Web Community

Hub/Authority is often used to identify Web
communities
Nice notion of Hub and Authority of the
community
Often Hub and Authority are tightly linked to
each other

36
Any Questions?
37
Questions

Can we apply Hub/Authority to the entire Web like
PageRank?

38
Hub/Authority on the Entire Web?

Hub/Authority works well on a topic-specific
subset, but works poorly for the whole Web
Easy to spam
Create a page pointing to many authority pages
(e.g., Yahoo, Google, etc.)? The page becomes
a good hub page
On the page, add a link to your home page

39
Questions