Title: CS246
1CS246
2Todays Topic
- Page Ranking
- TFIDF (Term frequency inverse document frequency)
vector and cosine similarity - PageRank
- Hub/Authority
3Main Problem
- What page to return for a query Stanford
University? - Any idea?
4Traditional IR Measure
- If a page mentions Stanford and University
many times, the page is relevant - TF (Term frequency) number of times that a word
occurs in a document - Page A Stanford - 100, University - 100
- Page B Stanford - 10, University - 10
- Are Stanford and University equal?
- Page A Stanford - 100, University - 10
- Page B Stanford - 10, University - 100
5Inverse Document Frequency
- Rare words are more significant than common words
- IDF (Inverse document frequency) inverse of the
number of documents containing the word - Stanford is considered more significant than
University - TFIDF
- Pages with many rare words considered relevant
6TFIDF Vector
- Every unique word corresponds to one dimension in
TFIDF vector - Di (TF1IDF1, TF2IDF2, , TFnIDFn)
- n total number of words in the document corpus
- TFj 0 if the word is not in the document
- More precisely,
- Similarly, we construct TFIDF vector Q for the
query
7Cosine Similarity
-
- Examples
- Q Stanford, Stanford ? Di
- Q Stanford, D1 Stanford, D2
Stanford, MIT - How do we compute cosine similarity efficiently?
8Inverted Index
- Q Di 0 if Di has no query words
- Consider only the documents with query words
- Inverted Index Word ? Document
9Problems of TFIDF Vector
- Works well on small controlled corpus, but not on
the Web - Top result for American Airlines query
accident report of American Airline flights - Do users really care how many times American
Airlines mentioned? - Easy to spam
- Ranking purely based on page content
- Authors can manipulate page content to get high
ranking - Any idea?
10Link-based Ranking
- People expect to get AA home page for the query
American Airlines - Many pages point to AA home page, but not to
accident report - Use link-count!
11Simple Link Count
- Still easy to spam
- Create many pages and add links to a page
- How to avoid spam?
12PageRank
- A page is important if it is pointed by many
important pages - PR(p) PR(p1)/n1 PR(pk)/nk pi page
pointing to p, ni number of links in pi - PageRank of p is the sum of PageRanks of its
parents - One equation for every page
- N equations, N unknown variables
13Example Web of 1842
- Netscape, Microsoft and Amazon
PR(n) PR(n)/2 PR(a)/2 PR(m)
PR(a)/2 PR(a) PR(n)/2
PR(m)
14PageRank Matrix Notation
- Web graph matrix M mij
- Each page i corresponds to row i and column i of
the matrix M - mij 1/n if page i is one of the n children of
page j mij 0 otherwise - PageRank vector
- PageRank equation
15PageRank Iterative Computation
- Initially every page has a unit of importance
- At each round, each page shares its importance
among its children and receives new importance
from its parents - Eventually the importance of each page reaches a
limit - Stochastic matrix
16Example Web of 1842
Ne
MS
Am
17PageRank Eigenvector
- PageRank equation
- is the principal eigenvector of M
18PageRank Random Surfer Model
- The probability of a Web surfer to reach a page
after many clicks, following random links
Random Click
19Problems on the Real Web
- Dead end
- A page with no links to send importance
- All importance leak out of the Web
- Crawler trap
- A group of one or more pages that have no links
out of the group - Accumulate all the importance of the Web
20Example Dead End
Dead end
Ne
MS
Am
21Example Dead End
Ne
MS
Am
22Solution to Dead End
- Assume a surfer to jumps to a random page at a
dead end
Ne
MS
Am
23Example Crawler Trap
- Only self-link at Microsoft
Crawler trap
Ne
MS
Am
24Example Crawler Trap
Ne
MS
Am
25Crawler Trap Damping Factor
- Tax each page some fraction of its importance
and distribute it equally - Probability to jump to a random page
- Assuming 20 tax
26Anti-Spamming at Search Engines
- Anchor text
- Consider what others think about your page
- Give higher weights to anchors from high PageRank
pages - More difficult to spam
- PageRank
- To gain importance, you need to convince many
important people - More difficult to spam
- Consider inter-site links with higher weight
27Hub and Authority
- More detailed evaluation of importance
- A page is useful if
- It has good contents or
- It has links to useful pages (good bookmark)
- Hub/Authority
- Authority pages with good contents
- Hub pages pointing to good content pages
28Hub/Authority Definition
- Recursive definition similar to PageRank
- Authority pages are linked to by many hub pages
- Hub pages link to many authority pages
- H(p) A(p1) A(pk)A(p) H(p1) H(pm)
29Hub/Authority Matrix Notation
- Web graph matrix A aij
- Each page i corresponds to row i and column i of
the matrix A - aij 1 if page i points to page j aij 0
otherwise - A is not a stochastic matrix
- AT similar to PageRank matrix M, without
stochastic restriction
30Example Web of 1842
31Hub/Authority Iterative Computation
- Hub/Authority vector
- ? divergence scaling factor
- ? divergence scaling factor
- Compute and iteratively with scaling
32Hub/Authority Eigenvector
-
-
-
- eigenvector of eigenvector of
33Example Web of 1842
34Hub/Authority and Root Set
- Apply the equations on a small neighbor graph
(base set) - Start with, say, 100 pages on bicycling
- Add pages pointing to the 100 pages
- Add pages that the 100 pages are pointing to
- Identified pages are good Hub and Authority
on bicycling
35Hub/Authority and Web Community
- Hub/Authority is often used to identify Web
communities - Nice notion of Hub and Authority of the
community - Often Hub and Authority are tightly linked to
each other
36Any Questions?
37Questions
- Can we apply Hub/Authority to the entire Web like
PageRank?
38Hub/Authority on the Entire Web?
- Hub/Authority works well on a topic-specific
subset, but works poorly for the whole Web - Easy to spam
- Create a page pointing to many authority pages
(e.g., Yahoo, Google, etc.)? The page becomes
a good hub page - On the page, add a link to your home page
39Questions
- Can we apply PageRank to a small base set?
40PageRank on a Small Subset
- Maybe No experiments yet. In general, PageRank
works better for larger dataset - We may be able to compute topic-specific
PageRank - Any other way for topic-specific PageRank?
41Summary
- TDIDF and cosine similarity
- PageRank
- Hub/Authority