Title: Web Ranking
1Web Ranking
2Information Retrieval
- Input Document collection
- Goal Retrieve documents or text with information
content that is relevant to users information
need
3Classic information retrieval
- Ranking is a function of query term frequency
within the document (tf) and across all documents
(idf) - This works because of the following assumptions
in classical IR - Queries are long and well specified
- What is the impact of the Falklands war on
Anglo-Argentinean relations - Documents (e.g., newspaper articles) are
coherent, well authored, and are usually about
one topic - The vocabulary is small and relatively well
understood
4Web information retrieval
- None of these assumptions hold
- Queries are short 2.35 terms in avg
- Huge variety in documents language, quality,
duplication - Huge vocabulary 100s million of terms
- Deliberate misinformation
- Ranking is a function of the query terms and of
the hyperlink structure
SPAM
5Hyperlink analysis
- Idea Mine structure of the web graph
- Each web page is a node
- Each hyperlink is a directed edge
- Related work
- Classic IR work (citations links) a.k.a.
Bibliometrics K63, G72, S73, - Socio-metrics K53, MMSM86,
- Many Web related papers use this approach
PPR96, AMM97, S97, CK97, K98, BP98,
6 So...
- Our basic problem
- Given a DiGraph G, of web documents, rank all
documents relevant to query q
1
2
7Topics
- Eigenvectors review
- HITS, variants
- Pagerank, variants
- Rank aggregation
- Page Reputations
8Eigenvectors review
- Lets say we have a matrix M
- Now consider V1 , V2 , V3
- We have MV1 , MV2 , MV3
- In other words, MV1 0V1 , MV2 -4V2, MV33V3
9Eigenvectors review
- MV? ? V?
- a matrix can have many of these.
Eigenvector
Eigenvalue
10Eigenvectors review
- Combine Vx to form P
- Now P-1.M.P
- Or M P P-1
Diagonal Matrix
11Eigenvectors review
- This implies
- Mn P P-1
- Or Mn P P
(Well need this)
12Some definitions
- Non-negative matrix Mij 0 gt (M 0)
- Irreducible matrix square, nonnegative, and
there exists t s.t. (Mt)ij gt 0 - For adjacency matrix Strongly connected digraph
- Period of i gcd(t (Mt)ii gt 0)
- For irreducible period same for all i.
- For adjacency matrix period gcd of length of
cycle - Primitive matrix There exists t s.t. Mt gt 0
- Diff. from irreducible all gt 0
- Adjacency matrix gcd of cycle lengths 1
13Perron-Frobenius Theorem
- For a nonnegative, irreducible, primitive matrix
M, there exists an eigenvalue ? s.t. - ? is real and positive and that ? gt ? for
every other ? ? a - ? corresponds to a strictly positive eigenvector
- ? is a simple root of the char. eq.(M a In) 0
- This property allows us to compute dominant
eigenvalue / eigenvector easily.
14Dominant Eigenvector
- Since MV? ? V?, and (a1, , an) coordinates
of vector x in basis formed by eigenvectors. - Mtx Sai?ti Vi
- Now since ?1 gt ?i, igt1,
- Mt a1 ?t1 V1 for large t
- Since V1 is strictly positive, any random
positive vector will work -
i
Dominant Eigenvector!
15(Contd)
- Special case Stochastic matrix, ?11, and Mt
converges exponentially - lim Mt 1Tr
- where r stationary distribution of Markov chain
t ? 8
Random surfer model
16HITS
- Introduced by Jon M. Kleinberg (1998).
- Hypertext Induced Topic Selection
- Find a set of interesting pages
- Find a base subgraph (of Web) using this set
- Use hubness and authoritativeness to rank
- Recursive Concept
- Good hubs point to good authorities
- Good authorities are pointed by good hubs
17HITS Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
18HITS Base Subgraph
- BaseSubgraph( R, d)
- S ? r
- for each v in R
- do S ? S U chv
- P ? pav
- if P gt d
- then P ? arbitrary subset of P having size d
- S ? S U P
- return S
S
R
19HITS Algorithm
- HubsAuthorities(G)
- 1 ? 1,,1 ? R
- a ? h ? 1
- t ? 1
- repeat
- for each v in V
- do a (v) ? S h (w)
- h (v) ? S a (w)
- a ? a / a
- h ? h / h
- t ? t 1
- until a a h h lt
e - return (a , h )
V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
20HITS Ensuring Convergence
- Recursive dependency
-
- a(v) ? S h(w)
- h(v) ? S a(w)
w ? pav
w ? chv
a(v) and h(v) converge
21HITS Ensuring Convergence
- at MTht-1 and ht Mat-1
- Thus, after t iterations
- at at(MTM)t-1MT1
- ht ßt(MMT)t1
- It can be shown that these converge, e.g. for
nonnegative symmetric matrix M, to - a ?1(MTM) and h ?1(MMT)
22HITS (contd)
- Spamming
- Identical links
- distribute scores by normalizing effects from
same host. (e.g. 1/n) - topic drift many unrelated pages
- weight the edges of the graph according the
relevance of the source and destination (e.g.
link text nbd.) - Hub replication, clique attacks, link farms?
- Solution ?
23 demo!
- Intuition of Hubness / Authness
- Teoma.com
- foosball
- mountain dew
24SALSA
- SALSA (Lempel, Moran 2001)
- Probabilistic extension of the HITS algorithm
- Random walk is carried out by following
hyperlinks both in the forward and in the
backward direction - Two separate random walks
- Hub walk
- Authority walk
25SALSA (contd)
- Hub walk
- Follow a Web link from a page uh to a page wa (a
forward link) and then - Immediately traverse a backlink going from wa to
vh, where (u,w) ? E and (v,w) ? E - Authority Walk
- Follow a Web link from a page w(a) to a page u(h)
(a backward link) and then - Immediately traverse a forward link going back
from vh to wa where (u,w) ? E and (v,w) ? E
26SALSA (contd)
- Hub weight computed from the sum of the product
of the inverse degree of the in-links and the
out-links - This solves the clique attack / link farm problems
27PHITS
- Co-citation matrix community
- Effect on eigenvector authority of document in
community - HITS uses only dominant eigenvector principal
community. - What about smaller communities? (smaller
eigenvectors)
28PHITS Model
- P(d) P(zd)
P(cz) - Add communities between documents and citations
- Describe citation likelihood as
- P(d,c) P(d)P(cd), where
- P(cd) S P(cz)P(zd)
- Total likelihood of citations matrix M
- L(M) ? P(d,c)
- this becomes a max. likelihood problem
d
z
c
Note this is factored. (Different for mixture
model)
z
(d,c) ? M
29PHITS (contd)
- Open up the eqn
- P(d,c) S P(z)P(cz)P(dz)
- Alternate between
- Computing P(zd,c)
- Re-estimating P(z), P(cz) and P(dz)
- Issues not globally optimal, cannot guarantee
fits (soln restarts start with HITS / PCA
model) - How to decide of factors? (Topic hierarchy)
30PageRank
- Page, et. al.1998
- Different from HITS
- HITS takes Hubness Authority weights
- The page rank is proportional to its parents
rank, but inversely proportional to its parents
outdegree
31PageRank Model
- Just measuring in-degree (citation count) doesnt
account for the authority of the source of a
link. - Initial page rank equation for page p
- Nq is the total number of out-links from page q.
- A page, q, gives an equal fraction of its
authority to all the pages it points to (e.g. p). - c is a normalizing constant set so that the rank
of all pages always sums to 1.
32Algorithm
- Iterate rank-flowing process until convergence
- Let S be the total set of pages.
- Initialize ?p?S R(p) 1/S
- Until ranks do not change (much)
(convergence) - For each p?S
- For each p?S R(p) cR(p)
(normalize)
33Linear Algebra Version
- Treat R as a vector over web pages.
- Let M be a 2-d matrix over pages where
- Mvu 1/Nu if u ?v else Mvu 0
- Then RcMR
- R converges to the principal eigenvector of M.
34Problems
- Dangling page Problem
- Many Web pages have no inlinks/outlinks
- Results in dangling edges in the graph
- E.g.
- no parent ? rank 0
- MT converges to a matrix
- whose last column is all zero
- no children ? no solution
- MT converges to zero matrix
35Modifications
- Surfer will restart browsing by picking a new Web
page at random - M ( B E )
- E escape matrix
- M stochastic matrix
- Still
- It is not guaranteed that M is primitive
- If M is stochastic and primitive, PageRank
converges to corresponding stationary
distribution of M
36New Formula
Escape / Damping Vector. Can also be overloaded
as personalization vector
37PageRank Algorithm
Let S be the total set of pages. Let ?p?S E(p)
?/S (for some 0lt?lt1, e.g. 0.15) Initialize
?p?S R(p) 1/S Until ranks do not change
(much) (convergence) For each
p?S For each p?S R(p)
cR(p) (normalize)
38Stochastic interpretation
- PageRank can be seen as modeling a random
surfer that starts on a random page and then at
each point - With probability E(p) randomly jumps to page p.
- Otherwise, randomly follows a link on the
current page. - R(p) models the probability that this random
surfer will be on page p at any given time. - E jumps are needed to prevent the random surfer
from getting trapped in web sinks with no
outgoing links.
39PageRank (cont.)
- Simplifying and adding a damping factor d
- PageRank stationary probability for this Markov
chain, i.e. - where n is the total number of nodes in the
graph
40demo!
- JUNG demo
- General intuition
- Pagerank.xls
- Changes in initial values
- dangling pages zero PR
- Changes in damping factors
- Number of iterations
41 Damping factor
- P (1-d)P d/n
- A low damping factor ( much damping) will make
calculations easier. Since the flow of PageRank
is dampened the iterations will quickly converge.
- A high damping factor ( little damping) will
result in the average pages PageRank growing
higher. Since there is little damping, PageRank
received from external pages will be passed
around in the system. It will not grow forever
though - the maximum limit is Inbound PageRank
d/(1-d).
42PageRank Communities
- Bianchini et al.
- Community level interpretation
- E(community energy) of subgraph GI
- EI I EIin - EIout - EIdp
- where EI Spxi , xi stable PR, dpdangling
pages - Implications
- Same content divided into small pages good
(I) - Dangling pages loss in energy
43Stability
- Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results
dont change significantly? - The connectivity of a portion of the graph is
changed arbitrary - How will it affect the results of algorithms?
44Stability of HITS
- Ng et al (2001)
- A bound on the number of hyperlinks k that can
added or deleted from one page without affecting
the authority or hubness weights - It is possible to perturb a symmetric matrix by
a quantity that grows as d that produces a
constant perturbation of the dominant eigenvector
d eigengap ?1 ?2d maximum outdegree of G
45Stability of PageRank
Ng et al (2001)
V the set of vertices touched by the perturbation
- The parameter e of the mixture model has a
stabilization role - If the set of pages affected by the perturbation
have a small rank, the overall change will also
be small
tighter bound byBianchini et al (2001)
d(j) gt 2 depends on the edges incident on j
46PageRank vs. HITS
- Computation
- Once for all documents and queries (offline)
- Query-independent requires combination with
query-dependent criteria - Hard to spam
- Computation
- Requires computation for each query
- Query-dependent
- Relatively easy to spam
- Quality depends on quality of start set
- Gives hubs as well as authorities
47PageRank vs. HITS
- Lempel Not rank-stable O(1) changes in graph
can change O(N2) order-relations - Ng,Zheng, Jordan01 Value-Stable change in k
nodes (with PR values p1,pk) results in p s.t.
- Not rank-stable
- value-stablility depends on gap g between
largest and second largest eigenvector change of
O(g) nodes results in p s.t.
48PageRank variants
- ObjectRank
- Hristidis, et al.
- Create network of objects in databases
- Additional processing step (thanks to size)
- Create a PR vector for each word
- Merge word lists at query time
- Is this web-scalable?
- Not all words are distinct (synonyms)
- Popular queries 100,000 (4B pages 4 1014
ints)
49PageRank variants
- Topic Sensitive Pagerank
- Havelivala, 2002
- Pre-compute PPV(ri) for a topical basis r1,,rk,
k20 - Query user submits a topic by
- Query engine combines PPV(ri) vectors using
personalization weights
50Rank Aggregation
- Why?
- Metasearch (Dogpile Y! G Ask)
- Rank Aggregation (PageRank TF/IDF)
- HowGiven lists A and B, A(i) rank of element
i in A. - Minimize Distance measures
- Spearman footrule distance sum of rank distance
S A(i) B(i) (linear) - Kendall Tau distance pairwise disagreements
(i, j) i lt j, A(i) lt A(j), but B(i) gt B(j)
(nlogn) - What about top-k lists? Take union, and project.
S i1
51Rank Aggregation
- Strategy Make global list, minimize distance
- Kemeny aggregation (minimize kendall) NP-hard,
even with 4 lists. - This has a max. likelihood interpretation
- Consider each candidate list as noisy version of
the global list. - Find list max. likely to produce candidate
lists. - Kemeny satisfies ext.Condorcet criterionpartition
global list, part A beats part B by majority. - Good for spam hard to spam a majority of search
engines.
52Page Reputations
- Penetration Pp(t) I(p, t) / N(t)Focus
Ft(p) I(p, t) / In(p) - I(p, t) pages on t, pointing to p
- In(p) pages pointing to p
- N(t) pages on t
- RM(p,t) (Pp(t) L(p))/L(p) (NwI(p,
t)/N(t)In(p)) - 1 - L(p) In(p) / Nw
- t derived from snippets, pre-decided.
53fin.
54bibliography
- J. Kleinberg, et. al. HITS Inferring Web
communities from link topology (link) - R. Lempel, S. Moran. SALSA the stochastic
approach for link-structure analysis. ACM
Transactions on Information Systems (TOIS), 2001.
(link) - Sergey Brin and Lawrence Page. The anatomy of a
large-scale hypertextual Web search engine. In
Proceedings of the 7th International Conference
on the World Wide Web, pages 107-117, 1998.
Elsevier Science B. V. 12 - Arvind Arasu, Jasmine Novak, Andrew Tomkins, and
John Tomlin. PageRank computation and the
structure of the web Experiments and algorithms.
In Proceedings of the 11th International
Conference on the World Wide Web, 2002. ACM
Press. 2 - Monica Bianchini, Marco Gori, and Franco
Scarselli. Inside PageRank. ACM Transactions on
Internet Technology, 5(1)92-128, 2002. ACM
Press. 6 - David Cohn and Huan Chang. Learning to
probabilistically identify authoritative
documents. In Pat Langley, editor, Proceedings of
the 17th International Conference on Machine
Learning, pages 167-174, 2000. Morgan Kaufmann.
19 - Andrey Balmin, Vagelis Hristidis, and Yannis
Papakonstantinou. Authority-Based Keyword Queries
in Databases using ObjectRank. (link) - Taher Haveliwala. Topic Sensitive PageRank in WWW
2002. (link) - Cynthia Dwork, Ravi Kumar, Moni Naor, and D.
Sivakumar. Rank aggregation methods for the Web.
In Proceedings of the 10th International
Conference on the World Wide Web, pages 613-622,
2001. ACM Press. 23 - Alberto O. Mendelzon and Davood Rafiei. What do
the neighbours think? Computing web page
reputations. IEEE Data Engineering Bulletin,
23(3)9-16, 2000. 417 PageRank explained with
bright colors (link) - Hyperlink Analysis of the Web Monika
Henzinger,Google Inc. presentation. (link) - Modeling the Internet and the Web - Pierre
Baldi, Paolo Frasconi, Padhraic Smyth (link) - Mining the Web Soumen Chakrabarti (link)