Web Ranking - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Web Ranking

Description:

Irreducible matrix := square, nonnegative, and there exists 't' s.t. (Mt)ij 0 ... For a nonnegative, irreducible, primitive matrix M, there exists an eigenvalue ? ... – PowerPoint PPT presentation

Number of Views:240

Avg rating:3.0/5.0

Slides: 55

Provided by: clairS

Category:

Tags: mt | ranking | web

more less

Transcript and Presenter's Notes

Title: Web Ranking

1
Web Ranking
2
Information Retrieval

Input Document collection
Goal Retrieve documents or text with information
content that is relevant to users information
need

3
Classic information retrieval

Ranking is a function of query term frequency
within the document (tf) and across all documents
(idf)
This works because of the following assumptions
in classical IR
Queries are long and well specified
What is the impact of the Falklands war on
Anglo-Argentinean relations
Documents (e.g., newspaper articles) are
coherent, well authored, and are usually about
one topic
The vocabulary is small and relatively well
understood

4
Web information retrieval

None of these assumptions hold
Queries are short 2.35 terms in avg
Huge variety in documents language, quality,
duplication
Huge vocabulary 100s million of terms
Deliberate misinformation
Ranking is a function of the query terms and of
the hyperlink structure

SPAM
5
Hyperlink analysis

Idea Mine structure of the web graph
Each web page is a node
Each hyperlink is a directed edge
Related work
Classic IR work (citations links) a.k.a.
Bibliometrics K63, G72, S73,
Socio-metrics K53, MMSM86,
Many Web related papers use this approach
PPR96, AMM97, S97, CK97, K98, BP98,

6
So...

Our basic problem
Given a DiGraph G, of web documents, rank all
documents relevant to query q

1
2
7
Topics

Eigenvectors review
HITS, variants
Pagerank, variants
Rank aggregation
Page Reputations

8
Eigenvectors review

Lets say we have a matrix M
Now consider V1 , V2 , V3
We have MV1 , MV2 , MV3
In other words, MV1 0V1 , MV2 -4V2, MV33V3

9
Eigenvectors review

MV? ? V?
a matrix can have many of these.

Eigenvector
Eigenvalue
10
Eigenvectors review

Combine Vx to form P
Now P-1.M.P
Or M P P-1

Diagonal Matrix
11
Eigenvectors review

This implies
Mn P P-1
Or Mn P P

(Well need this)
12
Some definitions

Non-negative matrix Mij 0 gt (M 0)
Irreducible matrix square, nonnegative, and
there exists t s.t. (Mt)ij gt 0
For adjacency matrix Strongly connected digraph
Period of i gcd(t (Mt)ii gt 0)
For irreducible period same for all i.
For adjacency matrix period gcd of length of
cycle
Primitive matrix There exists t s.t. Mt gt 0
Diff. from irreducible all gt 0
Adjacency matrix gcd of cycle lengths 1

13
Perron-Frobenius Theorem

For a nonnegative, irreducible, primitive matrix
M, there exists an eigenvalue ? s.t.
? is real and positive and that ? gt ? for
every other ? ? a
? corresponds to a strictly positive eigenvector
? is a simple root of the char. eq.(M a In) 0
This property allows us to compute dominant
eigenvalue / eigenvector easily.

14
Dominant Eigenvector

Since MV? ? V?, and (a1, , an) coordinates
of vector x in basis formed by eigenvectors.
Mtx Sai?ti Vi
Now since ?1 gt ?i, igt1,
Mt a1 ?t1 V1 for large t
Since V1 is strictly positive, any random
positive vector will work

i
Dominant Eigenvector!
15
(Contd)

Special case Stochastic matrix, ?11, and Mt
converges exponentially
lim Mt 1Tr
where r stationary distribution of Markov chain

t ? 8
Random surfer model
16
HITS

Introduced by Jon M. Kleinberg (1998).
Hypertext Induced Topic Selection
Find a set of interesting pages
Find a base subgraph (of Web) using this set
Use hubness and authoritativeness to rank
Recursive Concept
Good hubs point to good authorities
Good authorities are pointed by good hubs

17
HITS Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
18
HITS Base Subgraph

BaseSubgraph( R, d)
S ? r
for each v in R
do S ? S U chv
P ? pav
if P gt d
then P ? arbitrary subset of P having size d
S ? S U P
return S

S
R
19
HITS Algorithm

HubsAuthorities(G)
1 ? 1,,1 ? R
a ? h ? 1
t ? 1
repeat
for each v in V
do a (v) ? S h (w)
h (v) ? S a (w)
a ? a / a
h ? h / h
t ? t 1
until a a h h lt
e
return (a , h )

V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
20
HITS Ensuring Convergence

Recursive dependency
a(v) ? S h(w)
h(v) ? S a(w)

w ? pav
w ? chv

we can prove

a(v) and h(v) converge
21
HITS Ensuring Convergence

at MTht-1 and ht Mat-1
Thus, after t iterations
at at(MTM)t-1MT1
ht ßt(MMT)t1
It can be shown that these converge, e.g. for
nonnegative symmetric matrix M, to
a ?1(MTM) and h ?1(MMT)

22
HITS (contd)

Spamming
Identical links
distribute scores by normalizing effects from
same host. (e.g. 1/n)
topic drift many unrelated pages
weight the edges of the graph according the
relevance of the source and destination (e.g.
link text nbd.)
Hub replication, clique attacks, link farms?
Solution ?

23
demo!

Intuition of Hubness / Authness
Teoma.com
foosball
mountain dew

24
SALSA

SALSA (Lempel, Moran 2001)
Probabilistic extension of the HITS algorithm
Random walk is carried out by following
hyperlinks both in the forward and in the
backward direction
Two separate random walks
Hub walk
Authority walk

25
SALSA (contd)

Hub walk
Follow a Web link from a page uh to a page wa (a
forward link) and then
Immediately traverse a backlink going from wa to
vh, where (u,w) ? E and (v,w) ? E
Authority Walk
Follow a Web link from a page w(a) to a page u(h)
(a backward link) and then
Immediately traverse a forward link going back
from vh to wa where (u,w) ? E and (v,w) ? E

26
SALSA (contd)

Hub weight computed from the sum of the product
of the inverse degree of the in-links and the
out-links
This solves the clique attack / link farm problems

27
PHITS

Co-citation matrix community
Effect on eigenvector authority of document in
community
HITS uses only dominant eigenvector principal
community.
What about smaller communities? (smaller
eigenvectors)

28
PHITS Model

P(d) P(zd)
P(cz)
Add communities between documents and citations
Describe citation likelihood as
P(d,c) P(d)P(cd), where
P(cd) S P(cz)P(zd)
Total likelihood of citations matrix M
L(M) ? P(d,c)
this becomes a max. likelihood problem

d
z
c
Note this is factored. (Different for mixture
model)
z
(d,c) ? M
29
PHITS (contd)

Open up the eqn
P(d,c) S P(z)P(cz)P(dz)
Alternate between
Computing P(zd,c)
Re-estimating P(z), P(cz) and P(dz)
Issues not globally optimal, cannot guarantee
fits (soln restarts start with HITS / PCA
model)
How to decide of factors? (Topic hierarchy)

30
PageRank

Page, et. al.1998
Different from HITS
HITS takes Hubness Authority weights
The page rank is proportional to its parents
rank, but inversely proportional to its parents
outdegree

31
PageRank Model

Just measuring in-degree (citation count) doesnt
account for the authority of the source of a
link.
Initial page rank equation for page p
Nq is the total number of out-links from page q.
A page, q, gives an equal fraction of its
authority to all the pages it points to (e.g. p).
c is a normalizing constant set so that the rank
of all pages always sums to 1.

32
Algorithm

Iterate rank-flowing process until convergence
Let S be the total set of pages.
Initialize ?p?S R(p) 1/S
Until ranks do not change (much)
(convergence)
For each p?S
For each p?S R(p) cR(p)
(normalize)

33
Linear Algebra Version

Treat R as a vector over web pages.
Let M be a 2-d matrix over pages where
Mvu 1/Nu if u ?v else Mvu 0
Then RcMR
R converges to the principal eigenvector of M.

34
Problems

Dangling page Problem
Many Web pages have no inlinks/outlinks
Results in dangling edges in the graph
E.g.
no parent ? rank 0
MT converges to a matrix
whose last column is all zero
no children ? no solution
MT converges to zero matrix

35
Modifications

Surfer will restart browsing by picking a new Web
page at random
M ( B E )
E escape matrix
M stochastic matrix
Still
It is not guaranteed that M is primitive
If M is stochastic and primitive, PageRank
converges to corresponding stationary
distribution of M

36
New Formula

Hence we get

Escape / Damping Vector. Can also be overloaded
as personalization vector
37
PageRank Algorithm
Let S be the total set of pages. Let ?p?S E(p)
?/S (for some 0lt?lt1, e.g. 0.15) Initialize
?p?S R(p) 1/S Until ranks do not change
(much) (convergence) For each
p?S For each p?S R(p)
cR(p) (normalize)
38
Stochastic interpretation

PageRank can be seen as modeling a random
surfer that starts on a random page and then at
each point
With probability E(p) randomly jumps to page p.
Otherwise, randomly follows a link on the
current page.
R(p) models the probability that this random
surfer will be on page p at any given time.
E jumps are needed to prevent the random surfer
from getting trapped in web sinks with no
outgoing links.

39
PageRank (cont.)

Simplifying and adding a damping factor d
PageRank stationary probability for this Markov
chain, i.e.
where n is the total number of nodes in the
graph

40
demo!

JUNG demo
General intuition
Pagerank.xls
Changes in initial values
dangling pages zero PR
Changes in damping factors
Number of iterations

41
Damping factor

P (1-d)P d/n
A low damping factor ( much damping) will make
calculations easier. Since the flow of PageRank
is dampened the iterations will quickly converge.
A high damping factor ( little damping) will
result in the average pages PageRank growing
higher. Since there is little damping, PageRank
received from external pages will be passed
around in the system. It will not grow forever
though - the maximum limit is Inbound PageRank
d/(1-d).

42
PageRank Communities

Bianchini et al.
Community level interpretation
E(community energy) of subgraph GI
EI I EIin - EIout - EIdp
where EI Spxi , xi stable PR, dpdangling
pages
Implications
Same content divided into small pages good
(I)
Dangling pages loss in energy

43
Stability

Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results
dont change significantly?
The connectivity of a portion of the graph is
changed arbitrary
How will it affect the results of algorithms?

44
Stability of HITS

Ng et al (2001)
A bound on the number of hyperlinks k that can
added or deleted from one page without affecting
the authority or hubness weights
It is possible to perturb a symmetric matrix by
a quantity that grows as d that produces a
constant perturbation of the dominant eigenvector

d eigengap ?1 ?2d maximum outdegree of G
45
Stability of PageRank
Ng et al (2001)
V the set of vertices touched by the perturbation

The parameter e of the mixture model has a
stabilization role
If the set of pages affected by the perturbation
have a small rank, the overall change will also
be small

tighter bound byBianchini et al (2001)
d(j) gt 2 depends on the edges incident on j
46
PageRank vs. HITS

Computation
Once for all documents and queries (offline)
Query-independent requires combination with
query-dependent criteria
Hard to spam

Computation
Requires computation for each query
Query-dependent
Relatively easy to spam
Quality depends on quality of start set
Gives hubs as well as authorities

47
PageRank vs. HITS

Lempel Not rank-stable O(1) changes in graph
can change O(N2) order-relations
Ng,Zheng, Jordan01 Value-Stable change in k
nodes (with PR values p1,pk) results in p s.t.

Not rank-stable
value-stablility depends on gap g between
largest and second largest eigenvector change of
O(g) nodes results in p s.t.

48
PageRank variants

ObjectRank
Hristidis, et al.
Create network of objects in databases
Additional processing step (thanks to size)
Create a PR vector for each word
Merge word lists at query time
Is this web-scalable?
Not all words are distinct (synonyms)
Popular queries 100,000 (4B pages 4 1014
ints)

49
PageRank variants

Topic Sensitive Pagerank
Havelivala, 2002
Pre-compute PPV(ri) for a topical basis r1,,rk,
k20
Query user submits a topic by
Query engine combines PPV(ri) vectors using
personalization weights

50
Rank Aggregation

Why?
Metasearch (Dogpile Y! G Ask)
Rank Aggregation (PageRank TF/IDF)
HowGiven lists A and B, A(i) rank of element
i in A.
Minimize Distance measures
Spearman footrule distance sum of rank distance
S A(i) B(i) (linear)
Kendall Tau distance pairwise disagreements
(i, j) i lt j, A(i) lt A(j), but B(i) gt B(j)
(nlogn)
What about top-k lists? Take union, and project.

S i1
51
Rank Aggregation

Strategy Make global list, minimize distance
Kemeny aggregation (minimize kendall) NP-hard,
even with 4 lists.
This has a max. likelihood interpretation
Consider each candidate list as noisy version of
the global list.
Find list max. likely to produce candidate
lists.
Kemeny satisfies ext.Condorcet criterionpartition
global list, part A beats part B by majority.
Good for spam hard to spam a majority of search
engines.

52
Page Reputations

Penetration Pp(t) I(p, t) / N(t)Focus
Ft(p) I(p, t) / In(p)
I(p, t) pages on t, pointing to p
In(p) pages pointing to p
N(t) pages on t
RM(p,t) (Pp(t) L(p))/L(p) (NwI(p,
t)/N(t)In(p)) - 1
L(p) In(p) / Nw
t derived from snippets, pre-decided.

53
fin.
54
bibliography

J. Kleinberg, et. al. HITS Inferring Web
communities from link topology (link)
R. Lempel, S. Moran. SALSA the stochastic
approach for link-structure analysis. ACM
Transactions on Information Systems (TOIS), 2001.
(link)
Sergey Brin and Lawrence Page. The anatomy of a
large-scale hypertextual Web search engine. In
Proceedings of the 7th International Conference
on the World Wide Web, pages 107-117, 1998.
Elsevier Science B. V. 12
Arvind Arasu, Jasmine Novak, Andrew Tomkins, and
John Tomlin. PageRank computation and the
structure of the web Experiments and algorithms.
In Proceedings of the 11th International
Conference on the World Wide Web, 2002. ACM
Press. 2
Monica Bianchini, Marco Gori, and Franco
Scarselli. Inside PageRank. ACM Transactions on
Internet Technology, 5(1)92-128, 2002. ACM
Press. 6
David Cohn and Huan Chang. Learning to
probabilistically identify authoritative
documents. In Pat Langley, editor, Proceedings of
the 17th International Conference on Machine
Learning, pages 167-174, 2000. Morgan Kaufmann.
19
Andrey Balmin, Vagelis Hristidis, and Yannis
Papakonstantinou. Authority-Based Keyword Queries
in Databases using ObjectRank. (link)
Taher Haveliwala. Topic Sensitive PageRank in WWW
2002. (link)
Cynthia Dwork, Ravi Kumar, Moni Naor, and D.
Sivakumar. Rank aggregation methods for the Web.
In Proceedings of the 10th International
Conference on the World Wide Web, pages 613-622,
2001. ACM Press. 23
Alberto O. Mendelzon and Davood Rafiei. What do
the neighbours think? Computing web page
reputations. IEEE Data Engineering Bulletin,
23(3)9-16, 2000. 417 PageRank explained with
bright colors (link)
Hyperlink Analysis of the Web Monika
Henzinger,Google Inc. presentation. (link)
Modeling the Internet and the Web - Pierre
Baldi, Paolo Frasconi, Padhraic Smyth (link)
Mining the Web Soumen Chakrabarti (link)