Web Ranking - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Web Ranking

Description:

Irreducible matrix := square, nonnegative, and there exists 't' s.t. (Mt)ij 0 ... For a nonnegative, irreducible, primitive matrix M, there exists an eigenvalue ? ... – PowerPoint PPT presentation

Number of Views:240
Avg rating:3.0/5.0
Slides: 55
Provided by: clairS
Category:
Tags: mt | ranking | web

less

Transcript and Presenter's Notes

Title: Web Ranking


1
Web Ranking
2
Information Retrieval
  • Input Document collection
  • Goal Retrieve documents or text with information
    content that is relevant to users information
    need

3
Classic information retrieval
  • Ranking is a function of query term frequency
    within the document (tf) and across all documents
    (idf)
  • This works because of the following assumptions
    in classical IR
  • Queries are long and well specified
  • What is the impact of the Falklands war on
    Anglo-Argentinean relations
  • Documents (e.g., newspaper articles) are
    coherent, well authored, and are usually about
    one topic
  • The vocabulary is small and relatively well
    understood

4
Web information retrieval
  • None of these assumptions hold
  • Queries are short 2.35 terms in avg
  • Huge variety in documents language, quality,
    duplication
  • Huge vocabulary 100s million of terms
  • Deliberate misinformation
  • Ranking is a function of the query terms and of
    the hyperlink structure

SPAM
5
Hyperlink analysis
  • Idea Mine structure of the web graph
  • Each web page is a node
  • Each hyperlink is a directed edge
  • Related work
  • Classic IR work (citations links) a.k.a.
    Bibliometrics K63, G72, S73,
  • Socio-metrics K53, MMSM86,
  • Many Web related papers use this approach
    PPR96, AMM97, S97, CK97, K98, BP98,

6
So...
  • Our basic problem
  • Given a DiGraph G, of web documents, rank all
    documents relevant to query q

1
2
7
Topics
  • Eigenvectors review
  • HITS, variants
  • Pagerank, variants
  • Rank aggregation
  • Page Reputations

8
Eigenvectors review
  • Lets say we have a matrix M
  • Now consider V1 , V2 , V3
  • We have MV1 , MV2 , MV3
  • In other words, MV1 0V1 , MV2 -4V2, MV33V3

9
Eigenvectors review
  • MV? ? V?
  • a matrix can have many of these.

Eigenvector
Eigenvalue
10
Eigenvectors review
  • Combine Vx to form P
  • Now P-1.M.P
  • Or M P P-1

Diagonal Matrix
11
Eigenvectors review
  • This implies
  • Mn P P-1
  • Or Mn P P

(Well need this)
12
Some definitions
  • Non-negative matrix Mij 0 gt (M 0)
  • Irreducible matrix square, nonnegative, and
    there exists t s.t. (Mt)ij gt 0
  • For adjacency matrix Strongly connected digraph
  • Period of i gcd(t (Mt)ii gt 0)
  • For irreducible period same for all i.
  • For adjacency matrix period gcd of length of
    cycle
  • Primitive matrix There exists t s.t. Mt gt 0
  • Diff. from irreducible all gt 0
  • Adjacency matrix gcd of cycle lengths 1

13
Perron-Frobenius Theorem
  • For a nonnegative, irreducible, primitive matrix
    M, there exists an eigenvalue ? s.t.
  • ? is real and positive and that ? gt ? for
    every other ? ? a
  • ? corresponds to a strictly positive eigenvector
  • ? is a simple root of the char. eq.(M a In) 0
  • This property allows us to compute dominant
    eigenvalue / eigenvector easily.

14
Dominant Eigenvector
  • Since MV? ? V?, and (a1, , an) coordinates
    of vector x in basis formed by eigenvectors.
  • Mtx Sai?ti Vi
  • Now since ?1 gt ?i, igt1,
  • Mt a1 ?t1 V1 for large t
  • Since V1 is strictly positive, any random
    positive vector will work

i
Dominant Eigenvector!
15
(Contd)
  • Special case Stochastic matrix, ?11, and Mt
    converges exponentially
  • lim Mt 1Tr
  • where r stationary distribution of Markov chain

t ? 8
Random surfer model
16
HITS
  • Introduced by Jon M. Kleinberg (1998).
  • Hypertext Induced Topic Selection
  • Find a set of interesting pages
  • Find a base subgraph (of Web) using this set
  • Use hubness and authoritativeness to rank
  • Recursive Concept
  • Good hubs point to good authorities
  • Good authorities are pointed by good hubs

17
HITS Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
18
HITS Base Subgraph
  • BaseSubgraph( R, d)
  • S ? r
  • for each v in R
  • do S ? S U chv
  • P ? pav
  • if P gt d
  • then P ? arbitrary subset of P having size d
  • S ? S U P
  • return S

S
R
19
HITS Algorithm
  • HubsAuthorities(G)
  • 1 ? 1,,1 ? R
  • a ? h ? 1
  • t ? 1
  • repeat
  • for each v in V
  • do a (v) ? S h (w)
  • h (v) ? S a (w)
  • a ? a / a
  • h ? h / h
  • t ? t 1
  • until a a h h lt
    e
  • return (a , h )

V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
20
HITS Ensuring Convergence
  • Recursive dependency
  • a(v) ? S h(w)
  • h(v) ? S a(w)

w ? pav
w ? chv
  • we can prove

a(v) and h(v) converge
21
HITS Ensuring Convergence
  • at MTht-1 and ht Mat-1
  • Thus, after t iterations
  • at at(MTM)t-1MT1
  • ht ßt(MMT)t1
  • It can be shown that these converge, e.g. for
    nonnegative symmetric matrix M, to
  • a ?1(MTM) and h ?1(MMT)

22
HITS (contd)
  • Spamming
  • Identical links
  • distribute scores by normalizing effects from
    same host. (e.g. 1/n)
  • topic drift many unrelated pages
  • weight the edges of the graph according the
    relevance of the source and destination (e.g.
    link text nbd.)
  • Hub replication, clique attacks, link farms?
  • Solution ?

23
demo!
  • Intuition of Hubness / Authness
  • Teoma.com
  • foosball
  • mountain dew

24
SALSA
  • SALSA (Lempel, Moran 2001)
  • Probabilistic extension of the HITS algorithm
  • Random walk is carried out by following
    hyperlinks both in the forward and in the
    backward direction
  • Two separate random walks
  • Hub walk
  • Authority walk

25
SALSA (contd)
  • Hub walk
  • Follow a Web link from a page uh to a page wa (a
    forward link) and then
  • Immediately traverse a backlink going from wa to
    vh, where (u,w) ? E and (v,w) ? E
  • Authority Walk
  • Follow a Web link from a page w(a) to a page u(h)
    (a backward link) and then
  • Immediately traverse a forward link going back
    from vh to wa where (u,w) ? E and (v,w) ? E

26
SALSA (contd)
  • Hub weight computed from the sum of the product
    of the inverse degree of the in-links and the
    out-links
  • This solves the clique attack / link farm problems

27
PHITS
  • Co-citation matrix community
  • Effect on eigenvector authority of document in
    community
  • HITS uses only dominant eigenvector principal
    community.
  • What about smaller communities? (smaller
    eigenvectors)

28
PHITS Model
  • P(d) P(zd)
    P(cz)
  • Add communities between documents and citations
  • Describe citation likelihood as
  • P(d,c) P(d)P(cd), where
  • P(cd) S P(cz)P(zd)
  • Total likelihood of citations matrix M
  • L(M) ? P(d,c)
  • this becomes a max. likelihood problem

d
z
c
Note this is factored. (Different for mixture
model)
z
(d,c) ? M
29
PHITS (contd)
  • Open up the eqn
  • P(d,c) S P(z)P(cz)P(dz)
  • Alternate between
  • Computing P(zd,c)
  • Re-estimating P(z), P(cz) and P(dz)
  • Issues not globally optimal, cannot guarantee
    fits (soln restarts start with HITS / PCA
    model)
  • How to decide of factors? (Topic hierarchy)

30
PageRank
  • Page, et. al.1998
  • Different from HITS
  • HITS takes Hubness Authority weights
  • The page rank is proportional to its parents
    rank, but inversely proportional to its parents
    outdegree

31
PageRank Model
  • Just measuring in-degree (citation count) doesnt
    account for the authority of the source of a
    link.
  • Initial page rank equation for page p
  • Nq is the total number of out-links from page q.
  • A page, q, gives an equal fraction of its
    authority to all the pages it points to (e.g. p).
  • c is a normalizing constant set so that the rank
    of all pages always sums to 1.

32
Algorithm
  • Iterate rank-flowing process until convergence
  • Let S be the total set of pages.
  • Initialize ?p?S R(p) 1/S
  • Until ranks do not change (much)
    (convergence)
  • For each p?S
  • For each p?S R(p) cR(p)
    (normalize)

33
Linear Algebra Version
  • Treat R as a vector over web pages.
  • Let M be a 2-d matrix over pages where
  • Mvu 1/Nu if u ?v else Mvu 0
  • Then RcMR
  • R converges to the principal eigenvector of M.

34
Problems
  • Dangling page Problem
  • Many Web pages have no inlinks/outlinks
  • Results in dangling edges in the graph
  • E.g.
  • no parent ? rank 0
  • MT converges to a matrix
  • whose last column is all zero
  • no children ? no solution
  • MT converges to zero matrix

35
Modifications
  • Surfer will restart browsing by picking a new Web
    page at random
  • M ( B E )
  • E escape matrix
  • M stochastic matrix
  • Still
  • It is not guaranteed that M is primitive
  • If M is stochastic and primitive, PageRank
    converges to corresponding stationary
    distribution of M

36
New Formula
  • Hence we get

Escape / Damping Vector. Can also be overloaded
as personalization vector
37
PageRank Algorithm
Let S be the total set of pages. Let ?p?S E(p)
?/S (for some 0lt?lt1, e.g. 0.15) Initialize
?p?S R(p) 1/S Until ranks do not change
(much) (convergence) For each
p?S For each p?S R(p)
cR(p) (normalize)
38
Stochastic interpretation
  • PageRank can be seen as modeling a random
    surfer that starts on a random page and then at
    each point
  • With probability E(p) randomly jumps to page p.
  • Otherwise, randomly follows a link on the
    current page.
  • R(p) models the probability that this random
    surfer will be on page p at any given time.
  • E jumps are needed to prevent the random surfer
    from getting trapped in web sinks with no
    outgoing links.

39
PageRank (cont.)
  • Simplifying and adding a damping factor d
  • PageRank stationary probability for this Markov
    chain, i.e.
  • where n is the total number of nodes in the
    graph

40
demo!
  • JUNG demo
  • General intuition
  • Pagerank.xls
  • Changes in initial values
  • dangling pages zero PR
  • Changes in damping factors
  • Number of iterations

41
Damping factor
  • P (1-d)P d/n
  • A low damping factor ( much damping) will make
    calculations easier. Since the flow of PageRank
    is dampened the iterations will quickly converge.
  • A high damping factor ( little damping) will
    result in the average pages PageRank growing
    higher. Since there is little damping, PageRank
    received from external pages will be passed
    around in the system. It will not grow forever
    though - the maximum limit is Inbound PageRank
    d/(1-d).

42
PageRank Communities
  • Bianchini et al.
  • Community level interpretation
  • E(community energy) of subgraph GI
  • EI I EIin - EIout - EIdp
  • where EI Spxi , xi stable PR, dpdangling
    pages
  • Implications
  • Same content divided into small pages good
    (I)
  • Dangling pages loss in energy

43
Stability
  • Whether the link analysis algorithms based on
    eigenvectors are stable in the sense that results
    dont change significantly?
  • The connectivity of a portion of the graph is
    changed arbitrary
  • How will it affect the results of algorithms?

44
Stability of HITS
  • Ng et al (2001)
  • A bound on the number of hyperlinks k that can
    added or deleted from one page without affecting
    the authority or hubness weights
  • It is possible to perturb a symmetric matrix by
    a quantity that grows as d that produces a
    constant perturbation of the dominant eigenvector

d eigengap ?1 ?2d maximum outdegree of G
45
Stability of PageRank
Ng et al (2001)
V the set of vertices touched by the perturbation
  • The parameter e of the mixture model has a
    stabilization role
  • If the set of pages affected by the perturbation
    have a small rank, the overall change will also
    be small

tighter bound byBianchini et al (2001)
d(j) gt 2 depends on the edges incident on j
46
PageRank vs. HITS
  • Computation
  • Once for all documents and queries (offline)
  • Query-independent requires combination with
    query-dependent criteria
  • Hard to spam
  • Computation
  • Requires computation for each query
  • Query-dependent
  • Relatively easy to spam
  • Quality depends on quality of start set
  • Gives hubs as well as authorities

47
PageRank vs. HITS
  • Lempel Not rank-stable O(1) changes in graph
    can change O(N2) order-relations
  • Ng,Zheng, Jordan01 Value-Stable change in k
    nodes (with PR values p1,pk) results in p s.t.
  • Not rank-stable
  • value-stablility depends on gap g between
    largest and second largest eigenvector change of
    O(g) nodes results in p s.t.

48
PageRank variants
  • ObjectRank
  • Hristidis, et al.
  • Create network of objects in databases
  • Additional processing step (thanks to size)
  • Create a PR vector for each word
  • Merge word lists at query time
  • Is this web-scalable?
  • Not all words are distinct (synonyms)
  • Popular queries 100,000 (4B pages 4 1014
    ints)

49
PageRank variants
  • Topic Sensitive Pagerank
  • Havelivala, 2002
  • Pre-compute PPV(ri) for a topical basis r1,,rk,
    k20
  • Query user submits a topic by
  • Query engine combines PPV(ri) vectors using
    personalization weights

50
Rank Aggregation
  • Why?
  • Metasearch (Dogpile Y! G Ask)
  • Rank Aggregation (PageRank TF/IDF)
  • HowGiven lists A and B, A(i) rank of element
    i in A.
  • Minimize Distance measures
  • Spearman footrule distance sum of rank distance
    S A(i) B(i) (linear)
  • Kendall Tau distance pairwise disagreements
    (i, j) i lt j, A(i) lt A(j), but B(i) gt B(j)
    (nlogn)
  • What about top-k lists? Take union, and project.

S i1
51
Rank Aggregation
  • Strategy Make global list, minimize distance
  • Kemeny aggregation (minimize kendall) NP-hard,
    even with 4 lists.
  • This has a max. likelihood interpretation
  • Consider each candidate list as noisy version of
    the global list.
  • Find list max. likely to produce candidate
    lists.
  • Kemeny satisfies ext.Condorcet criterionpartition
    global list, part A beats part B by majority.
  • Good for spam hard to spam a majority of search
    engines.

52
Page Reputations
  • Penetration Pp(t) I(p, t) / N(t)Focus
    Ft(p) I(p, t) / In(p)
  • I(p, t) pages on t, pointing to p
  • In(p) pages pointing to p
  • N(t) pages on t
  • RM(p,t) (Pp(t) L(p))/L(p) (NwI(p,
    t)/N(t)In(p)) - 1
  • L(p) In(p) / Nw
  • t derived from snippets, pre-decided.

53
fin.
54
bibliography
  • J. Kleinberg, et. al. HITS Inferring Web
    communities from link topology (link)
  • R. Lempel, S. Moran. SALSA the stochastic
    approach for link-structure analysis. ACM
    Transactions on Information Systems (TOIS), 2001.
    (link)
  • Sergey Brin and Lawrence Page. The anatomy of a
    large-scale hypertextual Web search engine. In
    Proceedings of the 7th International Conference
    on the World Wide Web, pages 107-117, 1998.
    Elsevier Science B. V. 12
  • Arvind Arasu, Jasmine Novak, Andrew Tomkins, and
    John Tomlin. PageRank computation and the
    structure of the web Experiments and algorithms.
    In Proceedings of the 11th International
    Conference on the World Wide Web, 2002. ACM
    Press. 2
  • Monica Bianchini, Marco Gori, and Franco
    Scarselli. Inside PageRank. ACM Transactions on
    Internet Technology, 5(1)92-128, 2002. ACM
    Press. 6
  • David Cohn and Huan Chang. Learning to
    probabilistically identify authoritative
    documents. In Pat Langley, editor, Proceedings of
    the 17th International Conference on Machine
    Learning, pages 167-174, 2000. Morgan Kaufmann.
    19
  • Andrey Balmin, Vagelis Hristidis, and Yannis
    Papakonstantinou. Authority-Based Keyword Queries
    in Databases using ObjectRank. (link)
  • Taher Haveliwala. Topic Sensitive PageRank in WWW
    2002. (link)
  • Cynthia Dwork, Ravi Kumar, Moni Naor, and D.
    Sivakumar. Rank aggregation methods for the Web.
    In Proceedings of the 10th International
    Conference on the World Wide Web, pages 613-622,
    2001. ACM Press. 23
  • Alberto O. Mendelzon and Davood Rafiei. What do
    the neighbours think? Computing web page
    reputations. IEEE Data Engineering Bulletin,
    23(3)9-16, 2000. 417 PageRank explained with
    bright colors (link)
  • Hyperlink Analysis of the Web Monika
    Henzinger,Google Inc. presentation. (link)
  • Modeling the Internet and the Web - Pierre
    Baldi, Paolo Frasconi, Padhraic Smyth (link)
  • Mining the Web Soumen Chakrabarti (link)
Write a Comment
User Comments (0)
About PowerShow.com