Linkage Analysis - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Linkage Analysis

Description:

Imagine a surfer surfing the WWW. At each step of the walk, the surfer will perform ... Let xp(t) be the probability that the surfer is at the page p at time t. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 17
Provided by: yxie
Category:

less

Transcript and Presenter's Notes

Title: Linkage Analysis


1
Linkage Analysis
  • Dr. Ying Xie

2
Acknowledgement
  • This lecture note cites, adapts, or refers to
    some information from the following sources
  • http//www.stanford.edu/class/cs276a/
  • http//www.stanford.edu/class/cs276b/
  • http//www-clips.imag.fr/mrim/essir03/PDF/10.Meluc
    ci.pdf
  • Diligenti ET AL., A unified probabilistic
    framework for web page scoring systems
  • IEEE Trans. Knowledge and Data Engineering,
    Vol, 16(1), Jan, 2004
  • T. H. Haveliwala, Topic-Sensitive Pageranking A
    context-sensitive ranking algorithm for web
    search, IEEE Trans. Knowledge and Data
    Engineering, Vol, 15(4), July/August, 2003

3
Web Search Process
Crawler
Information Need
Formulation
Indexing
Query Rep
Inverted index and web graph
Ranking
Ranked List
Learning
User Relevance Feedback
4
Some challenges of web search
  • No central editorials
  • Web page quality is highly heterogeneous
  • - Although some pages are relevant to your
    query, but they are in very low quality.
  • Therefore web search should take advantage of
    both relevance and quality/reputation of a page.
  • The question is how to evaluate the
    quality/reputation of a web page.
  • The answer is by linkage analysis

5
The web as a graph
  • The Web can be viewed as a complex graph G
  • Each page is a node Dp, Dp ,
  • Each hyperlink is an arc. e (Dp, Dq)
  • The topology of this graph G is the result of the
    behaviors of the community of all the web
    authors.

6
Web as a graph (2)
  • Therefore, the graph topology carries much
    information related to the cooperative
    interaction of many agents
  • Based on this, two assumptions can be made
  • If a page is pointed to by a number of good
    pages, this page itself should be good
  • Less quality sites are unlikely to have many
    high-quality sites linking to them

7
Model for linkage analysis Random walk theory
  • Based on the web graph G, Random walk theory has
    been proposed as a framework for conducting the
    linkage analysis.
  • By random walk theory, the quality/reputation of
    a page can be computed as the probability of
    visiting that page in a random walk on the web
    graph.

8
  • Imagine a surfer surfing the WWW. At each step
    of the walk, the surfer will perform one of the
    following three actions
  • Randomly jump to any node/page in the graph
    (this action is denoted as j)
  • Following a hyperlink from the current page
    (this action is denoted as l)
  • Following a hyperlink in the inverse direction
    (this action is denoted as b)

9
Random walk theory single surfer walk
  • Based on the random walk model, we can have two
    set of probabilities
  • Set 1
  • - x(jq) probability of the surfer choosing
    jumping from page q.
  • - x(lq) probability of the surfer choosing
    following a hyperlink in page q.
  • - x(bq) probability of the surfer following
    an inverse link from q.
  • The above probability must satisfy the following
    constrains

10
Random walk theory single surfer walk (2)
Set 2 x(pq, j) probability of jumping from
page q to page p. x(pq, l) probability of
following a hyperlink in page q to page p. x(pq,
b) probability of following an inverse link back
to page p from q. The above probabilities
should satisfy the following constraints
11
Random walk theory single surfer walk (3)
  • Let xp(t) be the probability that the surfer is
    at the page p at time t.
  • Then x(t) x1(t), , xN(t) is the probability
    distribution on all the pages (N is the total
    number of pages) at time t.
  • So, how to calculate xp(t1)?

12
Random walk theory single surfer walk (4)
  • So, if x(0) x1(0), , xN(0) is known, we can
    calculate
  • xp(t).
  • So, how to get x(0) x1(0), , xN(0)?
  • Here is an interesting proposition

That means x x1, , xN can be used to
represent the quality/reputation of each page.
13
Googles page ranking
  • The calculation of Googles page ranking can be
    modeled by a simpler version of single surfer
    walk - Only two actions are allowed by the
    surfer
  • - Randomly jump to any node/page in the
    graph from page q with probability x(jq) 1-d
  • - Following a hyperlink from the current
    page q with the probability x(lq) d
  • - Given the jump action is taken, the
    probability of jumping to page p is x(pj) 1/N,
    where N is the total number of the web pages
  • - Given the following a link action is
    taken, the probability of reaching page p from q
    by following link l is x(pq, l) 1/hq, where
    hq is the number of links in the page q.

14
Googles page ranking(2)
  • So we can calculate the probability that the
    surfer will reach page p at time t.
  • x(jq) 1-d gt 0 guarantees that the pagerank
    vector x(t) converges to a distribution of page
    scores that doesnt depend on the initial
    distribution.

15
Googles page rank (3)
16
Googles page rank (4)
B
A
C
Write a Comment
User Comments (0)
About PowerShow.com