Title: Linkage Analysis
1Linkage Analysis
2Acknowledgement
- This lecture note cites, adapts, or refers to
some information from the following sources - http//www.stanford.edu/class/cs276a/
- http//www.stanford.edu/class/cs276b/
- http//www-clips.imag.fr/mrim/essir03/PDF/10.Meluc
ci.pdf - Diligenti ET AL., A unified probabilistic
framework for web page scoring systems - IEEE Trans. Knowledge and Data Engineering,
Vol, 16(1), Jan, 2004 - T. H. Haveliwala, Topic-Sensitive Pageranking A
context-sensitive ranking algorithm for web
search, IEEE Trans. Knowledge and Data
Engineering, Vol, 15(4), July/August, 2003
3Web Search Process
Crawler
Information Need
Formulation
Indexing
Query Rep
Inverted index and web graph
Ranking
Ranked List
Learning
User Relevance Feedback
4Some challenges of web search
- No central editorials
- Web page quality is highly heterogeneous
- - Although some pages are relevant to your
query, but they are in very low quality. - Therefore web search should take advantage of
both relevance and quality/reputation of a page. - The question is how to evaluate the
quality/reputation of a web page. - The answer is by linkage analysis
5The web as a graph
- The Web can be viewed as a complex graph G
- Each page is a node Dp, Dp ,
- Each hyperlink is an arc. e (Dp, Dq)
- The topology of this graph G is the result of the
behaviors of the community of all the web
authors.
6Web as a graph (2)
- Therefore, the graph topology carries much
information related to the cooperative
interaction of many agents - Based on this, two assumptions can be made
- If a page is pointed to by a number of good
pages, this page itself should be good - Less quality sites are unlikely to have many
high-quality sites linking to them
7Model for linkage analysis Random walk theory
- Based on the web graph G, Random walk theory has
been proposed as a framework for conducting the
linkage analysis. - By random walk theory, the quality/reputation of
a page can be computed as the probability of
visiting that page in a random walk on the web
graph.
8- Imagine a surfer surfing the WWW. At each step
of the walk, the surfer will perform one of the
following three actions - Randomly jump to any node/page in the graph
(this action is denoted as j) - Following a hyperlink from the current page
(this action is denoted as l) - Following a hyperlink in the inverse direction
(this action is denoted as b)
9Random walk theory single surfer walk
- Based on the random walk model, we can have two
set of probabilities - Set 1
- - x(jq) probability of the surfer choosing
jumping from page q. - - x(lq) probability of the surfer choosing
following a hyperlink in page q. - - x(bq) probability of the surfer following
an inverse link from q. - The above probability must satisfy the following
constrains
10Random walk theory single surfer walk (2)
Set 2 x(pq, j) probability of jumping from
page q to page p. x(pq, l) probability of
following a hyperlink in page q to page p. x(pq,
b) probability of following an inverse link back
to page p from q. The above probabilities
should satisfy the following constraints
11Random walk theory single surfer walk (3)
- Let xp(t) be the probability that the surfer is
at the page p at time t. - Then x(t) x1(t), , xN(t) is the probability
distribution on all the pages (N is the total
number of pages) at time t. - So, how to calculate xp(t1)?
-
12Random walk theory single surfer walk (4)
- So, if x(0) x1(0), , xN(0) is known, we can
calculate - xp(t).
- So, how to get x(0) x1(0), , xN(0)?
- Here is an interesting proposition
-
That means x x1, , xN can be used to
represent the quality/reputation of each page.
13Googles page ranking
- The calculation of Googles page ranking can be
modeled by a simpler version of single surfer
walk - Only two actions are allowed by the
surfer - - Randomly jump to any node/page in the
graph from page q with probability x(jq) 1-d - - Following a hyperlink from the current
page q with the probability x(lq) d -
- - Given the jump action is taken, the
probability of jumping to page p is x(pj) 1/N,
where N is the total number of the web pages - - Given the following a link action is
taken, the probability of reaching page p from q
by following link l is x(pq, l) 1/hq, where
hq is the number of links in the page q.
14Googles page ranking(2)
- So we can calculate the probability that the
surfer will reach page p at time t.
- x(jq) 1-d gt 0 guarantees that the pagerank
vector x(t) converges to a distribution of page
scores that doesnt depend on the initial
distribution.
15Googles page rank (3)
16Googles page rank (4)
B
A
C