Title: Vincent Blondel and Paul Van Dooren CESAME, Universite Catholique de Louvain
1(No Transcript)
2Web searching and graph similarityVincent
Blondel and Paul Van DoorenCESAME, Universite
Catholique de Louvainhttp//www.csam.ucl.ac.be/
- in collaboration with A. Gajardo, M. Heymans, P.
Sennelart - To appear in Siam Review
Bari 6,September 2004
3The web graph
- Nodes web pages, Edges hyperlinks between
pages - 3 billion (Google searched 3,083,324,625 webpages
in 2002) - Average of 7 outgoing links
4The web graph
- Nodes web pages, Edges hyperlinks between
pages - 3 billion (Google searched 3,083,324,625 webpages
in 2002) - Average of 7 outgoing links
- Growth of a few
- every month
5Outline
- 1. Structure of the web
- 2. Methods for searching the web
- (Google PageRank and Kleinberg Hits)
- 3. Similarity in graphs
- 4. Application to synonym extraction
6Structure of the web
- Experiments two crawls in 1999 found a giant
strongly connected
component (core) - Contains most prominent sites
- It contains 30 of all pages
- Average distance between nodes is 16
- Small world
- Ref Broder et al., Graph structure in the web,
WWW9, 2000
7The web is a bowtie
- Ref The web is a bowtie, Nature, May 11, 2000
8In- and out-degree distributions
- Power law distribution number of pages of
in-degree n is - proportional to 1/n2.1 (Zipf law)
9A score for every page
- The score of a page is high if the page has many
incoming - links coming from pages with high page score
- One browses from page to page by following
outgoing links - with equal probability. Score frequency a page
is visited.
10A score for every page
- The score of a page is high if the page has many
incoming - links coming from pages with high page score
- One browses from page to page by following
outgoing links - with equal probability. Score frequency a page
is visited. - some pages may have no outgoing links
- many pages have zero in-degree
11PageRank teleporting random score
- The surfer follows a path by choosing an outgoing
link with probability - p/dout(i) or teleports to a random web page with
probability 0lt1-p lt1. - Put the transition probability of i to j in a
matrix M (bij1 if i?j) - mij p bij /dout(i) (1-p)/n
- then the vector x of probability distribution on
the nodes of the graph - is the steady state vector of the iteration
xk1MTxk i.e. the dominant - eigenvector of the matrix MT (unique because of
Perron-Frobenius) - PageRank of node i is the (relative) size of
element i of this vector
12Matlab News and Notes, October 2002
13and my own page rank ?
- use Google toolbar
- some top pages
- PageRank In-degree
- 1 http//www.yahoo.com 10 654,000
- 2 http//www.adobe.com 10 646,000
- 5 http//www.google.com 10 252,000
- 8 http//www.microsoft.com 10 129,000
- 12 http//www.nasa.gov 10 93,900
- 20 http//mit.edu 10 47,600
- 23 http//www.nsf.gov 10 39,400
- 26 http//www.inria.fr 10 17,400
- 72 http//www.stanford.edu 9 36,300
14Kleinbergs structure graph
- The score of a page is high if the page has
- many incoming links
- The score is high if the incoming links are
- from pages that have high scores
15Kleinbergs structure graph
- The score of a page is high if the page has
- many incoming links
- The score is high if the incoming links are
- from pages that have high scores
- This inspired Kleinbergs structure graph
- hub authority
16Good authorities for University Belgium
17A good hub for University Belgium
18Hub and authority scores
- Web pages have a hub score hj and an authority
score aj which are - mutually reinforcing
- pages with large hj point to pages with high aj
- pages with large aj are pointed to by pages with
high hj - hj ?
S i(j?i) ai - aj ? S i(i?j) hi
- or, using the adjacency matrix B of the graph
(bji1 if j?i is an edge) - h 0 B
h h 1 - a k1 BT 0 a
k a 0 1 - Use limiting vector a (dominant eigenvector of
BTB) to rank pages
19(No Transcript)
20Extension to another structure graph
- Give three scores to each web page begin b,
center c, end e - b
c e - Use again mutual reinforcement to define the
iteration - bj ?
S i(j?i) ci - cj ? S i(i?j) bi
S i(j?i) ei - ej ?
S i(i?j) ci - Defines a limiting vector for the iteration
-
b 0 B
0 - xk1 M xk, x0 1 where
x c , M BT 0 B -
e 0 BT
0 -
21Towards arbitrary graphs
- For the graph ? A
and M - For the graph ? ? A
and M - Formula for M for two arbitrary graphs GA and GB
- M A B
AT BT - With xk vec(Xk) iteration xk1 M xk is
equivalent to Xk1 BXk ATBT Xk A
22Convergence ?
- The (normalized) sequence
- Zk1 (BZk ATBT Zk A)/ BZk ATBT Zk AF
- has two fixed points Zeven and Zodd for every
Z0gt0 - Similarity matrix S lim k?8 Z2k , Z0 1
- Si,j is the similarity score between Vj (A) and
Vi (B) - Properties (generic)
- ?S BSAT BTSA, ?BSATBTSAF
- Fixed point of largest 1-norm
- Robust fixed point for ?eSe BSeAT BTSeA
e1Se1 - Linear convergence (power method for sparse M)
23Bow tie example
- h a h a
- S S
- if mgtn if ngtm
- not satisfactory
graph A h ? a
graph B 2
1 n1
nm1
24Bow tie example
- b c e
- S
-
- central score is good
graph A b ? ? e c
graph B 2
1 n1
nm1
25Other properties
- Central score is a dominant eigenvector of
BBTBTB - (cfr. hub score of BBT and authority score of
BTB) - Similarity matrix of a graph with itself is
square and semi-definite. - Path graph ? ?
Cycle graph
26The dictionary graph
- OPTED, based on Websters unabridged dictionary
- http//msowww.anu.edu.au/ralph/OPTED
- Nodes words present in the dictionary 112,169
nodes - Edge (u,v) if v appears in the definition of u
1,398,424 edges - Average of 12 edges per node
27In and out degree distribution
- Very similar to web (power law)
- Words with highest in degree
- of, a, the, or, to, in
- Words with null out degree
- 14159, Fe3O4, Aaron,
- and some undefined or misspelled words
28Neighborhood graph
- is the subset of vertices used for finding
synonyms - it contains all parents and children of the node
- neighborhood graph
of likely - Central uses this sub-graph to rank
automatically synonyms - Comparison with Vectors, ArcRank (automatic)
- Wordnet, Microsoft
Word (manual)
29Disappear
30Science
31Sugar
32Conclusion
- New notion of similarity between vertices of a
graph - Easy to compute start from X0 1 and take even
normalized - iterates of Xk1BXkATBTXkA
- Potential use for data-mining, classification,
clustering - Successful implementation for the French
dictionary Le petit Robert - Applications in texts, internet, reference lists,
telephone networks, - bipartite graphs (Melnik, Widom, )
- Different from sub-graph problems !
33(No Transcript)
34Distribution of calls received
Number of customers
Number of calls received
Example 2000 people have received 100 calls