Vincent Blondel and Paul Van Dooren CESAME, Universite Catholique de Louvain - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Vincent Blondel and Paul Van Dooren CESAME, Universite Catholique de Louvain

Description:

To appear in Siam Review Bari 6,September 2004. The web graph ... One browses from page to page by following outgoing links. with equal probability. ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 35
Provided by: paulvan4
Category:

less

Transcript and Presenter's Notes

Title: Vincent Blondel and Paul Van Dooren CESAME, Universite Catholique de Louvain


1
(No Transcript)
2
Web searching and graph similarityVincent
Blondel and Paul Van DoorenCESAME, Universite
Catholique de Louvainhttp//www.csam.ucl.ac.be/
  • in collaboration with A. Gajardo, M. Heymans, P.
    Sennelart
  • To appear in Siam Review

    Bari 6,September 2004

3
The web graph
  • Nodes web pages, Edges hyperlinks between
    pages
  • 3 billion (Google searched 3,083,324,625 webpages
    in 2002)
  • Average of 7 outgoing links

4
The web graph
  • Nodes web pages, Edges hyperlinks between
    pages
  • 3 billion (Google searched 3,083,324,625 webpages
    in 2002)
  • Average of 7 outgoing links
  • Growth of a few
  • every month

5
Outline
  • 1. Structure of the web
  • 2. Methods for searching the web
  • (Google PageRank and Kleinberg Hits)
  • 3. Similarity in graphs
  • 4. Application to synonym extraction

6
Structure of the web
  • Experiments two crawls in 1999 found a giant
    strongly connected
    component (core)
  • Contains most prominent sites
  • It contains 30 of all pages
  • Average distance between nodes is 16
  • Small world
  • Ref Broder et al., Graph structure in the web,
    WWW9, 2000

7
The web is a bowtie
  • Ref The web is a bowtie, Nature, May 11, 2000

8
In- and out-degree distributions
  • Power law distribution number of pages of
    in-degree n is
  • proportional to 1/n2.1 (Zipf law)

9
A score for every page
  • The score of a page is high if the page has many
    incoming
  • links coming from pages with high page score
  • One browses from page to page by following
    outgoing links
  • with equal probability. Score frequency a page
    is visited.

10
A score for every page
  • The score of a page is high if the page has many
    incoming
  • links coming from pages with high page score
  • One browses from page to page by following
    outgoing links
  • with equal probability. Score frequency a page
    is visited.
  • some pages may have no outgoing links
  • many pages have zero in-degree

11
PageRank teleporting random score
  • The surfer follows a path by choosing an outgoing
    link with probability
  • p/dout(i) or teleports to a random web page with
    probability 0lt1-p lt1.
  • Put the transition probability of i to j in a
    matrix M (bij1 if i?j)
  • mij p bij /dout(i) (1-p)/n
  • then the vector x of probability distribution on
    the nodes of the graph
  • is the steady state vector of the iteration
    xk1MTxk i.e. the dominant
  • eigenvector of the matrix MT (unique because of
    Perron-Frobenius)
  • PageRank of node i is the (relative) size of
    element i of this vector

12
Matlab News and Notes, October 2002
13
and my own page rank ?
  • use Google toolbar
  • some top pages
  • PageRank In-degree
  • 1 http//www.yahoo.com 10 654,000
  • 2 http//www.adobe.com 10 646,000
  • 5 http//www.google.com 10 252,000
  • 8 http//www.microsoft.com 10 129,000
  • 12 http//www.nasa.gov 10 93,900
  • 20 http//mit.edu 10 47,600
  • 23 http//www.nsf.gov 10 39,400
  • 26 http//www.inria.fr 10 17,400
  • 72 http//www.stanford.edu 9 36,300

14
Kleinbergs structure graph
  • The score of a page is high if the page has
  • many incoming links
  • The score is high if the incoming links are
  • from pages that have high scores

15
Kleinbergs structure graph
  • The score of a page is high if the page has
  • many incoming links
  • The score is high if the incoming links are
  • from pages that have high scores
  • This inspired Kleinbergs structure graph
  • hub authority

16
Good authorities for University Belgium
17
A good hub for University Belgium
18
Hub and authority scores
  • Web pages have a hub score hj and an authority
    score aj which are
  • mutually reinforcing
  • pages with large hj point to pages with high aj
  • pages with large aj are pointed to by pages with
    high hj
  • hj ?
    S i(j?i) ai
  • aj ? S i(i?j) hi
  • or, using the adjacency matrix B of the graph
    (bji1 if j?i is an edge)
  • h 0 B
    h h 1
  • a k1 BT 0 a
    k a 0 1
  • Use limiting vector a (dominant eigenvector of
    BTB) to rank pages



19
(No Transcript)
20
Extension to another structure graph
  • Give three scores to each web page begin b,
    center c, end e
  • b
    c e
  • Use again mutual reinforcement to define the
    iteration
  • bj ?
    S i(j?i) ci
  • cj ? S i(i?j) bi
    S i(j?i) ei
  • ej ?
    S i(i?j) ci
  • Defines a limiting vector for the iteration

  • b 0 B
    0
  • xk1 M xk, x0 1 where
    x c , M BT 0 B

  • e 0 BT
    0

21
Towards arbitrary graphs
  • For the graph ? A
    and M
  • For the graph ? ? A
    and M
  • Formula for M for two arbitrary graphs GA and GB
  • M A B
    AT BT
  • With xk vec(Xk) iteration xk1 M xk is
    equivalent to Xk1 BXk ATBT Xk A

22
Convergence ?
  • The (normalized) sequence
  • Zk1 (BZk ATBT Zk A)/ BZk ATBT Zk AF
  • has two fixed points Zeven and Zodd for every
    Z0gt0
  • Similarity matrix S lim k?8 Z2k , Z0 1
  • Si,j is the similarity score between Vj (A) and
    Vi (B)
  • Properties (generic)
  • ?S BSAT BTSA, ?BSATBTSAF
  • Fixed point of largest 1-norm
  • Robust fixed point for ?eSe BSeAT BTSeA
    e1Se1
  • Linear convergence (power method for sparse M)

23
Bow tie example
  • h a h a
  • S S
  • if mgtn if ngtm
  • not satisfactory

graph A h ? a
graph B 2
1 n1
nm1
24
Bow tie example
  • b c e
  • S
  • central score is good

graph A b ? ? e c
graph B 2
1 n1
nm1
25
Other properties
  • Central score is a dominant eigenvector of
    BBTBTB
  • (cfr. hub score of BBT and authority score of
    BTB)
  • Similarity matrix of a graph with itself is
    square and semi-definite.
  • Path graph ? ?
    Cycle graph

26
The dictionary graph
  • OPTED, based on Websters unabridged dictionary
  • http//msowww.anu.edu.au/ralph/OPTED
  • Nodes words present in the dictionary 112,169
    nodes
  • Edge (u,v) if v appears in the definition of u
    1,398,424 edges
  • Average of 12 edges per node

27
In and out degree distribution
  • Very similar to web (power law)
  • Words with highest in degree
  • of, a, the, or, to, in
  • Words with null out degree
  • 14159, Fe3O4, Aaron,
  • and some undefined or misspelled words

28
Neighborhood graph
  • is the subset of vertices used for finding
    synonyms
  • it contains all parents and children of the node
  • neighborhood graph
    of likely
  • Central uses this sub-graph to rank
    automatically synonyms
  • Comparison with Vectors, ArcRank (automatic)
  • Wordnet, Microsoft
    Word (manual)

29
Disappear
30
Science
31
Sugar
32
Conclusion
  • New notion of similarity between vertices of a
    graph
  • Easy to compute start from X0 1 and take even
    normalized
  • iterates of Xk1BXkATBTXkA
  • Potential use for data-mining, classification,
    clustering
  • Successful implementation for the French
    dictionary Le petit Robert
  • Applications in texts, internet, reference lists,
    telephone networks,
  • bipartite graphs (Melnik, Widom, )
  • Different from sub-graph problems !

33
(No Transcript)
34
Distribution of calls received
Number of customers
Number of calls received
Example 2000 people have received 100 calls
Write a Comment
User Comments (0)
About PowerShow.com