The Web as a graph - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

The Web as a graph

Description:

... hubs and authorities from. the base set. Each page p has ... Increase hub weight if page points to many good authorities: yp= S xq. q? p. More mathematical... – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 30
Provided by: cse6
Category:
Tags: graph | hub | web

less

Transcript and Presenter's Notes

Title: The Web as a graph


1
The Web as a graph   measurements, models, and
methods  

2
1. Introduction
  • The Web graph is a directed graph of
  • nodes and directed edges
  • 4 billion surface web pages VS 550 billion deep
    web pages
  • Average node has 7 hyperlinks

3
  • Reasons to study Web graph
  •  
  • Improve Web search algorithms
  • Topic classification
  • Topic enumeration
  • Growth of the Web and behavior of users
  • now becoming a serious commercial interest
  •  

4
  • 2. Algorithms
  •  
  •   
  • HITS algorithm searches for high-quality pages on
    a topic query
  • Topic enumeration algorithm enumerates
  • all topics (cyber communities) of the Web graph
  •  

5
Terminology Hub pages contain links to
relevant pages on the topic
Authoritative pages are focused on a particular
topic Hubs
Authorities
6
The HITS algorithm Associate a non-negative
authority weight x and a non-negative hub weight
y to each page A page with large authority
weight is regarded as authority A page with
large hub weight is regarded as hub Initially
all values are the same
7
  • The HITS algorithm (continue)
  •   Hypertext-induced topic selection
  • Reveals the most relevant pages (subgraph/ root
    set) on a search topic by querying
  • Sampling step(extending root set to base set)
  • Weight-propagation step

8
  • Sampling step
  •  
  • Construct a subgraph expected to be rich in
    relevant,
  • authoritative pages
  • Keyword query to collect root set (200 pages)
  •  
  • Expand to base set
  • (1000-3000 pages) by including all pages that
    link to or are linked by
  • a page in the root set

9
Weight-propagation step   Extract good hubs and
authorities from the base set Each page p has
authority weight xp hub weight yp Pages of
large hub weights (good hubs) point to pages of
large authority weights (good authorities)

10
Updating weights   Increase authority weight if
page is pointed to by many good
hubs xpSyq Increase hub weight if page points
to many good authorities yp S xq
q? p
11
More mathematical... Adjacency matrix A with
entries (i,j) 1 if page i links to page j 0
otherwise
12
  • Conclusion 
  • Output list contains
  • pages with the largest hub weights
  • pages with the largest authority weights
  • After collecting the root set,
  • the algorithm is purely a link-based computation
  • it provides good search results for a wide range
    of queries

13
  • Query example
  • search engine
  • Returns yahoo!, excite,magellan,lycos, altavista
  • None of them has search engine

14
Topic enumeration(trawling algorithm) Enumerates
all topics (will processes entire web
graph) Bipartite core Ci,j a graph on ij nodes
contains a complete bipartite clique Ki,j  



C4,3 Intuition Every well represented
topic will contain a bipartite core Ci,j for
some appropriate i and j   Such a web graph is
likely to be a cyber community
15
  • Naive Algorithm
  • 2 fatal Problems
  • Size of search space too large
  • Requires random access to edges
  • -large fraction of graph must reside in
    memory

16
  • Elimination-generation Algorithm
  •  
  • Number of sequential passes over the graph
  •  
  • Each pass consists of elimination and generation
  • steps
  • During each pass, the algorithm writes a modified
    version of the graph to the disk
  •  

17
  • Elimination
  • Consider example C4,3
  • Edges of nodes with out-degree smaller 3
  • can be deleted because
  • the node cannot participate
  • on the left side
  • Nodes with in-degree smaller 4 cannot participate
    on the right side

18
  • Generation
  • Identify nodes u
  • that barely qualify for a core
  • Either output the core or prove that u
  • doesnt belong to a core, then drop
  • Example node u with in-degree exactly 4 only
    belongs to a C4,3 if the nodes that point to u
    have a neighborhood intersection of size at least
    3
  •  
  •  
  •  
  •  

19
  • The in/out degree of every nodes drops
    monotonically during each pass
  • After one pass, the remaining graph has less
    nodes than before, which may represent new
    chances during the next pass
  • Continue the procedure until we can not make
    significant progress
  • Possible results
  • At the end we drop all the nodes
  • The elimination/generation tail off as fewer and
    fewer nodes are eliminated the dominating
    phenomenon

20
  Observations Experiment over 90 of the
cores are not coincidental and correspond to
communities with a definite topic focus   
21
  • 3. Measurements
  • Degree distributions
  •  
  • - follows zipfian distribution
  • Number of bipartite cores(100 million web nodes)
  •  

22
  • Connectivity of the graph
  • - giant component
  • - giant biconnected component
  • - no giant strongly biconnected component (reach
    each other by directed path)

23
  •  
  • 4. Model
  • 1.Model structural properties of the graph
  •  
  • 2.Predict the behaviour of algorithms on the
    Web
  •          -find algorithms that are doomed to
    perform poorly on web graphs
  • 3.Make predictions about the shape of the Web
    graph in the future

24
  • Requirements
  • Model should have an easy and natural
  • description
  • Capture aggregate formation of the graph cant
    model the detailed individual behaviour
  •  
  • no static topics required, the Web is dynamic.
  •  
  • Reflect the measurements we have seen

25
  • A class of random graph models
  •  
  • Some page creators link to other sites without
    regard to existing topics
  • Most page creators link to pages within existing
    topics of interest 
  • Random copying as a mechanism to create Zipfian
    degree distributions

26
Stochastic processes     Creation processes Cv
and Ce  Deletion processes Dv and De Cv creates
a node with probability Ac(t)  Dv removes a
node with probability Ad(t) and also deletes all
incident edges De deletes an edge with
probability x(t)  
27
  • Edge creation process
  • Determine a node v and a number k
  • With probability b add edges pointing to
  • k uniformly chosen nodes
  • With probability 1-b copy k edges from a
    randomly chosen node u
  •  
  • If the outdegree of u is more than k, choose a
    random subset of size k
  •  
  • If the outdegree of u is less than k, copy the
    edges and choose another node u
  •  
  •  
  •  

28
  •  
  • A simple model
  • New node created at every time step
  • No deletions
  •  
  • Choose u uniformly at random
  • u
  • x new edge points to u
  • u
  • 1- x copy the out-link from u
  •  
  •  

29
  • Simulation
  • Follows zipfian distribution
  • Number of cores significantly larger than in a
    traditional random graph
  • Challenges
  • Study properties and evolution of the random
    graphs generated by the model
  • Need efficient algorithms to analyze such graphs
Write a Comment
User Comments (0)
About PowerShow.com