Title: The Web as a graph
1The Web as a graph measurements, models, and
methods
21. Introduction
- The Web graph is a directed graph of
- nodes and directed edges
- 4 billion surface web pages VS 550 billion deep
web pages - Average node has 7 hyperlinks
3- Reasons to study Web graph
-
- Improve Web search algorithms
- Topic classification
- Topic enumeration
- Growth of the Web and behavior of users
- now becoming a serious commercial interest
-
4- 2. Algorithms
-
-
- HITS algorithm searches for high-quality pages on
a topic query - Topic enumeration algorithm enumerates
- all topics (cyber communities) of the Web graph
-
5Terminology Hub pages contain links to
relevant pages on the topic
Authoritative pages are focused on a particular
topic Hubs
Authorities
6The HITS algorithm Associate a non-negative
authority weight x and a non-negative hub weight
y to each page A page with large authority
weight is regarded as authority A page with
large hub weight is regarded as hub Initially
all values are the same
7- The HITS algorithm (continue)
- Hypertext-induced topic selection
- Reveals the most relevant pages (subgraph/ root
set) on a search topic by querying - Sampling step(extending root set to base set)
- Weight-propagation step
8- Sampling step
-
- Construct a subgraph expected to be rich in
relevant, - authoritative pages
- Keyword query to collect root set (200 pages)
-
- Expand to base set
- (1000-3000 pages) by including all pages that
link to or are linked by - a page in the root set
9Weight-propagation step Extract good hubs and
authorities from the base set Each page p has
authority weight xp hub weight yp Pages of
large hub weights (good hubs) point to pages of
large authority weights (good authorities)
10Updating weights Increase authority weight if
page is pointed to by many good
hubs xpSyq Increase hub weight if page points
to many good authorities yp S xq
q? p
11More mathematical... Adjacency matrix A with
entries (i,j) 1 if page i links to page j 0
otherwise
12- Conclusion
- Output list contains
- pages with the largest hub weights
- pages with the largest authority weights
- After collecting the root set,
- the algorithm is purely a link-based computation
- it provides good search results for a wide range
of queries
13- Query example
- search engine
- Returns yahoo!, excite,magellan,lycos, altavista
- None of them has search engine
14Topic enumeration(trawling algorithm) Enumerates
all topics (will processes entire web
graph) Bipartite core Ci,j a graph on ij nodes
contains a complete bipartite clique Ki,j
C4,3 Intuition Every well represented
topic will contain a bipartite core Ci,j for
some appropriate i and j Such a web graph is
likely to be a cyber community
15- Naive Algorithm
- 2 fatal Problems
- Size of search space too large
- Requires random access to edges
- -large fraction of graph must reside in
memory
16- Elimination-generation Algorithm
-
- Number of sequential passes over the graph
-
- Each pass consists of elimination and generation
- steps
- During each pass, the algorithm writes a modified
version of the graph to the disk -
17- Elimination
- Consider example C4,3
- Edges of nodes with out-degree smaller 3
- can be deleted because
- the node cannot participate
- on the left side
- Nodes with in-degree smaller 4 cannot participate
on the right side
18- Generation
- Identify nodes u
- that barely qualify for a core
- Either output the core or prove that u
- doesnt belong to a core, then drop
- Example node u with in-degree exactly 4 only
belongs to a C4,3 if the nodes that point to u
have a neighborhood intersection of size at least
3 -
-
-
-
19- The in/out degree of every nodes drops
monotonically during each pass - After one pass, the remaining graph has less
nodes than before, which may represent new
chances during the next pass - Continue the procedure until we can not make
significant progress - Possible results
- At the end we drop all the nodes
- The elimination/generation tail off as fewer and
fewer nodes are eliminated the dominating
phenomenon
20 Observations Experiment over 90 of the
cores are not coincidental and correspond to
communities with a definite topic focus
21- 3. Measurements
- Degree distributions
-
- - follows zipfian distribution
- Number of bipartite cores(100 million web nodes)
-
22- Connectivity of the graph
- - giant component
- - giant biconnected component
- - no giant strongly biconnected component (reach
each other by directed path)
23-
- 4. Model
-
- 1.Model structural properties of the graph
-
- 2.Predict the behaviour of algorithms on the
Web - -find algorithms that are doomed to
perform poorly on web graphs - 3.Make predictions about the shape of the Web
graph in the future
24- Requirements
- Model should have an easy and natural
- description
- Capture aggregate formation of the graph cant
model the detailed individual behaviour -
- no static topics required, the Web is dynamic.
-
- Reflect the measurements we have seen
25- A class of random graph models
-
- Some page creators link to other sites without
regard to existing topics - Most page creators link to pages within existing
topics of interest - Random copying as a mechanism to create Zipfian
degree distributions
26Stochastic processes Creation processes Cv
and Ce Deletion processes Dv and De Cv creates
a node with probability Ac(t) Dv removes a
node with probability Ad(t) and also deletes all
incident edges De deletes an edge with
probability x(t)
27- Edge creation process
- Determine a node v and a number k
- With probability b add edges pointing to
- k uniformly chosen nodes
- With probability 1-b copy k edges from a
randomly chosen node u -
- If the outdegree of u is more than k, choose a
random subset of size k -
- If the outdegree of u is less than k, copy the
edges and choose another node u -
-
-
28-
- A simple model
- New node created at every time step
- No deletions
-
- Choose u uniformly at random
- u
- x new edge points to u
- u
- 1- x copy the out-link from u
-
-
29- Simulation
- Follows zipfian distribution
- Number of cores significantly larger than in a
traditional random graph - Challenges
- Study properties and evolution of the random
graphs generated by the model - Need efficient algorithms to analyze such graphs