The Web as a graph - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

The Web as a graph

Description:

... hubs and authorities from. the base set. Each page p has ... Increase hub weight if page points to many good authorities: yp= S xq. q? p. More mathematical... – PowerPoint PPT presentation

Number of Views:164

Avg rating:3.0/5.0

Slides: 30

Provided by: cse6

Category:

more less

Transcript and Presenter's Notes

Title: The Web as a graph

1
The Web as a graph measurements, models, and
methods

2
1. Introduction

The Web graph is a directed graph of
nodes and directed edges
4 billion surface web pages VS 550 billion deep
web pages
Average node has 7 hyperlinks

Reasons to study Web graph
Improve Web search algorithms
Topic classification
Topic enumeration
Growth of the Web and behavior of users
now becoming a serious commercial interest

2. Algorithms
HITS algorithm searches for high-quality pages on
a topic query
Topic enumeration algorithm enumerates
all topics (cyber communities) of the Web graph

5
Terminology Hub pages contain links to
relevant pages on the topic
Authoritative pages are focused on a particular
topic Hubs
Authorities
6
The HITS algorithm Associate a non-negative
authority weight x and a non-negative hub weight
y to each page A page with large authority
weight is regarded as authority A page with
large hub weight is regarded as hub Initially
all values are the same
7

The HITS algorithm (continue)
Hypertext-induced topic selection
Reveals the most relevant pages (subgraph/ root
set) on a search topic by querying
Sampling step(extending root set to base set)
Weight-propagation step

Sampling step
Construct a subgraph expected to be rich in
relevant,
authoritative pages
Keyword query to collect root set (200 pages)
Expand to base set
(1000-3000 pages) by including all pages that
link to or are linked by
a page in the root set

9
Weight-propagation step Extract good hubs and
authorities from the base set Each page p has
authority weight xp hub weight yp Pages of
large hub weights (good hubs) point to pages of
large authority weights (good authorities)

10
Updating weights Increase authority weight if
page is pointed to by many good
hubs xpSyq Increase hub weight if page points
to many good authorities yp S xq
q? p
11
More mathematical... Adjacency matrix A with
entries (i,j) 1 if page i links to page j 0
otherwise
12

Conclusion
Output list contains
pages with the largest hub weights
pages with the largest authority weights
After collecting the root set,
the algorithm is purely a link-based computation
it provides good search results for a wide range
of queries

Query example
search engine
Returns yahoo!, excite,magellan,lycos, altavista
None of them has search engine

14
Topic enumeration(trawling algorithm) Enumerates
all topics (will processes entire web
graph) Bipartite core Ci,j a graph on ij nodes
contains a complete bipartite clique Ki,j

C4,3 Intuition Every well represented
topic will contain a bipartite core Ci,j for
some appropriate i and j Such a web graph is
likely to be a cyber community
15

Naive Algorithm
2 fatal Problems
Size of search space too large
Requires random access to edges
-large fraction of graph must reside in
memory

Elimination-generation Algorithm
Number of sequential passes over the graph
Each pass consists of elimination and generation
steps
During each pass, the algorithm writes a modified
version of the graph to the disk

Elimination
Consider example C4,3
Edges of nodes with out-degree smaller 3
can be deleted because
the node cannot participate
on the left side
Nodes with in-degree smaller 4 cannot participate
on the right side

Generation
Identify nodes u
that barely qualify for a core
Either output the core or prove that u
doesnt belong to a core, then drop
Example node u with in-degree exactly 4 only
belongs to a C4,3 if the nodes that point to u
have a neighborhood intersection of size at least
3

The in/out degree of every nodes drops
monotonically during each pass
After one pass, the remaining graph has less
nodes than before, which may represent new
chances during the next pass
Continue the procedure until we can not make
significant progress
Possible results
At the end we drop all the nodes
The elimination/generation tail off as fewer and
fewer nodes are eliminated the dominating
phenomenon

20
Observations Experiment over 90 of the
cores are not coincidental and correspond to
communities with a definite topic focus
21

3. Measurements
Degree distributions
- follows zipfian distribution
Number of bipartite cores(100 million web nodes)

Connectivity of the graph
- giant component
- giant biconnected component
- no giant strongly biconnected component (reach
each other by directed path)

4. Model
1.Model structural properties of the graph
2.Predict the behaviour of algorithms on the
Web
-find algorithms that are doomed to
perform poorly on web graphs
3.Make predictions about the shape of the Web
graph in the future

Requirements
Model should have an easy and natural
description
Capture aggregate formation of the graph cant
model the detailed individual behaviour
no static topics required, the Web is dynamic.
Reflect the measurements we have seen

A class of random graph models
Some page creators link to other sites without
regard to existing topics
Most page creators link to pages within existing
topics of interest
Random copying as a mechanism to create Zipfian
degree distributions

26
Stochastic processes Creation processes Cv
and Ce Deletion processes Dv and De Cv creates
a node with probability Ac(t) Dv removes a
node with probability Ad(t) and also deletes all
incident edges De deletes an edge with
probability x(t)
27