An algorithm for Enumerating SCCs in Web Graph - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

An algorithm for Enumerating SCCs in Web Graph

Description:

When the threshold gets higher small components will be detached from the core. Then we fill the current cluster with those detached components. ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 14

Provided by: dblab1

Category:

more less

Transcript and Presenter's Notes

Title: An algorithm for Enumerating SCCs in Web Graph

1
An algorithm for Enumerating SCCs in Web Graph

Jie Han, Yong Yu, Guowei Liu, and Guirong Xue
APWeb 2005, LNCS 3399, pp. 655-667, 2005

May. 2. 2006 So Jeong Han
2
Content

3.2 Pages and Sites
3.3 Clustering the Sites
3.4 Efficiency

3
3.2 Pages and Sites

The web graph contains not only link information,
but also URL information.
Pages in the same site connected tightly with
each other.
Each html page belongs to its owner site.
Ex) the page with URL http//apex.sjtu.edu.cn/home
.zh-cn.gb.htm belongs to the site
http//apex.sjtu.edu.cn.
In the web graph, a large portion of links are
between two pages from a same site.
Ex) On the web graph in China, more than two
thirds of the links are pointed to a local page
within the same site, while only about one third
links are remote which is across different sites.
For most sites, a homepage is provided to guide
the user to the pages they want.
The homepage points to the most important pages
and those pages also link back to the homepage.
If we group the pages according to their owner
sites, the split will not be too bad.

4
3.3 Clustering the Sites(1/6)

If we regard each site as a group of nodes and
thus split the web graph
Each subgraph will be small enough to fit into
main memory.
But when we decompose all the sub-graphs and
contract the graph G
We feel unsure whether G can be wholly loaded
into main memory.

Split the web graph
Contract the graph G
5
3.3 Clustering the Sites(2/6)

The number of pages in a site follows a power law
The number of sites which contain x pages is
proportional to 1/xk with some k gt 1.
China the exponent k is about 1.74.
If we regard these sites as a subgraph
the small scale of sub-graph will make the effect
of split unsatisfactory because we can only get a
little information in many small sub-graphs.
If we cluster the sites and regard pages in each
cluster of sites as a sub-graph
the effect will be better.

Most sites contain only several tens of pages.
the richest 10 sites possess more than 90 pages
while 90 sites contain only less than 10 pages.
6
3.3 Clustering the Sites(3/6)

Random Assignment
The easiest way to cluster the sites is random
assignment.
In this way, sites are clustered in a random way
and little link information is considered.
Each sub-graph is composed of pages in several
random-chosen sites.
Advantage
We can control the size of each cluster in order
to be fit into memory.
Disadvantage
Sometimes the sites in a cluster are irrelevant.
The connectivity among these sites is poor.
The effect is as the same as they are not
clustered.
To avoid this situation, we may find another way
to cluster the sites.

7
3.3 Clustering the Sites(4/6)

Site Graph SCC
A method following the idea of hierarchical
clustering.
Sites and links among them also forms a directed
graph.
Nodes represent sites and edges represent links.
The weight of the node the number of pages
which the site possesses.
The weight of edges the number of real
hyperlinks across the pages of two sites.
Each SCC of site graph A cluster of site
the internal connectivity of each cluster could
be high.
Sites in the same component can reach each other
by directed hyperlinks between their pages.

8
3.3 Clustering the Sites(5/6)

Site Graph SCC (cont.)
Advantage
We can get well-connected sub-graphs by using
site graph SCC.
Most well-connected sites can be clustered into
the same sub-graph.
We ignore the edges with small weight in the site
graph and decompose it into SCCs, and regard each
component as a well connected cluster of sites.
Disadvantage
We can not control the size of each sub-graph
precisely.

9
3.3 Clustering the Sites(6/6)

Hierarchical Site SCC
Gradually increase the threshold of the edges,
starting from 0.
In the second method, the threshold of the edges
is fixed in each step.
When the threshold gets higher small components
will be detached from the core.
Then we fill the current cluster with those
detached components.
Once the size of the current cluster is estimated
to reach the memory limit, we begin to construct
another cluster.
This procedure is stopped when the remained graph
is estimated to be smaller than the memory limit.

10
3.4 Efficiency(1/4)

Split
Splitting the graph requires one extra copy of
the full graph each time.
It may cost several hours to several days
depending on the size of the graph.
As the web graph is very large, even to read the
whole graph from hard disk to memory costs
several hours.

Fin.
11
3.4 Efficiency(2/4)

Decompose Sub-graph/G
Decomposing sub-graphs (or G) can be finished in
O(ne) time.
n the number of nodes in the graph.
e the number of edges in the graph.
Each edge in the original graph G
will be visited at most once in one sub-graph or
in one contracted graph G.
It may cost a few days in total depending on the
size of the whole graph.

Fin.
12
3.4 Efficiency(3/4)

Contract the graph
Graph contraction requires at most one full-graph
copy.
Only a small portion of edges will be copied if
the split is appropriate .

Fin.
13
3.4 Efficiency(4/4)

Merge the SCCs
The cost of merging SCCs is O(nlog(n)).
When we got the SCC information M1 from
sub-graphs and the SCC information M2 from G, we
can sort the information in M1.
Then we can output each component in M2 using
binarysearch in M1.
It takes only a few hours to merge SCCs and we
can ignore the cost in contrast with those in
other steps.
The extra cost of split-merge algorithm
only depends on how many times we need to split
the graph.
only come from the IO operation during the
process of split and graph contraction.
If we just split the graph a few times, such cost
is affordable.
Further, optimizations are still available.
Ex) Eliminating the nodes with zero out-degree
(or in-degree) is a good advice if the final G
is a bit larger.