An Algorithm for Enumerating SCCs in Web Graph presentation

About This Presentation

Transcript and Presenter's Notes

Title: An Algorithm for Enumerating SCCs in Web Graph

1
An Algorithm for Enumerating SCCs in Web Graph

2
Contents

3
4. Experiments and Results

Object
Discuss detailed implementation of enumerating
SCCs in the web graph in China
Data
Data is crawled by Peking Univ.s Sky Net search
engine in May 2003
The Graph contains around 140 millions of nodes
and 4.3 billions of edges (node page, edge
link)
Machine
2.4GHz 4 Xeon CPU/ 4GB SDRAM/ 150GB 7 HDD
RAID5
Windows 2003/ gcc ver. 2.95.3 for windows
2.1GB main memory is available

4
4.1 Traditional Algorithm

5
4.2 The Split-Merge Algorithm (1)

Split the graph into 100 sub-graphs
Build a site graph
The site graph contains around 470 kilos of
nodesand 18 millions of edges
Sum of the weight from the edges which link
nodesto themselves is about 2/3 of total weight
Connectivity of each site is good
Cluster the sites
Set threshold for the site graph
Ignore edges with weights less than the
threshold
Get SCCs with at least three sites
Unclassified sites were viewed as a temporary
cluster

7
ltThreshold 3gt
8
6
7
5
2
ltSite Graph with thresholdgt
6
4.2 The Split-Merge Algorithm (2)

http//edu.china.com http//business.china.com h
ttp//news.china.com
http//china.com
ltclustergt
7
4.2 The Split-Merge Algorithm (3)

8
4.3 Efficiency

Total time cost - less than a full week
I/O cost of building the site graph - less than
5days
Decomposing all sub-graph and the G - one day
and a half
Merging SCCs - less than 6hours
The other costs are ignored in contrast with
those we have listed(ex. Decomposing of the site
graph)

9
4.4 Result (1)

About 80 of web pages are in the maximum SCC
If pages u and v are randomlychosen, the
probability thatthere exists a path from uto v
is about 4/5

ltThe bowtie structure of web graph in chinagt
10
4.4 Result (2)

Structure of the web graph in China is much
different from the structureof the web
The difference of culture
Different styles that people create html pages
Some other reasons

ltThe bowtie structure of the web graphgt
11
4.4 Result (3)

ltDistribution of the size of SCCof web graph in
Chinagt
ltDistribution of the size of SCCof the web graphgt
12
5. Conclusions

With a basic idea of split-merge
Take advantage of some useful properties of the
graph
Find a feasible method to enumerate SCCs
Apply the algorithm on the web graph in China and
accomplish in an affordable time
Site graph
Play important role
Can be viewed as a folded version of web graph
Future work
Find more potential relationship between sites
and pages
Try to predict some properties on the web graph
with the analysis on the site graph

Write a Comment

User Comments (0)

About PowerShow.com

An Algorithm for Enumerating SCCs in Web Graph PowerPoint PPT Presentation