Title: An Algorithm for Enumerating SCCs in Web Graph
1An Algorithm for Enumerating SCCs in Web Graph
- J. Han, Y. Yu, G. Liu, and G. Xue,
- Proc. 7th APWeb Conference, 2005
- Kyoung Hoon Kwak
- (May 02, 2006)
2Contents
- 4. Experiments and Results
- Environments
- 4.1 Traditional Algorithm
- 4.2 The Split-Merge Algorithm
- 4.3 Efficiency
- 4.4 Result
- 5. Conclusions
34. Experiments and Results
- Object
- Discuss detailed implementation of enumerating
SCCs in the web graph in China - Data
- Data is crawled by Peking Univ.s Sky Net search
engine in May 2003 - The Graph contains around 140 millions of nodes
and 4.3 billions of edges (node page, edge
link) - Machine
- 2.4GHz 4 Xeon CPU/ 4GB SDRAM/ 150GB 7 HDD
RAID5 - Windows 2003/ gcc ver. 2.95.3 for windows
- 2.1GB main memory is available
44.1 Traditional Algorithm
- Main program visited the edges in the graph by
cache - Hit rate is important for the performance
- Due to the large scale of the graph
- Actual hit rate is very low
- It may take several years to finish the work
- It is hard to increase the hit rate
- Therefore, another efficient algorithm is needed
54.2 The Split-Merge Algorithm (1)
- Split the graph into 100 sub-graphs
- Build a site graph
- The site graph contains around 470 kilos of
nodesand 18 millions of edges - Sum of the weight from the edges which link
nodesto themselves is about 2/3 of total weight - Connectivity of each site is good
- Cluster the sites
- Set threshold for the site graph
- Ignore edges with weights less than the
threshold - Get SCCs with at least three sites
- Unclassified sites were viewed as a temporary
cluster
7
ltThreshold 3gt
8
6
7
5
2
ltSite Graph with thresholdgt
64.2 The Split-Merge Algorithm (2)
- Threshold 1000
- 1249 sites are classified into 99 clusters and
469K site are not classified - Most classified sites are famous and have many
pages - Most unclassified sites are very small sites
- Some clusters are just parts of a big famous site
- Two clusters are still large to load into memory
- The biggest cluster of sites (25) and temporary
cluster (70) - Recursively apply split-merge algorithm to
them(using random assignment)
http//edu.china.com http//business.china.com h
ttp//news.china.com
http//china.com
ltclustergt
74.2 The Split-Merge Algorithm (3)
- After find all SCCs in each sub-graph
- We can get the final G
- The G contains less than 46millions of edges
- Decompose the G and merge the SCCs of each
sub-graph
84.3 Efficiency
- Total time cost - less than a full week
- I/O cost of building the site graph - less than
5days - Decomposing all sub-graph and the G - one day
and a half - Merging SCCs - less than 6hours
- The other costs are ignored in contrast with
those we have listed(ex. Decomposing of the site
graph)
94.4 Result (1)
- About 80 of web pages are in the maximum SCC
- If pages u and v are randomlychosen, the
probability thatthere exists a path from uto v
is about 4/5
ltThe bowtie structure of web graph in chinagt
104.4 Result (2)
- Structure of the web graph in China is much
different from the structureof the web - The difference of culture
- Different styles that people create html pages
- Some other reasons
ltThe bowtie structure of the web graphgt
114.4 Result (3)
- The size of the SCC follows a power lawwith the
exponent around 2.3
ltDistribution of the size of SCCof web graph in
Chinagt
ltDistribution of the size of SCCof the web graphgt
125. Conclusions
- With a basic idea of split-merge
- Take advantage of some useful properties of the
graph - Find a feasible method to enumerate SCCs
- Apply the algorithm on the web graph in China and
accomplish in an affordable time - Site graph
- Play important role
- Can be viewed as a folded version of web graph
- Future work
- Find more potential relationship between sites
and pages - Try to predict some properties on the web graph
with the analysis on the site graph