An Algorithm for Enumerating SCCs in Web Graph PowerPoint PPT Presentation

presentation player overlay
1 / 12
About This Presentation
Transcript and Presenter's Notes

Title: An Algorithm for Enumerating SCCs in Web Graph


1
An Algorithm for Enumerating SCCs in Web Graph
  • J. Han, Y. Yu, G. Liu, and G. Xue,
  • Proc. 7th APWeb Conference, 2005
  • Kyoung Hoon Kwak
  • (May 02, 2006)

2
Contents
  • 4. Experiments and Results
  • Environments
  • 4.1 Traditional Algorithm
  • 4.2 The Split-Merge Algorithm
  • 4.3 Efficiency
  • 4.4 Result
  • 5. Conclusions

3
4. Experiments and Results
  • Object
  • Discuss detailed implementation of enumerating
    SCCs in the web graph in China
  • Data
  • Data is crawled by Peking Univ.s Sky Net search
    engine in May 2003
  • The Graph contains around 140 millions of nodes
    and 4.3 billions of edges (node page, edge
    link)
  • Machine
  • 2.4GHz 4 Xeon CPU/ 4GB SDRAM/ 150GB 7 HDD
    RAID5
  • Windows 2003/ gcc ver. 2.95.3 for windows
  • 2.1GB main memory is available

4
4.1 Traditional Algorithm
  • Main program visited the edges in the graph by
    cache
  • Hit rate is important for the performance
  • Due to the large scale of the graph
  • Actual hit rate is very low
  • It may take several years to finish the work
  • It is hard to increase the hit rate
  • Therefore, another efficient algorithm is needed

5
4.2 The Split-Merge Algorithm (1)
  • Split the graph into 100 sub-graphs
  • Build a site graph
  • The site graph contains around 470 kilos of
    nodesand 18 millions of edges
  • Sum of the weight from the edges which link
    nodesto themselves is about 2/3 of total weight
  • Connectivity of each site is good
  • Cluster the sites
  • Set threshold for the site graph
  • Ignore edges with weights less than the
    threshold
  • Get SCCs with at least three sites
  • Unclassified sites were viewed as a temporary
    cluster

7
ltThreshold 3gt
8
6
7
5
2
ltSite Graph with thresholdgt
6
4.2 The Split-Merge Algorithm (2)
  • Threshold 1000
  • 1249 sites are classified into 99 clusters and
    469K site are not classified
  • Most classified sites are famous and have many
    pages
  • Most unclassified sites are very small sites
  • Some clusters are just parts of a big famous site
  • Two clusters are still large to load into memory
  • The biggest cluster of sites (25) and temporary
    cluster (70)
  • Recursively apply split-merge algorithm to
    them(using random assignment)

http//edu.china.com http//business.china.com h
ttp//news.china.com
http//china.com
ltclustergt
7
4.2 The Split-Merge Algorithm (3)
  • After find all SCCs in each sub-graph
  • We can get the final G
  • The G contains less than 46millions of edges
  • Decompose the G and merge the SCCs of each
    sub-graph

8
4.3 Efficiency
  • Total time cost - less than a full week
  • I/O cost of building the site graph - less than
    5days
  • Decomposing all sub-graph and the G - one day
    and a half
  • Merging SCCs - less than 6hours
  • The other costs are ignored in contrast with
    those we have listed(ex. Decomposing of the site
    graph)

9
4.4 Result (1)
  • About 80 of web pages are in the maximum SCC
  • If pages u and v are randomlychosen, the
    probability thatthere exists a path from uto v
    is about 4/5

ltThe bowtie structure of web graph in chinagt
10
4.4 Result (2)
  • Structure of the web graph in China is much
    different from the structureof the web
  • The difference of culture
  • Different styles that people create html pages
  • Some other reasons

ltThe bowtie structure of the web graphgt
11
4.4 Result (3)
  • The size of the SCC follows a power lawwith the
    exponent around 2.3

ltDistribution of the size of SCCof web graph in
Chinagt
ltDistribution of the size of SCCof the web graphgt
12
5. Conclusions
  • With a basic idea of split-merge
  • Take advantage of some useful properties of the
    graph
  • Find a feasible method to enumerate SCCs
  • Apply the algorithm on the web graph in China and
    accomplish in an affordable time
  • Site graph
  • Play important role
  • Can be viewed as a folded version of web graph
  • Future work
  • Find more potential relationship between sites
    and pages
  • Try to predict some properties on the web graph
    with the analysis on the site graph
Write a Comment
User Comments (0)
About PowerShow.com