An algorithm for Enumerating SCCs in Web Graph - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

An algorithm for Enumerating SCCs in Web Graph

Description:

When the threshold gets higher small components will be detached from the core. Then we fill the current cluster with those detached components. ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 14
Provided by: dblab1
Category:

less

Transcript and Presenter's Notes

Title: An algorithm for Enumerating SCCs in Web Graph


1
An algorithm for Enumerating SCCs in Web Graph
  • Jie Han, Yong Yu, Guowei Liu, and Guirong Xue
  • APWeb 2005, LNCS 3399, pp. 655-667, 2005

May. 2. 2006 So Jeong Han
2
Content
  • 3.2 Pages and Sites
  • 3.3 Clustering the Sites
  • 3.4 Efficiency

3
3.2 Pages and Sites
  • The web graph contains not only link information,
    but also URL information.
  • Pages in the same site connected tightly with
    each other.
  • Each html page belongs to its owner site.
  • Ex) the page with URL http//apex.sjtu.edu.cn/home
    .zh-cn.gb.htm belongs to the site
    http//apex.sjtu.edu.cn.
  • In the web graph, a large portion of links are
    between two pages from a same site.
  • Ex) On the web graph in China, more than two
    thirds of the links are pointed to a local page
    within the same site, while only about one third
    links are remote which is across different sites.
  • For most sites, a homepage is provided to guide
    the user to the pages they want.
  • The homepage points to the most important pages
    and those pages also link back to the homepage.
  • If we group the pages according to their owner
    sites, the split will not be too bad.

4
3.3 Clustering the Sites(1/6)
  • If we regard each site as a group of nodes and
    thus split the web graph
  • Each subgraph will be small enough to fit into
    main memory.
  • But when we decompose all the sub-graphs and
    contract the graph G
  • We feel unsure whether G can be wholly loaded
    into main memory.

Split the web graph
Contract the graph G
5
3.3 Clustering the Sites(2/6)
  • The number of pages in a site follows a power law
  • The number of sites which contain x pages is
    proportional to 1/xk with some k gt 1.
  • China the exponent k is about 1.74.
  • If we regard these sites as a subgraph
  • the small scale of sub-graph will make the effect
    of split unsatisfactory because we can only get a
    little information in many small sub-graphs.
  • If we cluster the sites and regard pages in each
    cluster of sites as a sub-graph
  • the effect will be better.

Most sites contain only several tens of pages.
the richest 10 sites possess more than 90 pages
while 90 sites contain only less than 10 pages.
6
3.3 Clustering the Sites(3/6)
  • Random Assignment
  • The easiest way to cluster the sites is random
    assignment.
  • In this way, sites are clustered in a random way
    and little link information is considered.
  • Each sub-graph is composed of pages in several
    random-chosen sites.
  • Advantage
  • We can control the size of each cluster in order
    to be fit into memory.
  • Disadvantage
  • Sometimes the sites in a cluster are irrelevant.
  • The connectivity among these sites is poor.
  • The effect is as the same as they are not
    clustered.
  • To avoid this situation, we may find another way
    to cluster the sites.

7
3.3 Clustering the Sites(4/6)
  • Site Graph SCC
  • A method following the idea of hierarchical
    clustering.
  • Sites and links among them also forms a directed
    graph.
  • Nodes represent sites and edges represent links.
  • The weight of the node the number of pages
    which the site possesses.
  • The weight of edges the number of real
    hyperlinks across the pages of two sites.
  • Each SCC of site graph A cluster of site
  • the internal connectivity of each cluster could
    be high.
  • Sites in the same component can reach each other
    by directed hyperlinks between their pages.

8
3.3 Clustering the Sites(5/6)
  • Site Graph SCC (cont.)
  • Advantage
  • We can get well-connected sub-graphs by using
    site graph SCC.
  • Most well-connected sites can be clustered into
    the same sub-graph.
  • We ignore the edges with small weight in the site
    graph and decompose it into SCCs, and regard each
    component as a well connected cluster of sites.
  • Disadvantage
  • We can not control the size of each sub-graph
    precisely.

9
3.3 Clustering the Sites(6/6)
  • Hierarchical Site SCC
  • Gradually increase the threshold of the edges,
    starting from 0.
  • In the second method, the threshold of the edges
    is fixed in each step.
  • When the threshold gets higher small components
    will be detached from the core.
  • Then we fill the current cluster with those
    detached components.
  • Once the size of the current cluster is estimated
    to reach the memory limit, we begin to construct
    another cluster.
  • This procedure is stopped when the remained graph
    is estimated to be smaller than the memory limit.

10
3.4 Efficiency(1/4)
  • Split
  • Splitting the graph requires one extra copy of
    the full graph each time.
  • It may cost several hours to several days
    depending on the size of the graph.
  • As the web graph is very large, even to read the
    whole graph from hard disk to memory costs
    several hours.

Fin.
11
3.4 Efficiency(2/4)
  • Decompose Sub-graph/G
  • Decomposing sub-graphs (or G) can be finished in
    O(ne) time.
  • n the number of nodes in the graph.
  • e the number of edges in the graph.
  • Each edge in the original graph G
  • will be visited at most once in one sub-graph or
    in one contracted graph G.
  • It may cost a few days in total depending on the
    size of the whole graph.

Fin.
12
3.4 Efficiency(3/4)
  • Contract the graph
  • Graph contraction requires at most one full-graph
    copy.
  • Only a small portion of edges will be copied if
    the split is appropriate .

Fin.
13
3.4 Efficiency(4/4)
  • Merge the SCCs
  • The cost of merging SCCs is O(nlog(n)).
  • When we got the SCC information M1 from
    sub-graphs and the SCC information M2 from G, we
    can sort the information in M1.
  • Then we can output each component in M2 using
    binarysearch in M1.
  • It takes only a few hours to merge SCCs and we
    can ignore the cost in contrast with those in
    other steps.
  • The extra cost of split-merge algorithm
  • only depends on how many times we need to split
    the graph.
  • only come from the IO operation during the
    process of split and graph contraction.
  • If we just split the graph a few times, such cost
    is affordable.
  • Further, optimizations are still available.
  • Ex) Eliminating the nodes with zero out-degree
    (or in-degree) is a good advice if the final G
    is a bit larger.

Fin.
Write a Comment
User Comments (0)
About PowerShow.com