Web Communities - PowerPoint PPT Presentation

About This Presentation
Title:

Web Communities

Description:

Groups of individuals who share common interests, together with the web pages ... i.e a collection of fan pages without any cogent theme unifying them. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 34
Provided by: saim7
Category:

less

Transcript and Presenter's Notes

Title: Web Communities


1
Web Communities
  • Prasanna Desikan
  • (06/13/2002)

2
Definition
  • Web community
  • Groups of individuals who share common interests,
    together with the web pages most popular among
    them.
  • Web page collections with a shared topic.

3
Types of Communities
  • Explicitly- defined.
  • Communities that manifest themselves as
    newsgroups or as resource collections on
    directories such as Yahoo!
  • Implicitly- defined.
  • Communities that result from nature of
    content-creation of the web.

4
Terms and Definitions
  • Directed Bipartite Graph A graph whose nodes
    set can be partitioned into two sets F and C,
    and every directed edge in the graph is from a
    node u in F to a node v in C.

5
Terms and Definitions
  • Completed Bipartite Graph A bipartite graph that
    contains all possible edges between a vertex of F
    and a vertex of C.
  • Core A complete bipartite sub-graph with at
    least i nodes from F and at least j nodes from C.
  • In the web world, the i pages the contains the
    links are referred to as fans and the j pages
    that are referenced as centers.

6
Inferring Web Communities From Link Topology
  • Community is a core of central authoritative
    pages linked together by hub pages.
  • Identify communities corresponding to the
    principal and non-principal eigenvectors
    discovered by HITS.
  • For communities on broad topics the grouping of
    pages discovered is relatively independent of the
    exact choice of root set.

7
Inferring Web Communities From Link Topology
  • Findings on Structure of Communities.
  • Robustness For broad topics, HITS produces
    stable, robust communities.
  • Topic Generalization HITS tend to generalize
    topics that are not broad.
  • Michael Jordan produces links to pages on MJ
    and his team.
  • Dennis Ritchie produces links that reference to
    C Programming Language.

8
Inferring Web Communities From Link Topology
  • Other Generalization HITS tends to converge on
    topics with greater density of linkage.
  • E.g for a query on linguistics, the top
    authorities are focused on a sub-topic
    computational linguistics because of its
    greater density of linkage on web.
  • Temporal Issues For obtaining long-term core
    of a topic, we can superimpose the results of
    HITS on the same topic, spaced-out several month
    periods.

9
Trawling the Web for Emerging Web Communities
  • Trawling Systematic Enumeration of emerging
    communities from web crawl.
  • Scan through a web crawl and identify all
    instances of graph structures that are indicative
    signatures of communities.

10
Trawling the Web for Emerging Web Communities
  • Data Source A copy of web from Alexa.
  • Pre-processing data.
  • Identify potential fan pages (a page that has
    links to at least six different websites) out
    of 200 million pages around 24 million were
    extracted.
  • Eliminate mirrors (out of 24 million it removed
    around 60 of pages.

11
Trawling the Web for Emerging Web Communities
  • Prune by in-degree.
  • Eliminate all pages that have an in-degree
    greater than a threshold value k. k is set as 50
    in the experiments.
  • Iterative pruning.
  • When looking for (i,j) cores any potential fan
    with out-degree smaller than j can be pruned and
    the corresponding edges deleted from the graph.

12
Trawling the Web for Emerging Web Communities
  • Inclusion-exclusion pruning.
  • Let c1,c2,..,cj be centers adjacent to a fan
    x.
  • N(ct) neighborhood of ct, the set of fans that
    point to ct.
  • x is a part of core if and only if the
    intersection of sets N(ct) has size at least i.
  • Filter nepotistic cores.

13
Trawling the Web for Emerging Web Communities
  • Evaluation of Communities.
  • Fossilization 30 of communities were
    fossilized.
  • A fossil is a community core not all of whose
    fans exist on the web today.
  • Reliability Only 4 of the trawled cores were
    coincidental i.e a collection of fan pages
    without any cogent theme unifying them.

14
Trawling the Web for Emerging Web Communities
  • Quality 56 were not in Yahoo as constructed
    from the crawl. And 29 were not in Yahoo at the
    time of the paper.
  • This indicates identification of emerging
    communities by trawling.

15
Self Organization and Identification of Web
Communities
  • Web community is defined as a collection of web
    pages such that each member page has more
    hyperlinks (in either direction) within the
    community than outside of the community.
  • Approach Maximal Flow Minimal Cut framework.
  • Benefits Focused crawling, automatic population
    of portal categories.

16
A Simple Community Identification Example
Figure Maximum Flow methods will separate the
two subgraphs with any choice of s and t that
has s on the left subgraph and t on the right
subgraph, removing the three dashed links.
17
Approximate Flow Community
18
Exact Flow Community
19
Exact Flow Community
  • An artificial source s, is added with infinite
    capacity edges routed to all seed vertices in S.
  • Each pre-existing edge is made bi-directional and
    rescaled to a constant value k.

20
Exact Flow Community
  • All vertices except the source, sink, and seed
    vertices are routed to the artificial sink with
    unit capacity.
  • A residual flow graph is produced by a maximum
    flow procedure.
  • All vertices accessible from s through non-zero
    positive edges form the desired result and
    satisfy our definition of a community.

21
Sample Results From Community Identification
The scores are the total number of inbound and
outbound links that a web page has to other pages
that are also in the community.
22
Characterization of Communities
Table 3 The fifteen most significant text
features for each community, sorted in descending
order of the Kullback-Leibler metric.
23
Discovering Seeds of New Interest Spread From
Premature Pages.
  • A method for discovering topics, which stimulate
    communities of people into earnest communications
    on the topics meaning, and grow into a trend of
    popular interest.
  • Community is a group of people sharing some value.

24
Agora Method on Links
  • Archive page - Page of highest rank according to
    Google in a community.
  • Agora Pages - Pages linked from multiple
    archive-pages but are not in any community
    themselves are taken as novel topics attracting
    multiple communities, called agora-topic pages.

25
Agora Method on Links
  • Step 1 A query representing users interest
    domain is entered to a search engine (Google
    here, obtaining 105 to 106 pages).
  • Step 2 Communities, of pages obtained in Step 1,
    are obtained and archive-pages are selected from
    communities.

26
Agora Method on Links
  • Step 3 Pages, not in the communities but linked
    from multiple archive-pages, are obtained as
    agora-pages. Having all obtained results by here,
    archive pages (black nodes), agora-pages (red
    nodes) and the links between them are visualized
    as in Fig.1.

27
Fig The output of Agora on Links, for domain
query Human Genome
28
Evaluation
  • Stage 1. An interest domain is fixed, a group of
    people relevant to the domain gathered, and the
    domain-name is input as a query (e.g.
    information retrieval).
  • Stage 2. The output graph adding real and fake
    red nodes, as if they all were really obtained as
    agora-pages, is shown to the subjects. That is,
    some red nodes, not really obtained, were added
    with red links to black archive-nodes. Subjects
    reported individual impressions and exchanged
    ideas in the group.

29
Sample Results
  • Institutes in red were the ones who have data
    sources of human or mouse genomes, and is useful
    for researchers in other institutes to look at
    those data.
  • 8 of the 12 red nodes were termed as
    interesting for thinking of future work by the
    subjects.

30
References
  • 1D. Gibson, J. Klienberg, and P.Raghavan.
    Inferring web communties from link topology. In
    Proc. 9th ACM Conference on Hypertext and
    Hypermedia.
  • 2Ravi Kumar, Prabhakar Raghavan, Sridhar
    Rajagopalan, and Andrew Tomkins. Trawling the web
    for emerging cyber-communities. In Proc 8th Int.
    World Wide Web Conf.,1999.
  • 3 Gary William Flake, Steve Lawrence, C. Lee
    Giles . Efficient Identification of Web
    Communities. Sixth ACM SIGKDD International
    Conference on Knowledge Discovery and Data
    Mining.
  • 4 Gary William Flake, Steve Lawrence, C. Lee
    Giles, Frans M. Coetzee. Self-Organization and
    Identification of Web Communities. IEEE Computer,
    35(3), 6671, 2002.

31
References
  • 5 Naohiro Matsumura , Yukio Ohsawa , Mitsuru
    Ishizuka Discovering Seeds of New Interest
    Spread from Premature Pages Cited by Multiple
    Communities, 2001 International Conference on Web
    Intelligence.

32
Kullback-Leibler Metric
  • Let p and q be probability distributions with
    support X and Y respectively. The relative
    entropy or Kullback-Liebler distance between two
    probability distributions p and q is defined as

Back
33
Back
Write a Comment
User Comments (0)
About PowerShow.com