Title: Web Communities
1Web Communities
- Prasanna Desikan
- (06/13/2002)
2Definition
- Web community
- Groups of individuals who share common interests,
together with the web pages most popular among
them. - Web page collections with a shared topic.
3Types of Communities
- Explicitly- defined.
- Communities that manifest themselves as
newsgroups or as resource collections on
directories such as Yahoo! - Implicitly- defined.
- Communities that result from nature of
content-creation of the web.
4Terms and Definitions
- Directed Bipartite Graph A graph whose nodes
set can be partitioned into two sets F and C,
and every directed edge in the graph is from a
node u in F to a node v in C.
5Terms and Definitions
- Completed Bipartite Graph A bipartite graph that
contains all possible edges between a vertex of F
and a vertex of C. - Core A complete bipartite sub-graph with at
least i nodes from F and at least j nodes from C.
- In the web world, the i pages the contains the
links are referred to as fans and the j pages
that are referenced as centers.
6Inferring Web Communities From Link Topology
- Community is a core of central authoritative
pages linked together by hub pages. - Identify communities corresponding to the
principal and non-principal eigenvectors
discovered by HITS. - For communities on broad topics the grouping of
pages discovered is relatively independent of the
exact choice of root set.
7Inferring Web Communities From Link Topology
- Findings on Structure of Communities.
- Robustness For broad topics, HITS produces
stable, robust communities. - Topic Generalization HITS tend to generalize
topics that are not broad. - Michael Jordan produces links to pages on MJ
and his team. - Dennis Ritchie produces links that reference to
C Programming Language.
8Inferring Web Communities From Link Topology
- Other Generalization HITS tends to converge on
topics with greater density of linkage. - E.g for a query on linguistics, the top
authorities are focused on a sub-topic
computational linguistics because of its
greater density of linkage on web. - Temporal Issues For obtaining long-term core
of a topic, we can superimpose the results of
HITS on the same topic, spaced-out several month
periods.
9Trawling the Web for Emerging Web Communities
- Trawling Systematic Enumeration of emerging
communities from web crawl. - Scan through a web crawl and identify all
instances of graph structures that are indicative
signatures of communities.
10Trawling the Web for Emerging Web Communities
- Data Source A copy of web from Alexa.
- Pre-processing data.
- Identify potential fan pages (a page that has
links to at least six different websites) out
of 200 million pages around 24 million were
extracted. - Eliminate mirrors (out of 24 million it removed
around 60 of pages.
11Trawling the Web for Emerging Web Communities
- Prune by in-degree.
- Eliminate all pages that have an in-degree
greater than a threshold value k. k is set as 50
in the experiments. - Iterative pruning.
- When looking for (i,j) cores any potential fan
with out-degree smaller than j can be pruned and
the corresponding edges deleted from the graph.
12Trawling the Web for Emerging Web Communities
- Inclusion-exclusion pruning.
- Let c1,c2,..,cj be centers adjacent to a fan
x. - N(ct) neighborhood of ct, the set of fans that
point to ct. - x is a part of core if and only if the
intersection of sets N(ct) has size at least i. - Filter nepotistic cores.
13Trawling the Web for Emerging Web Communities
- Evaluation of Communities.
- Fossilization 30 of communities were
fossilized. - A fossil is a community core not all of whose
fans exist on the web today. - Reliability Only 4 of the trawled cores were
coincidental i.e a collection of fan pages
without any cogent theme unifying them.
14Trawling the Web for Emerging Web Communities
- Quality 56 were not in Yahoo as constructed
from the crawl. And 29 were not in Yahoo at the
time of the paper. - This indicates identification of emerging
communities by trawling.
15Self Organization and Identification of Web
Communities
- Web community is defined as a collection of web
pages such that each member page has more
hyperlinks (in either direction) within the
community than outside of the community. - Approach Maximal Flow Minimal Cut framework.
- Benefits Focused crawling, automatic population
of portal categories.
16A Simple Community Identification Example
Figure Maximum Flow methods will separate the
two subgraphs with any choice of s and t that
has s on the left subgraph and t on the right
subgraph, removing the three dashed links.
17Approximate Flow Community
18Exact Flow Community
19Exact Flow Community
- An artificial source s, is added with infinite
capacity edges routed to all seed vertices in S. - Each pre-existing edge is made bi-directional and
rescaled to a constant value k.
20Exact Flow Community
- All vertices except the source, sink, and seed
vertices are routed to the artificial sink with
unit capacity. - A residual flow graph is produced by a maximum
flow procedure. - All vertices accessible from s through non-zero
positive edges form the desired result and
satisfy our definition of a community.
21Sample Results From Community Identification
The scores are the total number of inbound and
outbound links that a web page has to other pages
that are also in the community.
22Characterization of Communities
Table 3 The fifteen most significant text
features for each community, sorted in descending
order of the Kullback-Leibler metric.
23Discovering Seeds of New Interest Spread From
Premature Pages.
- A method for discovering topics, which stimulate
communities of people into earnest communications
on the topics meaning, and grow into a trend of
popular interest. - Community is a group of people sharing some value.
24Agora Method on Links
- Archive page - Page of highest rank according to
Google in a community. - Agora Pages - Pages linked from multiple
archive-pages but are not in any community
themselves are taken as novel topics attracting
multiple communities, called agora-topic pages.
25Agora Method on Links
- Step 1 A query representing users interest
domain is entered to a search engine (Google
here, obtaining 105 to 106 pages). - Step 2 Communities, of pages obtained in Step 1,
are obtained and archive-pages are selected from
communities.
26Agora Method on Links
- Step 3 Pages, not in the communities but linked
from multiple archive-pages, are obtained as
agora-pages. Having all obtained results by here,
archive pages (black nodes), agora-pages (red
nodes) and the links between them are visualized
as in Fig.1.
27Fig The output of Agora on Links, for domain
query Human Genome
28Evaluation
- Stage 1. An interest domain is fixed, a group of
people relevant to the domain gathered, and the
domain-name is input as a query (e.g.
information retrieval). - Stage 2. The output graph adding real and fake
red nodes, as if they all were really obtained as
agora-pages, is shown to the subjects. That is,
some red nodes, not really obtained, were added
with red links to black archive-nodes. Subjects
reported individual impressions and exchanged
ideas in the group.
29Sample Results
- Institutes in red were the ones who have data
sources of human or mouse genomes, and is useful
for researchers in other institutes to look at
those data. - 8 of the 12 red nodes were termed as
interesting for thinking of future work by the
subjects.
30References
- 1D. Gibson, J. Klienberg, and P.Raghavan.
Inferring web communties from link topology. In
Proc. 9th ACM Conference on Hypertext and
Hypermedia. - 2Ravi Kumar, Prabhakar Raghavan, Sridhar
Rajagopalan, and Andrew Tomkins. Trawling the web
for emerging cyber-communities. In Proc 8th Int.
World Wide Web Conf.,1999. - 3 Gary William Flake, Steve Lawrence, C. Lee
Giles . Efficient Identification of Web
Communities. Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining. - 4 Gary William Flake, Steve Lawrence, C. Lee
Giles, Frans M. Coetzee. Self-Organization and
Identification of Web Communities. IEEE Computer,
35(3), 6671, 2002.
31References
- 5 Naohiro Matsumura , Yukio Ohsawa , Mitsuru
Ishizuka Discovering Seeds of New Interest
Spread from Premature Pages Cited by Multiple
Communities, 2001 International Conference on Web
Intelligence.
32Kullback-Leibler Metric
- Let p and q be probability distributions with
support X and Y respectively. The relative
entropy or Kullback-Liebler distance between two
probability distributions p and q is defined as
Back
33Back