Title: Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies
1Detecting Communities Via Simultaneous Clustering
of Graphs and Folksonomies Akshay Java Anupam
Joshi Tim Finin University of Maryland,
Baltimore County
KDD 2008 Workshop on Web Mining and Web Usage
Analysis
2Outline
- Introduction
- Community Detection
- Clustering Approach
- Spectral Approach
- Co-Clustering
- Simultaneous Clustering
- Evaluation
- Future Work
- Conclusions
3Outline
- Introduction
- Community Detection
- Clustering Approach
- Spectral Approach
- Co-Clustering
- Simultaneous Clustering
- Evaluation
- Future Work
- Conclusions
4Social Media Describes the online technologies
and practices that people use to share
opinions, insights, experiences, and
perspectives and engage with each other.
Wikipedia
5Social Media Graphs
G (V,E) describing the relationships between
different entities (People, Documents,
etc.) G ltV,T,Rgt a tri-partite graph that
expresses how entities Tag some resource
6What is a Community
Political Blogs
A community in the real world is identified in a
graph as a set of nodes that have more links
within the set than outside it.
Twitter Network
Facebook Network
7Outline
- Introduction
- Community Detection
- Clustering Approach
- Spectral Approach
- Co-Clustering
- Simultaneous Clustering
- Evaluation
- Future Work
- Conclusions
8Community DetectionClustering Approach
- Clustering Approach
- Agglomerative/Hierarchical
- Topological Overlap Similarity is measured in
terms of number of nodes that both i and j link
to. (Razvasz et al.)
9Community DetectionClustering Approach
- Clustering Approach
- Agglomerative/Hierarchical
- Divisive/Partition based
- Remove edges that have highest edge betweenness
centrality
(Girvan-Newman Algorithm)
Political Books
10Community DetectionSpectral Approach
Graph Laplacian
- The graph can be partitioned using the
eigenspectrum of the Laplacian. (Shi and Malik) - The second smallest eigenvector of the graph
Laplacian is the Fiedler vector. - The graph can be recursively partitioned using
the sign of the values in its Fielder vector.
Normalized Cuts
Cost of edges deleted to disconnect the graph
Total cost of all edges that start from B
11Community DetectionCo-Clustering
- Spectral graph bipartitioning
- Compute graph laplacian using
- Where
- is the document by term matrix
- (Dhillon et al.)
12Outline
- Introduction
- Community Detection
- Clustering Approach
- Spectral Approach
- Co-Clustering
- Simultaneous Clustering
- Evaluation
- Future Work
- Conclusions
13Social Media Graphs
Links Between Nodes and Tags
Links Between Nodes
Simultaneous Cuts
14Communities in Social Media
A community in the real world is identified in a
graph as a set of nodes that have more links
within the set than outside it and share similar
tags.
15Clustering Tags and Graphs
Nodes
Tags
Tags
Tags
Nodes
Nodes
Tags
Nodes
Fiedler Vector Polarity
ß 0 is like co-clustering, ß 1 Equal
importance to blog-blog and blog-tag, ßgtgt 1 NCut
16Clustering Tags and Graphs
Clustering Only Links
Clustering Links Tags
ß 0 is like co-clustering, ß 1 Equal
importance to blog-blog and blog-tag, ßgtgt 1 NCut
17Clustering Tags and Graphs
Clustering Only Links
Clustering Links Tags
18Outline
- Introduction
- Community Detection
- Clustering Approach
- Spectral Approach
- Co-Clustering
- Simultaneous Clustering
- Evaluation
- Future Work
- Conclusions
19Datasets
- Citeseer
- Agents, AI, DB, HCI, IR, ML
- Words used in place of tags
- Blog data
- derived from the WWE/Buzzmetrics dataset
- Tags associated with Blogs derived from
del.icio.us - For dimensionality reduction 100 topics derived
from blog homepages using LDA (Latent Dirichilet
Allocation) - Pairwise similarity computed
- RBF Kernel for Citeseer
- Cosine for blogs
20Citeseer Data
Accuracy 36
Accuracy 62
Higher accuracy by adding tag information
21Citeseer Data
NCut
SimCut
- SimCut Results in
- Higher intra-cluster similarity
- Lower inter-cluster similarity
22Citeseer Data
NCut
True
SimCut
- Constrains cuts based on both
- Link Structure
- Tags
23Blog Data
NCut
SimCut
- SimCut Results in
- Higher intra-cluster similarity
- Lower inter-cluster similarity
24Blog Data
NCut
SimCut
35 Clusters
- Ncut
- Few, Large clusters with low intra-cluster
similarity - SimCut
- Moderate size clusters higher intra-cluster
similarity
25Effect of Number of Tags, Clusters
Citeseer
Mutual Information compares clusters to ground
truth
More tags help, to an extent Lower mutual
information if only the graph is used
26Effect of Number of Tags, Clusters
Blogs
Mutual Information compares clusters to
content-based clusters (no tags/graph)
More tags help, to an extent Lower mutual
information if only the graph is used
27Outline
- Introduction
- Community Detection
- Clustering Approach
- Spectral Approach
- Co-Clustering
- Simultaneous Clustering
- Evaluation
- Future Work
- Conclusions
28Future Work
- Evaluating SimCut algorithm on derived feature
types like named entities, sentiments and
opinions, links to main stream media. - For a dataset with ground truth, a comparison of
graph based, text based and graphtag based
clustering - Evaluating effect of varying ß
29Outline
- Introduction
- Community Detection
- Clustering Approach
- Spectral Approach
- Co-Clustering
- Simultaneous Clustering
- Evaluation
- Future Work
- Conclusions
30Conclusions
- Many Social Media sites allow users to tag
resources - Incorporating folksonomies in community detection
can yield better results - SimCut can be easily implemented and relates to
Ncut with two simultaneous objectives - Minimize number of node-node edges being cut
- Minimize number of node-tag edges being cut
- Detected communities can be associated with
meaningful, descriptive tags
31Thanks!
32 http//ebiquity.umbc.edu http//socialmedia.
typepad.com
33More Tags
Only Graph
SimCut
34Citeseer (Community Size, Similarity)
35Blogs (Community Size, Similarity)