Title: Graph mining techniques applied to blogs
1Graph mining techniques applied to blogs
Mary McGlohon Seminar on Social Media Analysis-
Oct 2 2007
2Last week Lots of methods for graph mining and
link analysis.
3Last week Lots of methods for graph mining and
link analysis. This week A few examples of
these methods applied to blogs.
4Paper 1
- Jure Leskovec, Mary McGlohon, Christos Faloutsos,
Natalie Glance, and Matthew Hurst. Patterns of
Cascading Behavior in Large Blog Graphs, SDM
2007. - What temporal and topological features do we
observe in a large network of blogs?
5Representing blogs as graphs
slashdot
boingboing
MichelleMalkin
Dlisted
5
6Representing blogs as graphs
slashdot
boingboing
MichelleMalkin
Dlisted
slashdot
boingboing
1
MichelleMalkin
Dlisted
Blog network
6
7Representing blogs as graphs
slashdot
boingboing
MichelleMalkin
Dlisted
slashdot
boingboing
1
MichelleMalkin
Dlisted
Blog network Post network
8Extracting subgraphs Cascades
- We gather cascades using the following procedure
- Find all initiators (out-degree 0).
a
b
c
d
e
8
9Extracting subgraphs Cascades
- We gather cascades using the following procedure
- Find all initiators (out-degree 0).
- Follow in-links.
a
a
b
b
c
c
d
d
e
e
9
10Extracting subgraphs Cascades
- We gather cascades using the following procedure
- Find all initiators (out-degree 0).
- Follow in-links.
- Produces directed acyclic graph.
a
a
a
c
b
d
b
b
c
c
d
e
d
e
e
e
10
11Paper 1,2 Dataset (Nielsen Buzzmetrics)
- Gathered from August-September 2005
- Used set of 44,362 blogs, traced cascades
- 2.4 million posts, 5 million out-links, 245,404
blog-to-blog links
Number of posts
Time 1 day
12Temporal Observations
- Does blog traffic behave periodically?
- Posts have weekend effect, less traffic on
Saturday/Sunday.
13Temporal Observations
- How does post popularity change over time?
Popularity on day 1
Popularity on day 40
Number in-links (log)
Monday post dropoff- days after post
14Temporal Observations
- How does post popularity change over time?
How does post popularity change over time? Post
popularity dropoff follows a power law identical
to that found in communication response times in
Vazquez2006.
Number in-links (log)
Number of in-links
Monday post dropoff- days after post
14
Days after post
15Temporal Observations
- How does post popularity change over time?
How does post popularity change over time? Post
popularity dropoff follows a power law identical
to that found in communication response times in
Vazquez2006.
The probability that a post written at time tp
acquires a link at time tp ? is p(tp?) ?
?1.5
Number of in-links
15
Days after post
16Topological Observations
- What graph properties does the blog network
exhibit?
17Topological Observations
- What graph properties does the blog network
exhibit? - 44,356 nodes, 122,153 edges
- Half of blogs belong to largest connected
component.
18Topological Observations
- What power laws does the blog network exhibit?
Count (log scale)
Count (log scale)
Number of blog in-links (log scale)
Number of blog out-links (log scale)
Both in- and out-degree follows a power law
distribution, in-link PL exponent -1.7,
out-degree PL exponent near -3. This suggests
strong rich-get-richer phenomena.
19Topological Observations
What graph properties does the post network
exhibit?
20Topological Observations
What graph properties does the post network
exhibit?
- Very sparsely connected 98 of posts are
isolated. - Inlinks/outlinks also follow power laws.
21Topological Observations
How do we measure how information flows through
the network?
- Common cascade shapes are extracted using
algorithms in Leskovec2006.
22Topological Observations
How do we measure how information flows through
the network?
- Number of edges increases linearally with
cascade size, while effective diameter increases
logarithmically, suggesting tree-like structures.
Number of edges
Effective diameter
Cascade size ( nodes)
Cascade size
23More on cascades
- Cascade sizes, including sizes of particular
shapes (stars, chains) also follow power laws. - This paper also presents a model for influence
propagation that generates cascades based on SIS
model of epidemiology. The topic of influence
propagation has been reserved for a later date. ?
24Paper 2
- Mary McGlohon, Jure Leskovec, Christos
Faloutsos, Matthew Hurst, and Natalie Glance.
Finding patterns in blog shapes and blog
evolution, SDM 2007. - Do different kinds of blogs exhibit different
properties? - What tools can we use to describe the behavior of
a blog over time?
24
25- Suppose we wanted to characterize a blog based on
the properties of its posts. - Obtain a set of post features based on its role
in a cascade. - Use PCA for dimensionality reduction.
26Post features
- There are several terms we use to describe
cascades - In-link, out-link
- Green node has one out-link
- Yellow node has one in-link.
- Depth downwards/upwards
- Pink node has an upward depth of 1,
- downward depth of 2.
- Conversation mass upwards/downwards
- Pink node has upward CM 1,
- downward CM 3
26
26
27Dimensionality reduction
- Post features may be correlated, so some
information may be unnecessary. - Principal Component Analysis is a method of
dimensionality reduction.
Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
27
28Dimesionality reduction
- Post features may be correlated, so some
information may be unnecessary. - Principal Component Analysis is a method of
dimensionality reduction.
Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
28
29Dimensionality reduction
- Post features may be correlated, so some
information may be unnecessary. - Principal Component Analysis is a method of
dimensionality reduction.
Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
29
30Setting up the matrix
log( in-links) log(out-links) log(CM up)
log(CM down) log(depth up)
log(depth down)
Run PCA
2,400,000 posts
30
30
31PostFeatures Results
- Observation Posts within a blog tend to retain
similar network characteristics.
- PC1 CM upward
- PC2 CM downward
31
32PostFeatures Results
- Observation Posts within a blog tend to retain
similar network characteristics.
- PC1 CM upward
- PC2 CM downward
MichelleMalkin
Dlisted
32
33- Suppose we want to cluster blogs based on
content. What features do we use? - Get set of features based on cascade shapes.
- Run PCA to reduce dimensionality.
34PCA on a sparse matrix
- This time, each blog is one row.
- Use log(count1)
- Project onto 2 PC
9,000 cascade types
44,000 blogs
35CascadeType Results
- Observation Content of blogs and cascade
behavior are often related.
- Distinct clusters for conservative and
humorous blogs (hand-labeling).
35
36CascadeType Results
- Observation Content of blogs and cascade
behavior are often related.
- Distinct clusters for conservative and
humorous blogs (hand-labeling).
36
37- What about time series data? How can we deal
with that? - Problem time series data is nonuniform and
difficult to analyze.
in-links over time
38BlogTimeFractal Definitions
- Fortunately, we find that behavior is often
self-similar. - The 80-20 law describes self-similarity.
- For any sequence, we divide it into two
equal-length subsequences. 80 of traffic is in
one, 20 in the other. - Repeat recursively.
38
39Self-similarity
- The bias factor for the 80-20 law is b0.8.
20
80
40Self-similarity
- The bias factor for the 80-20 law is b0.8.
20
80
Q How do we estimate b?
41Self-similarity
- The bias factor for the 80-20 law is b0.8.
20
80
Q How do we estimate b?
A Entropy plots!
42BlogTimeFractal
- An entropy plot plots entropy vs. resolution.
- From time series data, begin with resolution R
T/2. - Record entropy HR
42
43BlogTimeFractal
- An entropy plot plots entropy vs. resolution.
- From time series data, begin with resolution R
T/2. - Record entropy HR
- Recursively take finer resolutions.
43
44BlogTimeFractal
- An entropy plot plots entropy vs. resolution.
- From time series data, begin with resolution r
T/2. - Record entropy Hr
- Recursively take finer resolutions.
44
45BlogTimeFractal Definitions
- Entropy measures the non-uniformity of histogram
at a given resolution. - We define entropy of our sequence at given R
- where p(t) is percentage of posts from a blog on
interval t, R is resolution and 2R is number of
intervals.
46BlogTimeFractal
- For a b-model (and self similar cases), entropy
plot is linear. The slope s will tell us the
bias factor. - Lemma For traffic generated by a b-model, the
bias factor b obeys the equation - s - b log2 b (1-b) log2 (1-b)
46
47Entropy Plots
- Linear plot ? Self-similarity
Entropy
Resolution
48Entropy Plots
- Linear plot ? Self-similarity
- Uniform slope s1. bias.5
- Point mass s0. bias1
Entropy
Resolution
49Entropy Plots
- Linear plot ? Self-similarity
- Uniform slope s1. bias.5
- Point mass s0. bias1
Michelle Malkin in-links, s 0.85 By Lemma 1, b
0.72
Entropy
Resolution
50BlogTimeFractal Results
- Observation Most time series of interest are
self-similar. - Observation Bias factor is approximately 0.7--
that is, more bursty than uniform (70/30 law).
Entropy plots MichelleMalkin
in-links, b.72 conversation mass, b.76 number
of posts, b.70
50
51Papers 1,2 conclusions
- There are several power laws observed in a
network of blogs. - We can extract cascades to help describe how
information propagates through a network. - We can use cascade properties to describe
behavior of some blogs. - We can also use self-similarity to describe
behavior of blogs over time.
52Paper 3
- Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan
M. Lukose. Implicit Structure and the Dynamics of
Blogspace. WWW 2004. - What are the large- and small- scale patterns of
blog epidemics?
52
53Large scale Epidemic profiles
- Example The effects of popular websites linking
to a given blog may cause popularity spikes.
53
54Large scale Epidemic profiles
- Quantify popularity of a topic into a vector.
- Then, cluster different topics profiles.
55Large scale Epidemic profiles
- Used k-means clustering on topic buzz to identify
different ways ideas gain and lose popularity.
Found k4 worked best.
Centroids of clusters identified
55
56Large scale Epidemic profiles
- Catchall- picked up by different communities,
no major spike. - Back page news- delayed spike, broader
popularity. - Slashdot- link picked up quickly, dies off
quickly. - Front page news- immediate spike, broader
popularity.
catchall 48
slashdot 14
back page 20
front page 18
56
57Link gathering
- Links acquired by blogrolls or automated
trackbacks. - Posts sometimes give information on source of
information (via).
May 16 2003, 848a GIANTmicrobes
http//www.giantmicrobes.com/ We make stuffed
animals that look like tiny microbes only a
million times actual size! Now available The
Common Cold, The Flu, Sore Throat, and Stomach
Ache. (via BoingBoing)
57
58Small scale link mining
- Links acquired by blogrolls or automated
trackbacks. - Posts sometimes give information on source of
information (via).
May 16 2003, 848a GIANTmicrobes
http//www.giantmicrobes.com/ We make stuffed
animals that look like tiny microbes only a
million times actual size! Now available The
Common Cold, The Flu, Sore Throat, and Stomach
Ache. (via BoingBoing)
Epstein- Barr
58
Ebola
59Small scale link mining
- Unfortunately, since via information is rare
(O(.1)), there needs to be a better way to infer
infection paths. - Solution link prediction.
60Link prediction
- Predict likelihood of 2 blogs linking to each
other. - Blog similarity- common links to other blogs
- Link similarity- common non-blog links
- Textual similarity- text vector similarity
- Timing of posts on certain topics.
- First three are cosine similarity, timing is
likelihood based on observed distributions of
link timings.
60
61Link prediction results
- Used SVMs to predict links.
- Undirected link prediction accuracy 91
- (Directed link prediction, 57)
61
62More goodies from Paper 3
- And
- Built Zoomgraph, a visualization tool (stay tuned
next week.) - Proposed iRank, a ranking based on
infectiousness of blogs (stay tuned Oct. 23.) - A more in-depth slide show may be found here
http//www.blogpulse.com/papers/Adar_blogworkshop2
_ppt.pdf
62
63Paper 4
- Noor Ali-Hasan and Lada Adamic. Expressing
Social Relationships on the Blog through Links
and Comments. ICWSM 2007 - Do different blog communities exhibit certain
structural properties?
64Ali-Hasen and Adamic 2007
- Dataset of 3 blogging communities
- Dallas/Ft. Worth
- United Arab Emirates (UAE)
- Kuwait
- Analyzed 3 types of links
- Blogrolls (on a blogs webpage)
- Citations (link in a post)
- Comments (interaction in a posts discussion)
65Citation link
Blogroll link
66Comment link
67Link type analysis
Co-occurrence of link types (Kuwait)
- It is of interest to compare different types of
links - Co-occurrences of different link types.
68Link type analysis
Co-occurrence of link types (Kuwait)
- It is of interest to compare different types of
links - Co-occurrences of different link types.
- Reciprocity among link types, between communities.
Link reciprocation rates
69Structural properties
Links per blog
- Centralization- to what extent links are not
uniformly distributed. (low in all communities,
indicating hubs)
70Structural properties
Links per blog
- Centralization- to what extent links are not
uniformly distributed. (low in all communities,
indicating hubs) - Modularity- to what extent subcommunities have
formed.
Modularity
71Comparing communities
Kuwait -Fewest links external to community
(53) -Highly centralized -Much reciprocity
UAE -Fewer links external to community -More
centralization -Obvious hub structure
Dallas-Fort Worth -Most links are external to
community (91) -Low centralization -Low
reciprocity
72Paper 4 Conclusions
- Based on a survey, they suggest that these
different network characteristics indicated
different mindsets inside the community. - Kuwait bloggers more often reported blogging in
order to make new friends. - DFW more often reported blogging to update
friends/family on events.
73Conclusions
- Link analysis has discovered patterns in several
aspects of the blogosphere. - Observing general network characteristics.
- Describing behavior of specific blogs, or blog
topics. - Illustrating how influence propagates.
- Comparing different blogging communities.