Graph mining techniques applied to blogs presentation

About This Presentation

Transcript and Presenter's Notes

Title: Graph mining techniques applied to blogs

1
Graph mining techniques applied to blogs
Mary McGlohon Seminar on Social Media Analysis-
Oct 2 2007
2
Last week Lots of methods for graph mining and
link analysis.
3
Last week Lots of methods for graph mining and
link analysis. This week A few examples of
these methods applied to blogs.
4
Paper 1

Jure Leskovec, Mary McGlohon, Christos Faloutsos,
Natalie Glance, and Matthew Hurst. Patterns of
Cascading Behavior in Large Blog Graphs, SDM
2007.
What temporal and topological features do we
observe in a large network of blogs?

5
Representing blogs as graphs
slashdot
boingboing
MichelleMalkin
Dlisted

Blogosphere network

5
6
Representing blogs as graphs
slashdot
boingboing
MichelleMalkin
Dlisted

Blogosphere network

slashdot
boingboing
1
MichelleMalkin
Dlisted
Blog network
6
7
Representing blogs as graphs
slashdot
boingboing
MichelleMalkin
Dlisted

Blogosphere network

slashdot
boingboing
1
MichelleMalkin
Dlisted
Blog network Post network
8
Extracting subgraphs Cascades

We gather cascades using the following procedure
Find all initiators (out-degree 0).

a
b
c
d
e
8
9
Extracting subgraphs Cascades

We gather cascades using the following procedure
Find all initiators (out-degree 0).
Follow in-links.

a
a
b
b
c
c
d
d
e
e
9
10
Extracting subgraphs Cascades

We gather cascades using the following procedure
Find all initiators (out-degree 0).
Follow in-links.
Produces directed acyclic graph.

a
a
a
c
b
d
b
b
c
c
d
e
d
e
e
e
10
11
Paper 1,2 Dataset (Nielsen Buzzmetrics)

Gathered from August-September 2005
Used set of 44,362 blogs, traced cascades
2.4 million posts, 5 million out-links, 245,404
blog-to-blog links

Number of posts
Time 1 day
12
Temporal Observations

Does blog traffic behave periodically?
Posts have weekend effect, less traffic on
Saturday/Sunday.

13
Temporal Observations

How does post popularity change over time?

Popularity on day 1
Popularity on day 40
Number in-links (log)
Monday post dropoff- days after post
14
Temporal Observations

How does post popularity change over time?

How does post popularity change over time? Post
popularity dropoff follows a power law identical
to that found in communication response times in
Vazquez2006.
Number in-links (log)
Number of in-links
Monday post dropoff- days after post
14
Days after post
15
Temporal Observations

How does post popularity change over time?

How does post popularity change over time? Post
popularity dropoff follows a power law identical
to that found in communication response times in
Vazquez2006.
The probability that a post written at time tp
acquires a link at time tp ? is p(tp?) ?
?1.5
Number of in-links
15
Days after post
16
Topological Observations

What graph properties does the blog network
exhibit?

17
Topological Observations

What graph properties does the blog network
exhibit?
44,356 nodes, 122,153 edges
Half of blogs belong to largest connected
component.

18
Topological Observations

What power laws does the blog network exhibit?

Count (log scale)
Count (log scale)
Number of blog in-links (log scale)
Number of blog out-links (log scale)
Both in- and out-degree follows a power law
distribution, in-link PL exponent -1.7,
out-degree PL exponent near -3. This suggests
strong rich-get-richer phenomena.
19
Topological Observations
What graph properties does the post network
exhibit?
20
Topological Observations
What graph properties does the post network
exhibit?

Very sparsely connected 98 of posts are
isolated.
Inlinks/outlinks also follow power laws.

21
Topological Observations
How do we measure how information flows through
the network?

Common cascade shapes are extracted using
algorithms in Leskovec2006.

22
Topological Observations
How do we measure how information flows through
the network?

Number of edges increases linearally with
cascade size, while effective diameter increases
logarithmically, suggesting tree-like structures.

Number of edges
Effective diameter
Cascade size ( nodes)
Cascade size
23
More on cascades

Cascade sizes, including sizes of particular
shapes (stars, chains) also follow power laws.
This paper also presents a model for influence
propagation that generates cascades based on SIS
model of epidemiology. The topic of influence
propagation has been reserved for a later date. ?

24
Paper 2

Mary McGlohon, Jure Leskovec, Christos
Faloutsos, Matthew Hurst, and Natalie Glance.
Finding patterns in blog shapes and blog
evolution, SDM 2007.
Do different kinds of blogs exhibit different
properties?
What tools can we use to describe the behavior of
a blog over time?

24
25

Suppose we wanted to characterize a blog based on
the properties of its posts.
Obtain a set of post features based on its role
in a cascade.
Use PCA for dimensionality reduction.

26
Post features

There are several terms we use to describe
cascades
In-link, out-link
Green node has one out-link
Yellow node has one in-link.
Depth downwards/upwards
Pink node has an upward depth of 1,
downward depth of 2.
Conversation mass upwards/downwards
Pink node has upward CM 1,
downward CM 3

26
26
27
Dimensionality reduction

Post features may be correlated, so some
information may be unnecessary.
Principal Component Analysis is a method of
dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
27
28
Dimesionality reduction

Post features may be correlated, so some
information may be unnecessary.
Principal Component Analysis is a method of
dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
28
29
Dimensionality reduction

Post features may be correlated, so some
information may be unnecessary.
Principal Component Analysis is a method of
dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
29
30
Setting up the matrix
log( in-links) log(out-links) log(CM up)
log(CM down) log(depth up)
log(depth down)
Run PCA
2,400,000 posts
30
30
31
PostFeatures Results

Observation Posts within a blog tend to retain
similar network characteristics.

PC1 CM upward
PC2 CM downward

31
32
PostFeatures Results

Observation Posts within a blog tend to retain
similar network characteristics.

PC1 CM upward
PC2 CM downward

MichelleMalkin
Dlisted
32
33

Suppose we want to cluster blogs based on
content. What features do we use?
Get set of features based on cascade shapes.
Run PCA to reduce dimensionality.

34
PCA on a sparse matrix

This time, each blog is one row.
Use log(count1)
Project onto 2 PC

9,000 cascade types

44,000 blogs
35
CascadeType Results

Observation Content of blogs and cascade
behavior are often related.

Distinct clusters for conservative and
humorous blogs (hand-labeling).

35
36
CascadeType Results

Observation Content of blogs and cascade
behavior are often related.

Distinct clusters for conservative and
humorous blogs (hand-labeling).

36
37

What about time series data? How can we deal
with that?
Problem time series data is nonuniform and
difficult to analyze.

in-links over time
38
BlogTimeFractal Definitions

Fortunately, we find that behavior is often
self-similar.
The 80-20 law describes self-similarity.
For any sequence, we divide it into two
equal-length subsequences. 80 of traffic is in
one, 20 in the other.
Repeat recursively.

38
39
Self-similarity

The bias factor for the 80-20 law is b0.8.

20
80
40
Self-similarity

The bias factor for the 80-20 law is b0.8.

20
80
Q How do we estimate b?
41
Self-similarity

The bias factor for the 80-20 law is b0.8.

20
80
Q How do we estimate b?
A Entropy plots!
42
BlogTimeFractal

An entropy plot plots entropy vs. resolution.
From time series data, begin with resolution R
T/2.
Record entropy HR

42
43
BlogTimeFractal

An entropy plot plots entropy vs. resolution.
From time series data, begin with resolution R
T/2.
Record entropy HR
Recursively take finer resolutions.

43
44
BlogTimeFractal

An entropy plot plots entropy vs. resolution.
From time series data, begin with resolution r
T/2.
Record entropy Hr
Recursively take finer resolutions.

44
45
BlogTimeFractal Definitions

Entropy measures the non-uniformity of histogram
at a given resolution.
We define entropy of our sequence at given R
where p(t) is percentage of posts from a blog on
interval t, R is resolution and 2R is number of
intervals.

46
BlogTimeFractal

For a b-model (and self similar cases), entropy
plot is linear. The slope s will tell us the
bias factor.
Lemma For traffic generated by a b-model, the
bias factor b obeys the equation
s - b log2 b (1-b) log2 (1-b)

46
47
Entropy Plots

Linear plot ? Self-similarity

Entropy
Resolution
48
Entropy Plots

Linear plot ? Self-similarity
Uniform slope s1. bias.5
Point mass s0. bias1

Entropy
Resolution
49
Entropy Plots

Linear plot ? Self-similarity
Uniform slope s1. bias.5
Point mass s0. bias1

Michelle Malkin in-links, s 0.85 By Lemma 1, b
0.72
Entropy
Resolution
50
BlogTimeFractal Results

Observation Most time series of interest are
self-similar.
Observation Bias factor is approximately 0.7--
that is, more bursty than uniform (70/30 law).

Entropy plots MichelleMalkin
in-links, b.72 conversation mass, b.76 number
of posts, b.70
50
51
Papers 1,2 conclusions

There are several power laws observed in a
network of blogs.
We can extract cascades to help describe how
information propagates through a network.
We can use cascade properties to describe
behavior of some blogs.
We can also use self-similarity to describe
behavior of blogs over time.

52
Paper 3

Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan
M. Lukose. Implicit Structure and the Dynamics of
Blogspace. WWW 2004.
What are the large- and small- scale patterns of
blog epidemics?

52
53
Large scale Epidemic profiles

Example The effects of popular websites linking
to a given blog may cause popularity spikes.

53
54
Large scale Epidemic profiles

Quantify popularity of a topic into a vector.
Then, cluster different topics profiles.

55
Large scale Epidemic profiles

Used k-means clustering on topic buzz to identify
different ways ideas gain and lose popularity.
Found k4 worked best.

Centroids of clusters identified
55
56
Large scale Epidemic profiles

Catchall- picked up by different communities,
no major spike.
Back page news- delayed spike, broader
popularity.
Slashdot- link picked up quickly, dies off
quickly.
Front page news- immediate spike, broader
popularity.

catchall 48
slashdot 14
back page 20
front page 18
56
57
Link gathering

Links acquired by blogrolls or automated
trackbacks.
Posts sometimes give information on source of
information (via).

May 16 2003, 848a GIANTmicrobes
http//www.giantmicrobes.com/ We make stuffed
animals that look like tiny microbes only a
million times actual size! Now available The
Common Cold, The Flu, Sore Throat, and Stomach
Ache. (via BoingBoing)
57
58
Small scale link mining

Links acquired by blogrolls or automated
trackbacks.
Posts sometimes give information on source of
information (via).

Unfortunately, since via information is rare
(O(.1)), there needs to be a better way to infer
infection paths.
Solution link prediction.

60
Link prediction

Predict likelihood of 2 blogs linking to each
other.
Blog similarity- common links to other blogs
Link similarity- common non-blog links
Textual similarity- text vector similarity
Timing of posts on certain topics.
First three are cosine similarity, timing is
likelihood based on observed distributions of
link timings.

60
61
Link prediction results

Used SVMs to predict links.
Undirected link prediction accuracy 91
(Directed link prediction, 57)

61
62
More goodies from Paper 3

And
Built Zoomgraph, a visualization tool (stay tuned
next week.)
Proposed iRank, a ranking based on
infectiousness of blogs (stay tuned Oct. 23.)
A more in-depth slide show may be found here
http//www.blogpulse.com/papers/Adar_blogworkshop2
_ppt.pdf

62
63
Paper 4

Noor Ali-Hasan and Lada Adamic. Expressing
Social Relationships on the Blog through Links
and Comments. ICWSM 2007
Do different blog communities exhibit certain
structural properties?

64
Ali-Hasen and Adamic 2007

Dataset of 3 blogging communities
Dallas/Ft. Worth
United Arab Emirates (UAE)
Kuwait
Analyzed 3 types of links
Blogrolls (on a blogs webpage)
Citations (link in a post)
Comments (interaction in a posts discussion)

65
Citation link
Blogroll link
66
Comment link
67
Link type analysis
Co-occurrence of link types (Kuwait)

It is of interest to compare different types of
links
Co-occurrences of different link types.

68
Link type analysis
Co-occurrence of link types (Kuwait)

It is of interest to compare different types of
links
Co-occurrences of different link types.
Reciprocity among link types, between communities.

Link reciprocation rates
69
Structural properties
Links per blog

Centralization- to what extent links are not
uniformly distributed. (low in all communities,
indicating hubs)

70
Structural properties
Links per blog

Centralization- to what extent links are not
uniformly distributed. (low in all communities,
indicating hubs)
Modularity- to what extent subcommunities have
formed.

Modularity
71
Comparing communities
Kuwait -Fewest links external to community
(53) -Highly centralized -Much reciprocity
UAE -Fewer links external to community -More
centralization -Obvious hub structure
Dallas-Fort Worth -Most links are external to
community (91) -Low centralization -Low
reciprocity
72
Paper 4 Conclusions

Based on a survey, they suggest that these
different network characteristics indicated
different mindsets inside the community.
Kuwait bloggers more often reported blogging in
order to make new friends.
DFW more often reported blogging to update
friends/family on events.

73
Conclusions

Link analysis has discovered patterns in several
aspects of the blogosphere.
Observing general network characteristics.
Describing behavior of specific blogs, or blog
topics.
Illustrating how influence propagates.
Comparing different blogging communities.

Write a Comment

User Comments (0)

About PowerShow.com

Graph mining techniques applied to blogs PowerPoint PPT Presentation