Graph mining techniques applied to blogs - PowerPoint PPT Presentation

About This Presentation
Title:

Graph mining techniques applied to blogs

Description:

Dlisted. MichelleMalkin. Blog network. 7. Blogosphere network. Representing blogs as graphs ... Dlisted. MichelleMalkin. 8. Extracting subgraphs: Cascades ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 74
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Graph mining techniques applied to blogs


1
Graph mining techniques applied to blogs
Mary McGlohon Seminar on Social Media Analysis-
Oct 2 2007
2
Last week Lots of methods for graph mining and
link analysis.
3
Last week Lots of methods for graph mining and
link analysis. This week A few examples of
these methods applied to blogs.
4
Paper 1
  • Jure Leskovec, Mary McGlohon, Christos Faloutsos,
    Natalie Glance, and Matthew Hurst. Patterns of
    Cascading Behavior in Large Blog Graphs, SDM
    2007.
  • What temporal and topological features do we
    observe in a large network of blogs?

5
Representing blogs as graphs
slashdot
boingboing
MichelleMalkin
Dlisted
  • Blogosphere network

5
6
Representing blogs as graphs
slashdot
boingboing
MichelleMalkin
Dlisted
  • Blogosphere network

slashdot
boingboing
1
MichelleMalkin
Dlisted
Blog network
6
7
Representing blogs as graphs
slashdot
boingboing
MichelleMalkin
Dlisted
  • Blogosphere network

slashdot
boingboing
1
MichelleMalkin
Dlisted
Blog network Post network
8
Extracting subgraphs Cascades
  • We gather cascades using the following procedure
  • Find all initiators (out-degree 0).

a
b
c
d
e
8
9
Extracting subgraphs Cascades
  • We gather cascades using the following procedure
  • Find all initiators (out-degree 0).
  • Follow in-links.

a
a
b
b
c
c
d
d
e
e
9
10
Extracting subgraphs Cascades
  • We gather cascades using the following procedure
  • Find all initiators (out-degree 0).
  • Follow in-links.
  • Produces directed acyclic graph.

a
a
a
c
b
d
b
b
c
c
d
e
d
e
e
e
10
11
Paper 1,2 Dataset (Nielsen Buzzmetrics)
  • Gathered from August-September 2005
  • Used set of 44,362 blogs, traced cascades
  • 2.4 million posts, 5 million out-links, 245,404
    blog-to-blog links

Number of posts
Time 1 day
12
Temporal Observations
  • Does blog traffic behave periodically?
  • Posts have weekend effect, less traffic on
    Saturday/Sunday.

13
Temporal Observations
  • How does post popularity change over time?

Popularity on day 1
Popularity on day 40
Number in-links (log)
Monday post dropoff- days after post
14
Temporal Observations
  • How does post popularity change over time?

How does post popularity change over time? Post
popularity dropoff follows a power law identical
to that found in communication response times in
Vazquez2006.
Number in-links (log)
Number of in-links
Monday post dropoff- days after post
14
Days after post
15
Temporal Observations
  • How does post popularity change over time?

How does post popularity change over time? Post
popularity dropoff follows a power law identical
to that found in communication response times in
Vazquez2006.
The probability that a post written at time tp
acquires a link at time tp ? is p(tp?) ?
?1.5
Number of in-links
15
Days after post
16
Topological Observations
  • What graph properties does the blog network
    exhibit?

17
Topological Observations
  • What graph properties does the blog network
    exhibit?
  • 44,356 nodes, 122,153 edges
  • Half of blogs belong to largest connected
    component.

18
Topological Observations
  • What power laws does the blog network exhibit?

Count (log scale)
Count (log scale)
Number of blog in-links (log scale)
Number of blog out-links (log scale)
Both in- and out-degree follows a power law
distribution, in-link PL exponent -1.7,
out-degree PL exponent near -3. This suggests
strong rich-get-richer phenomena.
19
Topological Observations
What graph properties does the post network
exhibit?
20
Topological Observations
What graph properties does the post network
exhibit?
  • Very sparsely connected 98 of posts are
    isolated.
  • Inlinks/outlinks also follow power laws.

21
Topological Observations
How do we measure how information flows through
the network?
  • Common cascade shapes are extracted using
    algorithms in Leskovec2006.

22
Topological Observations
How do we measure how information flows through
the network?
  • Number of edges increases linearally with
    cascade size, while effective diameter increases
    logarithmically, suggesting tree-like structures.

Number of edges
Effective diameter
Cascade size ( nodes)
Cascade size
23
More on cascades
  • Cascade sizes, including sizes of particular
    shapes (stars, chains) also follow power laws.
  • This paper also presents a model for influence
    propagation that generates cascades based on SIS
    model of epidemiology. The topic of influence
    propagation has been reserved for a later date. ?

24
Paper 2
  • Mary McGlohon, Jure Leskovec, Christos
    Faloutsos, Matthew Hurst, and Natalie Glance.
    Finding patterns in blog shapes and blog
    evolution, SDM 2007.
  • Do different kinds of blogs exhibit different
    properties?
  • What tools can we use to describe the behavior of
    a blog over time?

24
25
  • Suppose we wanted to characterize a blog based on
    the properties of its posts.
  • Obtain a set of post features based on its role
    in a cascade.
  • Use PCA for dimensionality reduction.

26
Post features
  • There are several terms we use to describe
    cascades
  • In-link, out-link
  • Green node has one out-link
  • Yellow node has one in-link.
  • Depth downwards/upwards
  • Pink node has an upward depth of 1,
  • downward depth of 2.
  • Conversation mass upwards/downwards
  • Pink node has upward CM 1,
  • downward CM 3

26
26
27
Dimensionality reduction
  • Post features may be correlated, so some
    information may be unnecessary.
  • Principal Component Analysis is a method of
    dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
27
28
Dimesionality reduction
  • Post features may be correlated, so some
    information may be unnecessary.
  • Principal Component Analysis is a method of
    dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
28
29
Dimensionality reduction
  • Post features may be correlated, so some
    information may be unnecessary.
  • Principal Component Analysis is a method of
    dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
29
30
Setting up the matrix
log( in-links) log(out-links) log(CM up)
log(CM down) log(depth up)
log(depth down)
Run PCA
2,400,000 posts
30
30
31
PostFeatures Results
  • Observation Posts within a blog tend to retain
    similar network characteristics.
  • PC1 CM upward
  • PC2 CM downward

31
32
PostFeatures Results
  • Observation Posts within a blog tend to retain
    similar network characteristics.
  • PC1 CM upward
  • PC2 CM downward

MichelleMalkin
Dlisted
32
33
  • Suppose we want to cluster blogs based on
    content. What features do we use?
  • Get set of features based on cascade shapes.
  • Run PCA to reduce dimensionality.

34
PCA on a sparse matrix
  • This time, each blog is one row.
  • Use log(count1)
  • Project onto 2 PC

9,000 cascade types

44,000 blogs
35
CascadeType Results
  • Observation Content of blogs and cascade
    behavior are often related.
  • Distinct clusters for conservative and
    humorous blogs (hand-labeling).

35
36
CascadeType Results
  • Observation Content of blogs and cascade
    behavior are often related.
  • Distinct clusters for conservative and
    humorous blogs (hand-labeling).

36
37
  • What about time series data? How can we deal
    with that?
  • Problem time series data is nonuniform and
    difficult to analyze.

in-links over time
38
BlogTimeFractal Definitions
  • Fortunately, we find that behavior is often
    self-similar.
  • The 80-20 law describes self-similarity.
  • For any sequence, we divide it into two
    equal-length subsequences. 80 of traffic is in
    one, 20 in the other.
  • Repeat recursively.

38
39
Self-similarity
  • The bias factor for the 80-20 law is b0.8.

20
80
40
Self-similarity
  • The bias factor for the 80-20 law is b0.8.

20
80
Q How do we estimate b?
41
Self-similarity
  • The bias factor for the 80-20 law is b0.8.

20
80
Q How do we estimate b?
A Entropy plots!
42
BlogTimeFractal
  • An entropy plot plots entropy vs. resolution.
  • From time series data, begin with resolution R
    T/2.
  • Record entropy HR

42
43
BlogTimeFractal
  • An entropy plot plots entropy vs. resolution.
  • From time series data, begin with resolution R
    T/2.
  • Record entropy HR
  • Recursively take finer resolutions.

43
44
BlogTimeFractal
  • An entropy plot plots entropy vs. resolution.
  • From time series data, begin with resolution r
    T/2.
  • Record entropy Hr
  • Recursively take finer resolutions.

44
45
BlogTimeFractal Definitions
  • Entropy measures the non-uniformity of histogram
    at a given resolution.
  • We define entropy of our sequence at given R
  • where p(t) is percentage of posts from a blog on
    interval t, R is resolution and 2R is number of
    intervals.

46
BlogTimeFractal
  • For a b-model (and self similar cases), entropy
    plot is linear. The slope s will tell us the
    bias factor.
  • Lemma For traffic generated by a b-model, the
    bias factor b obeys the equation
  • s - b log2 b (1-b) log2 (1-b)

46
47
Entropy Plots
  • Linear plot ? Self-similarity

Entropy
Resolution
48
Entropy Plots
  • Linear plot ? Self-similarity
  • Uniform slope s1. bias.5
  • Point mass s0. bias1

Entropy
Resolution
49
Entropy Plots
  • Linear plot ? Self-similarity
  • Uniform slope s1. bias.5
  • Point mass s0. bias1

Michelle Malkin in-links, s 0.85 By Lemma 1, b
0.72
Entropy
Resolution
50
BlogTimeFractal Results
  • Observation Most time series of interest are
    self-similar.
  • Observation Bias factor is approximately 0.7--
    that is, more bursty than uniform (70/30 law).

Entropy plots MichelleMalkin
in-links, b.72 conversation mass, b.76 number
of posts, b.70
50
51
Papers 1,2 conclusions
  • There are several power laws observed in a
    network of blogs.
  • We can extract cascades to help describe how
    information propagates through a network.
  • We can use cascade properties to describe
    behavior of some blogs.
  • We can also use self-similarity to describe
    behavior of blogs over time.

52
Paper 3
  • Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan
    M. Lukose. Implicit Structure and the Dynamics of
    Blogspace. WWW 2004.
  • What are the large- and small- scale patterns of
    blog epidemics?

52
53
Large scale Epidemic profiles
  • Example The effects of popular websites linking
    to a given blog may cause popularity spikes.

53
54
Large scale Epidemic profiles
  • Quantify popularity of a topic into a vector.
  • Then, cluster different topics profiles.

55
Large scale Epidemic profiles
  • Used k-means clustering on topic buzz to identify
    different ways ideas gain and lose popularity.
    Found k4 worked best.

Centroids of clusters identified
55
56
Large scale Epidemic profiles
  • Catchall- picked up by different communities,
    no major spike.
  • Back page news- delayed spike, broader
    popularity.
  • Slashdot- link picked up quickly, dies off
    quickly.
  • Front page news- immediate spike, broader
    popularity.

catchall 48
slashdot 14
back page 20
front page 18
56
57
Link gathering
  • Links acquired by blogrolls or automated
    trackbacks.
  • Posts sometimes give information on source of
    information (via).

May 16 2003, 848a GIANTmicrobes
http//www.giantmicrobes.com/ We make stuffed
animals that look like tiny microbes only a
million times actual size! Now available The
Common Cold, The Flu, Sore Throat, and Stomach
Ache. (via BoingBoing)
57
58
Small scale link mining
  • Links acquired by blogrolls or automated
    trackbacks.
  • Posts sometimes give information on source of
    information (via).

May 16 2003, 848a GIANTmicrobes
http//www.giantmicrobes.com/ We make stuffed
animals that look like tiny microbes only a
million times actual size! Now available The
Common Cold, The Flu, Sore Throat, and Stomach
Ache. (via BoingBoing)
Epstein- Barr
58
Ebola
59
Small scale link mining
  • Unfortunately, since via information is rare
    (O(.1)), there needs to be a better way to infer
    infection paths.
  • Solution link prediction.

60
Link prediction
  • Predict likelihood of 2 blogs linking to each
    other.
  • Blog similarity- common links to other blogs
  • Link similarity- common non-blog links
  • Textual similarity- text vector similarity
  • Timing of posts on certain topics.
  • First three are cosine similarity, timing is
    likelihood based on observed distributions of
    link timings.

60
61
Link prediction results
  • Used SVMs to predict links.
  • Undirected link prediction accuracy 91
  • (Directed link prediction, 57)

61
62
More goodies from Paper 3
  • And
  • Built Zoomgraph, a visualization tool (stay tuned
    next week.)
  • Proposed iRank, a ranking based on
    infectiousness of blogs (stay tuned Oct. 23.)
  • A more in-depth slide show may be found here
    http//www.blogpulse.com/papers/Adar_blogworkshop2
    _ppt.pdf

62
63
Paper 4
  • Noor Ali-Hasan and Lada Adamic. Expressing
    Social Relationships on the Blog through Links
    and Comments. ICWSM 2007
  • Do different blog communities exhibit certain
    structural properties?

64
Ali-Hasen and Adamic 2007
  • Dataset of 3 blogging communities
  • Dallas/Ft. Worth
  • United Arab Emirates (UAE)
  • Kuwait
  • Analyzed 3 types of links
  • Blogrolls (on a blogs webpage)
  • Citations (link in a post)
  • Comments (interaction in a posts discussion)

65
Citation link
Blogroll link
66
Comment link
67
Link type analysis
Co-occurrence of link types (Kuwait)
  • It is of interest to compare different types of
    links
  • Co-occurrences of different link types.

68
Link type analysis
Co-occurrence of link types (Kuwait)
  • It is of interest to compare different types of
    links
  • Co-occurrences of different link types.
  • Reciprocity among link types, between communities.

Link reciprocation rates
69
Structural properties
Links per blog
  • Centralization- to what extent links are not
    uniformly distributed. (low in all communities,
    indicating hubs)

70
Structural properties
Links per blog
  • Centralization- to what extent links are not
    uniformly distributed. (low in all communities,
    indicating hubs)
  • Modularity- to what extent subcommunities have
    formed.

Modularity
71
Comparing communities
Kuwait -Fewest links external to community
(53) -Highly centralized -Much reciprocity
UAE -Fewer links external to community -More
centralization -Obvious hub structure
Dallas-Fort Worth -Most links are external to
community (91) -Low centralization -Low
reciprocity
72
Paper 4 Conclusions
  • Based on a survey, they suggest that these
    different network characteristics indicated
    different mindsets inside the community.
  • Kuwait bloggers more often reported blogging in
    order to make new friends.
  • DFW more often reported blogging to update
    friends/family on events.

73
Conclusions
  • Link analysis has discovered patterns in several
    aspects of the blogosphere.
  • Observing general network characteristics.
  • Describing behavior of specific blogs, or blog
    topics.
  • Illustrating how influence propagates.
  • Comparing different blogging communities.
Write a Comment
User Comments (0)
About PowerShow.com