Title: Studying Blogspace
1Studying Blogspace
- Ravi Kumar
- IBM Almaden Research Center
- ravi_at_almaden.ibm.com
2Etymology
- From the OED new ed. (draft entry, Mar 2003)
- blog intr. To write or maintain a weblog. Also
to read or browse through weblogs, esp.
habitually. - weblog n. 2. A frequently updated web site
consisting of personal observations, excerpts
from other sources, etc., typically run by a
single person, and usually with hyperlinks to
other sites an online journal or diary. - From WWW 2003 (Kumar, Novak, Raghavan, Tomkins)
- blogspace n. The collection of weblogs
blogosphere, blogsphere, blogistan,
3Blogs 101
- Characteristics
- Pages with reverse chronological sequences of
dated entries - Usually contain a persistent sidebar containing
profile (and other blogs read by the author
blogroll) - Usually maintained and published by one of the
common variants of public-domain blog software - From Slashdot, 1999
- a new, personal, and determinedly non-hostile
evolution of the electric community. They are
also the freshest example of how people use the
Net to make their own, radically different new
media
4Look and feel
- Quirky
- Highly personal
- Consumed by a small number of regular repeat
visitors - Often updated multiple times each day
- Highly interwoven into a network of small but
active micro-communities - Eg LiveJournal, Xanga, DeadJournal, Blogger,
Memepool,
5The blog era
- Blogs began in 1996, but exploded in popularity
in 1999 - Proliferation of authoring tools
- Newsweek 2002 estimates 500K
- LiveJournal 2005 estimates 3.5M
- Annual Blogathon for charity
- Bloggers update their Blogs every 30m for 24h
- Sponsors pay
- Impact of blogs
- Miserable failure on Google
6Structural study(Kumar, Novak, Raghavan,
Tomkins, CACM 2004)
7Livejournal blogspace
- Livejournal.com popular blog site
- 1.3M bloggers (Feb 2004)
- 3.5M bloggers (Apr 2005)
- Each blogger has a profile
- Name, age,
- Geographic information (city, state, zip, )
- Friends and friend of
- Interests/communities
8Eg, LiveJournal user bill
9LJ bloggers in US
10LJ bloggers world-wide
11Who are they?
Age Representative interests
12Friendship graph
- Directed
- 80 mutual
- Average degree 14
- Power law degrees
- Clustering coeff. 0.2
- Most friendships explained by age, location,
interest
Age 1
5
16
Location 20
Interest 16
22
13Evolutionary study (Kumar, Novak, Raghavan,
Tomkins, WWW 2003)
14Blogs and evolution
- Every blog contains a dated record of
- Every word ever written to the blog
- Every link ever added in the blog
- Blogs are an increasingly important medium, but
- Few systematic studies have been performed
- Such study should take an evolutionary
perspective - Brewington et al Bharat et al Fetterly et
al Cho et al - Tools for understanding evolution not fully
understood
15Time graphs
Jan
v1
v2
Feb
Mar
v3
v1
v2
Apr
time
May
v4
v3
v4
Jun
Jul
Aug
Underlying graph
Time graph
16Community evolution in blogs
- What are the communities within the time graph?
- Community definition, extraction
- Graph-based methods (trawling)
- Kumar Raghavan Rajagopalan Tomkins, WWW 99
- How active are these communities, and over what
timeframe? - Burst analysis Kleinberg, KDD 02
17Community extraction
- Community analysis based on graph structure
- Idea there are many subgraphs that would never
occur in a random graph if we find such
subgraphs, there must be some reason - In blogspace, we enumerate dense subgraphs using
a greedy heuristic
18Dense subgraph enumeration(heuristic)
- Scan edges, find triangles
- For each triangle, greedily grow its neighbor set
- Growth is allowable based on a measure of
connectivity to the current dense subgraph - Extracted communities are not unique
19Bursts Static to dynamic communities
- Phenomenon to characterize A topic in a temporal
stream occurs in a burst of activity - Model source as multi-state
- Each state has certain emission properties
- Traversal between states is controlled by a
Markov model - Determine most likely underlying state sequence
over time, given observable output
20An example
State 2 Output rate very high
0.01
State 1 Output rate very low
1
2
0.005
Ive been thinking about your idea with the
asparagus
Uh huh
I think I see
Uh huh
Yeah, thats what Im saying
So then I said Hey, lets give it a try
And anyway she said maybe, okay?
Time
Most likely hidden sequence
21Some experiments
- Crawled 24,109 blogs from popular sites (2003)
- Extract archive links from blogs
- Extract all dates on blog pages, and tag each
word and link with a date - Simple heuristics to automatically extract
time-stamps from entries (regular expressions,
training, ) - Obtained dates for 90 of edges
22Experiments (contd.)
- The time graph
- 22,299 nodes, 70,472 unique edges
- 0.77M multiedges (average edge multiplicity 11)
- Consider graphs formed by prefixes from Jan 1,
1999 to some later month generate 47 prefix
graphs for analysis - Enumerate communities and analyze their
burstiness
23SCC growth
Largest SCC as fraction of all nodes
2nd and 3rd largest SCCs as fraction of all nodes
24Connectivity in Blogspace
Number of nodes participating in a community
Number of communities
Fraction of nodes participating In some community
25Burstiness of communities
Number of communities in high state during each
time period
26Are these results fluke?
- Randomized Blogspace A distribution over time
graphs that look very much like the time graph of
Blogspace, but remove some of the locality of the
true graph - Vertices and edges arrive at the same times, each
edge has the same source, but a randomly-chosen
destination - If randomized blogspace behaves like blogspace,
then community structure is a fake
27SCC evolution
Blogspace
Randomized Blogspace
Randomized Blogspace forms an SCC much earlier
28Community evolution
Blogspace
Randomized Blogspace
Blogspace has many more communities
29Exogenous events
Number of blog pages that belong to a community
Number of blog communities
Number of communities identified automatically as
exhibiting bursty behavior measure of
cohesiveness of the blogspace
Wired magazine publishes an article on weblogs
that impacts the tech community
NewsWeek magazine publishes an article that
reaches the population at large, responding to
emergence, and triggering mainstream adoption
30Some questions
- Modeling
- Edge arrivals
- Interesting events
- Algorithms
- Prediction
- Information percolation
- Search
- o(t T(n))
- Studies
- Sociological
- Effect on search and ranking
31Prediction via blogs (Gruhl, Guha, Kumar, Novak,
Tomkins, 2005)
32Blogs as trend indicators
- Can blogs be used to predict trends?
- Data
- Amazon sales rank of some books
- Blog chatter in an index
- Questions
- How well do they correlate?
- Can sales rank be predicted using blogs?
33The Lance Armstrong Performance Program
Query Lance Armstrong OR Tour de France
34Vanity Fair
35Cross-correlation for Lance Armstrong
36Simple inferences
- How to formulate queries automatically
- Depends on the object (book, movie, )
- Simple heuristics work well
- Predicting sales motion is hard
- Predicting spikes appears relatively easier
- More to be done
37Blogs and social networks (Kumar, Liben-Nowell,
Novak, Raghavan, Tomkins, 2005)
38Social networks
- Blog friendship graph is a social network
- Is there a simple model to describe this network?
- Desiderata
- Fit experimental observations
- Exhibit six-degrees of separation
- Theoretically tractable
39RBF Rank-Based Friendship
- Population network model
- Each person has a geographic location
- d(, ) measures geographic distance
- rankA(B) C d(A, C)
- PrA befriends B / 1/rankA(B)
- Independent of distance
- Works with arbitrary population densities
- Plus local links to neighbors
40RBF Preliminary results
- Fits LiveJournal friendship experimental graph
data (using geo data in the profile) - Greedy routing Is able to route messages from
source to destination most of the time, just
using geographic information - Theoretical analysis Can show that this model
guarantees geographic routing to work