Studying Blogspace - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Studying Blogspace

Description:

A frequently updated web site consisting of personal observations, excerpts from ... Eg: LiveJournal, Xanga, DeadJournal, Blogger, Memepool, ... The blog era ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 41
Provided by: IBMU325
Category:

less

Transcript and Presenter's Notes

Title: Studying Blogspace


1
Studying Blogspace
  • Ravi Kumar
  • IBM Almaden Research Center
  • ravi_at_almaden.ibm.com

2
Etymology
  • From the OED new ed. (draft entry, Mar 2003)
  • blog intr. To write or maintain a weblog. Also
    to read or browse through weblogs, esp.
    habitually.
  • weblog n. 2. A frequently updated web site
    consisting of personal observations, excerpts
    from other sources, etc., typically run by a
    single person, and usually with hyperlinks to
    other sites an online journal or diary.
  • From WWW 2003 (Kumar, Novak, Raghavan, Tomkins)
  • blogspace n. The collection of weblogs
    blogosphere, blogsphere, blogistan,

3
Blogs 101
  • Characteristics
  • Pages with reverse chronological sequences of
    dated entries
  • Usually contain a persistent sidebar containing
    profile (and other blogs read by the author
    blogroll)
  • Usually maintained and published by one of the
    common variants of public-domain blog software
  • From Slashdot, 1999
  • a new, personal, and determinedly non-hostile
    evolution of the electric community. They are
    also the freshest example of how people use the
    Net to make their own, radically different new
    media

4
Look and feel
  • Quirky
  • Highly personal
  • Consumed by a small number of regular repeat
    visitors
  • Often updated multiple times each day
  • Highly interwoven into a network of small but
    active micro-communities
  • Eg LiveJournal, Xanga, DeadJournal, Blogger,
    Memepool,

5
The blog era
  • Blogs began in 1996, but exploded in popularity
    in 1999
  • Proliferation of authoring tools
  • Newsweek 2002 estimates 500K
  • LiveJournal 2005 estimates 3.5M
  • Annual Blogathon for charity
  • Bloggers update their Blogs every 30m for 24h
  • Sponsors pay
  • Impact of blogs
  • Miserable failure on Google

6
Structural study(Kumar, Novak, Raghavan,
Tomkins, CACM 2004)
7
Livejournal blogspace
  • Livejournal.com popular blog site
  • 1.3M bloggers (Feb 2004)
  • 3.5M bloggers (Apr 2005)
  • Each blogger has a profile
  • Name, age,
  • Geographic information (city, state, zip, )
  • Friends and friend of
  • Interests/communities

8
Eg, LiveJournal user bill
9
LJ bloggers in US
10
LJ bloggers world-wide
11
Who are they?
Age Representative interests
12
Friendship graph
  • Directed
  • 80 mutual
  • Average degree 14
  • Power law degrees
  • Clustering coeff. 0.2
  • Most friendships explained by age, location,
    interest

Age 1
5
16
Location 20
Interest 16
22
13
Evolutionary study (Kumar, Novak, Raghavan,
Tomkins, WWW 2003)
14
Blogs and evolution
  • Every blog contains a dated record of
  • Every word ever written to the blog
  • Every link ever added in the blog
  • Blogs are an increasingly important medium, but
  • Few systematic studies have been performed
  • Such study should take an evolutionary
    perspective
  • Brewington et al Bharat et al Fetterly et
    al Cho et al
  • Tools for understanding evolution not fully
    understood

15
Time graphs
Jan
v1
v2
Feb
Mar
v3
v1
v2
Apr
time
May
v4
v3
v4
Jun
Jul
Aug
Underlying graph
Time graph
16
Community evolution in blogs
  • What are the communities within the time graph?
  • Community definition, extraction
  • Graph-based methods (trawling)
  • Kumar Raghavan Rajagopalan Tomkins, WWW 99
  • How active are these communities, and over what
    timeframe?
  • Burst analysis Kleinberg, KDD 02

17
Community extraction
  • Community analysis based on graph structure
  • Idea there are many subgraphs that would never
    occur in a random graph if we find such
    subgraphs, there must be some reason
  • In blogspace, we enumerate dense subgraphs using
    a greedy heuristic

18
Dense subgraph enumeration(heuristic)
  • Scan edges, find triangles
  • For each triangle, greedily grow its neighbor set
  • Growth is allowable based on a measure of
    connectivity to the current dense subgraph
  • Extracted communities are not unique

19
Bursts Static to dynamic communities
  • Phenomenon to characterize A topic in a temporal
    stream occurs in a burst of activity
  • Model source as multi-state
  • Each state has certain emission properties
  • Traversal between states is controlled by a
    Markov model
  • Determine most likely underlying state sequence
    over time, given observable output

20
An example
State 2 Output rate very high
0.01
State 1 Output rate very low
1
2
0.005
Ive been thinking about your idea with the
asparagus
Uh huh
I think I see
Uh huh
Yeah, thats what Im saying
So then I said Hey, lets give it a try
And anyway she said maybe, okay?
Time
Most likely hidden sequence
21
Some experiments
  • Crawled 24,109 blogs from popular sites (2003)
  • Extract archive links from blogs
  • Extract all dates on blog pages, and tag each
    word and link with a date
  • Simple heuristics to automatically extract
    time-stamps from entries (regular expressions,
    training, )
  • Obtained dates for 90 of edges

22
Experiments (contd.)
  • The time graph
  • 22,299 nodes, 70,472 unique edges
  • 0.77M multiedges (average edge multiplicity 11)
  • Consider graphs formed by prefixes from Jan 1,
    1999 to some later month generate 47 prefix
    graphs for analysis
  • Enumerate communities and analyze their
    burstiness

23
SCC growth
Largest SCC as fraction of all nodes
2nd and 3rd largest SCCs as fraction of all nodes
24
Connectivity in Blogspace
Number of nodes participating in a community
Number of communities
Fraction of nodes participating In some community
25
Burstiness of communities
Number of communities in high state during each
time period
26
Are these results fluke?
  • Randomized Blogspace A distribution over time
    graphs that look very much like the time graph of
    Blogspace, but remove some of the locality of the
    true graph
  • Vertices and edges arrive at the same times, each
    edge has the same source, but a randomly-chosen
    destination
  • If randomized blogspace behaves like blogspace,
    then community structure is a fake

27
SCC evolution
Blogspace
Randomized Blogspace
Randomized Blogspace forms an SCC much earlier
28
Community evolution
Blogspace
Randomized Blogspace
Blogspace has many more communities
29
Exogenous events
Number of blog pages that belong to a community
Number of blog communities
Number of communities identified automatically as
exhibiting bursty behavior measure of
cohesiveness of the blogspace
Wired magazine publishes an article on weblogs
that impacts the tech community
NewsWeek magazine publishes an article that
reaches the population at large, responding to
emergence, and triggering mainstream adoption
30
Some questions
  • Modeling
  • Edge arrivals
  • Interesting events
  • Algorithms
  • Prediction
  • Information percolation
  • Search
  • o(t T(n))
  • Studies
  • Sociological
  • Effect on search and ranking

31
Prediction via blogs (Gruhl, Guha, Kumar, Novak,
Tomkins, 2005)
32
Blogs as trend indicators
  • Can blogs be used to predict trends?
  • Data
  • Amazon sales rank of some books
  • Blog chatter in an index
  • Questions
  • How well do they correlate?
  • Can sales rank be predicted using blogs?

33
The Lance Armstrong Performance Program
Query Lance Armstrong OR Tour de France
34
Vanity Fair
35
Cross-correlation for Lance Armstrong
36
Simple inferences
  • How to formulate queries automatically
  • Depends on the object (book, movie, )
  • Simple heuristics work well
  • Predicting sales motion is hard
  • Predicting spikes appears relatively easier
  • More to be done

37
Blogs and social networks (Kumar, Liben-Nowell,
Novak, Raghavan, Tomkins, 2005)
38
Social networks
  • Blog friendship graph is a social network
  • Is there a simple model to describe this network?
  • Desiderata
  • Fit experimental observations
  • Exhibit six-degrees of separation
  • Theoretically tractable

39
RBF Rank-Based Friendship
  • Population network model
  • Each person has a geographic location
  • d(, ) measures geographic distance
  • rankA(B) C d(A, C)
  • PrA befriends B / 1/rankA(B)
  • Independent of distance
  • Works with arbitrary population densities
  • Plus local links to neighbors

40
RBF Preliminary results
  • Fits LiveJournal friendship experimental graph
    data (using geo data in the profile)
  • Greedy routing Is able to route messages from
    source to destination most of the time, just
    using geographic information
  • Theoretical analysis Can show that this model
    guarantees geographic routing to work
Write a Comment
User Comments (0)
About PowerShow.com