Title: Lada Adamic, HP Labs, Palo Alto, CA
1Information dynamics in the networked world
Lada Adamic, HP Labs, Palo Alto, CA
2Talk outline
Information flow through blogs
Information flow through email
Search through email networks
Search within the enterprise
Search in an online community
3Implicit Structure and Dynamics of
BlogSpaceEytan Adar, Li Zhang, Lada Adamic,
Rajan Lukose
- Blog use
- Record real-world and virtual experiences
- Note and discuss things seen on the net
- Blog structure blog-to-blog linking
- Use Structure
- Great to track memes (catchy ideas)
4Approaches and uses of blog analysis
- Patterns of information flow
- How does the popularity of a topic evolve over
time? - Who is getting information from whom?
- Ranking algorithms that take advantage of
transmission patterns
5Tracking popularity over time
Popularity
Time
Blogdex, BlogPulse, etc. track the most popular
links/phrases of the day
6Different kinds of information have
differentpopularity profiles
1
Major-news site (editorial content) back of the
paper
Products, etc.
Slashdotpostings
Front-pagenews
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
5
10
15
5
10
15
5
10
15
5
10
15
of hits received on each day since first
appearance
7Micro example Giant Microbes
8Microscale Dynamics
- What do we need track specific info epidemics?
- Timings
- Underlying network
b1
t0
Time of infection
t1
9Microscale Dynamics
- Challenges
- Root may be unknown
- Multiple possible paths
- Uncrawled space, alternate media (email, voice)
- No links
bn
b1
?
?
t0
Time of infection
t1
10Microscale Dynamics who is getting info from whom
- Explicit blog to blog links (easy)
- Via links are even better
- Implicit/Inferred transfer (harder)
- Use ML algorithm for link inference problem
- Support Vector Machine (SVM)
- Logistic Regression
- What we can use
- Full text
- Blogs in common
- Links in common
- History of infection
11Visualization
http//www-idl.hpl.hp.com/blogstuff
- Zoomgraph tool
- Using GraphViz (by ATT) layouts
- Simple algorithm
- If single, explicit link exists, draw it
- Otherwise use ML algorithm
- Pick the most likely explicit link
- Pick the most likely possible link
- Tool lets you zoom around space, control
threshold, link types, etc.
12Giant Microbes epidemic visualization
via link
inferred link
blog
explicit link
13iRank
- Find early sources of good information
- using inferred information paths or timing
-
b1
True source
b2
Popular site
b3
b4
b5
bn
14iRank Algorithm
- Draw a weighted edge for all pairs of blogs that
cite the same URL - higher weight for mentions closer together
- run PageRank
- control for spam
t0
Time of infection
t1
15Do Bloggers Kill Kittens?
- 0200 AM Friday Mar. 05, 2004 PST Wired
publishes - "Warning Blogs Can Be Infectious.
- 725 AM Friday Mar. 05, 2004 PST Slashdot posts
- "Bloggers' Plagiarism Scientifically Proven"
- 955 AM Friday Mar. 05, 2004 PST Metafilter
announces - "A good amount of bloggers are outright thieves."
16Information flow in social groups Fang Wu,
Bernardo Huberman, Lada Adamic, Joshua Tyler
17Spread of disease is affected by the underlying
network
co-worker
mom
college friend
co-worker
mike
co-worker
18Spread of computer viruses is affected by the
underlying network
co-worker
mom
college friend
co-worker
mike
co-worker
19Difference between information flow and
disease/virus spread
Viruses (computer and otherwise) are
shared indiscriminately (involuntarily) Informati
on is passed selectively from one host to another
based on knowledge of the recipients interests
20Spread of information is affected by its
content, potential recipients, and network
topology
co-worker
mom
college friend
co-worker
mike
co-worker
21homophily individuals with like interests
associate with one another
personal homepages at Stanford
distance between personal homepages
22The Model Decay in transmission probability as a
function of the distance m between potential
target and originating node
T(m) (m1)-b T
power-law implies slowest decay
23Virus, information transmission on a scale free
network
P(k)
outdegree k
Degree distribution of all senders of email
passing through the HP email server
24epidemics on scale free graphs
106 nodes, epidemic if 1 (104) infected
1
k
b
,
0
0.8
k
b
100,
0
k
b
100,
1
0.6
critical threshold
0.4
0.2
0
1
1.5
2
2.5
3
3.5
4
a
25Study of the spread of URLs and attachments
40 participants (30 within HPL, 10 elsewhere in
HP other orgs) 6370 URLs and 3401 attachments
crypotgraphically hashed Question How many
recipients in our sample did each item reach?
caveats messages are deleted (still, the median
number of messages 2000) non-uniform sample
26Only forwarded messages are counted
27Results
average 1.1 for attachments, and 1.2 for URLs
ads at the bottom of hotmail yahoo messages
28Simulate transmission on email log each message
has a probability p of transmitting information
from an infected individual to the recipient
02/19/2003 154533 I-1 I-2 02/19/2003 154533 I-
1 I-3 02/19/2003 154540 E-1 I-4 02/19/2003 1545
52 I-5 E-2 02/19/2003 154555 E-3 I-6 02/19/2003
154558 I-7 I-8 02/19/2003 154600 E-4 I-9 02/1
9/2003 154605 I-10 I-11 02/19/2003 154610 I-12
I-13 02/19/2003 154610 I-12 I-14 02/19/2003 15
4610 I-12 I-15 02/19/2003 154614 I-16 E-5
. .
. . . .
. .
internal node
external node
29Simulation of information transmission on the
actual HP Labs email graph
an individual is infected if they receive a
particular piece of information individuals
remain infected for 24 hours start by infecting
one individual at random every time an infected
individual sends an email they have a probability
p of infecting the recipient track epidemic over
the course of a week, most run their course in
1-2 days
30Introduce a decay in the transmission
probability based on the hierarchical distance
hAB 5
distance 2
distance 2
A
B
317119 potential recipients
p0
32Conclusions on info flow in social groups
Information spread typically does not reach
epidemic proportions Information is passed on to
individuals with matching properties The
likelihood that properties match decreases with
distance from the source Model gives a finite
threshold Results are consistent with observed
URL attachment frequencies in a
sample Simulations following real email patterns
also consistent
33How to search in a small world
Milgrams experiment Given a target individual
and a particular property, pass the message to a
person you correspond with who is closest to
the target.
34Small world experiment at Columbia Dodds,
Muhamad, Watts, Science 301, (2003)
email experiement conducted in 2002 18 targets in
13 different countries 24,163 message chains
384 reached their targets average path length 4.0
35Why study small world phenomena?
Curiosity Why is the world small? How are
people able to route messages? Social Networking
as a Business Friendster, Orkut,
MySpace LinkedIn, Spoke, VisiblePath
36 Six degrees of separation - to be
expected Pool and Kochen (1978) - average
person has 500-1500 acquaintances Ignoring
clustering, other redundancy 103 first
neighbors, 106 second neighbors, 109 third
neighbors But networks are clustered my
friends friends tend to be my friends Watts
Strogatz (1998) - a few random links in an
otherwise clustered graph give an average
shortest path close to that of a random graph
37But how are people are able to find short paths?
How to choose among hundreds of
acquaintances? Strategy Simple greedy algorithm
- each participant chooses correspondent who is
closest to target with respect to the given
property Models geography Kleinberg
(2000) hierarchical groups Watts, Dodds, Newman
(2001), Kleinberg(2001) high degree
nodes Adamic, Puniyani, Lukose, Huberman (2001),
Newman(2003)
38Spatial search
Kleinberg (2000)
The geographic movement of the message from
Nebraska to Massachusetts is striking. There is
a progressive closing in on the target area as
each new person is added to the chain S.Milgram
The small world problem, Psychology Today
1,61,1967
nodes are placed on a lattice and connect to
nearest neighbors additional links placed with
f(d) d(u,v)-r if r 2, can search in polylog
(
39Kleinberg searching hierarchical
structures Small-World Phenomena and the
Dynamics of Information, NIPS 14, 2001
Hierarchical network models h is the distance
between two individuals in hierarchy with
branching b f(h) b-ah If a 1, can search
in O(log n) steps Group structure models q
size of smallest group that two individuals
belong to f(q) q-a If a 1, can achieve in
O(log n) steps
40Identity and search in social networks Watts,
Dodds, Newman (2001)
individuals belong to hierarchically nested
groups
multiple independent hierarchies coexist pij
exp(-a x)
41Identity and search in social networks Watts,
Dodds, Newman (2001)
There is an attrition rate r Network is
searchable if a fraction q of messages reach
the target
N102400
N204800
N409600
42High degree search
Adamic et al. Phys. Rev. E, 64 46135 (2001)
Mary
Bob
Who could introduce me to Richard Gere?
Jane
43power-law graph
number of nodes found
94
6
2
44Poisson graph
number of nodes found
93
45Scaling of search time with size of graph Sharp
cutoff at kN1/a , 2nd degree neighbors
random walk
a
0.37 fit
degree sequence
a
0.24 fit
covertime for half the nodes
size of graph
46Testing the models on social networks (w/
Eytan Adar)
Use a well defined network HP Labs email
correspondence over 3.5 months Edges are between
individuals who sent at least 6 email messages
each way Node properties specified degree geogra
phical location position in organizational
hierarchy Can greedy strategies work?
47Strategy 1 High degree search
Degree distribution of all senders of email
passing through the HP email server
outdegree
48Filtered network (6 messages sent each way)
Degree distribution no longer power-law, but
Poisson
450 users median degree 10 mean degree
13 average shortest path 3 High degree
search performance (poor) median steps
16 mean 40
49Strategy 2 Geography
50Communication across corporate geography
1U
1L
87 of the 4000 links are between individuals on
the same floor
3U
4U
2L
3L
2U
51Cubicle distance vs. probability of being linked
52Finding someone in a sea of cubicles
median 7 mean 12
53Strategy 3 Organizational hierarchy
54Email correspondence scrambled
55Actual email correspondence
56Example of search path
distance 2
distance 1
hierarchical distance 5 search path distance 4
57Probability of linking vs. distance in hierarchy
in the searchable regime 0
58Results
59Group size vs. probability of linking
60Group size and probability of linking
group size g
61Search Conclusions
Individuals associate on different levels into
groups. Group structure facilitates
decentralized search using social ties. HP Labs
as a social network is searchable but not quite
optimal. searching using the organizational
hierarchy is faster than using physical
location A fraction of important individuals
are easily findable Humans may be much more
resourceful in executing search tasks making
use of weak ties using more sophisticated
strategies
62PeopleFinder2 a search engine for HP people
Extract disambiguate names from publicly
available documents Enrich information available
about individuals Search for them by
topic Identify knowledge communities from
co-occurrence of names
Live Demo
If live demo fails Current PeopleFinder
functionality PeopleFinder2 info on a
person Extracted topics for a person Social
network Social network visualization Search
for individuals by topic Visualize knowledge
network Find social network paths to experts
63To find out more (papers, slides, other research
in the group)
Information dynamics group (IDL) at HP
Labs http//www.hpl.hp.com/research/idl List
of publications http//www.hpl.hp.com/personal/Lad
a_Adamic/research.html