Title: Structural Analysis in Large Networks Observations and Applications
1Structural Analysis in Large NetworksObservations
and Applications
- Mary McGlohon
- Committee
- Christos Faloutsos, co-chair
- Alan Montgomery, co-chair
- Geoffrey Gordon
- David Jensen, University of Massachusetts, Amherst
2Motivation
- Network (a.k.a. graph, relational, social
network) data has become ubiquitous. We want to
know - How do networks form and structure themselves?
- How does information propagate through networks?
- How do sub-communities form?
1
2
3
Computer networks
Facebook
IMDB actor-movie
3Outline for thesis
1
2
3
4Motivation Topology
- How do these network strucures form?
- Example identify topological properties common
to many different types of graphs (citations,
friendships, etc.) - Developing models of these properties allows for
forecasting.
1
vs
5Motivation Cascades
- Once the networks form, how does information
propagate through the graph? - Example Extract, analyze, and model cascades.
2
6Motivation Community
- How do we compare communities, or sub-networks?
- Example For a set of online groups (Usenet),
which ones continue to thrive over time?
3
7Thesis statement
- We propose to
- investigate how interactions in graphs occur, how
these interactions lead to diffusion and
community behavior, and - to model these behaviors and apply these findings
to real-world problems.
1
2
3
8We propose to
- investigate how interactions in graphs occur,
to model these behaviors and apply these
findings to real-world problems.
- how these interactions lead to diffusion
- and community behavior, and
9Impact
- Understanding the relations found in networks has
many applications, such as - Fraud/anomaly detection
- Given typical behavior and information about
nodes/edges, how suspicious is a node or group
of nodes? - Ad personalization/recommendation systems
- Given some information about an individual and
their friends, which ads to display? - Resource allocation
- Given typical patterns of network growth, how can
we allocate resources (hardware, advertising
budget, etc.)?
10Completed Work
SDM07
- to appear
11Proposed Work
P1a How do cascades compare across network
structures?
P1b Can we use cascades to model product
adoption?
P2 Can we predict success/failure of groups?
12The rest of the talk
- Motivation and thesis statement
- Completed work
- Proposed work
- Conclusions and impact
- Audience participation!
13Completed Work
- What patterns are common to networks?
14Topological Observations
(Kevin Bacon)
15Topological Observations Data
- Analyze unipartite and bipartite networks
- Networks are evolving over time
- Networks may be weighted
-Repeated edges
-Edge weights
3
3
Unipartite Citations, Blogs, Router traffic
n1
Bipartite IMDB Actor-Movie, Campaign
contributions
m1
n2
m2
n3
m3
n4
16Topological Observations Gelling Point
- When does a graph begin displaying expected
patterns, such as the giant connected component?
How can we tell when this happens?
17Topological Observations Gelling Point
- Observation Most real graphs display a gelling
point, where the graph begins to come together
and the giant connected component forms. After
that point, they exhibit typical behavior.
IMDB
t1914
Diameter
Time
18Topological Observations NLCCs
- In graphs a giant connected component emerges.
- We look at sizes of the next-largest connected
components (NLCCs) - After gelling point, do they continue to grow? Do
they shrink?
19Topological Observations NLCCs
- Observation After the gelling point, the giant
connected component takes off, but next-largest
connected components remain constant or oscillate.
IMDB
t1914)
ia
2nd connected component
Size of next-largest connected components
3rd connected component
Time
20Topological Observations Weights
- How are edges in a graph repeated, or otherwise
weighted? - As the number of edges increases, does the total
edge weight grow linearly?
21Topological Observations Weights
- Observation Weight additions follow a power law
with respect to the number of edges - W(t) ? E(t)w
- W(t) total weight of graph at t
- E(t) total edges of graph at t
- w is PL exponent (wgt1)
- Many other weighted laws
- see KDD08, ICDM08
Orgs-Candidates
log(Weights)
slope1.3
log(Edges)
22Completed Work
- What patterns are common to networks?
23Completed Work
- Gelling point, CCs
- Weighted laws
24Completed Work
- Gelling point, CCs
- Weighted laws
- Can we develop generative models?
25Topological Models Butterfly
- Goals are to generate
- Constant/oscillating NLCCs
- Densification power law Leskovec05
- Shrinking diameter (after gelling point)
- Power-law degree distribution
- Emergent, local, intuitive behavior
26Topological Models Butterfly
- Main idea Uses 3 parameters
- Curiosity how much to explore local network
(U(0,1), creates power-law degree distribution) - Flyout how many local networks to explore
(global, joins components) - Friendliness how often to connect (global,
allows new components) - Details see KDD08
27Topological Models Butterfly
28Completed Work
- Gelling point, CCs
- Weighted laws
- Can we develop generative models?
29Completed Work
- Gelling point, CCs
- Weighted laws
30Completed Work
- Gelling point, CCs
- Weighted laws
- What are patterns of cascades in networks?
31Cascade Observations Data
- Gathered from August-September 2005
- Used set of 44,362 blogs, traced cascades
- 2.4 million posts
- 245,404 blog-to-blog links
Sep 29
Aug 1
Number of posts
Jul 4
Time 1 day
32Cascade Observations Prelims
a
b
c
d
e
Blogosphere
Star Chain
- How quickly does a link to a post occur?
- What size do cascades typically reach?
- What are typical shapes how often are stars
and chains occurring?
33Temporal Observations
- How quickly does a link to a post occur?
- Does popularity decay at a constant rate?
- With an exponential (half life)?
Linear-linear scale
Log-linear scale
Log-log scale
34Cascade Observations Link Popularity
- Observation The probability that a post written
at time tp acquires a link at time tp ? is - p(tp?) ? ?-1.5
- Similar to Vazquez06
35Cascade Observations Cascade Size
- Q What size distribution do cascades follow? Are
large cascades frequent? - Observation The probability of observing a
cascade of n blog posts follows a Zipf
distribution - p(n) ? n-2
log(Count)
slope-2
log(Cascade size) ( of nodes)
36Cascade Observations Cascade Size
- Q What is the distribution of particular cascade
shapes? - Observation Stars and chains in blog cascades
also follow a power law, with different exponents
(star -3.1, chain -8.5).
37Completed Work
- Gelling point, CCs
- Weighted laws
- What are patterns of cascades in networks?
38Completed Work
- Gelling point, CCs
- Weighted laws
- Cascades laws
- Cascades as features
39Completed Work
- Gelling point, CCs
- Weighted laws
- Cascades laws
- Cascades as features
- Can we develop predictive models for cascades?
40Cascade Models CGM
- Cascade Generation Model
- Overview Produce realistic cascades through an
emergent viral model - Details See SDM07
41Cascade Models CGM
Most frequent cascades
model
data
42Completed Work
- Gelling point, CCs
- Weighted laws
- Cascades laws
- Cascades as features
- Can we develop predictive models for cascades?
43Completed Work
- Gelling point, CCs
- Weighted laws
- Cascades laws
- Cascades as features
- Cascade generation model
- ZC model
44Completed Work
- Gelling point, CCs
- Weighted laws
- Cascades laws
- Cascades as features
- Cascade generation model
- ZC model
- How can we compare communities?
45Completed Work
- Gelling point, CCs
- Weighted laws
- Cascades laws
- Cascades as features
- Cascade generation model
- ZC model
46Completed Work
- Gelling point, CCs
- Weighted laws
- Cascades laws
- Cascades as features
- Cascade generation model
- ZC model
47Community Tools SNARE
- Problem Given a network and some domain
knowledge about suspicious nodes (flags),
determine which nodes are most risky. - Data Accounting transaction data. Nodes are
accounts, edges are transactions between
accounts.
Accounts Payable
Revenue Accts
Accounts Receivable
48Community Tools SNARE
- Example Channel stuffing
- Some accounts overstated
- But other accounts also involved.
- Since many accounts are slightly affected, it is
easy to cover up activity.
Very risky
Accounts Payable
Revenue Accts
Accounts Receivable
Not risky
49Community Tools SNARE
- Social Network Analytic Risk Evaluation
- Use domain knowledge to flag certain nodes.
- Assume homophily between nodes (guilt by
association) - Then, using initial risk as initial node
potentials, use belief propagation (message
passing between nodes) to determine end risk
scores.
50Community Tools SNARE
- Belief Propagation
- Flags are node potentials, or intial risk
scores - All nodes send messages back and forth with
beliefs - Upon convergence, end result will reflect
riskiest nodes.
Revenue Accts
51Community Tools SNARE
- Produces improvement over simply using flags
- Up to 6.5 lift
- Improvement especially for low false positive
rate
Results for accounts data (ROC Curve)
Ideal
SNARE
True positive rate
Baseline (flags only)
False positive rate
52Community Tools SNARE
- Accurate- Produces large improvement over simply
using flags - Flexible- Can be applied to other domains
- Scalable- One iteration BP runs in linear time (
edges) - Robust- Works on large range of parameters
53Completed Work
- Gelling point, CCs
- Weighted laws
- Cascades laws
- Cascades as features
- Cascade generation model
- ZC model
54Completed Work
- Gelling point, CCs
- Weighted laws
- Cascades laws
- Cascades as features
- Cascade generation model
- ZC model
55The rest of the talk
- Motivation and thesis statement
- Completed work
- Proposed work
- Conclusions and impact
- Audience participation!
56Proposed Work
- 2 main problems
- P1 Cascades and product adoption
- How do cascades vary according to network
structure? - Can we use cascades to model product adoption?
- P2 Predicting success/failure of online groups
57- P1a How do cascades compare across network
structures?
- P1b Can we use cascades to model product
adoption?
- P2 Can we predict success/failure of groups?
58- P1a How do cascades compare across network
structures?
- P1b Can we use cascades to model product
adoption?
- P2 Can we predict success/failure of groups?
59P1a Cascades Network Structure
- In different networks, how does starting point of
an epidemic affect the epidemic size? - What modifications on current model changes the
cascades (weights, self-infection)? - Can we reverse-engineer network properties based
on observed cascades?
60- P1a How do cascades compare across network
structures?
- P1b Can we use cascades to model product
adoption?
- P2 Can we predict success/failure of groups?
61P1b Cascades Product Adoption
- Examine adoption of Caller Ringback Tones (CRBT)
- User buys ringtone
- Friend calls user, hears CRBT
- Phone call data
- Nodes User ID, DOB, salutation (Mr/Ms), date of
joining, data plan - Call Edges src/dest ID, call time, duration
- SMS Edges src/dest ID, time
- CRBT purchases purchase date, song name, cost
62P1b Cascades Product Adoption
- Can we fit the Bass Model for different CRBTs?
63P1b Cascades Product Adoption
- Are some CRBTs more viral than others? Does
the footprint follow a skewed distribution? - How long after purchase is a CRBT infective?
Survival Function P(Xgtx)
Number of downloads (per song)
64P1b Cascades Product Adoption
- How does the weight of a link, homophily, or
other factors affect the likelihood of
transmission? - Can we explicitly test whether a purchase is a
result of basic similarity of neighbors or a
result of viral propagation? - How can we build and verify a model for this
propagation?
65- P1a How do cascades compare across network
structures?
- P1b Can we use cascades to model product
adoption?
- P2 Can we predict success/failure of groups?
66P2 Success Failure of Online Groups
- Use data over 4 years from nearly 200 newsgroups.
(Political Usenet) - Many discussion groups stop posting by the third
year. - Why?
67P2 Success Failure of Online Groups
- P2 Questions
- If structural network characteristics can be
traced to success or failure, which features are
most predictive? - Can we test causality in the predictive
characteristics?
68Timeline
May 09
P1 preliminaries
Jun 09
Internship at Google
Sep 09
P1a Cascades and network structure
Nov 09
P1b Cascades and product adoption
Mar 10
P2 Success/failure of online groups
Jul 10
Complete document
Aug 10
Defend
69Related work
- Topology
- Heavy-tailed degree distributions Faloutsos99
Albert02 Kleinberg99 - Shrinking diameter, densification Leskovec05
- Random graphs model Erdos60
- Forest Fire model Leskovec05
- Winners do not take all model Pennock02
- Cascades
- Recommendations Leskovec06
- Diffusion in blogs Adar03 Gruhl04
Kempe03 Kumar03 - Marketing Product adoption Bass69,
Word-of-mouth Godes04 - Virus propagation Populations Hethcote,
Networks Boguna, Pastor-Satorras Charkabarti - Communities and other applications
- Securities fraud detection Neville05 Fast07
- Author identification Hill04
- Online group behavior Backstrom08
70Conclusions Completed
- Demonstrated several properties common to
networks in a wide range of domains. - Oscillating sizes of next-largest connected
components - Power laws for weighted graphs
- Butterfly model generates properties
71Conclusions Completed
- Studied and modeled cascades in blogs
- Several power laws for cascade shapes and size
- Cascade Generation Model
- Devised SNARE for anomaly detection for
accounting data (lift factor up to 6.5)
72Conclusions Proposed
- P1a Continue cascade studies across network
structures - P1b Use cascades to model purchases in
phone-call graph - P2 Build predictive models for success and
failure in online groups
73References
- Topology
- KDD08 M. McGlohon, L. Akoglu, and C. Faloutsos.
Weighted Graphs and Disconnected Components
Patterns and a Generator. SIG-KDD. Las Vegas,
Nev., August 2008. - ICDM08 L. Akoglu. M. McGlohon, and C.
Faloutsos. RTM Laws and a Recursive Generator
for Weighted Time-Evolving Graphs. ICDM. Pisa,
Italy, Dec. 2008. - Cascades
- SDM07 J. Leskovec, J, M. McGlohon, C.
Faloutsos, N. Glance, and M. Hurst. Patterns of
Cascading Behavior in Large Blog Graphs. SDM.
Minneapolis, Minn., April 2007. - ICWSM07 M. McGlohon, J. Leskovec, C. Faloutsos,
N. Glance, and M. Hurst. Finding patterns in blog
shapes and blog evolution. ICWSM. Boulder, Colo.,
March 2007. - ICWSM09-1 M. Goetz, J. Leskovec, M. McGlohon,
and C. Faloutsos. Modeling Blog Dynamics. ICWSM.
San Jose, Cali. May 2009.
74References
- Community
- KDD09 M. McGlohon, S. Bay, M. Anderle, D.
Steier, and C. Faloutsos. SNARE A Link Analytic
System for Evaluating Fraud Risk. ACM Special
Interest Group on Knowledge Discovery and Data
Mining (SIG-KDD). Paris, France. June 2009. - ICWSM09-2 M. McGlohon and M. Hurst. Community
Structure and Information Flow in Usenet
Improving analysis with a thread ownership model.
International Conference on Weblogs and Social
Media (ICWSM). San Jose, CA. May 2009. - ICWSM09-3 M. McGlohon and M. Hurst. Considering
the Sources Comparing linking patterns in Usenet
and blogs. International Conference on Weblogs
and Social Media (ICWSM09). San Jose, CA. May
2009.
75- Acknowledgments
- Leman Akoglu
- Markus Anderle
- Stephen Bay
- Polo Chau
- Christos Faloutsos
- Natalie Glance
- Mila Goetz
- Geoff Gordon
- Matthew Hurst
- i-Lab
- David Jensen
- Ramayya Krishnan
- Jure Leskovec
- Austin McDonald
- Alan Montgomery
- Chris Neff
- Nachi Sahoo
- Purna Sarkar
- Support
- PricewaterhouseCoopers
- Microsoft Live Labs
- NSF Graduate Research Fellowship
- Yahoo! Key Technical Challenges Grant,
Pennsylvania Infrastrucutre Technology Alliance
(PITA) - Hewlett-Packard
- NSF Grants No. IIS- 0705359, IIS-0534205, and
CNS-0721736, 0209107, SENSOR-0329549, EF-0331657,
IIS-0326322 - U.S. Department of Energy Lawrence Livermore
National Laboratory contract No.W-7405-ENG-48.
76Audience participation!
77(No Transcript)
78Talk expansion pack
79P1b Other Cascade Data
- Post data from corporate blogs
- Demographic data on bloggers (employee ID,
location, job description) - Read data (timestamped)
- Write data (timestamped)
- CRBT adoption in general
- Perhaps people do not adopt particular songs, but
the CRBT mechanism - More public blog data (spinn3r)
- Also use edge information from blogrolls/comments
80P2 Potential features to examine
- Posting behavior
- Which users are posting, how often are they
posting, and how skewed is the distribution? - Linking behavior
- How long are cascades (threads), in terms of post
and time? - Content
- Topics, keywords, sentence length, other textual
features, sentiment analysis
81Unipartite Networks
- Postnet Posts in blogs, hyperlinks between
- Blognet Aggregated Postnet, repeated edges
- Patent Patent citations
- NIPS Academic citations
- Arxiv Academic citations
- NetTraffic Packets, repeated edges
- Autonomous Systems (AS) Packets, repeated edges
4 million nodes 8 million edges 17 years
82Bipartite Networks
- IMDB Actor-movie network
- Netflix User-movie ratings
- DBLP conference- repeated edges
- Author-Keyword
- Keyword-Conference
- Author-Conference
- US Election Donations weights, repeated edges
- Orgs-Candidates
- Individuals-Orgs
6 million nodes 10 million edges 22 years
83Topological Models Butterfly
84Topological Models Butterfly
- Nodes may have multiple hosts ( ).
- Joins components
85Topological Models RTM
- Recursive Tensor Model
- Goal to introduce time and burstiness
- Main idea Begin with a core tensor
(multidimensional array), and use self-similarity
to reproduce observed power laws.
86Topological Models RTM
- Self similarity arises from Kronecker product
- 2D
Leskovec06
87Topological Models RTM
- 3D Use Kronecker product on a core tensor
- Reproduced power laws as found in ICDM08
Adjacency matrix
88Topological Models RTM
- 3D Use Kronecker product on a core tensor
- Reproduced power laws as found in ICDM08
3rd dim time
89Topological Applications Oddball
- Main ideas
- Use local neighborhood of node
- Find common patterns
- Score how much a node deviates from common
patterns - Results
- Identified anomalous nodes such as Ken Lay in
Enron, particularly different blog posts
90Cascade Models CGM
91Cascade Models Zero-crossing
- Main ideas
- Models blogs in both network growth and network
diffusion - Choose to post based on random walk (produces
burstiness) - Link based on recency an popularity (reproduces
-1.5 law and skewed degree) - Improvement over CGM because network is generated
92Community Observations Newsgroups
- Observation Threads introduced to a group later
in the thread tended to have more activity from
that group. - Observation Discussions tended to flow from
main groups (can.politics) into subgroups
(ab.politics, bc.politics)
93Community Observations Newsgroups
- 189 newsgroups (polit in name), January
2004-June 2008 - 37 million posts
- Includes many countries, provinces, states,
topical groups (alt.politics.guns)
Major issue over half are cross-posted to
multiple groups. Where is conversation truly
occurring?
94Community Observations Newsgroups
- Solution Introduce Thread ownership, by
assigning threads according to where authors
exclusively post.
95Community Observations Newsgroups
- Observation Discussions tended to flow from
main groups (can.politics) into subgroups
(ab.politics, bc.politics)
96Completed Work
- What patterns are common to networks?
- Can we develop generative models and detect
anomalies?
- What are patterns of cascades in networks?
- Can we develop predictive models for cascades?
- How can we compare communities?
- Can we detect anomalies, and predict group
behavior?