Structural Analysis in Large Networks Observations and Applications

About This Presentation

Title:

Structural Analysis in Large Networks Observations and Applications

Description:

... SNARE ... Community Tools: SNARE. 50. Belief Propagation. Flags are node potentials, or ' ... Community Tools: SNARE. 52. Accurate- Produces large improvement over ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 97

Provided by: marymc7

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Structural Analysis in Large Networks Observations and Applications

1
Structural Analysis in Large NetworksObservations
and Applications

Mary McGlohon
Committee
Christos Faloutsos, co-chair
Alan Montgomery, co-chair
Geoffrey Gordon
David Jensen, University of Massachusetts, Amherst

2
Motivation

Network (a.k.a. graph, relational, social
network) data has become ubiquitous. We want to
know
How do networks form and structure themselves?
How does information propagate through networks?
How do sub-communities form?

1
2
3
Computer networks
Facebook
IMDB actor-movie
3
Outline for thesis
1
2
3
4
Motivation Topology

How do these network strucures form?
Example identify topological properties common
to many different types of graphs (citations,
friendships, etc.)
Developing models of these properties allows for
forecasting.

1
vs
5
Motivation Cascades

Once the networks form, how does information
propagate through the graph?
Example Extract, analyze, and model cascades.

2
6
Motivation Community

How do we compare communities, or sub-networks?
Example For a set of online groups (Usenet),
which ones continue to thrive over time?

3
7
Thesis statement

We propose to
investigate how interactions in graphs occur, how
these interactions lead to diffusion and
community behavior, and
to model these behaviors and apply these findings
to real-world problems.

1
2
3
8
We propose to

investigate how interactions in graphs occur,

to model these behaviors and apply these
findings to real-world problems.

how these interactions lead to diffusion

and community behavior, and

9
Impact

Understanding the relations found in networks has
many applications, such as
Fraud/anomaly detection
Given typical behavior and information about
nodes/edges, how suspicious is a node or group
of nodes?
Ad personalization/recommendation systems
Given some information about an individual and
their friends, which ads to display?
Resource allocation
Given typical patterns of network growth, how can
we allocate resources (hardware, advertising
budget, etc.)?

10
Completed Work

KDD08
ICDM08

SDM07

ICWSM07

ICWSM09-1

ICWSM09-2
ICWSM09-3

KDD09

- to appear
11
Proposed Work
P1a How do cascades compare across network
structures?
P1b Can we use cascades to model product
adoption?
P2 Can we predict success/failure of groups?
12
The rest of the talk

Motivation and thesis statement
Completed work
Proposed work
Conclusions and impact
Audience participation!

13
Completed Work

What patterns are common to networks?

14
Topological Observations

Diameter over time

Connected components

(Kevin Bacon)

Edge weights

15
Topological Observations Data

Analyze unipartite and bipartite networks
Networks are evolving over time
Networks may be weighted

-Repeated edges
-Edge weights
3
3
Unipartite Citations, Blogs, Router traffic
n1
Bipartite IMDB Actor-Movie, Campaign
contributions
m1
n2
m2
n3
m3
n4
16
Topological Observations Gelling Point

When does a graph begin displaying expected
patterns, such as the giant connected component?
How can we tell when this happens?

17
Topological Observations Gelling Point

Observation Most real graphs display a gelling
point, where the graph begins to come together
and the giant connected component forms. After
that point, they exhibit typical behavior.

IMDB
t1914
Diameter
Time
18
Topological Observations NLCCs

In graphs a giant connected component emerges.
We look at sizes of the next-largest connected
components (NLCCs)
After gelling point, do they continue to grow? Do
they shrink?

19
Topological Observations NLCCs

Observation After the gelling point, the giant
connected component takes off, but next-largest
connected components remain constant or oscillate.

IMDB
t1914)
ia
2nd connected component
Size of next-largest connected components
3rd connected component
Time
20
Topological Observations Weights

How are edges in a graph repeated, or otherwise
weighted?
As the number of edges increases, does the total
edge weight grow linearly?

21
Topological Observations Weights

Observation Weight additions follow a power law
with respect to the number of edges
W(t) ? E(t)w
W(t) total weight of graph at t
E(t) total edges of graph at t
w is PL exponent (wgt1)
Many other weighted laws
see KDD08, ICDM08

Orgs-Candidates
log(Weights)
slope1.3
log(Edges)
22
Completed Work

What patterns are common to networks?

23
Completed Work

Gelling point, CCs
Weighted laws

24
Completed Work

Gelling point, CCs
Weighted laws

Can we develop generative models?

25
Topological Models Butterfly

Goals are to generate
Constant/oscillating NLCCs
Densification power law Leskovec05
Shrinking diameter (after gelling point)
Power-law degree distribution
Emergent, local, intuitive behavior

26
Topological Models Butterfly

Main idea Uses 3 parameters
Curiosity how much to explore local network
(U(0,1), creates power-law degree distribution)
Flyout how many local networks to explore
(global, joins components)
Friendliness how often to connect (global,
allows new components)
Details see KDD08

27
Topological Models Butterfly
28
Completed Work

Gelling point, CCs
Weighted laws

Can we develop generative models?

29
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

30
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

What are patterns of cascades in networks?

31
Cascade Observations Data

Gathered from August-September 2005
Used set of 44,362 blogs, traced cascades
2.4 million posts
245,404 blog-to-blog links

Sep 29
Aug 1
Number of posts
Jul 4
Time 1 day
32
Cascade Observations Prelims
a
b
c
d
e
Blogosphere
Star Chain

How quickly does a link to a post occur?
What size do cascades typically reach?
What are typical shapes how often are stars
and chains occurring?

33
Temporal Observations

How quickly does a link to a post occur?
Does popularity decay at a constant rate?
With an exponential (half life)?

Linear-linear scale
Log-linear scale
Log-log scale
34
Cascade Observations Link Popularity

Observation The probability that a post written
at time tp acquires a link at time tp ? is
p(tp?) ? ?-1.5
Similar to Vazquez06

35
Cascade Observations Cascade Size

Q What size distribution do cascades follow? Are
large cascades frequent?
Observation The probability of observing a
cascade of n blog posts follows a Zipf
distribution
p(n) ? n-2

log(Count)
slope-2
log(Cascade size) ( of nodes)
36
Cascade Observations Cascade Size

Q What is the distribution of particular cascade
shapes?
Observation Stars and chains in blog cascades
also follow a power law, with different exponents
(star -3.1, chain -8.5).

37
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

What are patterns of cascades in networks?

38
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

Cascades laws
Cascades as features

39
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

Cascades laws
Cascades as features

Can we develop predictive models for cascades?

40
Cascade Models CGM

Cascade Generation Model
Overview Produce realistic cascades through an
emergent viral model
Details See SDM07

41
Cascade Models CGM
Most frequent cascades
model
data
42
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

Cascades laws
Cascades as features

Can we develop predictive models for cascades?

43
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

Cascades laws
Cascades as features

Cascade generation model
ZC model

44
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

Cascades laws
Cascades as features

Cascade generation model
ZC model

How can we compare communities?

45
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

Cascades laws
Cascades as features

Cascade generation model
ZC model

Political Usenet study

46
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

Cascades laws
Cascades as features

Cascade generation model
ZC model

Political Usenet study

Can we detect anomalies?

47
Community Tools SNARE

Problem Given a network and some domain
knowledge about suspicious nodes (flags),
determine which nodes are most risky.
Data Accounting transaction data. Nodes are
accounts, edges are transactions between
accounts.

Accounts Payable
Revenue Accts
Accounts Receivable
48
Community Tools SNARE

Example Channel stuffing
Some accounts overstated
But other accounts also involved.
Since many accounts are slightly affected, it is
easy to cover up activity.

Very risky
Accounts Payable
Revenue Accts
Accounts Receivable
Not risky
49
Community Tools SNARE

Social Network Analytic Risk Evaluation
Use domain knowledge to flag certain nodes.
Assume homophily between nodes (guilt by
association)
Then, using initial risk as initial node
potentials, use belief propagation (message
passing between nodes) to determine end risk
scores.

50
Community Tools SNARE

Belief Propagation
Flags are node potentials, or intial risk
scores
All nodes send messages back and forth with
beliefs
Upon convergence, end result will reflect
riskiest nodes.

Revenue Accts
51
Community Tools SNARE

Produces improvement over simply using flags
Up to 6.5 lift
Improvement especially for low false positive
rate

Results for accounts data (ROC Curve)
Ideal
SNARE
True positive rate
Baseline (flags only)
False positive rate
52
Community Tools SNARE

Accurate- Produces large improvement over simply
using flags
Flexible- Can be applied to other domains
Scalable- One iteration BP runs in linear time (
edges)
Robust- Works on large range of parameters

53
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

Cascades laws
Cascades as features

Cascade generation model
ZC model

Political Usenet study

Can we detect anomalies?

54
Completed Work

Gelling point, CCs
Weighted laws

Butterfly
RTM
Oddball

Cascades laws
Cascades as features

Cascade generation model
ZC model

Political Usenet study

SNARE

55
The rest of the talk

Motivation and thesis statement
Completed work
Proposed work
Conclusions and impact
Audience participation!

56
Proposed Work

2 main problems
P1 Cascades and product adoption
How do cascades vary according to network
structure?
Can we use cascades to model product adoption?
P2 Predicting success/failure of online groups

P1a How do cascades compare across network
structures?

P1b Can we use cascades to model product
adoption?

P2 Can we predict success/failure of groups?

P1a How do cascades compare across network
structures?

P1b Can we use cascades to model product
adoption?

P2 Can we predict success/failure of groups?

59
P1a Cascades Network Structure

In different networks, how does starting point of
an epidemic affect the epidemic size?
What modifications on current model changes the
cascades (weights, self-infection)?
Can we reverse-engineer network properties based
on observed cascades?

P1a How do cascades compare across network
structures?

P1b Can we use cascades to model product
adoption?

P2 Can we predict success/failure of groups?

61
P1b Cascades Product Adoption

Examine adoption of Caller Ringback Tones (CRBT)
User buys ringtone
Friend calls user, hears CRBT
Phone call data
Nodes User ID, DOB, salutation (Mr/Ms), date of
joining, data plan
Call Edges src/dest ID, call time, duration
SMS Edges src/dest ID, time
CRBT purchases purchase date, song name, cost

62
P1b Cascades Product Adoption

Can we fit the Bass Model for different CRBTs?

63
P1b Cascades Product Adoption

Are some CRBTs more viral than others? Does
the footprint follow a skewed distribution?
How long after purchase is a CRBT infective?

Survival Function P(Xgtx)
Number of downloads (per song)
64
P1b Cascades Product Adoption

How does the weight of a link, homophily, or
other factors affect the likelihood of
transmission?
Can we explicitly test whether a purchase is a
result of basic similarity of neighbors or a
result of viral propagation?
How can we build and verify a model for this
propagation?

P1a How do cascades compare across network
structures?

P1b Can we use cascades to model product
adoption?

P2 Can we predict success/failure of groups?

66
P2 Success Failure of Online Groups

Use data over 4 years from nearly 200 newsgroups.
(Political Usenet)
Many discussion groups stop posting by the third
year.
Why?

67
P2 Success Failure of Online Groups

P2 Questions
If structural network characteristics can be
traced to success or failure, which features are
most predictive?
Can we test causality in the predictive
characteristics?

68
Timeline
May 09
P1 preliminaries
Jun 09
Internship at Google
Sep 09
P1a Cascades and network structure
Nov 09
P1b Cascades and product adoption
Mar 10
P2 Success/failure of online groups
Jul 10
Complete document
Aug 10
Defend
69
Related work

Topology
Heavy-tailed degree distributions Faloutsos99
Albert02 Kleinberg99
Shrinking diameter, densification Leskovec05
Random graphs model Erdos60
Forest Fire model Leskovec05
Winners do not take all model Pennock02
Cascades
Recommendations Leskovec06
Diffusion in blogs Adar03 Gruhl04
Kempe03 Kumar03
Marketing Product adoption Bass69,
Word-of-mouth Godes04
Virus propagation Populations Hethcote,
Networks Boguna, Pastor-Satorras Charkabarti
Communities and other applications
Securities fraud detection Neville05 Fast07
Author identification Hill04
Online group behavior Backstrom08

70
Conclusions Completed

Demonstrated several properties common to
networks in a wide range of domains.
Oscillating sizes of next-largest connected
components
Power laws for weighted graphs
Butterfly model generates properties

71
Conclusions Completed

Studied and modeled cascades in blogs
Several power laws for cascade shapes and size
Cascade Generation Model
Devised SNARE for anomaly detection for
accounting data (lift factor up to 6.5)

72
Conclusions Proposed

P1a Continue cascade studies across network
structures
P1b Use cascades to model purchases in
phone-call graph
P2 Build predictive models for success and
failure in online groups

73
References

Topology
KDD08 M. McGlohon, L. Akoglu, and C. Faloutsos.
Weighted Graphs and Disconnected Components
Patterns and a Generator. SIG-KDD. Las Vegas,
Nev., August 2008.
ICDM08 L. Akoglu. M. McGlohon, and C.
Faloutsos. RTM Laws and a Recursive Generator
for Weighted Time-Evolving Graphs. ICDM. Pisa,
Italy, Dec. 2008.
Cascades
SDM07 J. Leskovec, J, M. McGlohon, C.
Faloutsos, N. Glance, and M. Hurst. Patterns of
Cascading Behavior in Large Blog Graphs. SDM.
Minneapolis, Minn., April 2007.
ICWSM07 M. McGlohon, J. Leskovec, C. Faloutsos,
N. Glance, and M. Hurst. Finding patterns in blog
shapes and blog evolution. ICWSM. Boulder, Colo.,
March 2007.
ICWSM09-1 M. Goetz, J. Leskovec, M. McGlohon,
and C. Faloutsos. Modeling Blog Dynamics. ICWSM.
San Jose, Cali. May 2009.

74
References

Community
KDD09 M. McGlohon, S. Bay, M. Anderle, D.
Steier, and C. Faloutsos. SNARE A Link Analytic
System for Evaluating Fraud Risk. ACM Special
Interest Group on Knowledge Discovery and Data
Mining (SIG-KDD). Paris, France. June 2009.
ICWSM09-2 M. McGlohon and M. Hurst. Community
Structure and Information Flow in Usenet
Improving analysis with a thread ownership model.
International Conference on Weblogs and Social
Media (ICWSM). San Jose, CA. May 2009.
ICWSM09-3 M. McGlohon and M. Hurst. Considering
the Sources Comparing linking patterns in Usenet
and blogs. International Conference on Weblogs
and Social Media (ICWSM09). San Jose, CA. May
2009.

Acknowledgments
Leman Akoglu
Markus Anderle
Stephen Bay
Polo Chau
Christos Faloutsos
Natalie Glance
Mila Goetz
Geoff Gordon
Matthew Hurst
i-Lab
David Jensen
Ramayya Krishnan
Jure Leskovec
Austin McDonald
Alan Montgomery
Chris Neff
Nachi Sahoo
Purna Sarkar

Support
PricewaterhouseCoopers
Microsoft Live Labs
NSF Graduate Research Fellowship
Yahoo! Key Technical Challenges Grant,
Pennsylvania Infrastrucutre Technology Alliance
(PITA)
Hewlett-Packard
NSF Grants No. IIS- 0705359, IIS-0534205, and
CNS-0721736, 0209107, SENSOR-0329549, EF-0331657,
IIS-0326322
U.S. Department of Energy Lawrence Livermore
National Laboratory contract No.W-7405-ENG-48.

76
Audience participation!
77
(No Transcript)
78
Talk expansion pack
79
P1b Other Cascade Data

Post data from corporate blogs
Demographic data on bloggers (employee ID,
location, job description)
Read data (timestamped)
Write data (timestamped)
CRBT adoption in general
Perhaps people do not adopt particular songs, but
the CRBT mechanism
More public blog data (spinn3r)
Also use edge information from blogrolls/comments

80
P2 Potential features to examine

Posting behavior
Which users are posting, how often are they
posting, and how skewed is the distribution?
Linking behavior
How long are cascades (threads), in terms of post
and time?
Content
Topics, keywords, sentence length, other textual
features, sentiment analysis

81
Unipartite Networks

Postnet Posts in blogs, hyperlinks between
Blognet Aggregated Postnet, repeated edges
Patent Patent citations
NIPS Academic citations
Arxiv Academic citations
NetTraffic Packets, repeated edges
Autonomous Systems (AS) Packets, repeated edges

4 million nodes 8 million edges 17 years
82
Bipartite Networks

IMDB Actor-movie network
Netflix User-movie ratings
DBLP conference- repeated edges
Author-Keyword
Keyword-Conference
Author-Conference
US Election Donations weights, repeated edges
Orgs-Candidates
Individuals-Orgs

6 million nodes 10 million edges 22 years
83
Topological Models Butterfly
84
Topological Models Butterfly

Nodes may have multiple hosts ( ).
Joins components

85
Topological Models RTM

Recursive Tensor Model
Goal to introduce time and burstiness
Main idea Begin with a core tensor
(multidimensional array), and use self-similarity
to reproduce observed power laws.

86
Topological Models RTM

Self similarity arises from Kronecker product
2D

Leskovec06
87
Topological Models RTM

3D Use Kronecker product on a core tensor
Reproduced power laws as found in ICDM08

Adjacency matrix
88
Topological Models RTM

3D Use Kronecker product on a core tensor
Reproduced power laws as found in ICDM08

3rd dim time
89
Topological Applications Oddball

Main ideas
Use local neighborhood of node
Find common patterns
Score how much a node deviates from common
patterns
Results
Identified anomalous nodes such as Ken Lay in
Enron, particularly different blog posts

90
Cascade Models CGM
91
Cascade Models Zero-crossing

Main ideas
Models blogs in both network growth and network
diffusion
Choose to post based on random walk (produces
burstiness)
Link based on recency an popularity (reproduces
-1.5 law and skewed degree)
Improvement over CGM because network is generated

92
Community Observations Newsgroups

Observation Threads introduced to a group later
in the thread tended to have more activity from
that group.
Observation Discussions tended to flow from
main groups (can.politics) into subgroups
(ab.politics, bc.politics)

93
Community Observations Newsgroups

189 newsgroups (polit in name), January
2004-June 2008
37 million posts
Includes many countries, provinces, states,
topical groups (alt.politics.guns)

Major issue over half are cross-posted to
multiple groups. Where is conversation truly
occurring?
94
Community Observations Newsgroups

Solution Introduce Thread ownership, by
assigning threads according to where authors
exclusively post.

95
Community Observations Newsgroups

Observation Discussions tended to flow from
main groups (can.politics) into subgroups
(ab.politics, bc.politics)

96
Completed Work

What patterns are common to networks?

Can we develop generative models and detect
anomalies?

What are patterns of cascades in networks?

Can we develop predictive models for cascades?

How can we compare communities?

Can we detect anomalies, and predict group
behavior?

Write a Comment

User Comments (0)

About PowerShow.com

Structural Analysis in Large Networks Observations and Applications - PowerPoint PPT Presentation

Structural Analysis in Large Networks Observations and Applications

... SNARE ... Community Tools: SNARE. 50. Belief Propagation. Flags are node potentials, or ' ... Community Tools: SNARE. 52. Accurate- Produces large improvement over ... – PowerPoint PPT presentation