Modeling massive dynamic graphs - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Modeling massive dynamic graphs

Description:

Modeling massive dynamic graphs Chris Volinsky AT&T Research Statistics Research Department Along with: Deepak Agarwal, Bob Bell, Corinna Cortes, Shawndra Hill, Daryl ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 43
Provided by: volinsky
Category:

less

Transcript and Presenter's Notes

Title: Modeling massive dynamic graphs


1
Modeling massive dynamic graphs
  • Chris Volinsky
  • ATT Research
  • Statistics Research Department
  • Along with
  • Deepak Agarwal, Bob Bell, Corinna Cortes,
    Shawndra Hill, Daryl Pregibon,

2
Outline
  • Defining a Dynamic Graph, and our objectives
  • A motivating example Repetitive fraud in
    telecommunications
  • Our approach representation and approximation of
    dynamic graphs
  • Revisiting Repetitive Fraud
  • Parameter setting and applications to other
    domains
  • Fraud revisited applying our methods
  • Other applications, conclusions, further work
  • Summary
  • Analyzing large dynamic graphs of transactional
    data is hard!
  • By storing and analyzing a massive graph as an
    indexed set of small local graphs, we have
    developed a generic framework for dynamic graph
    representation, which facilitates fast and
    accurate analysis.

3
  • Defining a Dynamic Graph, and our objectives

4
Defining Dynamic Graphs
  • Dynamic Graphs represent transactional data
  • Credit card data
  • Web connectivity data
  • Web logs
  • Telecommunications network traffic
  • Online auction data
  • Transactional data can be represented as a
    directed graph

Kathleen
5
Defining Dynamic Graphs
  • Dynamic Graphs
  • Nodes represent transactors
  • Edges are directed transactions
  • All edges have a time stamp
  • All edges have a weight (?)
  • May contain
  • Other attributes on nodes
  • Other attributes on edges

6
Analysis of dynamic graphs
  • Why is it hard?
  • Scale
  • Often tens or hundreds of millions of nodes and
    edges
  • Doesnt fit in main memory
  • Dynamic
  • Large numbers of nodes coming and going
    continuously
  • Accounting for temporal component of changing
    graphs is a challenge
  • Heterogeneous nodes
  • Superactive nodes dominate visualization and
    analysis
  • Zipfs law holds most users have few connections

7
Overview Our Objectives
  • Define a representation for a dynamic graph
  • What is the graph at time t Gt
  • How does one account for addition and attrition
    of nodes
  • Graphs can be used for
  • Global properties - learning about the graph
    properties
  • Clustering coefficient
  • Power law coefficient
  • Graph depth
  • Local represenation learning about individual
    nodes in the graph
  • Entities the local subgraph of a node of
    interest
  • Fraud signatures, anomaly (intrusion) detection,
    matching, etc.
  • Our Goal Methods for entity based analysis in
    large dynamic graphs

8
  • A motivating example Repetitive fraud in
    telecommunications

9
Motivating Example Repetitive Fraud
  • Lots of people cant pay their bill, but they want
    phone service anyway

Name Ted Hanley
Address 14 Pearl Dr St Peters, MN
Balance 208.00
Disconnected 2/19/04 (nonpayment)
Name Debra Handley
Address 14 Pearl Dr St Peters, MN
Balance 142.00
Connected 2/22/04
Name Elizabeth Harmon
Address APT 1045 4301 ST JOHN RD SCOTTSDALE, AZ
Balance 149.00
Disconnected 2/19/04 (nonpayment)
Name Elizabeth Harmon
Address 180 N 40TH PL APT 40 PHOENIX, AZ
Balance 72.00
Connected 1/31/04
10
Motivating Example Repetitive Fraud
How can we identify that it is the same person
behind both accounts?
Old Account 67855232344 New Account 4215554597
Old Date 2003-02-25 New Date 2003-02-13
Old Name DAVID ATKINS New Name DAVID WATKINS
Old Address 10 NIGHT WAY APT 114 New Address 10 HATSWORTH DR
Old City FAYVILLE New City BONDALE
Old State AL New State AL
Old Zip 302141798 New Zip 300021530
Old II Code 5512127609901 New II Code 5312074639501
Old Balance 284.62 New Balance 5.83

11
Motivating Example Challenges
  • This is a problem of record linkage and graph
    matching, but because of obfuscation, we can only
    count on the graph matching!
  • But the problem is huge
  • Q How can we do efficient matching to put a dent
    in this problem?
  • A Efficient representation of entities

12
Motivating Example Our data
  • Our graph is large
  • 350M Telephone numbers (TNs) currently active on
    our Long Distance network, 300M calls/day
  • Our graph is dynamic

4 Million TNs appear per week
4 Million TNs disappear per week
13
Motivating Example Our data
Our graph is sparse For one year of long
distance data
14
Motivating Example Our data
Quite sparse.
Now onto two examples
15
  • Our Approach to Dynamic Graphs
  • Definition
  • Representation
  • Approximation

16
Our Approach Defining dynamic graphs
  • Q for transactional data, what does the graph
    at time t (Gt)mean?

- let gt be the collection of nodes and edges
during the time period t
Too narrow!
Too broad!
Too many!
17
Our Approach Defining dynamic graphs
We adopt an Exponentially Weighted Moving Average
(EWMA)
i.e. todays graph is defined recursively as a
convex combination of yesterdays graph and
todays data
Alternatively, this is
Through time, edge weights decay with decay rate
q
  • Advantages
  • recent data has most influence
  • only one most recent graph need be stored

18
Our Approach Defining dynamic graphs
Selecting q
  • closer to 1
  • calls decay slower
  • more historical data included
  • smoother
  • q closer to 0
  • faster decay
  • recent calls count more
  • more power to detect changes
  • less smooth

q 1/(1-n) means weight reduces to 1/e times its
original weight in n days
19
Our Approach Representation
  • Since we are interested in entities, we represent
    the graph as a union of entity graphs
  • These entity graphs are the atomic units of
    analysis, a signature of the behavior of the node
  • As it turns out, storing hundreds of millions of
    small graphs is much more efficient than storing
    one massive graph. We use Hancock, a language
    developed at ATT for signatures, which allows
    for easy indexed storage of
  • In short, we are approximating a hugh graph with
    many little graphs, which are easily accessible
    (via indexed storage) for analysis.

20
Our Approach Representation
Update the graph by updating all of the atomic
units daily so any time we access the data we
have the most recent representation.
Yesterdays graph
Todays data
Todays graph
1111111111 20.0 2122121212 10.0 9991119999
5.0
2222222222 100.3 1111111111
90.1 3213232423 27.0 9098765453
11.3 8876457326 5.4 2122121212
3.0 9908989898 0.9 8887878787 0.1


21
Our Approach Approximation
  • We also implore two types of approximation of the
    graph, by pruning.
  • Local pruning of edges designate a maximal in
    and out degree (k) for each node, and assign an
    overflow bin
  • Global pruning of edges overall threshold (e)
    below which edges are removed from the graph
  • Removes stale edges
  • Reduces effect of supernodes

22
Our Approach Approximation
  • Defending k
  • Most entities have the vast majority of their
    weight in a fraction of their nodes

23
Our Approach Parameter Selection
  • We have now defined a representation of a dynamic
    graph by three parameters
  • q - controls the decay of edges and edge weights
  • e - global pruning parameter
  • k local pruning parameter
  • For a given application, we choose the parameters
    by optimizing predictive performance, selecting
    the parameters which optimize a loss function
  • Two loss functions we have used
  • Weighted Dice
  • Hellinger Distance

24
Our Approach Parameter Selection
  • Let A and B be two entities.
  • Weighted Dice
  • Hellinger Distance
  • For each value
  • Set e to be a low tolerance value
  • For a range of k, optimize q
  • Look at the plot to select parameters

25
Our Approach Parameter Selection
26
Our Approach Summary
  • This method allows us, for all 350 million phone
    numbers we see on the network to have an
    up-to-date representation of the entity
    associated with each number. These entities are
    stored in an indexed data base for easy storage
    and retrieval
  • Parameters are set to maximize the
    self-similarity of a node with itself through
    time, to allow us to match entities.

27
(No Transcript)
28
  • Fraud Revisited Applying our methods

29
Fraud Revisited Applying our methods
  • Fraudsters dont get caught, so they go elsewhere
  • Can we find them, based on their calling
    patterns?
  • Intuition fraudster may change network identity,
    but calling pattern will be similar
  • Find overlap in calling pattern between recent
    connected lines and lines shut down for fraud.

30
Fraud Revisited Applying our methods
  • Repetitive Fraud Process

Connect pool
T
Restrict pool
  • Anchor on a few restrict dates in the pool
  • Compare to all connects in the connect pool
  • For all of those that have at least one overlap,
    collect more information

31
Fraud Revisited Applying our methods
  • For each potential matched pair
  • Calculate a matching score, based on the
    characteristics of the overlap
  • Incorporate other information, including
  • Name, address
  • Payment history
  • Tenure

32
Fraud Revisited Applying our methods
  • Results
  • We identify 50-100 of these cases a day
  • 95 match rate
  • 85 block rate
  • Credited with saving ATT 5M in reduced costs
    and uncollectables in 2002-2003
  • By far the most reliable matching criteria is the
    entity based matching

33
Other applications, conclusions, further work
  • We have a representation of a dynamic graph which
    can be applied to any problem where entity
    modeling over time is of interest
  • Other fraud GBA
  • Language models
  • Email
  • Web pages
  • Terrorism
  • Viral Marketing
  • Targeting
  • Understanding

34
Other slides
35
Matching Algorithm
  • What cases will we present to the reps?
  • A combination of
  • COI Overlap measures
  • At least two, and strength determined by
    uniqueness of overlap TNs
  • Name/address overlap
  • Edit distance no more than 50 of the longest
    name or address
  • owed
  • Most interested in the ones that will generate
    the most
  • 500-1000 cases a day become 100-150 that we
    present to the reps

36
Motivating Example Repetitive Fraud
  • When we catch a fraudster, we rarely catch the
    person, we simply shut down the line
  • They will likely move on to another attempt at
    defrauding us, from a different network location
  • Idea record linkage - network identity has
    changed, but network behavior is the same
  • We can use network behavior to indicate that the
    new line has the same owner as an old line

37
COI Signatures to COI
  • To construct a COI from a COI signature
  • Often the signature contains things we dont
    want
  • Businesses
  • High weight nodes
  • Often the signature doesnt contain things we do
    want
  • Local calls
  • Other carrier calls
  • To combat this, create a COI by
  • Recursively expanding the COI signature
  • Adding edges
  • Pruning edges

heres an example
38
COI signature
other
me
other
39
Extended COI
other
me
other
40
Enhanced COI
other
me
other
41
Pruned COI
other
me
other
42
A likely case of the same fraudster showing up as
a new number
Pink nodes exist in both COI
43
Fraud Revisited Applying our methods
  • Calculate the informative overlap score
  • Where
  • wao weight of edge from a to o
  • wob weight of edge from o to b
  • wo sum weight of edges to o
  • dao, dob are the graph distances from a and b to
    o

wob
wao
wo
Write a Comment
User Comments (0)
About PowerShow.com