Title: Byron Marshall:
1USING IMPORTANCE FLOODING
TO IDENTIFY INTERESTING
NETWORKS OF CRIMINAL
ACTIVITY
- Byron Marshall
- Oregon State University
- Hsinchun Chen
- University of Arizona
ISI 2006 IEEE Intelligence and Security
Informatics Conference May 23-24 San Diego, CA
2The Case for Importance Flooding Analysis in Law
Enforcement (LE)
- The need feasible, cross-jurisdictional,
intelligence analysis tools - The promise of network methodologies
- Helping analysts create link charts
- Criminal Activity Networks (CANs), importance
flooding, and path-based importance heuristics - Importance flooding for LE and more
3We Need Feasible Cross-Jurisdictional Analysis
Methodologies
- Events of the past few years both highlight the
need for cross-jurisdictional sharing and
demonstrate the difficulty of establishing
feasible systems. - Focusing on investigational usefulness
- Beyond criminal justice data
- Respecting privacy and security issues
- Good sharing systems should support
investigations.
4Network-Based Methodologies The Tool of Choice
Criminal conspirators receive longer sentences.
- Criminal association networks are understandable
and actionable.
5Network- or Graph-Based Analysis is Well Known in
Law Enforcement (LE)
- From a LE research perspective
- (Sparrow 1991) discussed the investigational
implications of social network measures and
identified some network properties that would
impact real-world analysis applications size,
incompleteness, fuzzy boundaries, and dynamism. - Social network analysis measures usually evaluate
networks based on a single association weight
(Coffman 2004 , Xu and Chen 2003). - Shortest path analysis in CrimeLink Explorer
(Schroeder et al. 2003).
6Network-Based Analysis in Real-World
Investigations
- Link charts
- combine many cases into an overall picture of
criminal activity related to crime types or
localities, - or focus on a particular case and particular
suspects. - They can be used to
- focus an investigation,
- communicate within law enforcement agencies,
- or present data in court.
- Link chart creation is a valuable, manual, and
expensive.
7CAN Analysis A Fraud/Meth Link Chart
- AN analyst spent 6 weeks in 2003 charting
relationships between fraud and methamphetamines.
Start with target individuals
Research known associates
Consider patterns of relationships
Dealer
To inform officers and focus investigations
Leg Breaker
Check Washer
8From a Data-Mining Perspective
- The interestingness (or importance) issue is a
well recognized problem in the association rule
mining field (Silberschatz Tuzhilin 1996). - Beliefs such as expected patterns and known
information can be used to guide data-mining
algorithms to unexpected or actionable items.
(Padmanabhan and A. Tuzhilin 1999).
9Network Interestingness
- In particular, interestingness as discovered in a
network of relationships has received some
attention - (White and Smyth 2003) implement a generalized
spreading activation model building from a root
set of nodes. - Lin and Chalupsky (2003) detect novel network
paths (not just nodes or links) to reveal
interesting information.
10Improving the Link Chart Creation Process
- Algorithmic can be applied to larger (e.g.
cross-jurisdictional) data sets - Easier used in more investigations
- More systematic less training
- Faster cheaper
11Research Gaps
- Previous research has not directly addressed link
chart creation. - Previous analysis uses only weighted associations
(network structure) for analysis. But
practitioners relied heavily on individual and
network-based activity heuristics. - Emulating crime analysts, we want to use both
associations (network structure) and importance
heuristics (node semantics) in our system.
12Research Questions
- How can we effectively identify interesting sub
networks - from associations found in a large collection of
criminal incidents - employing domain knowledge
- to generate useful investigational leads and
support criminal conspiracy investigations? - Does the use of path-based importance heuristics
and importance flooding improve upon link weight
only methodologies?
13Important Considerations(Solution Constraints)
- Criminal record datasets
- May miss personal associations (e.g. family)
- May miss key individuals who appear un-important
until caught or linked to investigational targets - Are ambiguous Very different associations look
the same in the records
14Thus Design Goals
- Selection algorithm design goals
- Be target focused
- Use query-specific information to fill in the
gaps - Tolerate missing and ambiguous data
- Incorporate adjustable heuristics (or beliefs)
- These goals are appropriate for a large scale
cross-jurisdictional analysis and local
investigations.
15Importance Flooding
- Basic Intuition
- Both a persons past activity and their
involvement in interesting association patterns
establish initial importance. - Interestingness is partly path-based. That is, we
improve the analysis by considering patterns of
association. - Associates of interesting people become
relatively more interesting.
16Identifying Interesting Sub-networks of Criminal
Associations
Importance Ranked Network
Importance Ranking
Importance Flooding
Simple Filtering (Path Distance)
The importance flooding module assigns a relative
importance score to nodes in the network
Target List
Link Weight Heuristics
Importance Heuristics
17How Does Importance Flooding Work?
- Step 1 Assign weights to network links based on
- the role of each actor in each incident
- and the frequency of association
- Step 2 Assign initial importance values to nodes
given - involvement in a specified type of incident
(e.g., fraud) - involved in a set of incident types (e.g., fraud
drugs) - participation in an identified path (e.g.
Fraud-Drugs-Assault)
Step 3 Recursively pass importance to neighbors
Step 4 Start with targets, best first search
18Importance Flooding
- 1. Assign weights to network links
4. Start with targets, best first search
Once all the ranks are assigned, we select the
highest ranked node with a direct connection to a
previously selected node.
19Could An Analyst Make A Link Chart Faster Using
Importance Flooding?
- The Fraud/Meth Chart
- 110 people
- Bronze standard
- Start with 4 targets
The basic notion of the methodology is to help
the analyst build an association network around
the target individuals. We measure success by
computing the ratio of correct suggestions
(people included in the manually constructed
chart) to incorrect suggestions (people not
selected in the original chart).
20Our Testbed
- A network of person-incident-person triples from
incident reports with date, crime type, and role
(e.g. suspect, arrestee, victim) for each
individual. - For the Fraud/Meth chart we included 4,877 people
which includes 73/110 targets.
21Compare Selection Methods
- BFS Breadth first search
- approximates a manual approach
- CA Closest Associate (CA) - link weights only
- choose the unselected node which is most closely
associated with a previously selected node - a link chart implementation of previous ideas
- IMP Importance flooding - path-based importance
- Path heuristics with no flooding (PATH)
- Node-only importance flooding (NO no path
heuristics) - PIF Perfect Importance Flooding
- Approximate flooding with perfect information
- correct nodes 1 all other nodes0
- this would be the theoretical limit for our
methodology
22Results Fraud/Meth
How many nodes did we look at to find one
included in the manually-drawn link chart?
23Conclusions
- Our analysis shows the algorithms promise.
- All the intelligent methods out performed breadth
first search. - Importance analysis seems to improve on a link
weight only approach. - Path-based heuristics helped as compared to
solely node-based heuristics. - Still, we should be cautious our data set is
limited and we cant really say that one chart is
correct while all others are incorrect. More
study is needed.
24Why Does It Fit the Domain?
- We can encode the kind of heuristics used by
investigators path heuristics, association
weights, and target focus. - Inquiry specific information can be leveraged by
the algorithm. - Heuristics can be tuned to a particular
investigation. - The data we use is shareable
- We use relationships not complete reports.
- Different entity matching rules can be used for
different applications. - But we still move beyond criminal justice data.
25Importance Flooding Not Just for CANs
- We believe that this kind of algorithm can be
applied to other informal node-link knowledge
representations. - When the desired output is a network, this
algorithm is designed to overcome link and
identifier ambiguity by leveraging both the
structure and the semantics of the underlying
network. - For example, we plan to explore the use of this
algorithm in selecting interesting subsets of a
network of biomedical pathway relations extracted
from the text of journal abstracts.
26Acknowledgement
- NSF, Knowledge Discovery and Dissemination (KDD)
9983304. - NSF, ITR "COPLINK Center for Intelligence and
Security Informatics Research - A Crime Data
Mining Approach to Developing Border Safe
Research." - Department of Homeland Security (DHS) /
Corporation for National Research Initiatives
(CNRI) "Border Safe." - Thanks to the Tucson Police Department Kathy
Martinjak, Tim Petersen, and Chuck Violette.
27Hypotheses Tested on Fraud/Meth Data A
(technique) Average nodes selected / correct
nodes selected
Results
Techniques Techniques
BFS breadth first (rank by of hops) CA closest associates IMP importance flooding PATH path rules, no flooding NO node rules, flooding PIF perfect importance flooding
All techniques improve on BFS All techniques improve on BFS
H1a A(IMP) lt A(BFS) Accepted H1b A(CA) lt A(BFS) Accepted
Importance flooding out performs closest associates H2 A(IMP) lt A(CA) Accepted Importance flooding out performs closest associates H2 A(IMP) lt A(CA) Accepted
Importance flooding out performs heuristics with no flooding H3 A(IMP) lt A(PATH) Accepted at 500,1000, 2000 but NOT for 100,250 Importance flooding out performs heuristics with no flooding H3 A(IMP) lt A(PATH) Accepted at 500,1000, 2000 but NOT for 100,250
Importance flooding with path rules out performs flooding with only node rules H4 A(IMP) lt A(NO) Accepted Importance flooding with path rules out performs flooding with only node rules H4 A(IMP) lt A(NO) Accepted
Given Perfect information, flooding out performs other techniques Given Perfect information, flooding out performs other techniques
H5a A(PIF) lt A(IMP) Accepted H5b A(PIF) lt A(CA) Accepted
Hypotheses should hold for 100, 250, 500, 1000, and 2000 selected nodes significant at p.01 for all levels of selected nodes Hypotheses should hold for 100, 250, 500, 1000, and 2000 selected nodes significant at p.01 for all levels of selected nodes
28Association Strength
- Association Strength is based on
- the role of each actor in each incident
- and the frequency of association
- (Schroeder et al. 2003)
1 2 may both be selected before 3 because of
the Association Strength
- Example (As used in this work)
- Person/Role
- Suspect/Suspect Relationships .99
- Suspect/Not Suspect .5
- Not Suspect/Not Suspect .3
- Frequency Adjustment
- 4 or more associations, weight 1
- else, (strongest .6, 2nd .2 , and 3rd .2)
3
1
2
29Activity and Path-Based Initial Importance
System Design
Group Rule Example Assign this node to the
Fraud Group if they were ever a suspect in a
fraud incident. Optionally, add 2 to the
importance value of any node assigned to the
Fraud Group.
Multi-Group Rule Example Assign this node to the
Fraud/Drug group if they are a member of both the
Fraud and Drug Groups.
Path Rule Example F In Fraud, D In
Drug Sales, A In Assault s/s Two
individuals who are both suspects in a recent
incident Add 5 to nodes participating in an
F-s/s-D-s/s-A path
30The Arrow Key Investigation
Testbed
- Depicts 110 key people (coincidentally the same
size as the Fraud/Meth chart) - 23 original starting points (targets) were
identified
31BorderSafe Research Testbed
14 years of Local Law Enforcement (LE) Data -
(1990-2004) 2.2 million people, 5.2 million
incidents, 1 million vehicles
Tucson Police Department (TPD)
People
Pima County Sheriffs Department (PCSD)
Incident Reports
Incident Reports
32Data Integration Framework2 Steps, 3 Classes of
Data
(Marshall et al. 2004)
33Results Arrow Key Investigation
Results
How many nodes did we look at to find one
included in the manually-drawn link chart?
34Average Nodes Selected Per Correct Node
- As more nodes are selected, a higher proportion
of incorrect nodes are selected.
Node Ranking Method
35Raw Results Fraud/Meth Data
Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node
SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node
In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected
The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD)
Ranking Methodology Perfect Perfect Importance Flooding Importance Flooding Path Importance No Flooding Path Importance No Flooding Node Importance With Flooding Node Importance With Flooding Closest Associate Closest Associate Breadth First Breadth First
1 to 100 Nodes Selected 68 68 27 27 24 24 19 19 15 15 0 0
1 to 100 Nodes Selected 1.08 (0.14) 3.40 (0.62) 3.39 (0.68) 3.86 (1.07) 4.98 (1.51) N/A N/A
1 to 250 Nodes Selected 69 69 48 48 49 49 42 42 31 31 23 23
1 to 250 Nodes Selected N/A N/A 3.95 (0.71) 3.96 (0.69) 4.70 (1.00) 6.20 (1.43) 47.63 (21.62)
1 to 500 Nodes Selected 69 69 56 56 51 51 47 47 38 38 35 35
1 to 500 Nodes Selected N/A N/A 5.53 (1.82) 5.76 (2.10) 5.76 (2.10) 8.68 (2.93) 29.68 (23.61)
1 to 1000 Nodes Selected 69 69 62 62 59 59 53 53 51 51 38 38
1 to 1000 Nodes Selected N/A N/A 8.99 (3.97) 9.58 (4.32) 10.92 (5.06) 12.47 (4.52) 25.06 (17.48)
1 to 2000 Nodes Selected 69 69 68 68 67 67 63 63 64 64 45 45
1 to 2000 Nodes Selected N/A N/A 15.71 (7.77) 16.37 (7.93) 18.23 (8.57) 19.41 (7.98) 30.04 (13.76)
Until All 69 Correct Nodes Were Selected 101 101 2158 2158 3058 3058 4407 4407 4140 4140 4828 4828
Until All 69 Correct Nodes Were Selected 1.08 (0.14) 16.80 (8.43) 23.69 (12.21) 35.03 (17.88) 33.05 (15.59) 46.92 (17.76)
36We Need Feasible Cross-Jurisdictional Analysis
Methodologies
- Vast Volume
- Tucson alone has Law Enforcement data for 2
million people, 5 million incidents, and 1
million vehicles in 14 years - Privacy Policies
- What is buried in the reports?
- Medical? Personal? Sensitive?
- Entity Equivalence
- No unique identifier is available
- Task-dependent accuracy requirements
- Sharing data across agencies is difficult and
expensive. Are simple queries the answer?
37Numeric Measure
- We expect that in networks ranked by a better
algorithm, an analyst would have to look at fewer
nodes to find a correct node. - A (technique) Average (nodes selected / correct
nodes selected). - It can be measured at various selected node
levels. - For Example A (importance flooding) at 250
- average ratio of selected nodes to correct
nodes, - selected by the importance flooding algorithm,
- when the number of selected nodes is 1,2,3250.
38Network- or Graph-Based Analysis is Well Known in
Law Enforcement (LE)
- From a LE research perspective
- (Sparrow 1991) discussed the investigational
implications of social network measures and
identified some network properties that would
impact real-world analysis applications size,
incompleteness, fuzzy boundaries, and dynamism. - Social network analysis measures usually evaluate
networks based on a single association weight
(Coffman 2004 , Xu and Chen 2003). - Shortest path analysis in CrimeLink Explorer
(Schroeder et al. 2003).
39Association Closeness and Importance (chosen
based on analyst input)
- Link Weight Heuristics
- Suspect/Suspect Relationships .99
- Suspect/Not Suspect .5
- Not Suspect/Not Suspect .3
- Frequency Adjustment
- 4 or more associations, weight 1
- else, ? (1st strongest relation .6, 2nd .2 ,
and 3rd .2) - Importance
- Groups Aggravated Assault (A), Drug Sales (S),
Drug Possession (P), Fraud (F) - (A),(D), or (F) 3
- Path Rules (all applied only to crimes after
01/01/2001) - (A)-(D)-(F) 5
- (A)-(D), (A)-(F), (D)-(F), (P)-(F) 3
- Nodes with any 2 of (A),(D), (F) 3 (A),(D),
(F) 5