Byron Marshall:

About This Presentation

Title:

Byron Marshall:

Description:

Shortest path analysis in CrimeLink Explorer (Schroeder et al. 2003). CrimeLink Explorer's taxonomy of criminal association closeness for shortest ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 40

Provided by: byronma4

Learn more at: http://people.oregonstate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Byron Marshall:

1
USING IMPORTANCE FLOODING
TO IDENTIFY INTERESTING
NETWORKS OF CRIMINAL
ACTIVITY

Byron Marshall
Oregon State University
Hsinchun Chen
University of Arizona

ISI 2006 IEEE Intelligence and Security
Informatics Conference May 23-24 San Diego, CA
2
The Case for Importance Flooding Analysis in Law
Enforcement (LE)

The need feasible, cross-jurisdictional,
intelligence analysis tools
The promise of network methodologies
Helping analysts create link charts
Criminal Activity Networks (CANs), importance
flooding, and path-based importance heuristics
Importance flooding for LE and more

3
We Need Feasible Cross-Jurisdictional Analysis
Methodologies

Events of the past few years both highlight the
need for cross-jurisdictional sharing and
demonstrate the difficulty of establishing
feasible systems.
Focusing on investigational usefulness
Beyond criminal justice data
Respecting privacy and security issues
Good sharing systems should support
investigations.

4
Network-Based Methodologies The Tool of Choice
Criminal conspirators receive longer sentences.

Criminal association networks are understandable
and actionable.

5
Network- or Graph-Based Analysis is Well Known in
Law Enforcement (LE)

From a LE research perspective
(Sparrow 1991) discussed the investigational
implications of social network measures and
identified some network properties that would
impact real-world analysis applications size,
incompleteness, fuzzy boundaries, and dynamism.
Social network analysis measures usually evaluate
networks based on a single association weight
(Coffman 2004 , Xu and Chen 2003).
Shortest path analysis in CrimeLink Explorer
(Schroeder et al. 2003).

6
Network-Based Analysis in Real-World
Investigations

Link charts
combine many cases into an overall picture of
criminal activity related to crime types or
localities,
or focus on a particular case and particular
suspects.
They can be used to
focus an investigation,
communicate within law enforcement agencies,
or present data in court.
Link chart creation is a valuable, manual, and
expensive.

7
CAN Analysis A Fraud/Meth Link Chart

AN analyst spent 6 weeks in 2003 charting
relationships between fraud and methamphetamines.

Start with target individuals
Research known associates
Consider patterns of relationships
Dealer
To inform officers and focus investigations
Leg Breaker
Check Washer
8
From a Data-Mining Perspective

The interestingness (or importance) issue is a
well recognized problem in the association rule
mining field (Silberschatz Tuzhilin 1996).
Beliefs such as expected patterns and known
information can be used to guide data-mining
algorithms to unexpected or actionable items.
(Padmanabhan and A. Tuzhilin 1999).

9
Network Interestingness

In particular, interestingness as discovered in a
network of relationships has received some
attention
(White and Smyth 2003) implement a generalized
spreading activation model building from a root
set of nodes.
Lin and Chalupsky (2003) detect novel network
paths (not just nodes or links) to reveal
interesting information.

10
Improving the Link Chart Creation Process

Algorithmic can be applied to larger (e.g.
cross-jurisdictional) data sets
Easier used in more investigations

More systematic less training
Faster cheaper

11
Research Gaps

Previous research has not directly addressed link
chart creation.
Previous analysis uses only weighted associations
(network structure) for analysis. But
practitioners relied heavily on individual and
network-based activity heuristics.
Emulating crime analysts, we want to use both
associations (network structure) and importance
heuristics (node semantics) in our system.

12
Research Questions

How can we effectively identify interesting sub
networks
from associations found in a large collection of
criminal incidents
employing domain knowledge
to generate useful investigational leads and
support criminal conspiracy investigations?
Does the use of path-based importance heuristics
and importance flooding improve upon link weight
only methodologies?

13
Important Considerations(Solution Constraints)

Criminal record datasets
May miss personal associations (e.g. family)
May miss key individuals who appear un-important
until caught or linked to investigational targets
Are ambiguous Very different associations look
the same in the records

14
Thus Design Goals

Selection algorithm design goals
Be target focused
Use query-specific information to fill in the
gaps
Tolerate missing and ambiguous data
Incorporate adjustable heuristics (or beliefs)
These goals are appropriate for a large scale
cross-jurisdictional analysis and local
investigations.

15
Importance Flooding

Basic Intuition
Both a persons past activity and their
involvement in interesting association patterns
establish initial importance.
Interestingness is partly path-based. That is, we
improve the analysis by considering patterns of
association.
Associates of interesting people become
relatively more interesting.

16
Identifying Interesting Sub-networks of Criminal
Associations
Importance Ranked Network
Importance Ranking
Importance Flooding
Simple Filtering (Path Distance)
The importance flooding module assigns a relative
importance score to nodes in the network
Target List
Link Weight Heuristics
Importance Heuristics
17
How Does Importance Flooding Work?

Step 1 Assign weights to network links based on
the role of each actor in each incident
and the frequency of association

Step 2 Assign initial importance values to nodes
given
involvement in a specified type of incident
(e.g., fraud)
involved in a set of incident types (e.g., fraud
drugs)
participation in an identified path (e.g.
Fraud-Drugs-Assault)

Step 3 Recursively pass importance to neighbors
Step 4 Start with targets, best first search
18
Importance Flooding

1. Assign weights to network links

4. Start with targets, best first search
Once all the ranks are assigned, we select the
highest ranked node with a direct connection to a
previously selected node.
19
Could An Analyst Make A Link Chart Faster Using
Importance Flooding?

The Fraud/Meth Chart
110 people
Bronze standard
Start with 4 targets

The basic notion of the methodology is to help
the analyst build an association network around
the target individuals. We measure success by
computing the ratio of correct suggestions
(people included in the manually constructed
chart) to incorrect suggestions (people not
selected in the original chart).
20
Our Testbed

A network of person-incident-person triples from
incident reports with date, crime type, and role
(e.g. suspect, arrestee, victim) for each
individual.
For the Fraud/Meth chart we included 4,877 people
which includes 73/110 targets.

21
Compare Selection Methods

BFS Breadth first search
approximates a manual approach
CA Closest Associate (CA) - link weights only
choose the unselected node which is most closely
associated with a previously selected node
a link chart implementation of previous ideas
IMP Importance flooding - path-based importance
Path heuristics with no flooding (PATH)
Node-only importance flooding (NO no path
heuristics)
PIF Perfect Importance Flooding
Approximate flooding with perfect information
correct nodes 1 all other nodes0
this would be the theoretical limit for our
methodology

22
Results Fraud/Meth
How many nodes did we look at to find one
included in the manually-drawn link chart?
23
Conclusions

Our analysis shows the algorithms promise.
All the intelligent methods out performed breadth
first search.
Importance analysis seems to improve on a link
weight only approach.
Path-based heuristics helped as compared to
solely node-based heuristics.
Still, we should be cautious our data set is
limited and we cant really say that one chart is
correct while all others are incorrect. More
study is needed.

24
Why Does It Fit the Domain?

We can encode the kind of heuristics used by
investigators path heuristics, association
weights, and target focus.
Inquiry specific information can be leveraged by
the algorithm.
Heuristics can be tuned to a particular
investigation.
The data we use is shareable
We use relationships not complete reports.
Different entity matching rules can be used for
different applications.
But we still move beyond criminal justice data.

25
Importance Flooding Not Just for CANs

We believe that this kind of algorithm can be
applied to other informal node-link knowledge
representations.
When the desired output is a network, this
algorithm is designed to overcome link and
identifier ambiguity by leveraging both the
structure and the semantics of the underlying
network.
For example, we plan to explore the use of this
algorithm in selecting interesting subsets of a
network of biomedical pathway relations extracted
from the text of journal abstracts.

26
Acknowledgement

NSF, Knowledge Discovery and Dissemination (KDD)
9983304.
NSF, ITR "COPLINK Center for Intelligence and
Security Informatics Research - A Crime Data
Mining Approach to Developing Border Safe
Research."
Department of Homeland Security (DHS) /
Corporation for National Research Initiatives
(CNRI) "Border Safe."
Thanks to the Tucson Police Department Kathy
Martinjak, Tim Petersen, and Chuck Violette.

27
Hypotheses Tested on Fraud/Meth Data A
(technique) Average nodes selected / correct
nodes selected
Results
Techniques Techniques
BFS breadth first (rank by of hops) CA closest associates IMP importance flooding PATH path rules, no flooding NO node rules, flooding PIF perfect importance flooding
All techniques improve on BFS All techniques improve on BFS
H1a A(IMP) lt A(BFS) Accepted H1b A(CA) lt A(BFS) Accepted
Importance flooding out performs closest associates H2 A(IMP) lt A(CA) Accepted Importance flooding out performs closest associates H2 A(IMP) lt A(CA) Accepted
Importance flooding out performs heuristics with no flooding H3 A(IMP) lt A(PATH) Accepted at 500,1000, 2000 but NOT for 100,250 Importance flooding out performs heuristics with no flooding H3 A(IMP) lt A(PATH) Accepted at 500,1000, 2000 but NOT for 100,250
Importance flooding with path rules out performs flooding with only node rules H4 A(IMP) lt A(NO) Accepted Importance flooding with path rules out performs flooding with only node rules H4 A(IMP) lt A(NO) Accepted
Given Perfect information, flooding out performs other techniques Given Perfect information, flooding out performs other techniques
H5a A(PIF) lt A(IMP) Accepted H5b A(PIF) lt A(CA) Accepted
Hypotheses should hold for 100, 250, 500, 1000, and 2000 selected nodes significant at p.01 for all levels of selected nodes Hypotheses should hold for 100, 250, 500, 1000, and 2000 selected nodes significant at p.01 for all levels of selected nodes
28
Association Strength

Association Strength is based on
the role of each actor in each incident
and the frequency of association
(Schroeder et al. 2003)

1 2 may both be selected before 3 because of
the Association Strength

Example (As used in this work)
Person/Role
Suspect/Suspect Relationships .99
Suspect/Not Suspect .5
Not Suspect/Not Suspect .3
Frequency Adjustment
4 or more associations, weight 1
else, (strongest .6, 2nd .2 , and 3rd .2)

3
1
2
29
Activity and Path-Based Initial Importance
System Design
Group Rule Example Assign this node to the
Fraud Group if they were ever a suspect in a
fraud incident. Optionally, add 2 to the
importance value of any node assigned to the
Fraud Group.
Multi-Group Rule Example Assign this node to the
Fraud/Drug group if they are a member of both the
Fraud and Drug Groups.
Path Rule Example F In Fraud, D In
Drug Sales, A In Assault s/s Two
individuals who are both suspects in a recent
incident Add 5 to nodes participating in an
F-s/s-D-s/s-A path
30
The Arrow Key Investigation
Testbed

Depicts 110 key people (coincidentally the same
size as the Fraud/Meth chart)
23 original starting points (targets) were
identified

31
BorderSafe Research Testbed
14 years of Local Law Enforcement (LE) Data -
(1990-2004) 2.2 million people, 5.2 million
incidents, 1 million vehicles
Tucson Police Department (TPD)
People
Pima County Sheriffs Department (PCSD)
Incident Reports
Incident Reports
32
Data Integration Framework2 Steps, 3 Classes of
Data
(Marshall et al. 2004)
33
Results Arrow Key Investigation
Results
How many nodes did we look at to find one
included in the manually-drawn link chart?
34
Average Nodes Selected Per Correct Node

As more nodes are selected, a higher proportion
of incorrect nodes are selected.

Node Ranking Method
35
Raw Results Fraud/Meth Data
Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node
SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node

In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected
The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD)
Ranking Methodology Perfect Perfect Importance Flooding Importance Flooding Path Importance No Flooding Path Importance No Flooding Node Importance With Flooding Node Importance With Flooding Closest Associate Closest Associate Breadth First Breadth First
1 to 100 Nodes Selected 68 68 27 27 24 24 19 19 15 15 0 0
1 to 100 Nodes Selected 1.08 (0.14) 3.40 (0.62) 3.39 (0.68) 3.86 (1.07) 4.98 (1.51) N/A N/A
1 to 250 Nodes Selected 69 69 48 48 49 49 42 42 31 31 23 23
1 to 250 Nodes Selected N/A N/A 3.95 (0.71) 3.96 (0.69) 4.70 (1.00) 6.20 (1.43) 47.63 (21.62)
1 to 500 Nodes Selected 69 69 56 56 51 51 47 47 38 38 35 35
1 to 500 Nodes Selected N/A N/A 5.53 (1.82) 5.76 (2.10) 5.76 (2.10) 8.68 (2.93) 29.68 (23.61)
1 to 1000 Nodes Selected 69 69 62 62 59 59 53 53 51 51 38 38
1 to 1000 Nodes Selected N/A N/A 8.99 (3.97) 9.58 (4.32) 10.92 (5.06) 12.47 (4.52) 25.06 (17.48)
1 to 2000 Nodes Selected 69 69 68 68 67 67 63 63 64 64 45 45
1 to 2000 Nodes Selected N/A N/A 15.71 (7.77) 16.37 (7.93) 18.23 (8.57) 19.41 (7.98) 30.04 (13.76)
Until All 69 Correct Nodes Were Selected 101 101 2158 2158 3058 3058 4407 4407 4140 4140 4828 4828
Until All 69 Correct Nodes Were Selected 1.08 (0.14) 16.80 (8.43) 23.69 (12.21) 35.03 (17.88) 33.05 (15.59) 46.92 (17.76)
36
We Need Feasible Cross-Jurisdictional Analysis
Methodologies

Vast Volume
Tucson alone has Law Enforcement data for 2
million people, 5 million incidents, and 1
million vehicles in 14 years
Privacy Policies
What is buried in the reports?
Medical? Personal? Sensitive?
Entity Equivalence
No unique identifier is available
Task-dependent accuracy requirements
Sharing data across agencies is difficult and
expensive. Are simple queries the answer?

37
Numeric Measure

We expect that in networks ranked by a better
algorithm, an analyst would have to look at fewer
nodes to find a correct node.
A (technique) Average (nodes selected / correct
nodes selected).
It can be measured at various selected node
levels.
For Example A (importance flooding) at 250
average ratio of selected nodes to correct
nodes,
selected by the importance flooding algorithm,
when the number of selected nodes is 1,2,3250.

38
Network- or Graph-Based Analysis is Well Known in
Law Enforcement (LE)

From a LE research perspective
(Sparrow 1991) discussed the investigational
implications of social network measures and
identified some network properties that would
impact real-world analysis applications size,
incompleteness, fuzzy boundaries, and dynamism.
Social network analysis measures usually evaluate
networks based on a single association weight
(Coffman 2004 , Xu and Chen 2003).
Shortest path analysis in CrimeLink Explorer
(Schroeder et al. 2003).

39
Association Closeness and Importance (chosen
based on analyst input)