Byron Marshall: - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Byron Marshall:

Description:

Shortest path analysis in CrimeLink Explorer (Schroeder et al. 2003). CrimeLink Explorer's taxonomy of criminal association closeness for shortest ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 40
Provided by: byronma4
Category:

less

Transcript and Presenter's Notes

Title: Byron Marshall:


1
USING IMPORTANCE FLOODING
TO IDENTIFY INTERESTING
NETWORKS OF CRIMINAL
ACTIVITY
  • Byron Marshall
  • Oregon State University
  • Hsinchun Chen
  • University of Arizona

ISI 2006 IEEE Intelligence and Security
Informatics Conference May 23-24 San Diego, CA
2
The Case for Importance Flooding Analysis in Law
Enforcement (LE)
  • The need feasible, cross-jurisdictional,
    intelligence analysis tools
  • The promise of network methodologies
  • Helping analysts create link charts
  • Criminal Activity Networks (CANs), importance
    flooding, and path-based importance heuristics
  • Importance flooding for LE and more

3
We Need Feasible Cross-Jurisdictional Analysis
Methodologies
  • Events of the past few years both highlight the
    need for cross-jurisdictional sharing and
    demonstrate the difficulty of establishing
    feasible systems.
  • Focusing on investigational usefulness
  • Beyond criminal justice data
  • Respecting privacy and security issues
  • Good sharing systems should support
    investigations.

4
Network-Based Methodologies The Tool of Choice
Criminal conspirators receive longer sentences.
  • Criminal association networks are understandable
    and actionable.

5
Network- or Graph-Based Analysis is Well Known in
Law Enforcement (LE)
  • From a LE research perspective
  • (Sparrow 1991) discussed the investigational
    implications of social network measures and
    identified some network properties that would
    impact real-world analysis applications size,
    incompleteness, fuzzy boundaries, and dynamism.
  • Social network analysis measures usually evaluate
    networks based on a single association weight
    (Coffman 2004 , Xu and Chen 2003).
  • Shortest path analysis in CrimeLink Explorer
    (Schroeder et al. 2003).

6
Network-Based Analysis in Real-World
Investigations
  • Link charts
  • combine many cases into an overall picture of
    criminal activity related to crime types or
    localities,
  • or focus on a particular case and particular
    suspects.
  • They can be used to
  • focus an investigation,
  • communicate within law enforcement agencies,
  • or present data in court.
  • Link chart creation is a valuable, manual, and
    expensive.

7
CAN Analysis A Fraud/Meth Link Chart
  • AN analyst spent 6 weeks in 2003 charting
    relationships between fraud and methamphetamines.

Start with target individuals
Research known associates
Consider patterns of relationships
Dealer
To inform officers and focus investigations
Leg Breaker
Check Washer
8
From a Data-Mining Perspective
  • The interestingness (or importance) issue is a
    well recognized problem in the association rule
    mining field (Silberschatz Tuzhilin 1996).
  • Beliefs such as expected patterns and known
    information can be used to guide data-mining
    algorithms to unexpected or actionable items.
    (Padmanabhan and A. Tuzhilin 1999).

9
Network Interestingness
  • In particular, interestingness as discovered in a
    network of relationships has received some
    attention
  • (White and Smyth 2003) implement a generalized
    spreading activation model building from a root
    set of nodes.
  • Lin and Chalupsky (2003) detect novel network
    paths (not just nodes or links) to reveal
    interesting information.

10
Improving the Link Chart Creation Process
  • Algorithmic can be applied to larger (e.g.
    cross-jurisdictional) data sets
  • Easier used in more investigations
  • More systematic less training
  • Faster cheaper

11
Research Gaps
  • Previous research has not directly addressed link
    chart creation.
  • Previous analysis uses only weighted associations
    (network structure) for analysis. But
    practitioners relied heavily on individual and
    network-based activity heuristics.
  • Emulating crime analysts, we want to use both
    associations (network structure) and importance
    heuristics (node semantics) in our system.

12
Research Questions
  • How can we effectively identify interesting sub
    networks
  • from associations found in a large collection of
    criminal incidents
  • employing domain knowledge
  • to generate useful investigational leads and
    support criminal conspiracy investigations?
  • Does the use of path-based importance heuristics
    and importance flooding improve upon link weight
    only methodologies?

13
Important Considerations(Solution Constraints)
  • Criminal record datasets
  • May miss personal associations (e.g. family)
  • May miss key individuals who appear un-important
    until caught or linked to investigational targets
  • Are ambiguous Very different associations look
    the same in the records

14
Thus Design Goals
  • Selection algorithm design goals
  • Be target focused
  • Use query-specific information to fill in the
    gaps
  • Tolerate missing and ambiguous data
  • Incorporate adjustable heuristics (or beliefs)
  • These goals are appropriate for a large scale
    cross-jurisdictional analysis and local
    investigations.

15
Importance Flooding
  • Basic Intuition
  • Both a persons past activity and their
    involvement in interesting association patterns
    establish initial importance.
  • Interestingness is partly path-based. That is, we
    improve the analysis by considering patterns of
    association.
  • Associates of interesting people become
    relatively more interesting.

16
Identifying Interesting Sub-networks of Criminal
Associations
Importance Ranked Network
Importance Ranking
Importance Flooding
Simple Filtering (Path Distance)
The importance flooding module assigns a relative
importance score to nodes in the network
Target List
Link Weight Heuristics
Importance Heuristics
17
How Does Importance Flooding Work?
  • Step 1 Assign weights to network links based on
  • the role of each actor in each incident
  • and the frequency of association
  • Step 2 Assign initial importance values to nodes
    given
  • involvement in a specified type of incident
    (e.g., fraud)
  • involved in a set of incident types (e.g., fraud
    drugs)
  • participation in an identified path (e.g.
    Fraud-Drugs-Assault)

Step 3 Recursively pass importance to neighbors
Step 4 Start with targets, best first search
18
Importance Flooding
  • 1. Assign weights to network links

4. Start with targets, best first search
Once all the ranks are assigned, we select the
highest ranked node with a direct connection to a
previously selected node.
19
Could An Analyst Make A Link Chart Faster Using
Importance Flooding?
  • The Fraud/Meth Chart
  • 110 people
  • Bronze standard
  • Start with 4 targets

The basic notion of the methodology is to help
the analyst build an association network around
the target individuals. We measure success by
computing the ratio of correct suggestions
(people included in the manually constructed
chart) to incorrect suggestions (people not
selected in the original chart).
20
Our Testbed
  • A network of person-incident-person triples from
    incident reports with date, crime type, and role
    (e.g. suspect, arrestee, victim) for each
    individual.
  • For the Fraud/Meth chart we included 4,877 people
    which includes 73/110 targets.

21
Compare Selection Methods
  • BFS Breadth first search
  • approximates a manual approach
  • CA Closest Associate (CA) - link weights only
  • choose the unselected node which is most closely
    associated with a previously selected node
  • a link chart implementation of previous ideas
  • IMP Importance flooding - path-based importance
  • Path heuristics with no flooding (PATH)
  • Node-only importance flooding (NO no path
    heuristics)
  • PIF Perfect Importance Flooding
  • Approximate flooding with perfect information
  • correct nodes 1 all other nodes0
  • this would be the theoretical limit for our
    methodology

22
Results Fraud/Meth
How many nodes did we look at to find one
included in the manually-drawn link chart?
23
Conclusions
  • Our analysis shows the algorithms promise.
  • All the intelligent methods out performed breadth
    first search.
  • Importance analysis seems to improve on a link
    weight only approach.
  • Path-based heuristics helped as compared to
    solely node-based heuristics.
  • Still, we should be cautious our data set is
    limited and we cant really say that one chart is
    correct while all others are incorrect. More
    study is needed.

24
Why Does It Fit the Domain?
  • We can encode the kind of heuristics used by
    investigators path heuristics, association
    weights, and target focus.
  • Inquiry specific information can be leveraged by
    the algorithm.
  • Heuristics can be tuned to a particular
    investigation.
  • The data we use is shareable
  • We use relationships not complete reports.
  • Different entity matching rules can be used for
    different applications.
  • But we still move beyond criminal justice data.

25
Importance Flooding Not Just for CANs
  • We believe that this kind of algorithm can be
    applied to other informal node-link knowledge
    representations.
  • When the desired output is a network, this
    algorithm is designed to overcome link and
    identifier ambiguity by leveraging both the
    structure and the semantics of the underlying
    network.
  • For example, we plan to explore the use of this
    algorithm in selecting interesting subsets of a
    network of biomedical pathway relations extracted
    from the text of journal abstracts.

26
Acknowledgement
  • NSF, Knowledge Discovery and Dissemination (KDD)
    9983304.
  • NSF, ITR "COPLINK Center for Intelligence and
    Security Informatics Research - A Crime Data
    Mining Approach to Developing Border Safe
    Research."
  • Department of Homeland Security (DHS) /
    Corporation for National Research Initiatives
    (CNRI) "Border Safe."
  • Thanks to the Tucson Police Department Kathy
    Martinjak, Tim Petersen, and Chuck Violette.

27
Hypotheses Tested on Fraud/Meth Data A
(technique) Average nodes selected / correct
nodes selected
Results
Techniques Techniques
BFS breadth first (rank by of hops) CA closest associates IMP importance flooding PATH path rules, no flooding NO node rules, flooding PIF perfect importance flooding
All techniques improve on BFS All techniques improve on BFS
H1a A(IMP) lt A(BFS) Accepted H1b A(CA) lt A(BFS) Accepted
Importance flooding out performs closest associates H2 A(IMP) lt A(CA) Accepted Importance flooding out performs closest associates H2 A(IMP) lt A(CA) Accepted
Importance flooding out performs heuristics with no flooding H3 A(IMP) lt A(PATH) Accepted at 500,1000, 2000 but NOT for 100,250 Importance flooding out performs heuristics with no flooding H3 A(IMP) lt A(PATH) Accepted at 500,1000, 2000 but NOT for 100,250
Importance flooding with path rules out performs flooding with only node rules H4 A(IMP) lt A(NO) Accepted Importance flooding with path rules out performs flooding with only node rules H4 A(IMP) lt A(NO) Accepted
Given Perfect information, flooding out performs other techniques Given Perfect information, flooding out performs other techniques
H5a A(PIF) lt A(IMP) Accepted H5b A(PIF) lt A(CA) Accepted
Hypotheses should hold for 100, 250, 500, 1000, and 2000 selected nodes significant at p.01 for all levels of selected nodes Hypotheses should hold for 100, 250, 500, 1000, and 2000 selected nodes significant at p.01 for all levels of selected nodes
28
Association Strength
  • Association Strength is based on
  • the role of each actor in each incident
  • and the frequency of association
  • (Schroeder et al. 2003)

1 2 may both be selected before 3 because of
the Association Strength
  • Example (As used in this work)
  • Person/Role
  • Suspect/Suspect Relationships .99
  • Suspect/Not Suspect .5
  • Not Suspect/Not Suspect .3
  • Frequency Adjustment
  • 4 or more associations, weight 1
  • else, (strongest .6, 2nd .2 , and 3rd .2)

3
1
2
29
Activity and Path-Based Initial Importance
System Design
Group Rule Example Assign this node to the
Fraud Group if they were ever a suspect in a
fraud incident. Optionally, add 2 to the
importance value of any node assigned to the
Fraud Group.
Multi-Group Rule Example Assign this node to the
Fraud/Drug group if they are a member of both the
Fraud and Drug Groups.
Path Rule Example F In Fraud, D In
Drug Sales, A In Assault s/s Two
individuals who are both suspects in a recent
incident Add 5 to nodes participating in an
F-s/s-D-s/s-A path
30
The Arrow Key Investigation
Testbed
  • Depicts 110 key people (coincidentally the same
    size as the Fraud/Meth chart)
  • 23 original starting points (targets) were
    identified

31
BorderSafe Research Testbed
14 years of Local Law Enforcement (LE) Data -
(1990-2004) 2.2 million people, 5.2 million
incidents, 1 million vehicles
Tucson Police Department (TPD)
People
Pima County Sheriffs Department (PCSD)
Incident Reports
Incident Reports
32
Data Integration Framework2 Steps, 3 Classes of
Data
(Marshall et al. 2004)
33
Results Arrow Key Investigation
Results
How many nodes did we look at to find one
included in the manually-drawn link chart?
34
Average Nodes Selected Per Correct Node
  • As more nodes are selected, a higher proportion
    of incorrect nodes are selected.

Node Ranking Method
35
Raw Results Fraud/Meth Data
Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node Avg Average Number of Nodes Selected Per Correct Node
SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node SD Standard Devation of Number of Nodes Per Correct Node

In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected In each cell, the top number is the number of correct nodes selected
The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD) The second row is the Avg and (SD)
Ranking Methodology Perfect Perfect Importance Flooding Importance Flooding Path Importance No Flooding Path Importance No Flooding Node Importance With Flooding Node Importance With Flooding Closest Associate Closest Associate Breadth First Breadth First
1 to 100 Nodes Selected 68 68 27 27 24 24 19 19 15 15 0 0
1 to 100 Nodes Selected 1.08 (0.14) 3.40 (0.62) 3.39 (0.68) 3.86 (1.07) 4.98 (1.51) N/A N/A
1 to 250 Nodes Selected 69 69 48 48 49 49 42 42 31 31 23 23
1 to 250 Nodes Selected N/A N/A 3.95 (0.71) 3.96 (0.69) 4.70 (1.00) 6.20 (1.43) 47.63 (21.62)
1 to 500 Nodes Selected 69 69 56 56 51 51 47 47 38 38 35 35
1 to 500 Nodes Selected N/A N/A 5.53 (1.82) 5.76 (2.10) 5.76 (2.10) 8.68 (2.93) 29.68 (23.61)
1 to 1000 Nodes Selected 69 69 62 62 59 59 53 53 51 51 38 38
1 to 1000 Nodes Selected N/A N/A 8.99 (3.97) 9.58 (4.32) 10.92 (5.06) 12.47 (4.52) 25.06 (17.48)
1 to 2000 Nodes Selected 69 69 68 68 67 67 63 63 64 64 45 45
1 to 2000 Nodes Selected N/A N/A 15.71 (7.77) 16.37 (7.93) 18.23 (8.57) 19.41 (7.98) 30.04 (13.76)
Until All 69 Correct Nodes Were Selected 101 101 2158 2158 3058 3058 4407 4407 4140 4140 4828 4828
Until All 69 Correct Nodes Were Selected 1.08 (0.14) 16.80 (8.43) 23.69 (12.21) 35.03 (17.88) 33.05 (15.59) 46.92 (17.76)
36
We Need Feasible Cross-Jurisdictional Analysis
Methodologies
  • Vast Volume
  • Tucson alone has Law Enforcement data for 2
    million people, 5 million incidents, and 1
    million vehicles in 14 years
  • Privacy Policies
  • What is buried in the reports?
  • Medical? Personal? Sensitive?
  • Entity Equivalence
  • No unique identifier is available
  • Task-dependent accuracy requirements
  • Sharing data across agencies is difficult and
    expensive. Are simple queries the answer?

37
Numeric Measure
  • We expect that in networks ranked by a better
    algorithm, an analyst would have to look at fewer
    nodes to find a correct node.
  • A (technique) Average (nodes selected / correct
    nodes selected).
  • It can be measured at various selected node
    levels.
  • For Example A (importance flooding) at 250
  • average ratio of selected nodes to correct
    nodes,
  • selected by the importance flooding algorithm,
  • when the number of selected nodes is 1,2,3250.

38
Network- or Graph-Based Analysis is Well Known in
Law Enforcement (LE)
  • From a LE research perspective
  • (Sparrow 1991) discussed the investigational
    implications of social network measures and
    identified some network properties that would
    impact real-world analysis applications size,
    incompleteness, fuzzy boundaries, and dynamism.
  • Social network analysis measures usually evaluate
    networks based on a single association weight
    (Coffman 2004 , Xu and Chen 2003).
  • Shortest path analysis in CrimeLink Explorer
    (Schroeder et al. 2003).

39
Association Closeness and Importance (chosen
based on analyst input)
  • Link Weight Heuristics
  • Suspect/Suspect Relationships .99
  • Suspect/Not Suspect .5
  • Not Suspect/Not Suspect .3
  • Frequency Adjustment
  • 4 or more associations, weight 1
  • else, ? (1st strongest relation .6, 2nd .2 ,
    and 3rd .2)
  • Importance
  • Groups Aggravated Assault (A), Drug Sales (S),
    Drug Possession (P), Fraud (F)
  • (A),(D), or (F) 3
  • Path Rules (all applied only to crimes after
    01/01/2001)
  • (A)-(D)-(F) 5
  • (A)-(D), (A)-(F), (D)-(F), (P)-(F) 3
  • Nodes with any 2 of (A),(D), (F) 3 (A),(D),
    (F) 5
Write a Comment
User Comments (0)
About PowerShow.com