Title: Investigative Data Mining in Fraud Detection
1Investigative Data Mining in Fraud Detection
2Overview (1)
- Investigative Data Mining and Problems in Fraud
Detection - Definitions
- Technical and Practical Problems
- Existing Fraud Detection Methods
- Widely used methods
- The Crime Detection Method
- Comparisons with Minority Report
- Classifiers as Precogs
- Combining Output as Integration Mechanisms
- Cluster Detection as Analytical Machinery
- Visualisation Techniques as Visual Symbols
3Overview (2)
- Implementing the Crime Detection System
Preparation Component - Investigation objectives
- Collected data
- Preparation of collected data to achieve
objectives - Implementing the Crime Detection System Action
Component - Which experiments generate best predictions?
- Which is the best insight?
- How can the new models and insights be deployed
within an organisation? - Contributions and Recommendations
- Significant research contributions
- Proposed solutions
4Literature and Acknowledgements
Dick P K (1956) Minority Report, Orion Publishing
Group, London, Great Britain. Abagnale F (2001)
The Art of the Steal How to Protect Yourself and
Your Business from Fraud, Transworld Publishers,
NSW, Australia. Mena J (2003) Investigative Data
Mining for Security and Criminal Detection,
Butterworth Heinemann, MA, USA.
Elkan C (2001) Magical Thinking in Data Mining
Lessons From CoIL Challenge 2000, Department of
Computer Science and Engineering, University of
California, San Diego, USA. Prodromidis A (1999)
Management of Intelligent Learning Agents in
Distributed Data Mining Systems, Unpublished PhD
thesis, Columbia University, USA. Berry M and
Linoff G (2000) Mastering Data Mining The Art
and Science of Customer Relationship Management,
John Wiley and Sons, New York, USA. Han J and
Kamber M (2001) Data Mining Concepts and
Techniques, Morgan Kaufmann Publishers. Witten I
and Frank E (1999) Data Mining Practical Machine
Learning Tools and Techniques with Java, Morgan
Kauffman Publishers, CA, USA.
5Investigative Data Mining and Problems in Fraud
Detection
6Investigative Data Mining - Definitions
- Investigative
- Official attempt to extract some truth, or
insights, about criminal activity from data - Data Mining
- Process of discovering, extracting and analysing
of meaningful patterns, structure, models, and
rules from large quantities of data. - Spans several research areas such as database,
machine learning, neural networks, data
visualisation, statistics, and distributed data
mining. - Investigative Data Mining
- Applied to law enforcement,
- Industry, and
- Private databases
7Fraud Detection - Definitions
- Fraud
- Criminal deception, use of false representations
to obtain an unjust advantage, or to injure the
rights and interests of another - Diversity of Fraud
- Against organisations, governments, and
individuals - Committed by external parties, internal
management, and non-management employees - Caused by customers, service providers, and
suppliers - Prevalent in insurance, credit card, and
telecommunications - Most common in automobile, travel, and household
contents - Cost of Fraud
- Automobile insurance fraud alone AUD32 million
for nine Australian companies
8Fraud Detection Problems - Technical
- Imperfect data
- Usually not collected for data mining
- Inaccurate, incomplete, and irrelevant data
attributes - Highly skewed data
- Many more legitimate than fraudulent examples
- Higher chances of overfitting
- Black-box predictions
- Numerical outputs incomprehensible to people
9Fraud Detection Problems - Practical
- Lack of domain knowledge
- Important attributes, likely relationships, and
known patterns - Three types of fraud offenders and their modus
operandi - Great variety of fraud scenarios over time
- Soft fraud Cost of investigation gt Cost of
fraud - Hard fraud Circumvents anti-fraud measures
- Assessing data mining potential
- Predictive accuracy are useless for skewed data
sets
10Existing Fraud Detection Methods
11Widely Used Methods in Fraud Detection
- Insurance Fraud
- Cluster detection -gt decision tree induction -gt
domain knowledge, statistical summaries, and
visualisations - Special case neural network classification -gt
cluster detection - Credit Card Fraud
- Decision tree and naive Bayesian classification
-gt stacking - Telecommunications Fraud
- Cluster detection -gt scores and rules
12The Crime Detection Method
13Comparisons with Minority Report
- Precogs
- Foresee and prevent crime
- Each precog contains multiple classifiers
- Integration Mechanisms
- Combine predictions
- Analytical Machinery
- Record, study, compare, and represent predictions
in simple terms - Single computer
- Visual Symbols
- Explain the final predictions
- Graphical visualisations, numerical scores, and
descriptive rules
14The Crime Detection Method
15Classifiers as Precogs
- Precog One Naive Bayesian Classifiers
- Statistical paradigm
- Simple and Fast
- Redundant and not normally distributed
attributes - Precog Two C4.5 Classifiers
- Computer metaphor
- Explain patterns and quite fast
- Scalability and efficiency issues
- Precog Three Backpropagation Classifiers
- Brain metaphor
- Long training times and extensive parameter
tuning - Advantages and disadvantages
- For details on how the problems were tackled,
please refer to the thesis
16Combining Output as Integration Mechanisms
- Cross Validation
- Divides training data into eleven data partitions
- Each data partition used for training, testing,
and evaluation once - Slightly better success rate
- Bagging
- Unweighted majority voting on each example or
instance - Combine predictions from same algorithm or
different algorithms - Increases success rate
- For details on how the technique works, please
refer to the thesis
1 2 3 4 5 6 7 8 9 10 11 Main Prediction
fraud fraud legal fraud legal fraud legal fraud fraud legal fraud fraud
fraud fraud fraud legal legal fraud legal legal legal fraud legal legal
17Combining Output as Integration Mechanisms
- Stacking
- Meta-classifier
- Base classifiers present predictions to
meta-classifier - Determines the most reliable classifiers
- For details on how the technique works, please
refer to the thesis
18Combining Output as Integration Mechanisms
19Cluster Detection as Analytical
MachineryVisualisation Techniques as Visual
Symbols
- Analytical Machinery Self Organising Maps
- Clusters high dimensional elements into more
simple, low dimensional maps - Automatically groups similar instances together
- Do not specify an easy-to-understand model
- Visual Symbols Classification and Clustering
Visualisations - Classification visualisation confusion matrix
- - naive Bayesian visualisation
- Clustering visualisation - column graph
- For details on how the problems were tackled,
please refer to the thesis
20Steps in the Crime Detection Method
21Implementing the Crime Detection
SystemPreparation Component
22The Crime Detection System
23The Crime Detection System Preparation Component
- Problem Understanding
- Determine investigation objectives
- - Choose
- - Explain
- Assess situation
- - Available tools
- - Available data set
- - Cost model
- Determine data mining objectives
- - Max hits/Min false alarms
- Produce project plan
- - Time
- - Tools
- For details, refer to the thesis
24The Crime Detection System Preparation Component
- Data Understanding
- Describe data
- - 11550 examples (1994 and 1995)
- - 3870 instances (1996)
- - 33 attributes
- - 6 fraudulent
- Explore data
- - Claim trends by month
- - Age of vehicles
- - Age of policy holder
- Verify data
- - Good data quality
- - Duplicate attribute, highly skewed attributes
25The Crime Detection System Preparation Component
- Data Preparation
- Select data
- - All, except one attribute, are retained for
analysis - Clean data
- - Missing values replaced
- - Spelling mistakes corrected
- Format data
- - All characters converted to lowercase
- - Underscore symbol
26The Crime Detection System Preparation Component
- Data Preparation
- Construct data
- - Derived attributes
- - weeks_past
- - is_holidayweek_claim
- - age_price_wsum
- - Numerical input
- - 14 attributes scaled between 0 and 1
- - 19 attributes represented by one-of-N or
binary encoding - For details, refer to the thesis
27The Crime Detection System Preparation Component
- Data Preparation
- Partition data
- - Data multiplication or oversampling
- - For example, 50/50 distribution
28Implementing the Crime Detection SystemAction
Component
29The Crime Detection System Action Component
- Modelling
- Generate experiment design (1)
-
Experiment Number Technique or Algorithm Data Distribution
I Naive Bayes 50/50
II Naive Bayes 40/60
III Naive Bayes 30/70
IV Backpropagation Determined by Experiments I, II, III
V C4.5 Determined by Experiments I, II, III
VI Bagging -
VII Stacking -
VIII Stacking and Bagging -
IX Backpropagation 5/95
X Self Organising Map 5/95
30The Crime Detection System Action Component
- Modelling
- Generate experiment design (2)
-
Test A B C D E F G H I J K Overall Success Rate
Training Set Partition 1 2 3 4 5 6 7 8 9 10 11 Â
Testing Set Partition 2 3 4 5 6 7 8 9 10 11 1 Â
Evaluation Set Partition 3 4 5 6 7 8 9 10 11 1 2 Â
Evaluating Success Rate A B C D E F G H I J K Average W
Bagging Predictions A B C D E F G H I J K Bagged X
Producing Classifier 1 2 3 4 5 6 7 8 9 10 11 Â
Scoring Set Success Rate A B C D E F G H I J K Average Y
Bagging Main Score Predictions A B C D E F G H I J K Bagged Z
31The Crime Detection System Action Component
- Modelling
- Build models (1)
- - Bagged X outperformed Averaged W
- - Bagged Z performed marginally better than
Averaged Y - - Experiment II achieved highest cost savings
than I and III - - 40/60 distribution most appropriate under the
cost model - - Experiment V achieved highest cost savings
than II and IV - - C4.5 algorithm is the best algorithm for the
data set -
32The Crime Detection System Action Component
- Modelling
- Build models (2)
- - Experiment VIII achieved slightly better cost
savings than V - - Combining models from different algorithms is
better than the single algorithm - - The top 15 classifiers from stacking consisted
of 9 C4.5, 4 backpropagation, and 2 naive
Bayesian classifiers - For details, refer to the thesis
33The Crime Detection System Action Component
- Modelling
- Build models (3)
- - No scores from D2K software
- - Experiment IX demonstrates sorted scores and
predefined thresholds result in focused
investigations - - Satisfies Paretos Law
- - Rules did not provide insights
- - Already in domain knowledge and data attribute
exploration - - Experiment X requires 5 clusters for
visualisation - - age_of_policyholder
- - weeks_past, is_holidayweek_claim
- - make, accident_area, vehicle_category,
age_price_wsum, number_of_cars, base_policy - For details, refer to the thesis
34The Crime Detection System Action Component
- Modelling
- Assess models (1)
- - Training and score data sets too small
- - Students t-test with k-1 degrees of freedom
- - McNemars hypothesis test
-
- For details, refer to the thesis
Rank Experiment Number Technique or Algorithm Cost Savings Overall Success Rate Percentage Saved
1 VIII Stacking and Bagging 167,069 60 29.71
2 V C4.5 40/60 165,242 60 29.38
3 VI Bagging 127,454 64 22.66
4 VII Stacking 104,887 70 18.65
5 II Naive Bayes 40/60 94,734 70 16.85
6 IX Backpropagation 5/95 89, 232 75 15.87
7 IV Backpropagation 40/60 -6,488 92 -1.15
35The Crime Detection System Action Component
- Modelling
- Assess models (2)
- - Clusters 1, 2, and 3 have higher occurrences
of fraud in 1996 - - Clusters 1, 3, and 5 consist of several makes
of inexpensive cars - - Utility vehicles, rural areas, and liability
policies - - Clusters 2 and 4 contain claims submitted many
weeks after the accidents - - Toyota, sport cars, and multiple policies
-
Cluster Number of instances Descriptive Cluster Profile
1 215 Cluster 1 contains a large number of 21 to 25 year olds. The insured vehicles are relatively new.
2 166 Cluster 2 also contains a large number of 21 to 25 year olds. The claims are usually reported 10 weeks past the accident. The insured vehicles are usually sport cars.
3 268 Cluster 3 has almost all 16 to 17 year old fraudsters. The insured vehicles are mainly Acuras, Chevrolets, and Hondas. The insured vehicles are usually utility cars.
4 103 Cluster 4 has claims are usually reported 20 weeks past the accident. Almost all insured cars are Toyotas and the fraudster has a high probability of getting 3 to 4 cars insured. Claims are unlikely to be submitted during holiday periods.
5 171 Cluster 5 consists of mainly Fords, Mazdas, and Pontiacs. Higher chances of rural accidents and the base policy type are likely to be liability.
36The Crime Detection System Action Component
- Modelling
- Assess models (3)
- - Statistical evaluation of descriptive cluster
profiles - - Cluster 4
- - 3121 Toyota car claims, 6 or 187 fraudulent
- - 2148 Toyota sedan car claims, expect 6 or 129
to be fraudulent with 10 standard deviation - - Actual 171 fraudulent Toyota sedan car claims,
z-score of 3.8 standard deviation - - This is an insight because it is statistically
reliable, not known previously, and actionable
Cluster Group Claims No. and of Fraud Sub-Group Claims Expected No. of Fraud Actual No. of Fraud z-Score
1 All claims 15420 923 (6) 21 to 25 year olds 108 2 16 5
2 Sport cars 5358 84 (1.6) 21 to 25 year olds Sport cars 32 1 10 9.5
3 16 to 17 year olds 320 31 (9.7) Honda 16 to 17 year olds 31 3 31 9.3
37The Crime Detection System Action Component
- Modelling
- Assess models (4)
- - Append main predictions from 3 algorithms and
final predictions from bagging to 615 fraudulent
instances - - 25 cannot be detected by any algorithms,
highest lift in Clusters 1 and 2 - - All can be detected by at least 1 algorithm in
Cluster 3 - - Not all fraudulent instances can be detected
- - Domain knowledge, cluster detection, and
statistics offer explanation - - 101 cannot be detected by 2 algorithms
- - Weakness of bagging
- - Other alternatives
38The Crime Detection System Action Component
- Evaluation
- Evaluate results
- - Experiment VIII generate the best predictions
with cost savings of about 168, 000. This is
almost 30 of total cost savings possible - - Most statistically reliable insight is the
knowledge of 21 to 25 year olds who drive sport
cars - Review process
- - Unsupervised learning to derive clusters first
- - More training data partitions
- - More skewed distributions
- - Cost model too simplistic
- - Probabilistic Neural Networks
39The Crime Detection System Action Component
- Deployment
- Plan deployment
- - Manage geographically distributed databases
using distributed data mining - - Take time into account
- Plan monitoring and maintenance
- - Determined by rate of change in external
environment and organisational requirements - - Rebuild models when cost savings are below a
certain percentage of maximum cost savings
possible
40Contributions and Recommendations
41Contributions
- New Crime Detection Method
- Crime Detection System
- Cost Model
- Visualisations
- Statistics
- Score-based Feature
- Extensive Literature Review
- In-depth Analysis of Algorithms
42Recommendations Technical Problems
- Imperfect data
- Statistical evaluation and confidence intervals
- Preparation component of crime detection system
- Derived attributes
- Cross validation
- Highly skewed data
- Partitioned data with most appropriate
distribution - Cost model
- Black-box predictions
- Classification and clustering visualisation
- Sorted scores and predefined thresholds, rules
43Recommendations Practical Problems
- Lack of domain knowledge
- Action component of crime detection system
- Extensive literature review
- Great variety of fraud scenarios over time
- SOM
- Crime detection method
- Choice of algorithms
- Assessing data mining potential
- Quality and quantity of data
- Cost model
- z-scores
44Transforming Minority Report from Science
Fiction to Science Fact
INVESTIGATIVE DATA MINING IN FRAUD DETECTION
- 1 INTRODUCTION
- The world is overwhelmed with terabytes of
data - but there are only few effective and efficient
ways to analyse and interpret it. - The purpose of the research is to simulate the
Precrime System from the science fiction novel,
Minority Report, using data mining methods and
techniques, to extract insights from enormous
amounts of data to - detect white-collar crime
- The application is in uncovering fraudulent
claims in automobile insurance - The objectives are to overcome the technical
and practical problems of data mining in fraud
detection
- 3 RESULTS ON AUTOMOBILE INSURANCE DATA
- Through the use of integration mechanisms, the
highest cost savings is achieved - The analytical machinery facilitated the
interesting discovery of 21 to 25 year old
fraudsters who used sport cars as their crime
tool
- 4 DISCUSSION
- Black-box approach from the precogs are
transformed into a - semi-transparent approach
- by using analytical machinery and visual symbols
to analyse and interpret the predictions - Precogs can be
- shared between organisations
- to increase the accuracy of the predictions,
without violating competitive and legal
requirements - The analytical machinery transforms
multidimensional data into two-dimensional
clusters which contain similar data to enable the
data analyst to easily - differentiate the groups of fraud. It also allows
the data analyst to - assess the algorithms ability
- to cope with evolving fraud
- The crime detection method provides a flexible
step-by-step approach - to generating predictions from any three
algorithms, and uses some form of integration
mechanisms to increase the likelihood of correct
final predictions
- Precogs, or precognitive elements, are entities
which have the knowledge to predict that
something will happen. Figure 1 uses three
precogs to foresee and prevent crime by stopping
potentially guilty criminals - Each precog contains multiple classification
models, or classifiers, trained with one data
mining technique to extrapolate the future - The three precogs are different from each
other because they are trained by different data
mining algorithms. For example, the first,
second, and third precog are trained using naive
Bayesian, C4.5, and backpropagation algorithms. - The precogs require numerical inputs of past
examples to output corresponding predictions for
new instances
2 THE CRIME DETECTION METHOD
- Integration Mechanisms are needed. As each
precog outputs its many predictions for each
instance, all are counted and the class with the
highest tally is chosen as the main prediction - Figure 1 shows that the main predictions can
be combined either by majority count (bagging) or
the predictions can be fed back into one of the
precogs (stacking), to derive a final prediction
- 5 CONCLUSION
- Other possible applications of this crime
detection method are - Anti-terrorism
- Burglary
- Customs declaration fraud
- Drug-related homocides
- Drug smuggling
- Government financial transactions
- Sexual offences
Figure 1 Predictions using Precogs, Analytical
Machinery, and Visual Symbols
- Analytical Machinery, or cluster detection,
records, studies, compares, and represents the
precogs predictions in easily understood terms - The analytical machinery is represented by the
Self Organising Map (SOM) which clusters the
similar data into groups - Figure 1 demonstrates that main predictions
and final predictions are appended to the
clustered data to determine the fraud
characteristics which cannot be detected, and the
most important attributes are selected for
visualisation
- Scores are numbers with a specified range,
which indicates the relative risk that a
particular data instance maybe fraudulent, to
rank instances - Rules are expressions in the form of Body ?
Head, where Body describes the conditions under
which the rule is generated and Head is the class
label
- Visual Symbols, or visualisations, integrate
human perceptual abilities in the data analysis
process by presenting the data in some visual and
interactive form - The naive Bayesian and C4.5 visualisations
facilitate analysis of classifier predictions and
performance, and column graphs aid the
interpretation of clustering results
- REFERENCES
- Dick P K (1956) Minority Report, Orion Publishing
Group, London, Great Britain. - Done by Clifton Phua for Honours 2003
- Supervised by Dr. Damminda Alahakoon
45Questions?