Title: Rule Discovery for Fraud Detection
1KDD Cup 2000 Question 1
2Overview
- Objective
- Given a set of page views, predict whether the
visitor will view another page or not - Data
- Raw Data - Clicks
- Aggregated Data - Sessions
- Some sessions clipped in the middle
- Indicator Session continues
- Methods and Tools
- Exploratory Data Analysis - SAS
- Classification Tree Amdocs Business Insight
Tool - Decision tree
- Rules Extraction
- Modeling
- Combining models
3The Winning Model - Introduction
This model combines Artificial intelligence,
i.e. Automated procedures with Human intuition /
Domain knowledge decisions
4The Winning Model - general scheme
5Building Main Model
Decision Tree
Decision Tree
Decision Tree
5 trees
5 trees
5 trees
built on 34000 cases
built on 34000 cases
built on 34000 cases
6Description of sub-models
Each model captures a different aspect of the
overall behavior in the data. Combining or
ensembling the models provides the best
prediction results.
Best rule
Chooses most accurate rule satisfied by each
record
Logistic regression on rule set raw field
values combine to define score for each record
Hybrid Model
Logistic regression on rule set defines score for
each record as a combination of rules the record
satisfies
Merged Rules
7Applying Main Model
Decision Tree
Decision Tree
Decision Tree
5 trees
5 trees
5 trees
built on 34000 cases
built on 34000 cases
built on 34000 cases
Rule Generator
Rule Generator
Rule Generator
1466 rules
1466 rules
1466 rules
111 continue rules
111 continue rules
111 continue rules
Best
Hybrid
Merged
Best
Hybrid
Merged
Best
Hybrid
Merged
Rule
Model
Rules
Rule
Model
Rules
Rule
Model
Rules
8The Winning Model - general scheme
9Small Whitebox
10Small Whitebox
Decision Tree
Applying The Model
11The prediction
The prediction is not that much better than
choosing the majority class. But it is enough to
win first place!
12Final Considerations
- Since both types of errors (false positives and
true negatives) are given the same weight, a
segment must have a very high probability of
continuing to justify not being classified as the
majority class. - The ratio of continue / not continue in the test
set must be estimated as accurately as possible. - The cutoff point (which score threshold divides
the two classes) must be carefully chosen.
13 The End