Title: Neo Metrics Template
1A mathematical model for fraud prediction and
control planning of electrical company clients
June 2010
-Camilla Bruni (Italy). Università degli Studi di
Firenze -Herberth Espinoza (Peru). Universidad
Complutense de Madrid -Antoine Levitt (France).
University of Oxford -María Loreto Luque (Spain).
Universidad Complutense de Madrid -Carlos Parra
(Spain). Universidad Complutense de
Madrid -Aranzazu Pérez (Spain). Universidad
Complutense de Madrid -Elisa Pérez (Spain)
Universidad Complutense de Madrid Coordinators -
Dr. Benjamin Ivorra (Universidad Complutense de
Madrid) -Dr. Juan Tejada (Universidad
Complutense de Madrid) -D. Jorge Juan Suerias
(Neo Metrics) -D. Fernando Fernández (Neo
Metrics) -Pilar Gómez (Neo Metrics)?
2Outline
- Problem Description
- Data Analysis
- Mathematical Modeling
- Numerical Validation
- Conclusions and Perspectives
3 4Context
- An electrical company keeps a crew of inspectors
in Chile to check whether customers are
manipulating their electrical meters. - Each check has a cost and the company wishes to
identify the customers with higher risk of fraud
in order to maximize benefits in a possible
control campaign.
5Objectives of the work
- Currently, the company policy to check randomly
on customers detects 6.6 of fraud. - Using the data set provided by the company, we
created a model to design an improved control
campaign. To do so, we have - Data exploration and treatment using SAS.
- Fit a logistic regression model using SAS and
calculate the lift-chart/ROC model performance
indicators. - According to the model find the optimal control
campaign, using Matlab, and perform a validation
of the results.
6 7DATA ANALYSIS
- What does Neometrics supply us?
- We received from Neometrics the database,
train.csv - Database characteristics
- 79,459 records
- 49 variables
- 30MB
8DATA DESCRIPTION
- We have 49 explanatory variables which can be
divided in several groups. -
- We consider some variables referring to each
customers characteristic (economical,
geographical, technical) - We select and simplify the more representing
variables. To do so - We categorize non-linear continuous variables
according to fraud proportion in population. - We analyze the groups that have both categorical
and quantitative variables in discrimination
techniques (Binary tree). - Groups containing only quantitative variables
apply principal components (PCA) to see these
correlations.
9Variable simplification
- In order to identify the non-linear
continuous variables and representative values,
we calculate - Population proportion on each group
- Fraud proportion on each group
10Variable simplification
11Variable simplification
12Variable simplification
13Variable simplification
14VARIABLES SELECTION
- The principal component analysis (PCA) is applied
to the Debt and Payment group variable.
- The aim of PCA is to reduce the size of the
observed variables for each individual, keeping
the greatest variability. -
- PCA is based on the spectral analysis of the
correlation matrix. - This process was performed using SAS.
15SELECTION OF VARIABLES
- We start with the principal component analysis
for groups of variables Calculated variables of
Debt and Payment.
- The most relevant variables are
- deuda_ult_mean,
- max_deuda,
- dif_ult_mean_deuda
- mean_dif_deuda.
- After finishing the procedure, we select three
variables, accounting for 90 of the information.
Repeating the process
16Study Of Significance
- We realize a study on the most significant
variables that we might include in our model.
For example Geographic Variables, Conexion
Identifiers, Customer Caracteristic Groups. - proc logistic selection stepwise
- Pearson Correlation Coeficient
17Study Of Multicolinearity
- VIF variance inflaction
- The VIF represents an increase in the variance
due to presence of multicollinearity. - VIF take values from a minimum of 1 when there is
no degree of multicollinearity
- Other procedures Discriminal Analysis (DISCRIM)
- max_deuda
18Classification trees
- To obtain a good predictive model
- Decision support models that can be applied in
the identification of fraudsters
19- III MATHEMATICAL MODELING
20Regression
- Predict a random variable with a set of
explanatory random variables -
- In our case, predict fraud probability using
client information - Well-studied problem, numerous applications (spam
filter, image classification, insurance
policy...)? - Simplest idea linear regression
21Logistic regression
- is a probability .
Linear regression does not
respect the constraint. - Map result of linear regression to a probability
- with logistic curve
-
- We calibrate model parameters on the training set
(find optimal a and b), and test on the
validation set. - Use an optimisation procedure to fit model
- Other possibilities add terms in the model.
Interactions ( etc.),
nonlinearities ( ), etc.
22Model evaluation
- In order to validate our model
- We compare estimated fraud probabilities to
measured outcomes on the validation set. - Use different metrics ROC curve, lift-chart,
etc.. - Use the results to establish optimal control
policy
23ROC curve
- X-axis false positives, Y-axis true positives
- For a given allowed false positive rate
(specificity), determine success rate
(sensitivity)? - Perfect model Y 1, random model, Y X, bad
model, Y 0 - Goal get above Y X
- Area under curve good indication of quality of
model c-value
24LIFT CHART
We sort the client according to their decreasing
fraud probabilities. The X-axis represent the
percentage of the population according to the
previous arrangement. The Y-axis represent a rate
calculated as
Rate (a)
25 26Results
- Use only a subset of available parameters
- Zone, type of line, past fraud history, past
payment history - Categorise continuous variables
- Final model 8 categorical variables, with or
without pairwise interactions - Model trained on 40k clients, validated on 40k
others.
27LIFT-Chart graph and ROC values
28Cost-benefit analysis Best model
Validation
Train
29Cost-benefit analysis Extreme models
Random model
Best model
30- V CONCLUSIONS AND PERSPECTIVES
31Summary
- Data analysis to isolate interesting variables
- Logistic regression to predict fraud
probabilities - Evaluation of the model (ROC, lift)?
- Use in a cost-benefit analysis
- Concrete results of use to the client
32Future Work
- Automatic data analysis (binary trees, principal
components analysis, etc.) - Carefully chose variables to include
- Other types of prediction (neural networks,
decision trees, etc.) - More complex optimization processes
33THANK YOU