Neo Metrics Template - PowerPoint PPT Presentation

About This Presentation
Title:

Neo Metrics Template

Description:

Title: Neo Metrics Template Author: Neo Metrics Last modified by: alumno Document presentation format: Personalizado Other titles: Times New Roman Arial Lucida Sans ... – PowerPoint PPT presentation

Number of Views:275
Avg rating:3.0/5.0
Slides: 34
Provided by: NeoMe
Category:

less

Transcript and Presenter's Notes

Title: Neo Metrics Template


1
A mathematical model for fraud prediction and
control planning of electrical company clients
June 2010
-Camilla Bruni (Italy). Università degli Studi di
Firenze -Herberth Espinoza (Peru). Universidad
Complutense de Madrid -Antoine Levitt (France).
University of Oxford -María Loreto Luque (Spain).
Universidad Complutense de Madrid -Carlos Parra
(Spain). Universidad Complutense de
Madrid -Aranzazu Pérez (Spain). Universidad
Complutense de Madrid -Elisa Pérez (Spain)
Universidad Complutense de Madrid Coordinators -
Dr. Benjamin Ivorra (Universidad Complutense de
Madrid) -Dr. Juan Tejada (Universidad
Complutense de Madrid) -D. Jorge Juan Suerias
(Neo Metrics) -D. Fernando Fernández (Neo
Metrics) -Pilar Gómez (Neo Metrics)?
2
Outline
  • Problem Description
  • Data Analysis
  • Mathematical Modeling
  • Numerical Validation
  • Conclusions and Perspectives

3
  • I PROBLEM DESCRIPTION

4
Context
  • An electrical company keeps a crew of inspectors
    in Chile to check whether customers are
    manipulating their electrical meters.
  • Each check has a cost and the company wishes to
    identify the customers with higher risk of fraud
    in order to maximize benefits in a possible
    control campaign.

5
Objectives of the work
  • Currently, the company policy to check randomly
    on customers detects 6.6 of fraud.
  • Using the data set provided by the company, we
    created a model to design an improved control
    campaign. To do so, we have
  • Data exploration and treatment using SAS.
  • Fit a logistic regression model using SAS and
    calculate the lift-chart/ROC model performance
    indicators.
  • According to the model find the optimal control
    campaign, using Matlab, and perform a validation
    of the results.

6
  • II DATA ANALYSIS

7
DATA ANALYSIS
  • What does Neometrics supply us?
  • We received from Neometrics the database,
    train.csv
  • Database characteristics
  • 79,459 records
  • 49 variables
  • 30MB

8
DATA DESCRIPTION
  • We have 49 explanatory variables which can be
    divided in several groups.
  • We consider some variables referring to each
    customers characteristic (economical,
    geographical, technical)
  • We select and simplify the more representing
    variables. To do so
  • We categorize non-linear continuous variables
    according to fraud proportion in population.
  • We analyze the groups that have both categorical
    and quantitative variables in discrimination
    techniques (Binary tree).
  • Groups containing only quantitative variables
    apply principal components (PCA) to see these
    correlations.

9
Variable simplification
  • In order to identify the non-linear
    continuous variables and representative values,
    we calculate
  • Population proportion on each group
  • Fraud proportion on each group

10
Variable simplification
11
Variable simplification
12
Variable simplification
13
Variable simplification
14
VARIABLES SELECTION
  • The principal component analysis (PCA) is applied
    to the Debt and Payment group variable.
  • The aim of PCA is to reduce the size of the
    observed variables for each individual, keeping
    the greatest variability.
  • PCA is based on the spectral analysis of the
    correlation matrix.
  • This process was performed using SAS.

15
SELECTION OF VARIABLES
  • We start with the principal component analysis
    for groups of variables Calculated variables of
    Debt and Payment.
  • The most relevant variables are
  • deuda_ult_mean,
  • max_deuda,
  • dif_ult_mean_deuda
  • mean_dif_deuda.
  • After finishing the procedure, we select three
    variables, accounting for 90 of the information.

Repeating the process
16
Study Of Significance
  • We realize a study on the most significant
    variables that we might include in our model.
    For example Geographic Variables, Conexion
    Identifiers, Customer Caracteristic Groups.
  • proc logistic selection stepwise
  • Pearson Correlation Coeficient

17
Study Of Multicolinearity
  • VIF variance inflaction
  • The VIF represents an increase in the variance
    due to presence of multicollinearity.
  • VIF take values from a minimum of 1 when there is
    no degree of multicollinearity
  • Other procedures Discriminal Analysis (DISCRIM)
  • max_deuda

18
Classification trees
  • To obtain a good predictive model
  • Decision support models that can be applied in
    the identification of fraudsters

19
  • III MATHEMATICAL MODELING

20
Regression
  • Predict a random variable with a set of
    explanatory random variables
  • In our case, predict fraud probability using
    client information
  • Well-studied problem, numerous applications (spam
    filter, image classification, insurance
    policy...)?
  • Simplest idea linear regression

21
Logistic regression
  • is a probability .
    Linear regression does not
    respect the constraint.
  • Map result of linear regression to a probability
  • with logistic curve
  • We calibrate model parameters on the training set
    (find optimal a and b), and test on the
    validation set.
  • Use an optimisation procedure to fit model
  • Other possibilities add terms in the model.
    Interactions ( etc.),
    nonlinearities ( ), etc.

22
Model evaluation
  • In order to validate our model
  • We compare estimated fraud probabilities to
    measured outcomes on the validation set.
  • Use different metrics ROC curve, lift-chart,
    etc..
  • Use the results to establish optimal control
    policy

23
ROC curve
  • X-axis false positives, Y-axis true positives
  • For a given allowed false positive rate
    (specificity), determine success rate
    (sensitivity)?
  • Perfect model Y 1, random model, Y X, bad
    model, Y 0
  • Goal get above Y X
  • Area under curve good indication of quality of
    model c-value

24
LIFT CHART
We sort the client according to their decreasing
fraud probabilities. The X-axis represent the
percentage of the population according to the
previous arrangement. The Y-axis represent a rate
calculated as
Rate (a)
25
  • IV NUMERICAL VALIDATION

26
Results
  • Use only a subset of available parameters
  • Zone, type of line, past fraud history, past
    payment history
  • Categorise continuous variables
  • Final model 8 categorical variables, with or
    without pairwise interactions
  • Model trained on 40k clients, validated on 40k
    others.

27
LIFT-Chart graph and ROC values
28
Cost-benefit analysis Best model
Validation
Train
29
Cost-benefit analysis Extreme models
Random model
Best model
30
  • V CONCLUSIONS AND PERSPECTIVES

31
Summary
  • Data analysis to isolate interesting variables
  • Logistic regression to predict fraud
    probabilities
  • Evaluation of the model (ROC, lift)?
  • Use in a cost-benefit analysis
  • Concrete results of use to the client

32
Future Work
  • Automatic data analysis (binary trees, principal
    components analysis, etc.)
  • Carefully chose variables to include
  • Other types of prediction (neural networks,
    decision trees, etc.)
  • More complex optimization processes

33
THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com