ITCS 6265/8265 Project

About This Presentation

Title:

ITCS 6265/8265 Project

Description:

ITCS 6265/8265 Project Group 5 Gabriel Njock Tanusree Pai Ke Wang Outline Domain Problem Statement and Objective Data Description Problem Characteristics and Method ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 30

Provided by: KeW50

Learn more at: https://webpages.charlotte.edu

Category:

more less

Transcript and Presenter's Notes

Title: ITCS 6265/8265 Project

1
ITCS 6265/8265 Project

Group 5

Gabriel Njock Tanusree Pai Ke Wang
2
Outline

Domain
Problem Statement and Objective
Data Description
Problem Characteristics and Method Used
Implementation
Data Formating
Feature Selection
Boosting and Derived Attributes
Testing Results
References

3
Domain

COIL CHALLENGE
Direct mailings to a company's potential
customers - "junk mail" to many - can be a very
effective way for them to market a product or a
service. However, as we all know, much of this
junk mail is really of no interest to the people
that receive it. Most of it ends up thrown away,
not only wasting the money that the company spent
on it, but also filling up landfill waste sites
or needing to be recycled. If the company had a
better understanding of who their potential
customers were, they would know more accurately
who to send it to, so some of this waste and
expense could be reduced.
Motivation for Data Mining cost reduction -
realized by only targeting a portion of the
potential customers.

4
Problem Statement

The data used in this problem represents a
frequently occurring problem analysis of data
about customers of a company, in this case an
insurance company.
Information about customers consists of 86
variables and includes product usage data and
socio-demographic data derived from zip codes.
The data was supplied by the Dutch data mining
company Sentient Machine Research, and is based
on real world business data.

5
Coil Challenge objective

The competition consists of two tasks
Predict which customers are potentially
interested in a caravan insurance policy .
Describe the actual or potential customers and
possibly explain why these customers buy a
caravan policy.

6
Project Objective

Propose a solution that will allow us to predict
whether a customer is interested in a caravan
insurance policy.
Find the subset of customers with a probability
of purchasing a caravan insurance policy above
some boundary probability.

7
Data Description

TRAINING SET
5822 customer records
86 attributes. The attributes could be broadly
categorized as follows
Socio-demographic (43)
Insurance Policy Related (42)
Contribution-per-policy type - (21)
Number-of-policies - (21)
Decision Attribute Caravan

8
Data Description

TEST SET
4000 customer records
85 attributes. Caravan attribute was missing.
Need to predict the Caravan attribute value.
Note Attribute values in both sets were
pre-discretized .

9
Problem Characteristics

The problem reduces to a classification analysis
of customers two classes are of those who are
interested in purchasing a Caravan policy and
those who are not.
The learning of the classification model is
supervised because the training set provides the
decision attribute values.

10
Method Used
Naive Bayesian Classification Bayesian
classifiers are statistical classifiers 1 used
to predict class membership probabilities. Bayes
Theorem If X is a data sample whose class label
is unknown and H is some hypothesis such that X
belongs to class C, then the probability that the
hypothesis H holds given the observed data sample
X is denoted as P(HX) and given by Bayes theorem
as
11
Naive Bayes Classification

The naive Bayesian classifier, is based on Bayes
theorem
1. Each data sample is represented by an
n-dimensional feature vector, X (x1, x2, ...,
xn), depicting n measurements made on the sample
from n attributes, respectively A1, A2, ..., An.
For our problem we have 86 attributes, so n 86.
2. If there are m classes, C1, C2, ..., Cm, then
given an unknown data sample, X (i.e. having no
class label), the classifier will predict that X
belongs to the class having the highest posterior
probability, conditioned on X. That is, the naive
Bayesian classifier assigns an unknown sample X
to the class Ci if and only if
P(CiX) gt P(CjX) for 1 lt j lt m, j ltgt i
Thus we maximize P(CiX). The class Ci for which
P(CiX) is maximized, is called the maximum
posteriori hypothesis. By Bayes theorem,

12
Naive Bayes Classification

3. As P(X) is constant for all classes, only P(X
Ci) P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly
assumed that the classes are equally likely,
i.e., P(C1) P(C2) ... P(Cm), and we would
therefore maximize P(X Ci). Otherwise we
maximize P(X Ci) P(Ci).
4. In order to reduce computation in evaluating
P(X Ci), the naive assumption of class
independence is made. This presumes that the
values of the attributes are conditionally
independent of one another.
5. To classify an unknown sample X, P(X Ci)
P(Ci) is evaluated for each class Ci. Sample X is
then assigned to the class Ci if and only if
P(X Ci) P(Ci) gt P(X Cj) P(Cj) for 1 lt j lt
m, j ltgt I
In other words, it is assigned to the class Ci
for which P(X Ci) P(Ci) is the maximum.

13
Naive Bayes Classification

Advantages
1. Comparable in performance with decision trees
and neural networks.
2. High accuracy and speed when applied to large
databases.
3. Theoretically, Bayesian classifiers have the
minimum error rate in comparison to all other
classifiers.
Disadvantages
1. It makes certain assumptions, which may lead
to inaccuracy.

14
Implementation

Data Formatting and Tools Used
The data set (training as well as test) available
was a simple text file with values separated by
space.
A database was created COILDB.mdb) with two
tables (COILDB_TRAIN and COILDB_TEST) having
appropriate column definitions for all
attributes. Data from the text file was populated
into the tables.
Software used MS Access
Purpose Allow executing sql query for data
analysis

15
Implementation

Data from the text file was populated into
spreadsheets.
Software used MS Excel, Analyse-It
Purpose Allow statistical analysis
(histogram, correlation) and have
graphical output.
.arff files were created
Software Used Weka
Purpose Use Bayesian Classification for
machine learning as well as testing.

16
Implementation

Feature selection
The analysis to determine the relevance of each
attribute was carried out in two steps
1. Analyze the relevance of demographical
attributes.
2. Analyze the relevance of the
non-demographical attributes.

17
Feature Selection Demographical Attributes

The attribute selection feature in Weka was used
to rank the demographical attributes according to
information gain.
The 4 demographical attributes that had the
highest information gain values were
Customer Type (Mostype),
Customer Subtype (Moshoofd),
Average Income (Minkgem) and
Purchasing Power Class (Mkoopla).

18
Feature Selection Demographical Attributes

Simple Naive Bayesian classification was used
with different combinations of the 4
demographical attributes along with the
non-demographical attributes to determine which
combination of these four attributes would yield
in the best accuracy.
The percentage of correctly classified instances
and percentage of incorrectly classified
instances were then compared for all the combined
attribute groups.

19
Feature Selection Demographical Attributes

A correlation analysis was also conducted on the
4 demographical attributes.
Figure shows the Customer Type and Customer
Subtype attributes with a Pearson Correlation
factor of 0.99

20
Feature Selection Demographical Attributes

Based on the correlation analysis between the 4
demographical attributes and results from the
comparison of percentage of correctly classified
instances and percentage of incorrectly
classified instances, it was decided to retain
only 2 of the 43 demographical attributes.
Average Income (Minkgem)
Accuracy 92.2363
Purchasing Power Class (Mkoopla)
Accuracy 92.0818
All attributes
Accuracy 83.3047

21
Feature Selection Policy Attributes

The 42 non-demographical attributes were mainly
insurance policy related and of two types
Contribution-per-policy Attributes
Number-of-policy Attributes
Preliminary Analysis
1. The contribution-per-policy and
number-of-policies attributes were highly
correlated.
2. For 37 out of the 43 policy related
attributes (including the caravan policy
ownership attribute), more then 90 of the
records has only 1 value (mainly 0) sparsely
used attributes.
3. the vast majority of customers buys mostly
the fire, car and third party insurance policies.

22
Boosting and Deriving Attributes

For each pair of attributes (contribution-per-poli
cy, number-of-policy) we performed two kinds of
analysis
1. Determine the correlation factor among the
attributes
2. Derive the Total Contribution Attribute,
which was the product of the contribution from a
policy and the number of those policies.

23
Boosting and Deriving Attributes

Simple Naive Bayes classification was conducted
using the derived attributes and it was found
that the product attribute gave a higher accuracy
than the attributes individually. The three
derived attributes which made a significant
difference were
1. CAR Policies (PPERSAUT, APERSAUT)
2. FIRE Policies (PRAND, ABRAND)
3. Private Third Party Insurance Policies
(PWAPART, AWAPART)

24
Boosting and Deriving Attributes

The classification was performed again, using all
combinations of the derived attributes to
determine which of the three derived attributes
had to be retained.
We compared the percentage of correctly
classified instances and percentage of
incorrectly classified instances.
The highest accuracy and lowest error was found
when using all three derived attributes.

25
Feature Selection Policy Attributes

The number of non-demographical attributes were
then reduced from 42 to 39 by replacing 6
attributes with the derived attributes.
These attributes were combined with the Average
Income and Purchasing Power Class Attribute
individually as well as together.
Accuracy in each case
Avg. Income Policy Attributes (with 3
derived)
93.181
Purchasing Power Class Policy Attributes (with
3 derived) 93.1982
Avg. Income, Purchasing Power class Policy
Attributes (3 derived)
93.0608

26
Feature Selection Policy Attributes

We ranked the remaining attributes again
according to information gain to test their
relevance.
The attributes with significant information gain
were related to Boat policies and SSN policies.
Correlation analysis as well accuracy analysis
with Classification was used to determine the
final set of attributes.
Accuracy reached with 6 attributes (Purchasing
Power, 3 derived attributes representing
contribution from Car, Fire and Private Third
Party policies, Number of Boat policies and
Number of SSN Policies was 93.9711

27
Testing Results

After training of the model, the test data was
used to predict the subset of customers likely to
purchase the CARAVAN policy.
A cut-off probability of 80 was used.
We obtained 115 records with 80 or higher
probability of purchasing the CARAVAN policy.

28
Conclusion

From a set of 4000 customer records, our
classification analysis predicted only 115 with a
probability of purchasing the Caravan Policy
higher than 80.
Significant reduction in cost can be obtained by
targeting only those customers.

29
References

1. The Insurance Company (TIC) Benchmark The
Coil Challenge Report, http//www.liacs.nl/putten
/library/cc2000/
2. Sentient Machine Research.
http//www.smr.nl/

Write a Comment

User Comments (0)

About PowerShow.com

ITCS 6265/8265 Project - PowerPoint PPT Presentation

ITCS 6265/8265 Project

ITCS 6265/8265 Project Group 5 Gabriel Njock Tanusree Pai Ke Wang Outline Domain Problem Statement and Objective Data Description Problem Characteristics and Method ... – PowerPoint PPT presentation