ITCS 6265/8265 Project - PowerPoint PPT Presentation

About This Presentation
Title:

ITCS 6265/8265 Project

Description:

ITCS 6265/8265 Project Group 5 Gabriel Njock Tanusree Pai Ke Wang Outline Domain Problem Statement and Objective Data Description Problem Characteristics and Method ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 30
Provided by: KeW50
Category:

less

Transcript and Presenter's Notes

Title: ITCS 6265/8265 Project


1
ITCS 6265/8265 Project
  • Group 5

Gabriel Njock Tanusree Pai Ke Wang 
2
Outline
  • Domain
  • Problem Statement and Objective
  • Data Description
  • Problem Characteristics and Method Used
  • Implementation
  • Data Formating
  • Feature Selection
  • Boosting and Derived Attributes
  • Testing Results
  • References

3
Domain
  • COIL CHALLENGE
  • Direct mailings to a company's potential
    customers - "junk mail" to many - can be a very
    effective way for them to market a product or a
    service. However, as we all know, much of this
    junk mail is really of no interest to the people
    that receive it. Most of it ends up thrown away,
    not only wasting the money that the company spent
    on it, but also filling up landfill waste sites
    or needing to be recycled. If the company had a
    better understanding of who their potential
    customers were, they would know more accurately
    who to send it to, so some of this waste and
    expense could be reduced.
  • Motivation for Data Mining cost reduction -
    realized by only targeting a portion of the
    potential customers.

4
Problem Statement
  • The data used in this problem represents a
    frequently occurring problem analysis of data
    about customers of a company, in this case an
    insurance company.
  • Information about customers consists of 86
    variables and includes product usage data and
    socio-demographic data derived from zip codes.
  • The data was supplied by the Dutch data mining
    company Sentient Machine Research, and is based
    on real world business data. 

5
Coil Challenge objective
  • The competition consists of two tasks
  • Predict which customers are potentially
    interested in a caravan insurance policy .
  • Describe the actual or potential customers and
    possibly explain why these customers buy a
    caravan policy.

6
Project Objective
  • Propose a solution that will allow us to predict
    whether a customer is interested in a caravan
    insurance policy.
  • Find the subset of customers with a probability
    of purchasing a caravan insurance policy above
    some boundary probability.

7
Data Description
  • TRAINING SET
  • 5822 customer records
  • 86 attributes. The attributes could be broadly
    categorized as follows
  • Socio-demographic (43)
  • Insurance Policy Related (42)
  • Contribution-per-policy type - (21)
  • Number-of-policies - (21)
  • Decision Attribute Caravan

8
Data Description
  • TEST SET
  • 4000 customer records
  • 85 attributes. Caravan attribute was missing.
    Need to predict the Caravan attribute value.
  • Note Attribute values in both sets were
    pre-discretized .

9
Problem Characteristics
  • The problem reduces to a classification analysis
    of customers two classes are of those who are
    interested in purchasing a Caravan policy and
    those who are not.
  • The learning of the classification model is
    supervised because the training set provides the
    decision attribute values.

10
Method Used
Naive Bayesian Classification Bayesian
classifiers are statistical classifiers 1 used
to predict class membership probabilities. Bayes
Theorem If X is a data sample whose class label
is unknown and H is some hypothesis such that X
belongs to class C, then the probability that the
hypothesis H holds given the observed data sample
X is denoted as P(HX) and given by Bayes theorem
as
11
Naive Bayes Classification
  • The naive Bayesian classifier, is based on Bayes
    theorem
  • 1. Each data sample is represented by an
    n-dimensional feature vector, X (x1, x2, ...,
    xn), depicting n measurements made on the sample
    from n attributes, respectively A1, A2, ..., An.
    For our problem we have 86 attributes, so n 86.
  • 2. If there are m classes, C1, C2, ..., Cm, then
    given an unknown data sample, X (i.e. having no
    class label), the classifier will predict that X
    belongs to the class having the highest posterior
    probability, conditioned on X. That is, the naive
    Bayesian classifier assigns an unknown sample X
    to the class Ci if and only if
  • P(CiX) gt P(CjX) for 1 lt j lt m, j ltgt i
  • Thus we maximize P(CiX). The class Ci for which
    P(CiX) is maximized, is called the maximum
    posteriori hypothesis. By Bayes theorem,

12
Naive Bayes Classification
  • 3. As P(X) is constant for all classes, only P(X
    Ci) P(Ci) need be maximized. If the class prior
    probabilities are not known, then it is commonly
    assumed that the classes are equally likely,
    i.e., P(C1) P(C2) ... P(Cm), and we would
    therefore maximize P(X Ci). Otherwise we
    maximize P(X Ci) P(Ci).
  • 4. In order to reduce computation in evaluating
    P(X Ci), the naive assumption of class
    independence is made. This presumes that the
    values of the attributes are conditionally
    independent of one another.
  • 5. To classify an unknown sample X, P(X Ci)
    P(Ci) is evaluated for each class Ci. Sample X is
    then assigned to the class Ci if and only if
  • P(X Ci) P(Ci) gt P(X Cj) P(Cj) for 1 lt j lt
    m, j ltgt I
  • In other words, it is assigned to the class Ci
    for which P(X Ci) P(Ci) is the maximum.

13
Naive Bayes Classification
  • Advantages
  • 1. Comparable in performance with decision trees
    and neural networks.
  • 2. High accuracy and speed when applied to large
    databases.
  • 3. Theoretically, Bayesian classifiers have the
    minimum error rate in comparison to all other
    classifiers.
  • Disadvantages
  • 1. It makes certain assumptions, which may lead
    to inaccuracy.

14
Implementation
  • Data Formatting and Tools Used
  • The data set (training as well as test) available
    was a simple text file with values separated by
    space.
  • A database was created COILDB.mdb) with two
    tables (COILDB_TRAIN and COILDB_TEST) having
    appropriate column definitions for all
    attributes. Data from the text file was populated
    into the tables.
  • Software used MS Access
  • Purpose Allow executing sql query for data
    analysis

15
Implementation
  • Data from the text file was populated into
    spreadsheets.
  • Software used MS Excel, Analyse-It
  • Purpose Allow statistical analysis
    (histogram, correlation) and have
    graphical output.
  • .arff files were created
  • Software Used Weka
  • Purpose Use Bayesian Classification for
    machine learning as well as testing.

16
Implementation
  • Feature selection
  • The analysis to determine the relevance of each
    attribute was carried out in two steps
  • 1. Analyze the relevance of demographical
    attributes.
  • 2. Analyze the relevance of the
    non-demographical attributes.

17
Feature Selection Demographical Attributes
  • The attribute selection feature in Weka was used
    to rank the demographical attributes according to
    information gain.
  • The 4 demographical attributes that had the
    highest information gain values were
  • Customer Type (Mostype),
  • Customer Subtype (Moshoofd),
  • Average Income (Minkgem) and
  • Purchasing Power Class (Mkoopla).

18
Feature Selection Demographical Attributes
  • Simple Naive Bayesian classification was used
    with different combinations of the 4
    demographical attributes along with the
    non-demographical attributes to determine which
    combination of these four attributes would yield
    in the best accuracy.
  • The percentage of correctly classified instances
    and percentage of incorrectly classified
    instances were then compared for all the combined
    attribute groups.

19
Feature Selection Demographical Attributes
  • A correlation analysis was also conducted on the
    4 demographical attributes.
  • Figure shows the Customer Type and Customer
    Subtype attributes with a Pearson Correlation
    factor of 0.99

20
Feature Selection Demographical Attributes
  • Based on the correlation analysis between the 4
    demographical attributes and results from the
    comparison of percentage of correctly classified
    instances and percentage of incorrectly
    classified instances, it was decided to retain
    only 2 of the 43 demographical attributes.
  • Average Income (Minkgem)
  • Accuracy 92.2363
  • Purchasing Power Class (Mkoopla)
  • Accuracy 92.0818
  • All attributes
  • Accuracy 83.3047

21
Feature Selection Policy Attributes
  • The 42 non-demographical attributes were mainly
    insurance policy related and of two types
  • Contribution-per-policy Attributes
  • Number-of-policy Attributes
  • Preliminary Analysis
  • 1. The contribution-per-policy and
    number-of-policies attributes were highly
    correlated.
  • 2. For 37 out of the 43 policy related
    attributes (including the caravan policy
    ownership attribute), more then 90 of the
    records has only 1 value (mainly 0) sparsely
    used attributes.
  • 3. the vast majority of customers buys mostly
    the fire, car and third party insurance policies.

22
Boosting and Deriving Attributes
  • For each pair of attributes (contribution-per-poli
    cy, number-of-policy) we performed two kinds of
    analysis
  • 1. Determine the correlation factor among the
    attributes
  • 2. Derive the Total Contribution Attribute,
    which was the product of the contribution from a
    policy and the number of those policies.

23
Boosting and Deriving Attributes
  • Simple Naive Bayes classification was conducted
    using the derived attributes and it was found
    that the product attribute gave a higher accuracy
    than the attributes individually. The three
    derived attributes which made a significant
    difference were
  • 1. CAR Policies (PPERSAUT, APERSAUT)
  • 2. FIRE Policies (PRAND, ABRAND)
  • 3. Private Third Party Insurance Policies
    (PWAPART, AWAPART)

24
Boosting and Deriving Attributes
  • The classification was performed again, using all
    combinations of the derived attributes to
    determine which of the three derived attributes
    had to be retained.
  • We compared the percentage of correctly
    classified instances and percentage of
    incorrectly classified instances.
  • The highest accuracy and lowest error was found
    when using all three derived attributes.

25
Feature Selection Policy Attributes
  • The number of non-demographical attributes were
    then reduced from 42 to 39 by replacing 6
    attributes with the derived attributes.
  • These attributes were combined with the Average
    Income and Purchasing Power Class Attribute
    individually as well as together.
  • Accuracy in each case
  • Avg. Income Policy Attributes (with 3
    derived)
  • 93.181
  • Purchasing Power Class Policy Attributes (with
    3 derived) 93.1982
  • Avg. Income, Purchasing Power class Policy
    Attributes (3 derived)
  • 93.0608

26
Feature Selection Policy Attributes
  • We ranked the remaining attributes again
    according to information gain to test their
    relevance.
  • The attributes with significant information gain
    were related to Boat policies and SSN policies.
  • Correlation analysis as well accuracy analysis
    with Classification was used to determine the
    final set of attributes.
  • Accuracy reached with 6 attributes (Purchasing
    Power, 3 derived attributes representing
    contribution from Car, Fire and Private Third
    Party policies, Number of Boat policies and
    Number of SSN Policies was 93.9711

27
Testing Results
  • After training of the model, the test data was
    used to predict the subset of customers likely to
    purchase the CARAVAN policy.
  • A cut-off probability of 80 was used.
  • We obtained 115 records with 80 or higher
    probability of purchasing the CARAVAN policy.

28
Conclusion
  • From a set of 4000 customer records, our
    classification analysis predicted only 115 with a
    probability of purchasing the Caravan Policy
    higher than 80.
  • Significant reduction in cost can be obtained by
    targeting only those customers.

29
References
  • 1. The Insurance Company (TIC) Benchmark The
    Coil Challenge Report, http//www.liacs.nl/putten
    /library/cc2000/ 
  • 2. Sentient Machine Research.
    http//www.smr.nl/ 
Write a Comment
User Comments (0)
About PowerShow.com