Title: MKT 700 Business Intelligence and Decision Models
1MKT 700Business Intelligence and Decision Models
- Algorithms and
- Customer Profiling (1)
2Classification and Prediction
3ClassificationUnsupervised Learning
4PredictingSupervised Learning
5SPSS Direct Marketing
Classification Predictive
Unsupervised Learning RFM Cluster analysis Postal Code Responses NA
Supervised Learning Customer Profiling Propensity to buy
6SPSS Analysis
Classification Predictive
Unsupervised Learning Hierarchical Cluster Two-Step Cluster K-Means Cluster NA
Supervised Learning Classification Trees CHAID CART Linear Regression Logistic Regression Artificial Neural Nets
7Major Algorithms
Classification Predictive
Unsupervised Learning Euclidean Distance Log Likelihood NA
Supervised Learning Chi-square Statistics Log Likelihood GINI Impurity Index F-Statistics (ANOVA) Log Likelihood F-Statistics (ANOVA)
Nominal Chi-square, Log Likelihood Continuous
F-Statistics, Log Likelihood
8Euclidean Distance
9Euclidean Distance for Continuous Variables
- Pythagorean distance ? vd2 v(a2b2)
- Euclidean space ? vd2 v(a2b2c2)
- Euclidean distance ? d ?(di)21/2(Cluster
Analysis with continuous var.)
10Pearsons Chi-Square
11Contingency Table
North South East West Tot.
Yes 68 75 57 79 279
No 32 45 33 31 141
Tot. 100 120 90 110 420
12Observed and theoretical Frequencies
North South East West Tot.
Yes 68 66 75 80 57 60 79 73 279 66
No 32 34 45 40 33 30 31 37 141 34
Tot. 100 120 90 110 420
13Chi-Square
Obs. fo fe fo-fe (fo-fe)2 (fo-fe)2 fe
1,1 68 1,2 75 1,3 57 1,4 79 2,1 32 2,2 45 2,2 33 2,4 31 66 80 60 73 34 40 30 37 2 -5 -3 6 -2 5 3 6 4 25 9 36 4 25 9 36 .0606 .3125 .1500 .4932 .1176 .6250 .3000 .9730 X2 3.032
14Statistical Inference
- DF (4 col 1) (2 rows 1) 3
15Log Likelihood Chi-Square
16Log Likelihood
- Based on probability distributions rather than
contingency (frequency) tables. - Applicable to both categorical and continuous
variables, contrary to chi-square which must be
discreticized.
17Contingency Table (Observed Frequencies)
Cluster 1 Cluster 2 Total
Male 10 30 40
18Contingency Table (Expected Frequencies)
Cluster 1 Cluster 2 Total
Male 10 20 30 20 40 40
19Chi-Square
Obs. fo Fe fo-fe (fo-fe)2 (fo-fe)2 fe
1,1 10 1,2 30 20 20 -10 10 100 100 5.00 5.00 X2 10.00
p lt 0.05 DF 1 Critical value 3.84
20Log Likelihood Distance Probability
Cluster 1 Cluster 2
Male O E 10 20 30 20
O/E Ln (O/E) O Ln (O/E) 2?OLn(O/E) 10/20 .50 -.693 10-.693 -6.93 30/201.50 .405 30.405 12.164 2(-6.9312.164) 10.46
p lt 0.05 critical value 3.84 p lt 0.05 critical value 3.84
21Variance, ANOVA, andF Statistics
22F-Statistics
- For metric or continuous variables
- Compares explained (in the model) and unexplained
variances (errors)
23Variance
SQUARED
VALUE VALUE VALUE MEAN DIFFERENCE
20 20 43.6 557
34 34 43.6 92.16
34 34 43.6 92.16
38 38 43.6 31.36
38 38 43.6 31.36
40 40 43.6 12.96
41 41 43.6 6.76
41 41 43.6 6.76
41 41 43.6 6.76
42 42 43.6 2.56
43 43 43.6 0.36
47 47 43.6 11.56
47 47 43.6 11.56
48 48 43.6 19.36
49 49 43.6 29.16
49 49 43.6 29.16
55 55 43.6 130
55 55 43.6 130
55 55 43.6 130
55 55 43.6 130
COUNT 20 20 SS 1461
DF 19
VAR 76.88
MEAN 43.6 43.6 SD 8.768
SS is Sum of Squares DF N-1 VARSS/DF SD vVAR
24ANOVA
- Two Groups T-test
- Three Group Comparisons Are errors
(discrepancies between observations and the
overall mean) explained by group membership or by
some other (random) effect?
25OnewayANOVA
Grand mean
Group 1 Group 2 Group 3 5.042
6 8 3
5 9 2 (X-Mean)2
4 7 1 0.918
5 8 3 0.002
4 9 2 1.085
6 7 1 0.002
5 8 3 1.085
4 9 2 0.918
0.002
Group means Group means 1.085
4.875 8.125 2.125 8.752
15.668
3.835
8.752
(X-Mean)2 (X-Mean)2 (X-Mean)2 15.668
1.266 0.016 0.766 3.835
0.016 0.766 0.016 8.752
0.766 1.266 1.266 15.668
0.016 0.016 0.766 4.168
0.766 0.766 0.016 9.252
1.266 1.266 1.266 16.335
0.016 0.016 0.766 4.168
0.766 0.766 0.016 9.252
16.335
4.875 4.875 4.875 4.168
9.252
SS Within 14.625
Total SS 158.958
26MSS(Between)/MSS(Within)
Winthin groups Winthin groups Between Groups Total Errors
SS 14.625 144.333 158.958
DF 24-321 3-12 24-123
Mean SS 0.696 72.167 6.911
Between Groups Mean SS Between Groups Mean SS 72.167 103.624 p-value lt .05
Within Groups Mean SS Within Groups Mean SS 0.696
27ONEWAY (Excel or SPSS)
Anova Single Factor Anova Single Factor Anova Single Factor
SUMMARY SUMMARY
Groups Groups Count Sum Average Variance
Group 1 Group 1 8 39 4.875 0.696
Group 2 Group 2 8 65 8.125 0.696
Group 3 Group 3 8 17 2.125 0.696
ANOVA ANOVA
Source of Variation Source of Variation SS df MS F P-value F crit
Between Groups Between Groups 144.333 2 72.167 103.624 1.318E-11 3.467
Within Groups Within Groups 14.625 21 0.696
Total Total 158.958 23
28Profiling
29Customer Profiling Documenting or Describing
- Who is likely to buy or not respond?
- Who is likely to buy what product or service?
- Who is in danger of lapsing?
30CHAID or CART
- Chi-Square Automatic Interaction Detector
- Based on Chi-Square
- All variables discretecized
- Dependent variable nominal
- Classification and Regression Tree
- Variables can be discrete or continuous
- Based on GINI or F-Test
- Dependent variable nominal or continuous
31Use of Decision Trees
- Classify observations from a target binary or
nominal variable ? Segmentation - Predictive response analysis from a target
numerical variable ? Behaviour - Decision support rules ? Processing
32Decision Tree
33Exampledmdata.sav
34CHAID AlgorithmSelecting Variables
- Example
- Regions (4), Gender (3, including Missing)Age
(6, including Missing) - For each variable, collapse categories to
maximize chi-square test of independence
Ex Region (N, S, E, W,) ? (WSE, N) - Select most significant variable
- Go to next branch and next level
- Stop growing if estimated X2 lt theoretical X2
35CART (Nominal Target)
- Nominal Targets
- GINI (Impurity Reduction or Entropy)
- Squared probability of node membership
- Gini0 when targets are perfectly classified.
- Gini Index 1-?pi2
- Example
- Prob Bus 0.4, Car 0.3, Train 0.3
- Gini 1 (0.42 0.32 0.32) 0.660
36CART (Metric Target)
- Continuous Variables
- Variance Reduction (F-test)
37Comparative Advantages(From Wikipedia)
- Simple to understand and interpret
- Requires little data preparation
- Able to handle both numerical and categorical data
- Uses a white box model easilyexplained by
Boolean logic. - Possible to validate a modelusing statistical
tests - Robust
38(No Transcript)