Data Mining in Market Research

About This Presentation

Title:

Data Mining in Market Research

Description:

Data Mining in Market Research What is data mining? Methods for finding interesting structure in large databases E.g. patterns, prediction rules, unusual cases – PowerPoint PPT presentation

Number of Views:548

Avg rating:3.0/5.0

Slides: 56

Provided by: JamesR132

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining in Market Research

1
Data Mining in Market Research

What is data mining?
Methods for finding interesting structure in
large databases
E.g. patterns, prediction rules, unusual cases
Focus on efficient, scalable algorithms
Contrasts with emphasis on correct inference in
statistics
Related to data warehousing, machine learning
Why is data mining important?
Well marketed now a large industry pays well
Handles large databases directly
Can make data analysis more accessible to end
users
Semi-automation of analysis
Results can be easier to interpret than e.g.
regression models
Strong focus on decisions and their implementation

2
CRISP-DM Process Model
3
Data Mining Software

Many providers of data mining software
SAS Enterprise Miner, SPSS Clementine, Statistica
Data Miner, MS SQL Server, Polyanalyst,
KnowledgeSTUDIO,
See http//www.kdnuggets.com/software/suites.html
for a list
Good algorithms important, but also need good
facilities for handling data and meta-data
Well use
WEKA (Waikato Environment for Knowledge Analysis)
Free (GPLed) Java package with GUI
Online at www.cs.waikato.ac.nz/ml/weka
Witten and Frank, 2000. Data Mining Practical
Machine Learning Tools and Techniques with Java
Implementations.
R packages
E.g. rpart, class, tree, nnet, cclust, deal,
GeneSOM, knnTree, mlbench, randomForest, subselect

4
Data Mining Terms

Different names for familiar statistical
concepts, from database and AI communities
Observation case, record, instance
Variable field, attribute
Analysis of dependence vs interdependence
Supervised vs unsupervised learning
Relationship association, concept
Dependent variable response, output
Independent variable predictor, input

5
Common Data Mining Techniques

Predictive modeling
Classification
Derive classification rules
Decision trees
Numeric prediction
Regression trees, model trees
Association rules
Meta-learning methods
Cross-validation, bagging, boosting
Other data mining methods include
artificial neural networks, genetic algorithms,
density estimation, clustering, abstraction,
discretisation, visualisation, detecting changes
in data or models

6
Classification

Methods for predicting a discrete response
One kind of supervised learning
Note in biological and other sciences,
classification has long had a different meaning,
referring to cluster analysis
Applications include
Identifying good prospects for specific marketing
or sales efforts
Cross-selling, up-selling when to offer
products
Customers likely to be especially profitable
Customers likely to defect
Identifying poor credit risks
Diagnosing customer problems

7
Weather/Game-Playing Data

Small dataset
14 instances
5 attributes
Outlook - nominal
Temperature - numeric
Humidity - numeric
Wind - nominal
Play
Whether or not a certain game would be played
This is what we want to understand and predict

8
ARFF file for the weather data.
9
German Credit Risk Dataset

1000 instances (people), 21 attributes
class attribute describes people as good or bad
credit risks
Other attributes include financial information
and demographics
E.g. checking_status, duration, credit_history,
purpose, credit_amount, savings_status,
employment, Age, housing, job, num_dependents,
own_telephone, foreign_worker
Want to predict credit risk
Data available at UCI machine learning data
repository
http//www.ics.uci.edu/mlearn/MLRepository.html
and on 747 web page
http//www.stat.auckland.ac.nz/reilly/credit-g.ar
ff

10
Classification Algorithms

Many methods available in WEKA
0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM,
Instance-based learner (IB1, IBk), C4.5 (J48),
PART, Support vector machine (SMO)
Usually train on part of the data, test on the
rest
Simple method Zero-rule, or 0R
Predict the most common category
Class ZeroR in WEKA
Too simple for practical use, but a useful
baseline for evaluating performance of more
complex methods

11
1-Rule (1R) Algorithm

Based on single predictor
Predict mode within each value of that predictor
Look at error rate for each predictor on training
dataset, and choose best predictor
Called OneR in WEKA
Must group numerical predictor values for this
method
Common method is to split at each change in the
response
Collapse buckets until each contains at least 6
instances

12
1R Algorithm (continued)

Biased towards predictors with more categories
These can result in over-fitting to the training
data
But found to perform surprisingly well
Study on 16 widely used datasets
Holte (1993), Machine Learning 11, 63-91
Often error rate only a few percentages points
higher than more sophisticated methods (e.g.
decision trees)
Produced rules that were much simpler and more
easily understood

13
Naïve Bayes Method

Calculates probabilities of each response value,
assuming independence of attribute effects
Response value with highest probability is
predicted
Numeric attributes are assumed to follow a normal
distribution within each response value
Contribution to probability calculated from
normal density function
Instead can use kernel density estimate, or
simply discretise the numerical attributes

14
Naïve Bayes Calculations

Observed counts and probabilities above
Temperature and humidity have been discretised
Consider new day
Outlooksunny, temperaturecool, humidityhigh,
windytrue
Probability(playyes) a 2/9 x 3/9 x 3/9 x 3/9 x
9/14 0.0053
Probability(playno) a 3/5 x 1/5 x 4/5 x 3/5 x
5/14 0.0206
Probability(playno) 0.0206/(0.00530.0206)
79.5
no four times more likely than yes

15
Naïve Bayes Method

If any of the component probabilities are zero,
the whole probability is zero
Effectively a veto on that response value
Add one to each cells count to get around this
problem
Corresponds to weak positive prior information
Naïve Bayes effectively assumes that attributes
are equally important
Several highly correlated attributes could drown
out an important variable that would add new
information
However this method often works well in practice

16
Decision Trees

Classification rules can be expressed in a tree
structure
Move from the top of the tree, down through
various nodes, to the leaves
At each node, a decision is made using a simple
test based on attribute values
The leaf you reach holds the appropriate
predicted value
Decision trees are appealing and easily used
However they can be verbose
Depending on the tests being used, they may
obscure rather than reveal the true pattern
More info online at http//recursive-partitioning.
com/

17
Decision tree with a replicated subtree
If x1 and y1 then class a If z1 and w1
then class a Otherwise class b
18
Problems with Univariate Splits
19
Constructing Decision Trees

Develop tree recursively
Start with all data in one root node
Need to choose attribute that defines first split
For now, we assume univariate splits are used
For accurate predictions, want leaf nodes to be
as pure as possible
Choose the attribute that maximises the average
purity of the daughter nodes
The measure of purity used is the entropy of the
node
This is the amount of information needed to
specify the value of an instance in that node,
measured in bits

20
Tree stumps for the weather data
(a)
(b)
(c)
(d)
21
Weather Example

First node from outlook split is for sunny,
with entropy 2/5 log2(2/5) 3/5 log2(3/5)
0.971
Average entropy of nodes from outlook split is
5/14 x 0.971 4/14 x 0 5/14 x 0.971 0.693
Entropy of root node is 0.940 bits
Gain of 0.247 bits
Other splits yield
Gain(temperature)0.029 bits
Gain(humidity)0.152 bits
Gain(windy)0.048 bits
So outlook is the best attribute to split on

22
Expanded tree stumps for weather data
(a)
(b)
(c)
23
Decision tree for the weather data
24
Decision Tree Algorithms

The algorithm described in the preceding slides
is known as ID3
Due to Quinlan (1986)
Tends to choose attributes with many values
Using information gain ratio helps solve this
problem
Several more improvements have been made to
handle numeric attributes (via univariate
splits), missing values and noisy data (via
pruning)
Resulting algorithm known as C4.5
Described by Quinlan (1993)
Widely used (as is the commercial version C5.0)
WEKA has a version called J4.8

25
Classification Trees

Described (along with regression trees) in
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J.
Stone, 1984. Classification and Regression Trees.
More sophisticated method than ID3
However Quinlans (1993) C4.5 method caught up
with CART in most areas
CART also incorporates methods for pruning,
missing values and numeric attributes
Multivariate splits are possible, as well as
univariate
Split on linear combination Scjxj gt d
CART typically uses Gini measure of node purity
to determine best splits
This is of the form Sp(1-p)
But information/entropy measure also available

26
Regression Trees

Trees can also be used to predict numeric
attributes
Predict using average value of the response in
the appropriate node
Implemented in CART and C4.5 frameworks
Can use a model at each node instead
Implemented in Wekas M5 algorithm
Harder to interpret than regression trees
Classification and regression trees are
implemented in Rs rpart package
See Ch 10 in Venables and Ripley, MASS 3rd Ed.

27
Problems with Trees

Can be unnecessarily verbose
Structure often unstable
Greedy hierarchical algorithm
Small variations can change chosen splits at high
level nodes, which then changes subtree below
Conclusions about attribute importance can be
unreliable
Direct methods tend to overfit training dataset
This problem can be reduced by pruning the tree
Another approach that often works well is to fit
the tree, remove all training cases that are not
correctly predicted, and refit the tree on the
reduced dataset
Typically gives a smaller tree
This usually works almost as well on the training
data
But generalises better, e.g. works better on test
data
Bagging the tree algorithm also gives more stable
results
Will discuss bagging later

28
Classification Tree Example

Use Wekas J4.8 algorithm on German credit data
(with default options)
1000 instances, 21 attributes
Produces a pruned tree with 140 nodes, 103 leaves

Run information
Scheme weka.classifiers.j48.J48 -C 0.25 -M
2
Relation german_credit
Instances 1000
Attributes 21
Number of Leaves 103
Size of the tree 140
Stratified cross-validation
Summary
Correctly Classified Instances 739
73.9
Incorrectly Classified Instances 261
26.1
Kappa statistic 0.3153
Mean absolute error 0.3241
Root mean squared error 0.4604

30
Cross-Validation

Due to over-fitting, cannot estimate prediction
error directly on the training dataset
Cross-validation is a simple and widely used
method for estimating prediction error
Simple approach
Set aside a test dataset
Train learner on the remainder (the training
dataset)
Estimate prediction error by using the resulting
prediction model on the test dataset
This is only feasible where there is enough data
to set aside a test dataset and still have enough
to reliably train the learning algorithm

31
k-fold Cross-Validation

For smaller datasets, use k-fold cross-validation
Split dataset into k roughly equal parts
For each part, train on the other k-1 parts and
use this part as the test dataset
Do this for each of the k parts, and average the
resulting prediction errors
This method measures the prediction error when
training the learner on a fraction (k-1)/k of the
data
If k is small, this will overestimate the
prediction error
k10 is usually enough

Tr
Tr
Tr
Tr
Tr
Tr
Tr
Tr
Test
32
Regression Tree Example

data(car.test.frame)
z.auto lt- rpart(Mileage Weight, car.test.frame)
post(z.auto,FILE)
summary(z.auto)

33
(No Transcript)
34

Call
rpart(formula Mileage Weight, data
car.test.frame)
n 60
CP nsplit rel error xerror
xstd
1 0.59534912 0 1.0000000 1.0322233
0.17981796
2 0.13452819 1 0.4046509 0.6081645
0.11371656
3 0.01282843 2 0.2701227 0.4557341
0.09178782
4 0.01000000 3 0.2572943 0.4659556
0.09134201
Node number 1 60 observations, complexity
param0.5953491
mean24.58333, MSE22.57639
left son2 (45 obs) right son3 (15 obs)
Primary splits
Weight lt 2567.5 to the right,
improve0.5953491, (0 missing)
Node number 2 45 observations, complexity
param0.1345282
mean22.46667, MSE8.026667
left son4 (22 obs) right son5 (23 obs)

Node number 3 15 observations
mean30.93333, MSE12.46222
Node number 4 22 observations
mean20.40909, MSE2.78719
Node number 5 23 observations, complexity
param0.01282843
mean24.43478, MSE5.115312
left son10 (15 obs) right son11 (8 obs)
Primary splits
Weight lt 2747.5 to the right,
improve0.1476996, (0 missing)
Node number 10 15 observations
mean23.8, MSE4.026667
Node number 11 8 observations
mean25.625, MSE4.984375

36
Regression Tree Example (continued)

plotcp(z.auto)
z2.auto lt- prune(z.auto,cp0.1)
post(z2.auto, file"", cex1)

37
Complexity Parameter Plot
38
(No Transcript)
39
Pruned Regression Tree
40
Classification Methods

Project the attribute space into decision regions
Decision trees piecewise constant approximation
Logistic regression linear log-odds
approximation
Discriminant analysis and neural nets linear
non-linear separators
Density estimation coupled with a decision rule
E.g. Naïve Bayes
Define a metric space and decide based on
proximity
One type of instance-based learning
K-nearest neighbour methods
IBk algorithm in Weka
Would like to drop noisy and unnecessary points
Simple algorithm based on success rate confidence
intervals available in Weka
Compares naïve prediction with predictions using
that instance
Must choose suitable acceptance and rejection
confidence levels
Many of these approaches can produce probability
distributions as well as predictions
Depending on the application, this information
may be useful
Such as when results reported to expert (e.g.
loan officer) as input to their decision

41
Numeric Prediction Methods

Linear regression
Splines, including smoothing splines and
multivariate adaptive regression splines (MARS)
Generalised additive models (GAM)
Locally weighted regression (lowess, loess)
Regression and Model Trees
CART, C4.5, M5
Artificial neural networks (ANNs)

42
Artificial Neural Networks (ANNs)

An ANN is a network of many simple processors (or
units), that are connected by communication
channels that carry numeric data
ANNs are very flexible, encompassing nonlinear
regression models, discriminant models, and data
reduction models
They do require some expertise to set up
An appropriate architecture needs to be selected
and tuned for each application
They can be useful tools for learning from
examples to find patterns in data and predict
outputs
However on their own, they tend to overfit the
training data
Meta-learning tools are needed to choose the best
fit
Various network architectures in common use
Multilayer perceptron (MLR)
Radial basis functions (RBF)
Self-organising maps (SOM)
ANNs have been applied to data editing and
imputation, but not widely

43
Meta-Learning Methods - Bagging

General methods for improving the performance of
most learning algorithms
Bootstrap aggregation, bagging for short
Select B bootstrap samples from the data
Selected with replacement, same of instances
Can use parametric or non-parametric bootstrap
Fit the model/learner on each bootstrap sample
The bagged estimate is the average prediction
from all these B models
E.g. for a tree learner, the bagged estimate is
the average prediction from the resulting B trees
Note that this is not a tree
In general, bagging a model or learner does not
produce a model or learner of the same form
Bagging reduces the variance of unstable
procedures like regression trees, and can greatly
improve prediction accuracy
However it does not always work for poor 0-1
predictors

44
Meta-Learning Methods - Boosting

Boosting is a powerful technique for improving
accuracy
The AdaBoost.M1 method (for classifiers)
Give each instance an initial weight of 1/n
For m1 to M
Fit model using the current weights, store
resulting model m
If prediction error rate err is zero or gt 0.5,
terminate loop.
Otherwise calculate amlog((1-err)/err)
This is the log odds of success
Then adjust weights for incorrectly classified
cases by multiplying them by exp(am), and repeat
Predict using a weighted majority vote SamGm(x),
where Gm(x) is the prediction from model m

45
Meta-Learning Methods - Boosting

For example, for the German credit dataset
using 100 iterations of AdaBoost.M1 with the
DecisionStump algorithm,
10-fold cross-validation gives an error rate of
24.9 (compared to 26.1 for J4.8)

46
Association Rules

Data on n purchase baskets in form (id, item1,
item2, , itemk)
For example, purchases from a supermarket
Association rules are statements of the form
When people buy tea, they also often buy
coffee.
May be useful for product placement decisions or
cross-selling recommendations
We say there is an association rule i1 -gti2 if
i1 and i2 occur together in at least s of the n
baskets (the support)
And at least c of the baskets containing item i1
also contain i2 (the confidence)
The confidence criterion ensures that often is
a large enough proportion of the antecedent cases
to be interesting
The support criterion should be large enough that
the resulting rules have practical importance
Also helps to ensure reliability of the
conclusions

47
Association rules

The support/confidence approach is widely used
Efficiently implemented in the Apriori algorithm
First identify item sets with sufficient support
Then turn each item set into sets of rules with
sufficient confidence
This method was originally developed in the
database community, so there has been a focus on
efficient methods for large databases
Large means up to around 100 million instances,
and about ten thousand binary attributes
However this approach can find a vast number of
rules, and it can be difficult to make sense of
these
One useful extension is to identify only the
rules with high enough lift (or odds ratio)

48
Classification vs Association Rules

Classification rules predict the value of a
pre-specified attribute, e.g.
If outlooksunny and humidityhigh then play no
Association rules predict the value of an
arbitrary attribute (or combination of
attributes)
E.g. If temperaturecool then humiditynormal
If humiditynormal and playno then windytrue
If temperaturehigh and humidityhigh then playno

49
Clustering EM Algorithm

Assume that the data is from a mixture of normal
distributions
I.e. one normal component for each cluster
For simplicity, consider one attribute x and two
components or clusters
Model has five parameters (p, µ1, s1, µ2, s2)
?
Log-likelihood
This is hard to maximise directly
Use the expectation-maximisation (EM) algorithm
instead

50
Clustering EM Algorithm

Think of data as being augmented by a latent 0/1
variable di indicating membership of cluster 1
If the values of this variable were known, the
log-likelihood would be
Starting with initial values for the parameters,
calculate the expected value of di
Then substitute this into the above
log-likelihood and maximise to obtain new
parameter values
This will have increased the log-likelihood
Repeat until the log-likelihood converges

51
Clustering EM Algorithm

Resulting estimates may only be a local maximum
Run several times with different starting points
to find global maximum (hopefully)
With parameter estimates, can calculate segment
membership probabilities for each case

52
Clustering EM Algorithm

Extending to more latent classes is easy
Information criteria such as AIC and BIC are
often used to decide how many are appropriate
Extending to multiple attributes is easy if we
assume they are independent, at least
conditioning on segment membership
It is possible to introduce associations, but
this can rapidly increase the number of
parameters required
Nominal attributes can be accommodated by
allowing different discrete distributions in each
latent class, and assuming conditional
independence between attributes
Can extend this approach to a handle joint
clustering and prediction models, as mentioned in
the MVA lectures

53
Clustering - Scalability Issues

k-means algorithm is also widely used
However this and the EM-algorithm are slow on
large databases
So is hierarchical clustering - requires O(n2)
time
Iterative clustering methods require full DB scan
at each iteration
Scalable clustering algorithms are an area of
active research
A few recent algorithms
Distance-based/k-Means
Multi-Resolution kd-Tree for K-Means PM99
CLIQUE AGGR98
Scalable K-Means BFR98a
CLARANS NH94
Probabilistic/EM
Multi-Resolution kd-Tree for EM Moore99
Scalable EM BRF98b
CF Kernel Density Estimation ZRL99

54
Ethics of Data Mining

Data mining and data warehousing raise ethical
and legal issues
Combining information via data warehousing could
violate Privacy Act
Must tell people how their information will be
used when the data is obtained
Data mining raises ethical issues mainly during
application of results
E.g. using ethnicity as a factor in loan approval
decisions
E.g. screening job applications based on age or
sex (where not directly relevant)
E.g. declining insurance coverage based on
neighbourhood if this is related to race
(red-lining is illegal in much of the US)
Whether something is ethical depends on the
application
E.g. probably ethical to use ethnicity to
diagnose and choose treatments for a medical
problem, but not to decline medical insurance