Title: Data Mining in Market Research
1Data Mining in Market Research
- What is data mining?
- Methods for finding interesting structure in
large databases - E.g. patterns, prediction rules, unusual cases
- Focus on efficient, scalable algorithms
- Contrasts with emphasis on correct inference in
statistics - Related to data warehousing, machine learning
- Why is data mining important?
- Well marketed now a large industry pays well
- Handles large databases directly
- Can make data analysis more accessible to end
users - Semi-automation of analysis
- Results can be easier to interpret than e.g.
regression models - Strong focus on decisions and their implementation
2CRISP-DM Process Model
3Data Mining Software
- Many providers of data mining software
- SAS Enterprise Miner, SPSS Clementine, Statistica
Data Miner, MS SQL Server, Polyanalyst,
KnowledgeSTUDIO, - See http//www.kdnuggets.com/software/suites.html
for a list - Good algorithms important, but also need good
facilities for handling data and meta-data - Well use
- WEKA (Waikato Environment for Knowledge Analysis)
- Free (GPLed) Java package with GUI
- Online at www.cs.waikato.ac.nz/ml/weka
- Witten and Frank, 2000. Data Mining Practical
Machine Learning Tools and Techniques with Java
Implementations. - R packages
- E.g. rpart, class, tree, nnet, cclust, deal,
GeneSOM, knnTree, mlbench, randomForest, subselect
4Data Mining Terms
- Different names for familiar statistical
concepts, from database and AI communities - Observation case, record, instance
- Variable field, attribute
- Analysis of dependence vs interdependence
Supervised vs unsupervised learning - Relationship association, concept
- Dependent variable response, output
- Independent variable predictor, input
5Common Data Mining Techniques
- Predictive modeling
- Classification
- Derive classification rules
- Decision trees
- Numeric prediction
- Regression trees, model trees
- Association rules
- Meta-learning methods
- Cross-validation, bagging, boosting
- Other data mining methods include
- artificial neural networks, genetic algorithms,
density estimation, clustering, abstraction,
discretisation, visualisation, detecting changes
in data or models
6Classification
- Methods for predicting a discrete response
- One kind of supervised learning
- Note in biological and other sciences,
classification has long had a different meaning,
referring to cluster analysis - Applications include
- Identifying good prospects for specific marketing
or sales efforts - Cross-selling, up-selling when to offer
products - Customers likely to be especially profitable
- Customers likely to defect
- Identifying poor credit risks
- Diagnosing customer problems
7Weather/Game-Playing Data
- Small dataset
- 14 instances
- 5 attributes
- Outlook - nominal
- Temperature - numeric
- Humidity - numeric
- Wind - nominal
- Play
- Whether or not a certain game would be played
- This is what we want to understand and predict
8ARFF file for the weather data.
9German Credit Risk Dataset
- 1000 instances (people), 21 attributes
- class attribute describes people as good or bad
credit risks - Other attributes include financial information
and demographics - E.g. checking_status, duration, credit_history,
purpose, credit_amount, savings_status,
employment, Age, housing, job, num_dependents,
own_telephone, foreign_worker - Want to predict credit risk
- Data available at UCI machine learning data
repository - http//www.ics.uci.edu/mlearn/MLRepository.html
- and on 747 web page
- http//www.stat.auckland.ac.nz/reilly/credit-g.ar
ff
10Classification Algorithms
- Many methods available in WEKA
- 0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM,
Instance-based learner (IB1, IBk), C4.5 (J48),
PART, Support vector machine (SMO) - Usually train on part of the data, test on the
rest - Simple method Zero-rule, or 0R
- Predict the most common category
- Class ZeroR in WEKA
- Too simple for practical use, but a useful
baseline for evaluating performance of more
complex methods
111-Rule (1R) Algorithm
- Based on single predictor
- Predict mode within each value of that predictor
- Look at error rate for each predictor on training
dataset, and choose best predictor - Called OneR in WEKA
- Must group numerical predictor values for this
method - Common method is to split at each change in the
response - Collapse buckets until each contains at least 6
instances
121R Algorithm (continued)
- Biased towards predictors with more categories
- These can result in over-fitting to the training
data - But found to perform surprisingly well
- Study on 16 widely used datasets
- Holte (1993), Machine Learning 11, 63-91
- Often error rate only a few percentages points
higher than more sophisticated methods (e.g.
decision trees) - Produced rules that were much simpler and more
easily understood
13Naïve Bayes Method
- Calculates probabilities of each response value,
assuming independence of attribute effects - Response value with highest probability is
predicted - Numeric attributes are assumed to follow a normal
distribution within each response value - Contribution to probability calculated from
normal density function - Instead can use kernel density estimate, or
simply discretise the numerical attributes
14Naïve Bayes Calculations
- Observed counts and probabilities above
- Temperature and humidity have been discretised
- Consider new day
- Outlooksunny, temperaturecool, humidityhigh,
windytrue - Probability(playyes) a 2/9 x 3/9 x 3/9 x 3/9 x
9/14 0.0053 - Probability(playno) a 3/5 x 1/5 x 4/5 x 3/5 x
5/14 0.0206 - Probability(playno) 0.0206/(0.00530.0206)
79.5 - no four times more likely than yes
15Naïve Bayes Method
- If any of the component probabilities are zero,
the whole probability is zero - Effectively a veto on that response value
- Add one to each cells count to get around this
problem - Corresponds to weak positive prior information
- Naïve Bayes effectively assumes that attributes
are equally important - Several highly correlated attributes could drown
out an important variable that would add new
information - However this method often works well in practice
16Decision Trees
- Classification rules can be expressed in a tree
structure - Move from the top of the tree, down through
various nodes, to the leaves - At each node, a decision is made using a simple
test based on attribute values - The leaf you reach holds the appropriate
predicted value - Decision trees are appealing and easily used
- However they can be verbose
- Depending on the tests being used, they may
obscure rather than reveal the true pattern - More info online at http//recursive-partitioning.
com/
17Decision tree with a replicated subtree
If x1 and y1 then class a If z1 and w1
then class a Otherwise class b
18Problems with Univariate Splits
19Constructing Decision Trees
- Develop tree recursively
- Start with all data in one root node
- Need to choose attribute that defines first split
- For now, we assume univariate splits are used
- For accurate predictions, want leaf nodes to be
as pure as possible - Choose the attribute that maximises the average
purity of the daughter nodes - The measure of purity used is the entropy of the
node - This is the amount of information needed to
specify the value of an instance in that node,
measured in bits -
20Tree stumps for the weather data
(a)
(b)
(c)
(d)
21Weather Example
- First node from outlook split is for sunny,
with entropy 2/5 log2(2/5) 3/5 log2(3/5)
0.971 - Average entropy of nodes from outlook split is
- 5/14 x 0.971 4/14 x 0 5/14 x 0.971 0.693
- Entropy of root node is 0.940 bits
- Gain of 0.247 bits
- Other splits yield
- Gain(temperature)0.029 bits
- Gain(humidity)0.152 bits
- Gain(windy)0.048 bits
- So outlook is the best attribute to split on
22Expanded tree stumps for weather data
(a)
(b)
(c)
23Decision tree for the weather data
24Decision Tree Algorithms
- The algorithm described in the preceding slides
is known as ID3 - Due to Quinlan (1986)
- Tends to choose attributes with many values
- Using information gain ratio helps solve this
problem - Several more improvements have been made to
handle numeric attributes (via univariate
splits), missing values and noisy data (via
pruning) - Resulting algorithm known as C4.5
- Described by Quinlan (1993)
- Widely used (as is the commercial version C5.0)
- WEKA has a version called J4.8
25Classification Trees
- Described (along with regression trees) in
- L. Breiman, J.H. Friedman, R.A. Olshen, and C.J.
Stone, 1984. Classification and Regression Trees. - More sophisticated method than ID3
- However Quinlans (1993) C4.5 method caught up
with CART in most areas - CART also incorporates methods for pruning,
missing values and numeric attributes - Multivariate splits are possible, as well as
univariate - Split on linear combination Scjxj gt d
- CART typically uses Gini measure of node purity
to determine best splits - This is of the form Sp(1-p)
- But information/entropy measure also available
26Regression Trees
- Trees can also be used to predict numeric
attributes - Predict using average value of the response in
the appropriate node - Implemented in CART and C4.5 frameworks
- Can use a model at each node instead
- Implemented in Wekas M5 algorithm
- Harder to interpret than regression trees
- Classification and regression trees are
implemented in Rs rpart package - See Ch 10 in Venables and Ripley, MASS 3rd Ed.
27Problems with Trees
- Can be unnecessarily verbose
- Structure often unstable
- Greedy hierarchical algorithm
- Small variations can change chosen splits at high
level nodes, which then changes subtree below - Conclusions about attribute importance can be
unreliable - Direct methods tend to overfit training dataset
- This problem can be reduced by pruning the tree
- Another approach that often works well is to fit
the tree, remove all training cases that are not
correctly predicted, and refit the tree on the
reduced dataset - Typically gives a smaller tree
- This usually works almost as well on the training
data - But generalises better, e.g. works better on test
data - Bagging the tree algorithm also gives more stable
results - Will discuss bagging later
28Classification Tree Example
- Use Wekas J4.8 algorithm on German credit data
(with default options) - 1000 instances, 21 attributes
- Produces a pruned tree with 140 nodes, 103 leaves
29- Run information
- Scheme weka.classifiers.j48.J48 -C 0.25 -M
2 - Relation german_credit
- Instances 1000
- Attributes 21
- Number of Leaves 103
- Size of the tree 140
- Stratified cross-validation
- Summary
- Correctly Classified Instances 739
73.9 - Incorrectly Classified Instances 261
26.1 - Kappa statistic 0.3153
- Mean absolute error 0.3241
- Root mean squared error 0.4604
30Cross-Validation
- Due to over-fitting, cannot estimate prediction
error directly on the training dataset - Cross-validation is a simple and widely used
method for estimating prediction error - Simple approach
- Set aside a test dataset
- Train learner on the remainder (the training
dataset) - Estimate prediction error by using the resulting
prediction model on the test dataset - This is only feasible where there is enough data
to set aside a test dataset and still have enough
to reliably train the learning algorithm
31k-fold Cross-Validation
- For smaller datasets, use k-fold cross-validation
- Split dataset into k roughly equal parts
- For each part, train on the other k-1 parts and
use this part as the test dataset - Do this for each of the k parts, and average the
resulting prediction errors - This method measures the prediction error when
training the learner on a fraction (k-1)/k of the
data - If k is small, this will overestimate the
prediction error - k10 is usually enough
Tr
Tr
Tr
Tr
Tr
Tr
Tr
Tr
Test
32Regression Tree Example
- data(car.test.frame)
- z.auto lt- rpart(Mileage Weight, car.test.frame)
- post(z.auto,FILE)
- summary(z.auto)
33(No Transcript)
34- Call
- rpart(formula Mileage Weight, data
car.test.frame) - n 60
- CP nsplit rel error xerror
xstd - 1 0.59534912 0 1.0000000 1.0322233
0.17981796 - 2 0.13452819 1 0.4046509 0.6081645
0.11371656 - 3 0.01282843 2 0.2701227 0.4557341
0.09178782 - 4 0.01000000 3 0.2572943 0.4659556
0.09134201 - Node number 1 60 observations, complexity
param0.5953491 - mean24.58333, MSE22.57639
- left son2 (45 obs) right son3 (15 obs)
- Primary splits
- Weight lt 2567.5 to the right,
improve0.5953491, (0 missing) - Node number 2 45 observations, complexity
param0.1345282 - mean22.46667, MSE8.026667
- left son4 (22 obs) right son5 (23 obs)
35- Node number 3 15 observations
- mean30.93333, MSE12.46222
- Node number 4 22 observations
- mean20.40909, MSE2.78719
- Node number 5 23 observations, complexity
param0.01282843 - mean24.43478, MSE5.115312
- left son10 (15 obs) right son11 (8 obs)
- Primary splits
- Weight lt 2747.5 to the right,
improve0.1476996, (0 missing) - Node number 10 15 observations
- mean23.8, MSE4.026667
- Node number 11 8 observations
- mean25.625, MSE4.984375
36Regression Tree Example (continued)
- plotcp(z.auto)
- z2.auto lt- prune(z.auto,cp0.1)
- post(z2.auto, file"", cex1)
37Complexity Parameter Plot
38(No Transcript)
39Pruned Regression Tree
40Classification Methods
- Project the attribute space into decision regions
- Decision trees piecewise constant approximation
- Logistic regression linear log-odds
approximation - Discriminant analysis and neural nets linear
non-linear separators - Density estimation coupled with a decision rule
- E.g. Naïve Bayes
- Define a metric space and decide based on
proximity - One type of instance-based learning
- K-nearest neighbour methods
- IBk algorithm in Weka
- Would like to drop noisy and unnecessary points
- Simple algorithm based on success rate confidence
intervals available in Weka - Compares naïve prediction with predictions using
that instance - Must choose suitable acceptance and rejection
confidence levels - Many of these approaches can produce probability
distributions as well as predictions - Depending on the application, this information
may be useful - Such as when results reported to expert (e.g.
loan officer) as input to their decision
41Numeric Prediction Methods
- Linear regression
- Splines, including smoothing splines and
multivariate adaptive regression splines (MARS) - Generalised additive models (GAM)
- Locally weighted regression (lowess, loess)
- Regression and Model Trees
- CART, C4.5, M5
- Artificial neural networks (ANNs)
42Artificial Neural Networks (ANNs)
- An ANN is a network of many simple processors (or
units), that are connected by communication
channels that carry numeric data - ANNs are very flexible, encompassing nonlinear
regression models, discriminant models, and data
reduction models - They do require some expertise to set up
- An appropriate architecture needs to be selected
and tuned for each application - They can be useful tools for learning from
examples to find patterns in data and predict
outputs - However on their own, they tend to overfit the
training data - Meta-learning tools are needed to choose the best
fit - Various network architectures in common use
- Multilayer perceptron (MLR)
- Radial basis functions (RBF)
- Self-organising maps (SOM)
- ANNs have been applied to data editing and
imputation, but not widely
43Meta-Learning Methods - Bagging
- General methods for improving the performance of
most learning algorithms - Bootstrap aggregation, bagging for short
- Select B bootstrap samples from the data
- Selected with replacement, same of instances
- Can use parametric or non-parametric bootstrap
- Fit the model/learner on each bootstrap sample
- The bagged estimate is the average prediction
from all these B models - E.g. for a tree learner, the bagged estimate is
the average prediction from the resulting B trees - Note that this is not a tree
- In general, bagging a model or learner does not
produce a model or learner of the same form - Bagging reduces the variance of unstable
procedures like regression trees, and can greatly
improve prediction accuracy - However it does not always work for poor 0-1
predictors
44Meta-Learning Methods - Boosting
- Boosting is a powerful technique for improving
accuracy - The AdaBoost.M1 method (for classifiers)
- Give each instance an initial weight of 1/n
- For m1 to M
- Fit model using the current weights, store
resulting model m - If prediction error rate err is zero or gt 0.5,
terminate loop. - Otherwise calculate amlog((1-err)/err)
- This is the log odds of success
- Then adjust weights for incorrectly classified
cases by multiplying them by exp(am), and repeat - Predict using a weighted majority vote SamGm(x),
where Gm(x) is the prediction from model m
45Meta-Learning Methods - Boosting
- For example, for the German credit dataset
- using 100 iterations of AdaBoost.M1 with the
DecisionStump algorithm, - 10-fold cross-validation gives an error rate of
24.9 (compared to 26.1 for J4.8)
46Association Rules
- Data on n purchase baskets in form (id, item1,
item2, , itemk) - For example, purchases from a supermarket
- Association rules are statements of the form
- When people buy tea, they also often buy
coffee. - May be useful for product placement decisions or
cross-selling recommendations - We say there is an association rule i1 -gti2 if
- i1 and i2 occur together in at least s of the n
baskets (the support) - And at least c of the baskets containing item i1
also contain i2 (the confidence) - The confidence criterion ensures that often is
a large enough proportion of the antecedent cases
to be interesting - The support criterion should be large enough that
the resulting rules have practical importance - Also helps to ensure reliability of the
conclusions
47Association rules
- The support/confidence approach is widely used
- Efficiently implemented in the Apriori algorithm
- First identify item sets with sufficient support
- Then turn each item set into sets of rules with
sufficient confidence - This method was originally developed in the
database community, so there has been a focus on
efficient methods for large databases - Large means up to around 100 million instances,
and about ten thousand binary attributes - However this approach can find a vast number of
rules, and it can be difficult to make sense of
these - One useful extension is to identify only the
rules with high enough lift (or odds ratio)
48Classification vs Association Rules
- Classification rules predict the value of a
pre-specified attribute, e.g. - If outlooksunny and humidityhigh then play no
- Association rules predict the value of an
arbitrary attribute (or combination of
attributes) - E.g. If temperaturecool then humiditynormal
- If humiditynormal and playno then windytrue
- If temperaturehigh and humidityhigh then playno
49Clustering EM Algorithm
- Assume that the data is from a mixture of normal
distributions - I.e. one normal component for each cluster
- For simplicity, consider one attribute x and two
components or clusters - Model has five parameters (p, µ1, s1, µ2, s2)
? - Log-likelihood
- This is hard to maximise directly
- Use the expectation-maximisation (EM) algorithm
instead
50Clustering EM Algorithm
- Think of data as being augmented by a latent 0/1
variable di indicating membership of cluster 1 - If the values of this variable were known, the
log-likelihood would be - Starting with initial values for the parameters,
calculate the expected value of di - Then substitute this into the above
log-likelihood and maximise to obtain new
parameter values - This will have increased the log-likelihood
- Repeat until the log-likelihood converges
51Clustering EM Algorithm
- Resulting estimates may only be a local maximum
- Run several times with different starting points
to find global maximum (hopefully) - With parameter estimates, can calculate segment
membership probabilities for each case
52Clustering EM Algorithm
- Extending to more latent classes is easy
- Information criteria such as AIC and BIC are
often used to decide how many are appropriate - Extending to multiple attributes is easy if we
assume they are independent, at least
conditioning on segment membership - It is possible to introduce associations, but
this can rapidly increase the number of
parameters required - Nominal attributes can be accommodated by
allowing different discrete distributions in each
latent class, and assuming conditional
independence between attributes - Can extend this approach to a handle joint
clustering and prediction models, as mentioned in
the MVA lectures
53Clustering - Scalability Issues
- k-means algorithm is also widely used
- However this and the EM-algorithm are slow on
large databases - So is hierarchical clustering - requires O(n2)
time - Iterative clustering methods require full DB scan
at each iteration - Scalable clustering algorithms are an area of
active research - A few recent algorithms
- Distance-based/k-Means
- Multi-Resolution kd-Tree for K-Means PM99
- CLIQUE AGGR98
- Scalable K-Means BFR98a
- CLARANS NH94
- Probabilistic/EM
- Multi-Resolution kd-Tree for EM Moore99
- Scalable EM BRF98b
- CF Kernel Density Estimation ZRL99
54Ethics of Data Mining
- Data mining and data warehousing raise ethical
and legal issues - Combining information via data warehousing could
violate Privacy Act - Must tell people how their information will be
used when the data is obtained - Data mining raises ethical issues mainly during
application of results - E.g. using ethnicity as a factor in loan approval
decisions - E.g. screening job applications based on age or
sex (where not directly relevant) - E.g. declining insurance coverage based on
neighbourhood if this is related to race
(red-lining is illegal in much of the US) - Whether something is ethical depends on the
application - E.g. probably ethical to use ethnicity to
diagnose and choose treatments for a medical
problem, but not to decline medical insurance
55(No Transcript)