Data Mining in Market Research - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining in Market Research

Description:

Data Mining in Market Research What is data mining? Methods for finding interesting structure in large databases E.g. patterns, prediction rules, unusual cases – PowerPoint PPT presentation

Number of Views:531
Avg rating:3.0/5.0
Slides: 56
Provided by: JamesR132
Category:
Tags: data | market | mining | research

less

Transcript and Presenter's Notes

Title: Data Mining in Market Research


1
Data Mining in Market Research
  • What is data mining?
  • Methods for finding interesting structure in
    large databases
  • E.g. patterns, prediction rules, unusual cases
  • Focus on efficient, scalable algorithms
  • Contrasts with emphasis on correct inference in
    statistics
  • Related to data warehousing, machine learning
  • Why is data mining important?
  • Well marketed now a large industry pays well
  • Handles large databases directly
  • Can make data analysis more accessible to end
    users
  • Semi-automation of analysis
  • Results can be easier to interpret than e.g.
    regression models
  • Strong focus on decisions and their implementation

2
CRISP-DM Process Model
3
Data Mining Software
  • Many providers of data mining software
  • SAS Enterprise Miner, SPSS Clementine, Statistica
    Data Miner, MS SQL Server, Polyanalyst,
    KnowledgeSTUDIO,
  • See http//www.kdnuggets.com/software/suites.html
    for a list
  • Good algorithms important, but also need good
    facilities for handling data and meta-data
  • Well use
  • WEKA (Waikato Environment for Knowledge Analysis)
  • Free (GPLed) Java package with GUI
  • Online at www.cs.waikato.ac.nz/ml/weka
  • Witten and Frank, 2000. Data Mining Practical
    Machine Learning Tools and Techniques with Java
    Implementations.
  • R packages
  • E.g. rpart, class, tree, nnet, cclust, deal,
    GeneSOM, knnTree, mlbench, randomForest, subselect

4
Data Mining Terms
  • Different names for familiar statistical
    concepts, from database and AI communities
  • Observation case, record, instance
  • Variable field, attribute
  • Analysis of dependence vs interdependence
    Supervised vs unsupervised learning
  • Relationship association, concept
  • Dependent variable response, output
  • Independent variable predictor, input

5
Common Data Mining Techniques
  • Predictive modeling
  • Classification
  • Derive classification rules
  • Decision trees
  • Numeric prediction
  • Regression trees, model trees
  • Association rules
  • Meta-learning methods
  • Cross-validation, bagging, boosting
  • Other data mining methods include
  • artificial neural networks, genetic algorithms,
    density estimation, clustering, abstraction,
    discretisation, visualisation, detecting changes
    in data or models

6
Classification
  • Methods for predicting a discrete response
  • One kind of supervised learning
  • Note in biological and other sciences,
    classification has long had a different meaning,
    referring to cluster analysis
  • Applications include
  • Identifying good prospects for specific marketing
    or sales efforts
  • Cross-selling, up-selling when to offer
    products
  • Customers likely to be especially profitable
  • Customers likely to defect
  • Identifying poor credit risks
  • Diagnosing customer problems

7
Weather/Game-Playing Data
  • Small dataset
  • 14 instances
  • 5 attributes
  • Outlook - nominal
  • Temperature - numeric
  • Humidity - numeric
  • Wind - nominal
  • Play
  • Whether or not a certain game would be played
  • This is what we want to understand and predict

8
ARFF file for the weather data.
9
German Credit Risk Dataset
  • 1000 instances (people), 21 attributes
  • class attribute describes people as good or bad
    credit risks
  • Other attributes include financial information
    and demographics
  • E.g. checking_status, duration, credit_history,
    purpose, credit_amount, savings_status,
    employment, Age, housing, job, num_dependents,
    own_telephone, foreign_worker
  • Want to predict credit risk
  • Data available at UCI machine learning data
    repository
  • http//www.ics.uci.edu/mlearn/MLRepository.html
  • and on 747 web page
  • http//www.stat.auckland.ac.nz/reilly/credit-g.ar
    ff

10
Classification Algorithms
  • Many methods available in WEKA
  • 0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM,
    Instance-based learner (IB1, IBk), C4.5 (J48),
    PART, Support vector machine (SMO)
  • Usually train on part of the data, test on the
    rest
  • Simple method Zero-rule, or 0R
  • Predict the most common category
  • Class ZeroR in WEKA
  • Too simple for practical use, but a useful
    baseline for evaluating performance of more
    complex methods

11
1-Rule (1R) Algorithm
  • Based on single predictor
  • Predict mode within each value of that predictor
  • Look at error rate for each predictor on training
    dataset, and choose best predictor
  • Called OneR in WEKA
  • Must group numerical predictor values for this
    method
  • Common method is to split at each change in the
    response
  • Collapse buckets until each contains at least 6
    instances

12
1R Algorithm (continued)
  • Biased towards predictors with more categories
  • These can result in over-fitting to the training
    data
  • But found to perform surprisingly well
  • Study on 16 widely used datasets
  • Holte (1993), Machine Learning 11, 63-91
  • Often error rate only a few percentages points
    higher than more sophisticated methods (e.g.
    decision trees)
  • Produced rules that were much simpler and more
    easily understood

13
Naïve Bayes Method
  • Calculates probabilities of each response value,
    assuming independence of attribute effects
  • Response value with highest probability is
    predicted
  • Numeric attributes are assumed to follow a normal
    distribution within each response value
  • Contribution to probability calculated from
    normal density function
  • Instead can use kernel density estimate, or
    simply discretise the numerical attributes

14
Naïve Bayes Calculations
  • Observed counts and probabilities above
  • Temperature and humidity have been discretised
  • Consider new day
  • Outlooksunny, temperaturecool, humidityhigh,
    windytrue
  • Probability(playyes) a 2/9 x 3/9 x 3/9 x 3/9 x
    9/14 0.0053
  • Probability(playno) a 3/5 x 1/5 x 4/5 x 3/5 x
    5/14 0.0206
  • Probability(playno) 0.0206/(0.00530.0206)
    79.5
  • no four times more likely than yes

15
Naïve Bayes Method
  • If any of the component probabilities are zero,
    the whole probability is zero
  • Effectively a veto on that response value
  • Add one to each cells count to get around this
    problem
  • Corresponds to weak positive prior information
  • Naïve Bayes effectively assumes that attributes
    are equally important
  • Several highly correlated attributes could drown
    out an important variable that would add new
    information
  • However this method often works well in practice

16
Decision Trees
  • Classification rules can be expressed in a tree
    structure
  • Move from the top of the tree, down through
    various nodes, to the leaves
  • At each node, a decision is made using a simple
    test based on attribute values
  • The leaf you reach holds the appropriate
    predicted value
  • Decision trees are appealing and easily used
  • However they can be verbose
  • Depending on the tests being used, they may
    obscure rather than reveal the true pattern
  • More info online at http//recursive-partitioning.
    com/

17
Decision tree with a replicated subtree
If x1 and y1 then class a If z1 and w1
then class a Otherwise class b
18
Problems with Univariate Splits
19
Constructing Decision Trees
  • Develop tree recursively
  • Start with all data in one root node
  • Need to choose attribute that defines first split
  • For now, we assume univariate splits are used
  • For accurate predictions, want leaf nodes to be
    as pure as possible
  • Choose the attribute that maximises the average
    purity of the daughter nodes
  • The measure of purity used is the entropy of the
    node
  • This is the amount of information needed to
    specify the value of an instance in that node,
    measured in bits

20
Tree stumps for the weather data
(a)
(b)
(c)
(d)
21
Weather Example
  • First node from outlook split is for sunny,
    with entropy 2/5 log2(2/5) 3/5 log2(3/5)
    0.971
  • Average entropy of nodes from outlook split is
  • 5/14 x 0.971 4/14 x 0 5/14 x 0.971 0.693
  • Entropy of root node is 0.940 bits
  • Gain of 0.247 bits
  • Other splits yield
  • Gain(temperature)0.029 bits
  • Gain(humidity)0.152 bits
  • Gain(windy)0.048 bits
  • So outlook is the best attribute to split on

22
Expanded tree stumps for weather data
(a)
(b)
(c)
23
Decision tree for the weather data
24
Decision Tree Algorithms
  • The algorithm described in the preceding slides
    is known as ID3
  • Due to Quinlan (1986)
  • Tends to choose attributes with many values
  • Using information gain ratio helps solve this
    problem
  • Several more improvements have been made to
    handle numeric attributes (via univariate
    splits), missing values and noisy data (via
    pruning)
  • Resulting algorithm known as C4.5
  • Described by Quinlan (1993)
  • Widely used (as is the commercial version C5.0)
  • WEKA has a version called J4.8

25
Classification Trees
  • Described (along with regression trees) in
  • L. Breiman, J.H. Friedman, R.A. Olshen, and C.J.
    Stone, 1984. Classification and Regression Trees.
  • More sophisticated method than ID3
  • However Quinlans (1993) C4.5 method caught up
    with CART in most areas
  • CART also incorporates methods for pruning,
    missing values and numeric attributes
  • Multivariate splits are possible, as well as
    univariate
  • Split on linear combination Scjxj gt d
  • CART typically uses Gini measure of node purity
    to determine best splits
  • This is of the form Sp(1-p)
  • But information/entropy measure also available

26
Regression Trees
  • Trees can also be used to predict numeric
    attributes
  • Predict using average value of the response in
    the appropriate node
  • Implemented in CART and C4.5 frameworks
  • Can use a model at each node instead
  • Implemented in Wekas M5 algorithm
  • Harder to interpret than regression trees
  • Classification and regression trees are
    implemented in Rs rpart package
  • See Ch 10 in Venables and Ripley, MASS 3rd Ed.

27
Problems with Trees
  • Can be unnecessarily verbose
  • Structure often unstable
  • Greedy hierarchical algorithm
  • Small variations can change chosen splits at high
    level nodes, which then changes subtree below
  • Conclusions about attribute importance can be
    unreliable
  • Direct methods tend to overfit training dataset
  • This problem can be reduced by pruning the tree
  • Another approach that often works well is to fit
    the tree, remove all training cases that are not
    correctly predicted, and refit the tree on the
    reduced dataset
  • Typically gives a smaller tree
  • This usually works almost as well on the training
    data
  • But generalises better, e.g. works better on test
    data
  • Bagging the tree algorithm also gives more stable
    results
  • Will discuss bagging later

28
Classification Tree Example
  • Use Wekas J4.8 algorithm on German credit data
    (with default options)
  • 1000 instances, 21 attributes
  • Produces a pruned tree with 140 nodes, 103 leaves

29
  • Run information
  • Scheme weka.classifiers.j48.J48 -C 0.25 -M
    2
  • Relation german_credit
  • Instances 1000
  • Attributes 21
  • Number of Leaves 103
  • Size of the tree 140
  • Stratified cross-validation
  • Summary
  • Correctly Classified Instances 739
    73.9
  • Incorrectly Classified Instances 261
    26.1
  • Kappa statistic 0.3153
  • Mean absolute error 0.3241
  • Root mean squared error 0.4604

30
Cross-Validation
  • Due to over-fitting, cannot estimate prediction
    error directly on the training dataset
  • Cross-validation is a simple and widely used
    method for estimating prediction error
  • Simple approach
  • Set aside a test dataset
  • Train learner on the remainder (the training
    dataset)
  • Estimate prediction error by using the resulting
    prediction model on the test dataset
  • This is only feasible where there is enough data
    to set aside a test dataset and still have enough
    to reliably train the learning algorithm

31
k-fold Cross-Validation
  • For smaller datasets, use k-fold cross-validation
  • Split dataset into k roughly equal parts
  • For each part, train on the other k-1 parts and
    use this part as the test dataset
  • Do this for each of the k parts, and average the
    resulting prediction errors
  • This method measures the prediction error when
    training the learner on a fraction (k-1)/k of the
    data
  • If k is small, this will overestimate the
    prediction error
  • k10 is usually enough

Tr
Tr
Tr
Tr
Tr
Tr
Tr
Tr
Test
32
Regression Tree Example
  • data(car.test.frame)
  • z.auto lt- rpart(Mileage Weight, car.test.frame)
  • post(z.auto,FILE)
  • summary(z.auto)

33
(No Transcript)
34
  • Call
  • rpart(formula Mileage Weight, data
    car.test.frame)
  • n 60
  • CP nsplit rel error xerror
    xstd
  • 1 0.59534912 0 1.0000000 1.0322233
    0.17981796
  • 2 0.13452819 1 0.4046509 0.6081645
    0.11371656
  • 3 0.01282843 2 0.2701227 0.4557341
    0.09178782
  • 4 0.01000000 3 0.2572943 0.4659556
    0.09134201
  • Node number 1 60 observations, complexity
    param0.5953491
  • mean24.58333, MSE22.57639
  • left son2 (45 obs) right son3 (15 obs)
  • Primary splits
  • Weight lt 2567.5 to the right,
    improve0.5953491, (0 missing)
  • Node number 2 45 observations, complexity
    param0.1345282
  • mean22.46667, MSE8.026667
  • left son4 (22 obs) right son5 (23 obs)

35
  • Node number 3 15 observations
  • mean30.93333, MSE12.46222
  • Node number 4 22 observations
  • mean20.40909, MSE2.78719
  • Node number 5 23 observations, complexity
    param0.01282843
  • mean24.43478, MSE5.115312
  • left son10 (15 obs) right son11 (8 obs)
  • Primary splits
  • Weight lt 2747.5 to the right,
    improve0.1476996, (0 missing)
  • Node number 10 15 observations
  • mean23.8, MSE4.026667
  • Node number 11 8 observations
  • mean25.625, MSE4.984375

36
Regression Tree Example (continued)
  • plotcp(z.auto)
  • z2.auto lt- prune(z.auto,cp0.1)
  • post(z2.auto, file"", cex1)

37
Complexity Parameter Plot
38
(No Transcript)
39
Pruned Regression Tree
40
Classification Methods
  • Project the attribute space into decision regions
  • Decision trees piecewise constant approximation
  • Logistic regression linear log-odds
    approximation
  • Discriminant analysis and neural nets linear
    non-linear separators
  • Density estimation coupled with a decision rule
  • E.g. Naïve Bayes
  • Define a metric space and decide based on
    proximity
  • One type of instance-based learning
  • K-nearest neighbour methods
  • IBk algorithm in Weka
  • Would like to drop noisy and unnecessary points
  • Simple algorithm based on success rate confidence
    intervals available in Weka
  • Compares naïve prediction with predictions using
    that instance
  • Must choose suitable acceptance and rejection
    confidence levels
  • Many of these approaches can produce probability
    distributions as well as predictions
  • Depending on the application, this information
    may be useful
  • Such as when results reported to expert (e.g.
    loan officer) as input to their decision

41
Numeric Prediction Methods
  • Linear regression
  • Splines, including smoothing splines and
    multivariate adaptive regression splines (MARS)
  • Generalised additive models (GAM)
  • Locally weighted regression (lowess, loess)
  • Regression and Model Trees
  • CART, C4.5, M5
  • Artificial neural networks (ANNs)

42
Artificial Neural Networks (ANNs)
  • An ANN is a network of many simple processors (or
    units), that are connected by communication
    channels that carry numeric data
  • ANNs are very flexible, encompassing nonlinear
    regression models, discriminant models, and data
    reduction models
  • They do require some expertise to set up
  • An appropriate architecture needs to be selected
    and tuned for each application
  • They can be useful tools for learning from
    examples to find patterns in data and predict
    outputs
  • However on their own, they tend to overfit the
    training data
  • Meta-learning tools are needed to choose the best
    fit
  • Various network architectures in common use
  • Multilayer perceptron (MLR)
  • Radial basis functions (RBF)
  • Self-organising maps (SOM)
  • ANNs have been applied to data editing and
    imputation, but not widely

43
Meta-Learning Methods - Bagging
  • General methods for improving the performance of
    most learning algorithms
  • Bootstrap aggregation, bagging for short
  • Select B bootstrap samples from the data
  • Selected with replacement, same of instances
  • Can use parametric or non-parametric bootstrap
  • Fit the model/learner on each bootstrap sample
  • The bagged estimate is the average prediction
    from all these B models
  • E.g. for a tree learner, the bagged estimate is
    the average prediction from the resulting B trees
  • Note that this is not a tree
  • In general, bagging a model or learner does not
    produce a model or learner of the same form
  • Bagging reduces the variance of unstable
    procedures like regression trees, and can greatly
    improve prediction accuracy
  • However it does not always work for poor 0-1
    predictors

44
Meta-Learning Methods - Boosting
  • Boosting is a powerful technique for improving
    accuracy
  • The AdaBoost.M1 method (for classifiers)
  • Give each instance an initial weight of 1/n
  • For m1 to M
  • Fit model using the current weights, store
    resulting model m
  • If prediction error rate err is zero or gt 0.5,
    terminate loop.
  • Otherwise calculate amlog((1-err)/err)
  • This is the log odds of success
  • Then adjust weights for incorrectly classified
    cases by multiplying them by exp(am), and repeat
  • Predict using a weighted majority vote SamGm(x),
    where Gm(x) is the prediction from model m

45
Meta-Learning Methods - Boosting
  • For example, for the German credit dataset
  • using 100 iterations of AdaBoost.M1 with the
    DecisionStump algorithm,
  • 10-fold cross-validation gives an error rate of
    24.9 (compared to 26.1 for J4.8)

46
Association Rules
  • Data on n purchase baskets in form (id, item1,
    item2, , itemk)
  • For example, purchases from a supermarket
  • Association rules are statements of the form
  • When people buy tea, they also often buy
    coffee.
  • May be useful for product placement decisions or
    cross-selling recommendations
  • We say there is an association rule i1 -gti2 if
  • i1 and i2 occur together in at least s of the n
    baskets (the support)
  • And at least c of the baskets containing item i1
    also contain i2 (the confidence)
  • The confidence criterion ensures that often is
    a large enough proportion of the antecedent cases
    to be interesting
  • The support criterion should be large enough that
    the resulting rules have practical importance
  • Also helps to ensure reliability of the
    conclusions

47
Association rules
  • The support/confidence approach is widely used
  • Efficiently implemented in the Apriori algorithm
  • First identify item sets with sufficient support
  • Then turn each item set into sets of rules with
    sufficient confidence
  • This method was originally developed in the
    database community, so there has been a focus on
    efficient methods for large databases
  • Large means up to around 100 million instances,
    and about ten thousand binary attributes
  • However this approach can find a vast number of
    rules, and it can be difficult to make sense of
    these
  • One useful extension is to identify only the
    rules with high enough lift (or odds ratio)

48
Classification vs Association Rules
  • Classification rules predict the value of a
    pre-specified attribute, e.g.
  • If outlooksunny and humidityhigh then play no
  • Association rules predict the value of an
    arbitrary attribute (or combination of
    attributes)
  • E.g. If temperaturecool then humiditynormal
  • If humiditynormal and playno then windytrue
  • If temperaturehigh and humidityhigh then playno

49
Clustering EM Algorithm
  • Assume that the data is from a mixture of normal
    distributions
  • I.e. one normal component for each cluster
  • For simplicity, consider one attribute x and two
    components or clusters
  • Model has five parameters (p, µ1, s1, µ2, s2)
    ?
  • Log-likelihood
  • This is hard to maximise directly
  • Use the expectation-maximisation (EM) algorithm
    instead

50
Clustering EM Algorithm
  • Think of data as being augmented by a latent 0/1
    variable di indicating membership of cluster 1
  • If the values of this variable were known, the
    log-likelihood would be
  • Starting with initial values for the parameters,
    calculate the expected value of di
  • Then substitute this into the above
    log-likelihood and maximise to obtain new
    parameter values
  • This will have increased the log-likelihood
  • Repeat until the log-likelihood converges

51
Clustering EM Algorithm
  • Resulting estimates may only be a local maximum
  • Run several times with different starting points
    to find global maximum (hopefully)
  • With parameter estimates, can calculate segment
    membership probabilities for each case

52
Clustering EM Algorithm
  • Extending to more latent classes is easy
  • Information criteria such as AIC and BIC are
    often used to decide how many are appropriate
  • Extending to multiple attributes is easy if we
    assume they are independent, at least
    conditioning on segment membership
  • It is possible to introduce associations, but
    this can rapidly increase the number of
    parameters required
  • Nominal attributes can be accommodated by
    allowing different discrete distributions in each
    latent class, and assuming conditional
    independence between attributes
  • Can extend this approach to a handle joint
    clustering and prediction models, as mentioned in
    the MVA lectures

53
Clustering - Scalability Issues
  • k-means algorithm is also widely used
  • However this and the EM-algorithm are slow on
    large databases
  • So is hierarchical clustering - requires O(n2)
    time
  • Iterative clustering methods require full DB scan
    at each iteration
  • Scalable clustering algorithms are an area of
    active research
  • A few recent algorithms
  • Distance-based/k-Means
  • Multi-Resolution kd-Tree for K-Means PM99
  • CLIQUE AGGR98
  • Scalable K-Means BFR98a
  • CLARANS NH94
  • Probabilistic/EM
  • Multi-Resolution kd-Tree for EM Moore99
  • Scalable EM BRF98b
  • CF Kernel Density Estimation ZRL99

54
Ethics of Data Mining
  • Data mining and data warehousing raise ethical
    and legal issues
  • Combining information via data warehousing could
    violate Privacy Act
  • Must tell people how their information will be
    used when the data is obtained
  • Data mining raises ethical issues mainly during
    application of results
  • E.g. using ethnicity as a factor in loan approval
    decisions
  • E.g. screening job applications based on age or
    sex (where not directly relevant)
  • E.g. declining insurance coverage based on
    neighbourhood if this is related to race
    (red-lining is illegal in much of the US)
  • Whether something is ethical depends on the
    application
  • E.g. probably ethical to use ethnicity to
    diagnose and choose treatments for a medical
    problem, but not to decline medical insurance

55
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com