Title: Classification and Supervised Learning
1Classification and Supervised Learning
- Credits
- Hand, Mannila and Smyth
- Cook and Swayne
- Padhraic Smyths notes
- Shawndra Hill notes
2Outline
- Supervised Learning Overview
- Linear Discriminant analysis
- Tree models
- Probability based and Bayes models
3Classification
- Classification or supervised learning
- prediction for categorical response
- for binary, T/F, can be used as an alternative to
logistic regression - often is a quantized real value or non-scaled
numeric - can be used with categorical predictors
- great for missing data - can be a response in
itself! - methods for fitting can be
- parametric
- algorithmic
4- Because labels are known, you can build
parametric models for the classes - can also define decision regions and decision
boundaries
5Examples of classifiers
- Generative/class-conditional/probabilistic, based
on p( x ck ), - NaĂŻve Bayes (simple, but often effective in high
dimensions) - Parametric generative models, e.g., Gaussian -
Linear discriminant analysis - Regression-based, based on p( ck x )
- Logistic regression simple, linear in odds
space - Neural network non-linear extension of logistic
- Discriminative models, focus on locating optimal
decision boundaries - Decision trees swiss army knife, often
effective in high dimensions - Linear discriminants,
- Support vector machines (SVM) generalization of
linear discriminants, can be quite effective,
computational complexity is an issue - Nearest neighbor simple, can scale poorly in
high dimensions
6Evaluation of Classifiers
- Already seen some of this
- Assume output is probability vector for each
class - Classification error
- P(true Y predicted Y)
- ROC Area
- area under ROC plot
- top-k analysis
- sometimes all you care about is how well you can
do at the top of the list - plan A top 50 candidates have 44 sales, top 500
have 300 sales - plan B top 50 have 48 sales, top 500 have 270
sales - which do you choose?
- often used with imbalanced class distributions -
good classification error is easy! - fraud, etc
- calibration is sometimes important
- if you say something has 90 chance, does it?
7Linear Discriminant Analysis
- LDA - parametric classification
- Fisher 1936 Rao 1948
- linear combination of variables separating two
classes by comparing the difference between class
means with the variance in each class - assumes multivariate normal distribution of each
class (cluster) - pros
- easy to define likelihood
- easy to define boundary
- easy to measure goodness of fit
- interpretation easy
- cons
- very rare for data come close to a multi-normal!
- works only on numeric predictors
8- painters data 54 painters rated on a score of
0-21 for composition, drawing color and
expression. Classified them into 8 classes - Composition Drawing Colour Expression School
- Da Udine 10 8 16
3 A - Da Vinci 15 16 4
14 A - Del Piombo 8 13 16
7 A - Del Sarto 12 16 9
8 A - Fr. Penni 0 15 8
0 A - Guilio Romano 15 16 4
14 A - Michelangelo 8 17 4
8 A - Perino del Vaga 15 16 7
6 A - Perugino 4 12 10
4 A - Raphael 17 18 12
18 A
library(MASS) lda1lda(School.,datapainters)
9(No Transcript)
10(No Transcript)
11LDA - predictions
- to check how good the model is, you can see how
well it predicts what actually happened
gt predict(lda1) class 1 D H D A A H A C A A A
A A C A B B E C C B E D D D D G D D D D D E D G H
E E E F G A F D G A G G E 50 G C H H H Levels
A B C D E F G H posterior
A B C D
E F Da Udine 0.0153311094
0.0059952857 0.0105980288 6.717937e-01
0.124938731 2.913817e-03 Da Vinci
0.1023448947 0.1963312180 0.1155149000
4.444461e-05 0.016182391 1.942920e-02 Del Piombo
0.1763906259 0.0142589568 0.0064792116
6.351212e-01 0.102924883 9.080713e-03 Del Sarto
0.4549047647 0.2079127774 0.1459033415
2.166203e-02 0.146171796 3.716302e-03
gt table(predict(lda1)class,paintersSch)
A B C D E F G H A 5 4 0 0 0 1 1 0 B 0 1 2 0 0
0 0 0 C 1 1 2 0 0 0 0 1 D 2 0 0 9 1 0 1 0 E
0 0 2 0 4 0 1 0 F 0 0 0 0 0 2 0 0 G 0 0 0 1 1
1 4 0 H 2 0 0 0 1 0 0 3
12Classification (Decision) Trees
- Trees are one of the most popular and useful of
all data mining models - Algorithmic version of classification
- no distributional assumptions
- Competing algorithms CART, C4.5, DBMiner
- Pros
- no distributional assumptions
- can handle real and nominal inputs
- speed and scalability
- robustness to outliers and missing values
- interpretability
- compactness of classification rules
- Cons
- interpretability ?
- several tuning parameters to set with little
guidance - decision boundary is non-continuous
13Decision Tree Example
Debt
Income
14Decision Tree Example
Debt
Income gt t1
??
Income
t1
15Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
??
16Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
17Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and
axis-parallel
18Example Titanic Data
- On the Titanic
- 1313 passengers
- 34 survived
- was it a random sample?
- or did survival depend on features of the
individual? - sex
- age
- class
pclass survived
name age embarked sex 1
1st 1 Allen, Miss
Elisabeth Walton 29.0000 Southampton female 2
1st 0 Allison, Miss
Helen Loraine 2.0000 Southampton female 3 1st
0 Allison, Mr Hudson Joshua
Creighton 30.0000 Southampton male 4 1st
0 Allison, Mrs Hudson J.C. (Bessie Waldo
Daniels) 25.0000 Southampton female 5 1st
1 Allison, Master Hudson
Trevor 0.9167 Southampton male 6 2nd
1 Anderson, Mr Harry
47.0000 Southampton male
19Decision trees
- At first split decide which is the best
variable to create separation between the
survivors and non-survivors cases
Female
Goodness of split is determined by the purity
of the leaves
20Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Examples are partitioned recursively to create
pure subgroups - Purity measured by information gain, Gini index,
entropy, etc - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - All leaf nodes are smaller than a specified
threshold - BUT building a tree too big will overfit the
data, and will predict poorly. - Predictions
- each leaf will have class probability estimates
(CPE), based on the training data that ended up
in that leaf. - majority voting is employed for classifying all
members of the leaf
21Purity in tree building
- Why do we care about pure subgroups?
- purity of the subgroup gives us confidence that
new cases that fall into this leaf have a given
label
22Purity measures
- If a data set T contains examples from n classes,
gini index, gini(T) is defined as -
- where pj is the relative frequency of class j in
T. - If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as - For Titanic split on sex 850/1313 x(1-0.160.84)
463/1313(1-0.660.34) 0.83 - The attribute provides the smallest ginisplit(T)
is chosen to split the node (need to enumerate
all possible splitting points for each
attribute). - Another often used measure Entropy
23Calculating Information Gain
Information Gain Impurity (parent)
Impurity (children)
Entire population (30 instances)
17 instances
Balancegt50K
Balancelt50K
13 instances
(Weighted) Average Impurity of Children
Information Gain Entropy ( parent) Entropy
(Children) 0.996
- 0.615 0.38
23
24Information Gain
Information Gain Impurity (parent)
Impurity (children)
Gain0.38
Impurity(D,E) 0.405
Impurity(,B,C) 0.61
Impurity(A) 0.996
Gain0.205
D
Agegt45
B
Impurity(D)0 Log20 1 log210
Balancegt50K
Agelt45
Impurity(B) 0.787
A
E
C
Impurity(E) -3/7 Log23/7 -4/7Log24/70.985
Balancelt50K
Impurity (C) 0.39
Bad risk (Default)
24
Good risk (Not default)
25Information Gain
- At each node chose first the attribute that
obtains maximum information gain providing
maximum information
Gain0.38
Impurity(D,E) 0.405
Impurity(A) 0.996
Impurity(B,C) 0.61
Gain0.205
D
B
Agegt45
Entire population
Balancegt50K
Agelt45
A
E
C
Balancelt50K
25
Bad risk (Default)
Good risk (Not default)
26Avoid Overfitting in Classification
- The generated tree may overfit the training data
- Too many branches, some may reflect anomalies due
to noise or outliers - Result is in poor accuracy for unseen samples
- Two approaches to avoid overfitting
- Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold - Difficult to choose an appropriate threshold
- Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees - Use a set of data different from the training
data to decide which is the best pruned tree
27Which attribute to split over?
- Brute-force search
- At each node examine splits over each of the
attributes - Select the attribute for which the maximum
information gain is obtained
Balance
gt50K
lt50K
27
28Finding the right size
- Use a hold out sample (n fold cross-validation)
- Overfit a tree - with many leaves
- snip the tree back and use the hold out sample
for prediction, calculate predictive error - record error rate for each tree size
- repeat for n folds
- plot average error rate as a function of tree
size - fit optimal tree size to the entire data set
R note can use cvtree()
29Olive oil data
X region area palmitic palmitoleic stearic oleic
linoleic linolenic arachidic 1 1.North-Apulia
1 1 1075 75 226 7823
672 36 60 2 2.North-Apulia 1
1 1088 73 224 7709 781
31 61 3 3.North-Apulia 1 1
911 54 246 8113 549 31
63 4 4.North-Apulia 1 1 966
57 240 7952 619 50
78 5 5.North-Apulia 1 1 1051
67 259 7771 672 50 80 6
6.North-Apulia 1 1 911 49
268 7924 678 51 70
- classification of Italian olive oils by their
components - 9 areas, from 3 regions
30(No Transcript)
31(No Transcript)
32(No Transcript)
33Regression Trees
- Trees can also be used for regression when the
response is real valued - leaf prediction is mean value instead of class
probability estimates (CPE) - helpful with categorical predictors
34Tips data
35Treating Missing Data in Trees
- Missing values are common in practice
- Approaches to handing missing values
- During training
- Ignore rows with missing values (inefficient)
- During testing
- Send the example being classified down both
branches and average predictions - Replace missing values with an imputed value
- Other approaches
- Treat missing as a unique value (useful if
missing values are correlated with the class) - Surrogate splits method
- Search for and store surrogate variables/splits
during training
36Other Issues with Classification Trees
- Can use non-binary splits
- Multi-way
- Linear combinations
- Tend to increase complexity substantially, and
dont improve performance - Binary splits are interpretable, even by
non-experts - Easy to compute, visualize
- Model instability
- A small change in the data can lead to a
completely different tree - Model averaging techniques (like bagging) can be
useful - Restricted to splits along coordinate axes
- Discontinuities in prediction space
37Why Trees are widely used in Practice
- Can handle high dimensional data
- builds a model using 1 dimension at time
- Can handle any type of input variables
- categorical, real-valued, etc
- Invariant to monotonic transformations of input
variables - E.g., using x, 10x 2, log(x), 2x, etc, will
not change the tree - So, scaling is not a factor - user can be sloppy!
- Trees are (somewhat) interpretable
- domain expert can read off the trees logic
- Tree algorithms are relatively easy to code and
test
38Limitations of Trees
- Representational Bias
- classification piecewise linear boundaries,
parallel to axes - regression piecewise constant surfaces
- High Variance
- trees can be unstable as a function of the
sample - e.g., small change in the data -gt completely
different tree - causes two problems
- 1. High variance contributes to prediction error
- 2. High variance reduces interpretability
- Trees are good candidates for model combining
- Often used with boosting and bagging
39Decision Trees are not stable
Moving just one example slightly may lead to
quite different trees and space partition! Lack
of stability against small perturbation of data.
Figure from Duda, Hart Stork, Chap. 8
40Random Forests
- Another con for trees
- trees are sensitive to the primary split, which
can lead the tree in inappropriate directions - one way to see this fit a tree on a random
sample, or a bootstrapped sample of the data - - Solution
- random forests an ensemble of unpruned decision
trees - each tree is built on a random subset of the
training data - at each split point, only a random subset of
predictors are selected - many parameters to fiddle!
- prediction is simply majority vote of the trees (
or mean prediction of the trees). - Has the advantage of trees, with more robustness,
and a smoother decision rule. - Also, they are trendy!
41Other Models k-NN
- k-Nearest Neighbors (kNN)
- to classify a new point
- look at the kth nearest neighbor from the
training set - look at the circle of radius r that includes this
point - what is the class distribution of this circle?
- Advantages
- simple to understand
- simple to implement
- Disadvantages
- what is k?
- k1 high variance, sensitive to data
- k large robust, reduces variance but blends
everything together - includes far away points - what is near?
- Euclidean distance assumes all inputs are equally
important - how do you deal with categorical data?
- no interpretable model
- Best to use cross-validation and visualization
techniques to pick k.
42Probabilistic (Bayesian) Models for Classification
If you belong to class k, you have a distribution
over input vectors
Then, given priors on ck, we can get posterior
distribution on classes
At each point in the x space, we have a predicted
class vector, allowing for decision boundaries
43Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
44Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
45Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries
46Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries Bayes error rate
fraction of examples misclassified by optimal
classifier (shaded area above). If max1, then
there is no error. Hence
47Procedure for optimal Bayes classifier
- For each class learn a model p( x ck )
- E.g., each class is multivariate Gaussian with
its own mean and covariance - Use Bayes rule to obtain p( ck x )
- gt this yields the optimal decision
regions/boundaries - gt use these decision regions/boundaries for
classification - Correct in theory. but practical problems
include - How do we model p( x ck ) ?
- Even if we know the model for p( x ck ),
modeling a distribution or density will be very
difficult in high dimensions (e.g., p 100) - Alternative approach model the decision
boundaries directly
48Bayesian Classification Why?
- Probabilistic learning Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems - Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge
can be combined with observed data. - Probabilistic prediction Predict multiple
hypotheses, weighted by their probabilities - Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
49NaĂŻve Bayes Classifiers
- Generative probabilistic model with conditional
independence assumption on p( x ck ), i.e.
p( x ck ) P p( xj
ck ) - Typically used with nominal variables
- Real-valued variables discretized to create
nominal versions - Comments
- Simple to train (just estimate conditional
probabilities for each feature-class pair) - Often works surprisingly well in practice
- e.g., state of the art for text-classification,
basis of many widely used spam filters
50NaĂŻve Bayes
- When all variables are categorical,
classification should be easy (since all xs can
be enumerated)
But, remember the curse of dimensionality!
51NaĂŻve Bayes Classification
Recall p(ck x) ? p(x ck)p(ck) Now
assume variables are conditionally independent
given the classes
- is this a valid assumption? Probably not, but
perhaps still useful - example - symptoms and diseases
52NaĂŻve Bayes
estimate of the prob that a point x will belong
to ck
weights of evidence
if two classes
53Play-tennis example estimating P(xiC)
outlook
P(sunnyy) 2/9 P(sunnyn) 3/5
P(overcasty) 4/9 P(overcastn) 0
P(rainy) 3/9 P(rainn) 2/5
temperature
P(hoty) 2/9 P(hotn) 2/5
P(mildy) 4/9 P(mildn) 2/5
P(cooly) 3/9 P(cooln) 1/5
humidity
P(highy) 3/9 P(highn) 4/5
P(normaly) 6/9 P(normaln) 2/5
windy
P(truey) 3/9 P(truen) 3/5
P(falsey) 6/9 P(falsen) 2/5
P(y) 9/14
P(n) 5/14
54Play-tennis example classifying X
- An unseen sample X ltrain, hot, high, falsegt
- P(Xy)P(y) P(rainy)P(hoty)P(highy)P(fals
ey)P(y) - 3/92/93/96/99/14 0.010582
- P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286 - Sample X is classified in class n (youll lose!)
55The independence hypothesis
- makes computation possible
- yields optimal classifiers when satisfied
- but is seldom satisfied in practice, as
attributes (variables) are often correlated. - Yet, empirically, naĂŻve bayes performs really
well in practice.
56Lab 5
- Olive Oil Data
- from Cook and Swayne book
- consists of composition of fatty acids found in
the lipid fraction of Italian Olive Oils. Study
done to determine authenticity of olive oils. - region (North, South, and Sardinia)
- area (nine regions)
- 9 fatty acids and s
57Lab 5
- Spam Data
- Collected at Iowa State University in 2003.
(Cook and Swayne) - 2171 cases
- 21 variables
- be careful - 3 vars spampct, category, and spam
were determined by spam models - do not use these
for fitting! - Goal determine spam from valid mail