Title: Supervised%20Learning
1Supervised Learning
2Introduction
- Key idea
- Known target concept (predict certain attribute)
- Find out how other attributes can be used
- Algorithms
- Rudimentary Rules (e.g., 1R)
- Statistical Modeling (e.g., Naïve Bayes)
- Divide and Conquer Decision Trees
- Instance-Based Learning
- Neural Networks
- Support Vector Machines
31-Rule
- Generate a one-level decision tree
- One attribute
- Performs quite well!
- Basic idea
- Rules testing a single attribute
- Classify according to frequency in training data
- Evaluate error rate for each attribute
- Choose the best attribute
- Thats all folks!
4The Weather Data (again)
5Apply 1R
- Attribute Rules Errors Total
- 1 outlook sunny?no 2/5 4/14
- overcast ?yes 0/4
- rainy ?yes 2/5
- 2 temperature hot ? no 2/4 5/14
- mild ? yes 2/6
- cool ? no 3/7
- 3 humidity high ? no 3/7 4/14
- normal ? yes 2/8
- 4 windy false ? yes 2/8 5/14
- true ? no 3/6
6Other Features
- Numeric Values
- Discretization
- Sort training data
- Split range into categories
- Missing Values
- Dummy attribute
7Naïve Bayes Classifier
- Allow all attributes to contribute equally
- Assumes
- All attributes equally important
- All attributes independent
- Realistic?
- Selection of attributes
8Bayes Theorem
Hypothesis
Posterior Probability
Prior
Evidence
Conditional probability of H given E
9Maximum a Posteriori (MAP)
Maximum Likelihood (ML)
10Classification
- Want to classify a new instance (a1, a2,, an)
into finite number of categories from the set V. - Bayesian approach Assign the most probable
category vMAP given (a1, a2,, an). - Can we estimate the probabilities from the
training data?
11Naïve Bayes Classifier
- Second probability easy to estimate
- How?
- The first probability difficult to estimate
- Why?
- Assume independence (this is the naïve bit)
12The Weather Data (yet again)
13Estimation
- Given a new instance with
- outlooksunny,
- temperaturehigh,
- humidityhigh,
- windytrue
14Calculations continued
15Normalization
- Note that we can normalize to get the
probabilities
16Problems .
- Suppose we had the following training data
- Now what?
17Laplace Estimator
18Numeric Values
- Assume a probability distribution for the numeric
attributes ? density f(x) - normal
- fit a distribution (better)
- Similarly as before
19Discussion
- Simple methodology
- Powerful - good results in practice
- Missing values no problem
- Not so good if independence assumption is
severely violated - Extreme case multiple attributes with same
values - Solutions
- Preselect which attributes to use
- Non-naïve Bayesian methods networks
20Decision Tree Learning
- Basic Algorithm
- Select an attribute to be tested
- If classification achieved return classification
- Otherwise, branch by setting attribute to each of
the possible values - Repeat with branch as your new tree
- Main issue how to select attributes
21Deciding on Branching
- What do we want to accomplish?
- Make good predictions
- Obtain simple to interpret rules
- No diversity (impurity) is best
- all same class
- all classes equally likely
- Goal select attributes to reduce impurity
22Measuring Impurity/Diversity
- Lets say we only have two classes
- Minimum
- Gini index/Simpson diversity index
- Entropy
23Impurity Functions
Entropy
Gini index
Minimum
24Entropy
Number of classes
Training data (instances)
Proportion of S classified as i
- Entropy is a measure of impurity in the training
data S - Measured in bits of information needed to encode
a member of S - Extreme cases
- All member same classification (Note 0log 0
0) - All classifications equally frequent
25Expected Information Gain
All possible values for attribute a
Gain(S,a) is the expected information provided
about the classification from knowing the value
of attribute a (Reduction in number of bits
needed)
26The Weather Data (yet again)
27Decision Tree Root Node
Outlook
Rainy
Sunny
Overcast
Yes Yes No No No
Yes Yes Yes Yes
Yes Yes Yes No No
28Calculating the Entropy
29Calculating the Gain
Select!
30Next Level
Outlook
Rainy
Sunny
Overcast
Temperature
No No
Yes No
Yes
31Calculating the Entropy
32Calculating the Gain
Select
33Final Tree
Outlook
Rainy
Sunny
Overcast
Humidity
Yes
Windy
High
Normal
True
False
No
Yes
No
Yes
34Whats in a Tree?
- Our final decision tree correctly classifies
every instance - Is this good?
- Two important concepts
- Overfitting
- Pruning
35Overfitting
- Two sources of abnormalities
- Noise (randomness)
- Outliers (measurement errors)
- Chasing every abnormality causes overfitting
- Tree to large and complex
- Does not generalize to new data
- Solution prune the tree
36Pruning
- Prepruning
- Halt construction of decision tree early
- Use same measure as in determining attributes,
e.g., halt if InfoGain lt K - Most frequent class becomes the leaf node
- Postpruning
- Construct complete decision tree
- Prune it back
- Prune to minimize expected error rates
- Prune to minimize bits of encoding (Minimum
Description Length principle)
37Scalability
- Need to design for large amounts of data
- Two things to worry about
- Large number of attributes
- Leads to a large tree (prepruning?)
- Takes a long time
- Large amounts of data
- Can the data be kept in memory?
- Some new algorithms do not require all the data
to be memory resident
38Discussion Decision Trees
- The most popular methods
- Quite effective
- Relatively simple
- Have discussed in detail the ID3 algorithm
- Information gain to select attributes
- No pruning
- Only handles nominal attributes
39Selecting Split Attributes
- Other Univariate splits
- Gain Ratio C4.5 Algorithm (J48 in Weka)
- CART (not in Weka)
- Multivariate splits
- May be possible to obtain better splits by
considering two or more attributes simultaneously
40Instance-Based Learning
- Classification
- To not construct a explicit description of how to
classify - Store all training data (learning)
- New example find most similar instance
- computing done at time of classification
- k-nearest neighbor
41K-Nearest Neighbor
- Each instance lives in n-dimensional space
- Distance between instances
42Example nearest neighbor
-
1-Nearest neighbor? 6-Nearest neighbor?
-
-
-
xq
-
-
-
43Normalizing
- Some attributes may take large values and other
small - Normalize
- All attributes on equal footing
44Other Methods for Supervised Learning
- Neural networks
- Support vector machines
- Optimization
- Rough set approach
- Fuzzy set approach
45Evaluating the Learning
- Measure of performance
- Classification error rate
- Resubstitution error
- Performance on training set
- Poor predictor of future performance
- Overfitting
- Useless for evaluation
46Test Set
- Need a set of test instances
- Independent of training set instances
- Representative of underlying structure
- Sometimes validation data
- Fine-tune parameters
- Independent of training and test data
- Plentiful data - no problem!
47Holdout Procedures
- Common case data set large but limited
- Usual procedure
- Reserve some data for testing
- Use remaining data for training
- Problems
- Want both sets as large as possible
- Want both sets to be representitive
48"Smart" Holdout
- Simple check Are the proportions of classes
about the same in each data set? - Stratified holdout
- Guarantee that classes are (approximately)
proportionally represented - Repeated holdout
- Randomly select holdout set several times and
average the error rate estimates
49Holdout w/ Cross-Validation
- Cross-validation
- Fixed number of partitions of the data (folds)
- In turn each partition used for testing and
remaining instances for training - May use stratification and randomization
- Standard practice
- Stratified tenfold cross-validation
- Instances divided randomly into the ten partitions
50Cross Validation
Fold 1
Train on 90 of the data
Model
Test on 10 of the data
Error rate e1
Fold 2
Train on 90 of the data
Model
Test on 10 of the data
Error rate e2
51Cross-Validation
- Final estimate of error
- Quality of estimate
52Leave-One-Out Holdout
- n-Fold Cross-Validation (n instance set)
- Use all but one instance for training
- Maximum use of the data
- Deterministic
- High computational cost
- Non-stratified sample
53Bootstrap
- Sample with replacement n times
- Use as training data
- Use instances not in training data for testing
- How many test instances are there?
540.632 Bootstrap
- On the average e-1 n 0.369 n instances will be
in the test set - Thus, on average we have 63.2 of instance in
training set - Estimate error rate
- e 0.632 etest 0.368 etrain
55Accuracy of our Estimate?
- Suppose we observe s successes in a testing set
of ntest instances ... - We then estimate the success rate
- Rsuccesss/ ntest.
- Each instance is either a success or failure
(Bernoulli trial w/success probability p) - Mean p
- Variance p(1-p)
56Properties of Estimate
- We have
- ERsuccessp
- VarRsuccessp(1-p)/ntest
- If ntraining is large enough the Central Limit
Theorem (CLT) states that, approximately, - RsuccessNormal(p,p(1-p)/ntest)
57Confidence Interval
Look up in table
Level
58Comparing Algorithms
- Know how to evaluate the results of our data
mining algorithms (classification) - How should we compare different algorithms?
- Evaluate each algorithm
- Rank
- Select best one
- Don't know if this ranking is reliable
59Assessing Other Learning
- Developed procedures for classification
- Association rules
- Evaluated based on accuracy
- Same methods as for classification
- Numerical prediction
- Error rate no longer applies
- Same principles
- use independent test set and hold-out procedures
- cross-validation or bootstrap
60Measures of Effectiveness
- Need to compare
- Predicted values p1, p2,..., pn.
- Actual values a1, a2,..., an.
- Most common measure
- Mean-squared error
61Other Measures
- Mean absolute error
- Relative squared error
- Relative absolute error
- Correlation
62What to Do?
- Large amounts of data
- Hold-out 1/3 of data for testing
- Train a model on 2/3 of data
- Estimate error (or success) rate and calculate CI
- Moderate amounts of data
- Estimate error rate
- Use 10-fold cross-validation with stratification,
- or use bootstrap.
- Train model on the entire data set
63Predicting Probabilities
- Classification into k classes
- Predict probabilities p1, p2,..., pnfor each
class. - Actual values a1, a2,..., an.
- No longer 0-1 error
- Quadratic loss function
Correct class
64Information Loss Function
- Instead of quadratic function
- where the j-th prediction is correct.
- Information required to communicate which class
is correct - in bits
- with respect to the probability distribution
65Occam's Razor
- Given a choice of theories that are equally good
the simplest theory should be chosen - Physical sciences any theory should be
consistant with all empirical observations - Data mining
- theory predictive model
- good theory good prediction
- What is good? Do we minimize the error rate?
66Minimum Description Length
- MDL principle
- Minimize
- size of theory info needed to specify
exceptions - Suppose trainings set E is mined resulting in a
theory T - Want to minimize
67Most Likely Theory
- Suppose we want to maximize PTE
- Bayes' rule
- Take logarithms
68Information Function
- Maximizing PTE equivilent to minimizing
- That is, the MDL principle!
Number of bits it takes to submit the exceptions
Number of bits it takes to submit the theory
69Applications to Learning
- Classification, association, numeric prediciton
- Several predictive models with 'similar' error
rate (usually as small as possible) - Select between them using Occam's razor
- Simplicity subjective
- Use MDL principle
- Clustering
- Important learning that is difficult to evaluate
- Can use MDL principle
70Comparing Mining Algorithms
- Know how to evaluate the results
- Suppose we have two algorithms
- Obtain two different models
- Estimate the error rates e(1) and e(2).
- Compare estimates
- Select the better one
- Problem?
71Weather Data Example
- Suppose we learn the rule
- If outlookrainy then playyes
- Otherwise playno
- Test it on the following test set
- Have zero error rate
72Different Test Set 2
- Again, suppose we learn the rule
- If outlookrainy then playyes
- Otherwise playno
- Test it on a different test set
- Have 100 error rate!
73Comparing Random Estimates
- Estimated error rate is just an estimate (random)
- Need variance as well as point estimates
- Construct a t-test statistic
Average of differences in error rates
H0 Difference 0
Estimated standard deviation
74Discussion
- Now know how to compare two learning algorithms
and select the one with the better error rate - We also know to select the simplest model that
has 'comparable' error rate - Is it really better?
- Minimising error rate can be misleading
75Examples of 'Good Models'
- Application loan approval
- Model no applicants default on loans
- Evaluation simple, low error rate
- Application cancer diagnosis
- Model all tumors are benign
- Evaluation simple, low error rate
- Application information assurance
- Model all visitors to network are well
intentioned - Evaluation simple, low error rate
76What's Going On?
- Many (most) data mining applications can be
thought about as detecting exceptions - Ignoring the exceptions does not significantly
increase the error rate! - Ignoring the exceptions often leads to a simple
model! - Thus, we can find a model that we evaluate as
good but completely misses the point - Need to account for the cost of error types
77Accounting for Cost of Errors
- Explicit modeling of the cost of each error
- costs may not be known
- often not practical
- Look at trade-offs
- visual inspection
- semi-automated learning
- Cost-sensitive learning
- assign costs to classes a priori
78Explicit Modeling of Cost
Confusion Matrix (Displayed in Weka)
79Cost Sensitive Learning
- Have used cost information to evaluate learning
- Better use cost information to learn
- Simple idea
- Increase instances that demonstrate important
behavior (e.g., classified as exceptions) - Applies for any learning algorithm
80Discussion
- Evaluate learning
- Estimate error rate
- Minimum length principle/Occams Razor
- Comparison of algorithm
- Based on evaluation
- Make sure difference is significant
- Cost of making errors may differ
- Use evaluation procedures with caution
- Incorporate into learning
81Engineering the Output
- Prediction base on one model
- Model performs well on one training set, but
poorly on others - New data becomes available ? new model
- Combine models
- Bagging
- Boosting
- Stacking
Improve prediction but complicate structure
82Bagging
- Bias error despite all the data in the world!
- Variance error due to limited data
- Intuitive idea of bagging
- Assume we have several data sets
- Apply learning algorithm to each set
- Vote on the prediction (classification/numeric)
- What type of error does this reduce?
- When is this beneficial?
83Bootstrap Aggregating
- In practice only one training data set
- Create many sets from one
- Sample with replacement (remember the bootstrap)
- Does this work?
- Often given improvements in predictive
performance - Never degeneration in performance
84Boosting
- Assume a stable learning procedure
- Low variance
- Bagging does very little
- Combine structurally different models
- Intuitive motivation
- Any given model may be good for a subset of the
training data - Encourage models to explain part of the data
85AdaBoost.M1
- Generate models
- Assign equal weight to each training instance
- Iterate
- Apply learning algorithm and store model
- e error
- If e 0 or e gt 0.5 terminate
- For every instance
- If classified correctly multiply weight by
e/(1-e) - Normalize weight
- Until STOP
86AdaBoost.M1
- Classification
- Assign zero weight to each class
- For every model
- Add
- to class predicted by model
- Return class with highest weight
87Performance Analysis
- Error of combined classifier converges to zero at
an exponential rate (very fast) - Questionable value due to possible overfitting
- Must use independent test data
- Fails on test data if
- Classifier more complex than training data
justifies - Training error become too large too quickly
- Must achieve balance between model complexity and
the fit to the data
88Fitting versus Overfitting
- Overfitting very difficult to assess here
- Assume we have reached zero error
- May be beneficial to continue boosting!
- Occam's razor?
- Build complex models from simple ones
- Boosting offers very significant improvement
- Can hope for more improvement than bagging
- Can degenerate performance
- Never happens with bagging
89Stacking
- Models of different types
- Meta learner
- Learn which learning algorithms are good
- Combine learning algorithms intelligently
Level-0 Models
Level-1 Model
Decision Tree Naïve Bayes Instance-Based
Meta Learner
90Meta Learning
- Holdout part of the training set
- Use remaining data for training level-0 methods
- Use holdout data to train level-1 learning
- Retrain level-0 algorithms with all the data
- Comments
- Level-1 learning use very simple algorithm
(e.g., linear model) - Can use cross-validation to allow level-1
algorithms to train on all the data
91Supervised Learning
- Two types of learning
- Classification
- Numerical prediction
- Classification learning algorithms
- Decision trees
- Naïve Bayes
- Instance-based learning
- Many others are part of Weka, browse!
92Other Issues in Supervised Learning
- Evaluation
- Accuracy hold-out, bootstrap, cross-validation
- Simplicity MDL principle
- Usefulness cost-sensitive learning
- Metalearning
- Bagging, Boosting, Stacking