Supervised%20Learning

About This Presentation

Title:

Supervised%20Learning

Description:

Rules testing a single attribute. Classify according to frequency in training data ... Otherwise, branch by setting attribute to each of the possible values ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 93

Provided by: publicI

Learn more at: http://www.public.iastate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Supervised%20Learning

1
Supervised Learning
2
Introduction

Key idea
Known target concept (predict certain attribute)
Find out how other attributes can be used
Algorithms
Rudimentary Rules (e.g., 1R)
Statistical Modeling (e.g., Naïve Bayes)
Divide and Conquer Decision Trees
Instance-Based Learning
Neural Networks
Support Vector Machines

3
1-Rule

Generate a one-level decision tree
One attribute
Performs quite well!
Basic idea
Rules testing a single attribute
Classify according to frequency in training data
Evaluate error rate for each attribute
Choose the best attribute
Thats all folks!

4
The Weather Data (again)
5
Apply 1R

Attribute Rules Errors Total
1 outlook sunny?no 2/5 4/14
overcast ?yes 0/4
rainy ?yes 2/5
2 temperature hot ? no 2/4 5/14
mild ? yes 2/6
cool ? no 3/7
3 humidity high ? no 3/7 4/14
normal ? yes 2/8
4 windy false ? yes 2/8 5/14
true ? no 3/6

6
Other Features

Numeric Values
Discretization
Sort training data
Split range into categories
Missing Values
Dummy attribute

7
Naïve Bayes Classifier

Allow all attributes to contribute equally
Assumes
All attributes equally important
All attributes independent
Realistic?
Selection of attributes

8
Bayes Theorem
Hypothesis
Posterior Probability
Prior
Evidence
Conditional probability of H given E
9
Maximum a Posteriori (MAP)
Maximum Likelihood (ML)
10
Classification

Want to classify a new instance (a1, a2,, an)
into finite number of categories from the set V.
Bayesian approach Assign the most probable
category vMAP given (a1, a2,, an).
Can we estimate the probabilities from the
training data?

11
Naïve Bayes Classifier

Second probability easy to estimate
How?
The first probability difficult to estimate
Why?
Assume independence (this is the naïve bit)

12
The Weather Data (yet again)
13
Estimation

Given a new instance with
outlooksunny,
temperaturehigh,
humidityhigh,
windytrue

14
Calculations continued

Similarly
Thus

15
Normalization

Note that we can normalize to get the
probabilities

16
Problems .

Suppose we had the following training data
Now what?

17
Laplace Estimator

Replace estimates
with

18
Numeric Values

Assume a probability distribution for the numeric
attributes ? density f(x)
normal
fit a distribution (better)
Similarly as before

19
Discussion

Simple methodology
Powerful - good results in practice
Missing values no problem
Not so good if independence assumption is
severely violated
Extreme case multiple attributes with same
values
Solutions
Preselect which attributes to use
Non-naïve Bayesian methods networks

20
Decision Tree Learning

Basic Algorithm
Select an attribute to be tested
If classification achieved return classification
Otherwise, branch by setting attribute to each of
the possible values
Repeat with branch as your new tree
Main issue how to select attributes

21
Deciding on Branching

What do we want to accomplish?
Make good predictions
Obtain simple to interpret rules
No diversity (impurity) is best
all same class
all classes equally likely
Goal select attributes to reduce impurity

22
Measuring Impurity/Diversity

Lets say we only have two classes
Minimum
Gini index/Simpson diversity index
Entropy

23
Impurity Functions
Entropy
Gini index
Minimum
24
Entropy
Number of classes
Training data (instances)
Proportion of S classified as i

Entropy is a measure of impurity in the training
data S
Measured in bits of information needed to encode
a member of S
Extreme cases
All member same classification (Note 0log 0
0)
All classifications equally frequent

25
Expected Information Gain
All possible values for attribute a
Gain(S,a) is the expected information provided
about the classification from knowing the value
of attribute a (Reduction in number of bits
needed)
26
The Weather Data (yet again)
27
Decision Tree Root Node
Outlook
Rainy
Sunny
Overcast
Yes Yes No No No
Yes Yes Yes Yes
Yes Yes Yes No No
28
Calculating the Entropy
29
Calculating the Gain
Select!
30
Next Level
Outlook
Rainy
Sunny
Overcast
Temperature
No No
Yes No
Yes
31
Calculating the Entropy
32
Calculating the Gain
Select
33
Final Tree
Outlook
Rainy
Sunny
Overcast
Humidity
Yes
Windy
High
Normal
True
False
No
Yes
No
Yes
34
Whats in a Tree?

Our final decision tree correctly classifies
every instance
Is this good?
Two important concepts
Overfitting
Pruning

35
Overfitting

Two sources of abnormalities
Noise (randomness)
Outliers (measurement errors)
Chasing every abnormality causes overfitting
Tree to large and complex
Does not generalize to new data
Solution prune the tree

36
Pruning

Prepruning
Halt construction of decision tree early
Use same measure as in determining attributes,
e.g., halt if InfoGain lt K
Most frequent class becomes the leaf node
Postpruning
Construct complete decision tree
Prune it back
Prune to minimize expected error rates
Prune to minimize bits of encoding (Minimum
Description Length principle)

37
Scalability

Need to design for large amounts of data
Two things to worry about
Large number of attributes
Leads to a large tree (prepruning?)
Takes a long time
Large amounts of data
Can the data be kept in memory?
Some new algorithms do not require all the data
to be memory resident

38
Discussion Decision Trees

The most popular methods
Quite effective
Relatively simple
Have discussed in detail the ID3 algorithm
Information gain to select attributes
No pruning
Only handles nominal attributes

39
Selecting Split Attributes

Other Univariate splits
Gain Ratio C4.5 Algorithm (J48 in Weka)
CART (not in Weka)
Multivariate splits
May be possible to obtain better splits by
considering two or more attributes simultaneously

40
Instance-Based Learning

Classification
To not construct a explicit description of how to
classify
Store all training data (learning)
New example find most similar instance
computing done at time of classification
k-nearest neighbor

41
K-Nearest Neighbor

Each instance lives in n-dimensional space
Distance between instances

42
Example nearest neighbor
-

1-Nearest neighbor? 6-Nearest neighbor?
-
-

-
xq
-
-

-

43
Normalizing

Some attributes may take large values and other
small
Normalize
All attributes on equal footing

44
Other Methods for Supervised Learning

Neural networks
Support vector machines
Optimization
Rough set approach
Fuzzy set approach

45
Evaluating the Learning

Measure of performance
Classification error rate
Resubstitution error
Performance on training set
Poor predictor of future performance
Overfitting
Useless for evaluation

46
Test Set

Need a set of test instances
Independent of training set instances
Representative of underlying structure
Sometimes validation data
Fine-tune parameters
Independent of training and test data
Plentiful data - no problem!

47
Holdout Procedures

Common case data set large but limited
Usual procedure
Reserve some data for testing
Use remaining data for training
Problems
Want both sets as large as possible
Want both sets to be representitive

48
"Smart" Holdout

Simple check Are the proportions of classes
about the same in each data set?
Stratified holdout
Guarantee that classes are (approximately)
proportionally represented
Repeated holdout
Randomly select holdout set several times and
average the error rate estimates

49
Holdout w/ Cross-Validation

Cross-validation
Fixed number of partitions of the data (folds)
In turn each partition used for testing and
remaining instances for training
May use stratification and randomization
Standard practice
Stratified tenfold cross-validation
Instances divided randomly into the ten partitions

50
Cross Validation
Fold 1
Train on 90 of the data
Model
Test on 10 of the data
Error rate e1
Fold 2
Train on 90 of the data
Model
Test on 10 of the data
Error rate e2
51
Cross-Validation

Final estimate of error
Quality of estimate

52
Leave-One-Out Holdout

n-Fold Cross-Validation (n instance set)
Use all but one instance for training
Maximum use of the data
Deterministic
High computational cost
Non-stratified sample

53
Bootstrap

Sample with replacement n times
Use as training data
Use instances not in training data for testing
How many test instances are there?

54
0.632 Bootstrap

On the average e-1 n 0.369 n instances will be
in the test set
Thus, on average we have 63.2 of instance in
training set
Estimate error rate
e 0.632 etest 0.368 etrain

55
Accuracy of our Estimate?

Suppose we observe s successes in a testing set
of ntest instances ...
We then estimate the success rate
Rsuccesss/ ntest.
Each instance is either a success or failure
(Bernoulli trial w/success probability p)
Mean p
Variance p(1-p)

56
Properties of Estimate

We have
ERsuccessp
VarRsuccessp(1-p)/ntest
If ntraining is large enough the Central Limit
Theorem (CLT) states that, approximately,
RsuccessNormal(p,p(1-p)/ntest)

57
Confidence Interval
Look up in table

CI for normal
CI for p

Level
58
Comparing Algorithms

Know how to evaluate the results of our data
mining algorithms (classification)
How should we compare different algorithms?
Evaluate each algorithm
Rank
Select best one
Don't know if this ranking is reliable

59
Assessing Other Learning

Developed procedures for classification
Association rules
Evaluated based on accuracy
Same methods as for classification
Numerical prediction
Error rate no longer applies
Same principles
use independent test set and hold-out procedures
cross-validation or bootstrap

60
Measures of Effectiveness

Need to compare
Predicted values p1, p2,..., pn.
Actual values a1, a2,..., an.
Most common measure
Mean-squared error

61
Other Measures

Mean absolute error
Relative squared error
Relative absolute error
Correlation

62
What to Do?

Large amounts of data
Hold-out 1/3 of data for testing
Train a model on 2/3 of data
Estimate error (or success) rate and calculate CI
Moderate amounts of data
Estimate error rate
Use 10-fold cross-validation with stratification,
or use bootstrap.
Train model on the entire data set

63
Predicting Probabilities

Classification into k classes
Predict probabilities p1, p2,..., pnfor each
class.
Actual values a1, a2,..., an.
No longer 0-1 error
Quadratic loss function

Correct class
64
Information Loss Function

Instead of quadratic function
where the j-th prediction is correct.
Information required to communicate which class
is correct
in bits
with respect to the probability distribution

65
Occam's Razor

Given a choice of theories that are equally good
the simplest theory should be chosen
Physical sciences any theory should be
consistant with all empirical observations
Data mining
theory predictive model
good theory good prediction
What is good? Do we minimize the error rate?

66
Minimum Description Length

MDL principle
Minimize
size of theory info needed to specify
exceptions
Suppose trainings set E is mined resulting in a
theory T
Want to minimize

67
Most Likely Theory

Suppose we want to maximize PTE
Bayes' rule
Take logarithms

68
Information Function

Maximizing PTE equivilent to minimizing
That is, the MDL principle!

Number of bits it takes to submit the exceptions
Number of bits it takes to submit the theory
69
Applications to Learning

Classification, association, numeric prediciton
Several predictive models with 'similar' error
rate (usually as small as possible)
Select between them using Occam's razor
Simplicity subjective
Use MDL principle
Clustering
Important learning that is difficult to evaluate
Can use MDL principle

70
Comparing Mining Algorithms

Know how to evaluate the results
Suppose we have two algorithms
Obtain two different models
Estimate the error rates e(1) and e(2).
Compare estimates
Select the better one
Problem?

71
Weather Data Example

Suppose we learn the rule
If outlookrainy then playyes
Otherwise playno
Test it on the following test set
Have zero error rate

72
Different Test Set 2

Again, suppose we learn the rule
If outlookrainy then playyes
Otherwise playno
Test it on a different test set
Have 100 error rate!

73
Comparing Random Estimates

Estimated error rate is just an estimate (random)
Need variance as well as point estimates
Construct a t-test statistic

Average of differences in error rates
H0 Difference 0
Estimated standard deviation
74
Discussion

Now know how to compare two learning algorithms
and select the one with the better error rate
We also know to select the simplest model that
has 'comparable' error rate
Is it really better?
Minimising error rate can be misleading

75
Examples of 'Good Models'

Application loan approval
Model no applicants default on loans
Evaluation simple, low error rate
Application cancer diagnosis
Model all tumors are benign
Evaluation simple, low error rate
Application information assurance
Model all visitors to network are well
intentioned
Evaluation simple, low error rate

76
What's Going On?

Many (most) data mining applications can be
thought about as detecting exceptions
Ignoring the exceptions does not significantly
increase the error rate!
Ignoring the exceptions often leads to a simple
model!
Thus, we can find a model that we evaluate as
good but completely misses the point
Need to account for the cost of error types

77
Accounting for Cost of Errors

Explicit modeling of the cost of each error
costs may not be known
often not practical
Look at trade-offs
visual inspection
semi-automated learning
Cost-sensitive learning
assign costs to classes a priori

78
Explicit Modeling of Cost
Confusion Matrix (Displayed in Weka)
79
Cost Sensitive Learning

Have used cost information to evaluate learning
Better use cost information to learn
Simple idea
Increase instances that demonstrate important
behavior (e.g., classified as exceptions)
Applies for any learning algorithm

80
Discussion

Evaluate learning
Estimate error rate
Minimum length principle/Occams Razor
Comparison of algorithm
Based on evaluation
Make sure difference is significant
Cost of making errors may differ
Use evaluation procedures with caution
Incorporate into learning

81
Engineering the Output

Prediction base on one model
Model performs well on one training set, but
poorly on others
New data becomes available ? new model
Combine models
Bagging
Boosting
Stacking

Improve prediction but complicate structure
82
Bagging

Bias error despite all the data in the world!
Variance error due to limited data
Intuitive idea of bagging
Assume we have several data sets
Apply learning algorithm to each set
Vote on the prediction (classification/numeric)
What type of error does this reduce?
When is this beneficial?

83
Bootstrap Aggregating

In practice only one training data set
Create many sets from one
Sample with replacement (remember the bootstrap)
Does this work?
Often given improvements in predictive
performance
Never degeneration in performance

84
Boosting

Assume a stable learning procedure
Low variance
Bagging does very little
Combine structurally different models
Intuitive motivation
Any given model may be good for a subset of the
training data
Encourage models to explain part of the data

85
AdaBoost.M1

Generate models
Assign equal weight to each training instance
Iterate
Apply learning algorithm and store model
e error
If e 0 or e gt 0.5 terminate
For every instance
If classified correctly multiply weight by
e/(1-e)
Normalize weight
Until STOP

86
AdaBoost.M1

Classification
Assign zero weight to each class
For every model
Add
to class predicted by model
Return class with highest weight

87
Performance Analysis

Error of combined classifier converges to zero at
an exponential rate (very fast)
Questionable value due to possible overfitting
Must use independent test data
Fails on test data if
Classifier more complex than training data
justifies
Training error become too large too quickly
Must achieve balance between model complexity and
the fit to the data

88
Fitting versus Overfitting