Supervised%20Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Supervised%20Learning

Description:

Rules testing a single attribute. Classify according to frequency in training data ... Otherwise, branch by setting attribute to each of the possible values ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 93
Provided by: publicI
Category:

less

Transcript and Presenter's Notes

Title: Supervised%20Learning


1
Supervised Learning
2
Introduction
  • Key idea
  • Known target concept (predict certain attribute)
  • Find out how other attributes can be used
  • Algorithms
  • Rudimentary Rules (e.g., 1R)
  • Statistical Modeling (e.g., Naïve Bayes)
  • Divide and Conquer Decision Trees
  • Instance-Based Learning
  • Neural Networks
  • Support Vector Machines

3
1-Rule
  • Generate a one-level decision tree
  • One attribute
  • Performs quite well!
  • Basic idea
  • Rules testing a single attribute
  • Classify according to frequency in training data
  • Evaluate error rate for each attribute
  • Choose the best attribute
  • Thats all folks!

4
The Weather Data (again)
5
Apply 1R
  • Attribute Rules Errors Total
  • 1 outlook sunny?no 2/5 4/14
  • overcast ?yes 0/4
  • rainy ?yes 2/5
  • 2 temperature hot ? no 2/4 5/14
  • mild ? yes 2/6
  • cool ? no 3/7
  • 3 humidity high ? no 3/7 4/14
  • normal ? yes 2/8
  • 4 windy false ? yes 2/8 5/14
  • true ? no 3/6

6
Other Features
  • Numeric Values
  • Discretization
  • Sort training data
  • Split range into categories
  • Missing Values
  • Dummy attribute

7
Naïve Bayes Classifier
  • Allow all attributes to contribute equally
  • Assumes
  • All attributes equally important
  • All attributes independent
  • Realistic?
  • Selection of attributes

8
Bayes Theorem
Hypothesis
Posterior Probability
Prior
Evidence
Conditional probability of H given E
9
Maximum a Posteriori (MAP)
Maximum Likelihood (ML)
10
Classification
  • Want to classify a new instance (a1, a2,, an)
    into finite number of categories from the set V.
  • Bayesian approach Assign the most probable
    category vMAP given (a1, a2,, an).
  • Can we estimate the probabilities from the
    training data?

11
Naïve Bayes Classifier
  • Second probability easy to estimate
  • How?
  • The first probability difficult to estimate
  • Why?
  • Assume independence (this is the naïve bit)

12
The Weather Data (yet again)
13
Estimation
  • Given a new instance with
  • outlooksunny,
  • temperaturehigh,
  • humidityhigh,
  • windytrue

14
Calculations continued
  • Similarly
  • Thus

15
Normalization
  • Note that we can normalize to get the
    probabilities

16
Problems .
  • Suppose we had the following training data
  • Now what?

17
Laplace Estimator
  • Replace estimates
  • with

18
Numeric Values
  • Assume a probability distribution for the numeric
    attributes ? density f(x)
  • normal
  • fit a distribution (better)
  • Similarly as before

19
Discussion
  • Simple methodology
  • Powerful - good results in practice
  • Missing values no problem
  • Not so good if independence assumption is
    severely violated
  • Extreme case multiple attributes with same
    values
  • Solutions
  • Preselect which attributes to use
  • Non-naïve Bayesian methods networks

20
Decision Tree Learning
  • Basic Algorithm
  • Select an attribute to be tested
  • If classification achieved return classification
  • Otherwise, branch by setting attribute to each of
    the possible values
  • Repeat with branch as your new tree
  • Main issue how to select attributes

21
Deciding on Branching
  • What do we want to accomplish?
  • Make good predictions
  • Obtain simple to interpret rules
  • No diversity (impurity) is best
  • all same class
  • all classes equally likely
  • Goal select attributes to reduce impurity

22
Measuring Impurity/Diversity
  • Lets say we only have two classes
  • Minimum
  • Gini index/Simpson diversity index
  • Entropy

23
Impurity Functions
Entropy
Gini index
Minimum
24
Entropy
Number of classes
Training data (instances)
Proportion of S classified as i
  • Entropy is a measure of impurity in the training
    data S
  • Measured in bits of information needed to encode
    a member of S
  • Extreme cases
  • All member same classification (Note 0log 0
    0)
  • All classifications equally frequent

25
Expected Information Gain
All possible values for attribute a
Gain(S,a) is the expected information provided
about the classification from knowing the value
of attribute a (Reduction in number of bits
needed)
26
The Weather Data (yet again)
27
Decision Tree Root Node
Outlook
Rainy
Sunny
Overcast
Yes Yes No No No
Yes Yes Yes Yes
Yes Yes Yes No No
28
Calculating the Entropy
29
Calculating the Gain
Select!
30
Next Level
Outlook
Rainy
Sunny
Overcast
Temperature
No No
Yes No
Yes
31
Calculating the Entropy
32
Calculating the Gain
Select
33
Final Tree
Outlook
Rainy
Sunny
Overcast
Humidity
Yes
Windy
High
Normal
True
False
No
Yes
No
Yes
34
Whats in a Tree?
  • Our final decision tree correctly classifies
    every instance
  • Is this good?
  • Two important concepts
  • Overfitting
  • Pruning

35
Overfitting
  • Two sources of abnormalities
  • Noise (randomness)
  • Outliers (measurement errors)
  • Chasing every abnormality causes overfitting
  • Tree to large and complex
  • Does not generalize to new data
  • Solution prune the tree

36
Pruning
  • Prepruning
  • Halt construction of decision tree early
  • Use same measure as in determining attributes,
    e.g., halt if InfoGain lt K
  • Most frequent class becomes the leaf node
  • Postpruning
  • Construct complete decision tree
  • Prune it back
  • Prune to minimize expected error rates
  • Prune to minimize bits of encoding (Minimum
    Description Length principle)

37
Scalability
  • Need to design for large amounts of data
  • Two things to worry about
  • Large number of attributes
  • Leads to a large tree (prepruning?)
  • Takes a long time
  • Large amounts of data
  • Can the data be kept in memory?
  • Some new algorithms do not require all the data
    to be memory resident

38
Discussion Decision Trees
  • The most popular methods
  • Quite effective
  • Relatively simple
  • Have discussed in detail the ID3 algorithm
  • Information gain to select attributes
  • No pruning
  • Only handles nominal attributes

39
Selecting Split Attributes
  • Other Univariate splits
  • Gain Ratio C4.5 Algorithm (J48 in Weka)
  • CART (not in Weka)
  • Multivariate splits
  • May be possible to obtain better splits by
    considering two or more attributes simultaneously

40
Instance-Based Learning
  • Classification
  • To not construct a explicit description of how to
    classify
  • Store all training data (learning)
  • New example find most similar instance
  • computing done at time of classification
  • k-nearest neighbor

41
K-Nearest Neighbor
  • Each instance lives in n-dimensional space
  • Distance between instances

42
Example nearest neighbor
-

1-Nearest neighbor? 6-Nearest neighbor?
-
-

-
xq
-
-

-


43
Normalizing
  • Some attributes may take large values and other
    small
  • Normalize
  • All attributes on equal footing

44
Other Methods for Supervised Learning
  • Neural networks
  • Support vector machines
  • Optimization
  • Rough set approach
  • Fuzzy set approach

45
Evaluating the Learning
  • Measure of performance
  • Classification error rate
  • Resubstitution error
  • Performance on training set
  • Poor predictor of future performance
  • Overfitting
  • Useless for evaluation

46
Test Set
  • Need a set of test instances
  • Independent of training set instances
  • Representative of underlying structure
  • Sometimes validation data
  • Fine-tune parameters
  • Independent of training and test data
  • Plentiful data - no problem!

47
Holdout Procedures
  • Common case data set large but limited
  • Usual procedure
  • Reserve some data for testing
  • Use remaining data for training
  • Problems
  • Want both sets as large as possible
  • Want both sets to be representitive

48
"Smart" Holdout
  • Simple check Are the proportions of classes
    about the same in each data set?
  • Stratified holdout
  • Guarantee that classes are (approximately)
    proportionally represented
  • Repeated holdout
  • Randomly select holdout set several times and
    average the error rate estimates

49
Holdout w/ Cross-Validation
  • Cross-validation
  • Fixed number of partitions of the data (folds)
  • In turn each partition used for testing and
    remaining instances for training
  • May use stratification and randomization
  • Standard practice
  • Stratified tenfold cross-validation
  • Instances divided randomly into the ten partitions

50
Cross Validation
Fold 1
Train on 90 of the data
Model
Test on 10 of the data
Error rate e1
Fold 2
Train on 90 of the data
Model
Test on 10 of the data
Error rate e2
51
Cross-Validation
  • Final estimate of error
  • Quality of estimate

52
Leave-One-Out Holdout
  • n-Fold Cross-Validation (n instance set)
  • Use all but one instance for training
  • Maximum use of the data
  • Deterministic
  • High computational cost
  • Non-stratified sample

53
Bootstrap
  • Sample with replacement n times
  • Use as training data
  • Use instances not in training data for testing
  • How many test instances are there?

54
0.632 Bootstrap
  • On the average e-1 n 0.369 n instances will be
    in the test set
  • Thus, on average we have 63.2 of instance in
    training set
  • Estimate error rate
  • e 0.632 etest 0.368 etrain

55
Accuracy of our Estimate?
  • Suppose we observe s successes in a testing set
    of ntest instances ...
  • We then estimate the success rate
  • Rsuccesss/ ntest.
  • Each instance is either a success or failure
    (Bernoulli trial w/success probability p)
  • Mean p
  • Variance p(1-p)

56
Properties of Estimate
  • We have
  • ERsuccessp
  • VarRsuccessp(1-p)/ntest
  • If ntraining is large enough the Central Limit
    Theorem (CLT) states that, approximately,
  • RsuccessNormal(p,p(1-p)/ntest)

57
Confidence Interval
Look up in table
  • CI for normal
  • CI for p

Level
58
Comparing Algorithms
  • Know how to evaluate the results of our data
    mining algorithms (classification)
  • How should we compare different algorithms?
  • Evaluate each algorithm
  • Rank
  • Select best one
  • Don't know if this ranking is reliable

59
Assessing Other Learning
  • Developed procedures for classification
  • Association rules
  • Evaluated based on accuracy
  • Same methods as for classification
  • Numerical prediction
  • Error rate no longer applies
  • Same principles
  • use independent test set and hold-out procedures
  • cross-validation or bootstrap

60
Measures of Effectiveness
  • Need to compare
  • Predicted values p1, p2,..., pn.
  • Actual values a1, a2,..., an.
  • Most common measure
  • Mean-squared error

61
Other Measures
  • Mean absolute error
  • Relative squared error
  • Relative absolute error
  • Correlation

62
What to Do?
  • Large amounts of data
  • Hold-out 1/3 of data for testing
  • Train a model on 2/3 of data
  • Estimate error (or success) rate and calculate CI
  • Moderate amounts of data
  • Estimate error rate
  • Use 10-fold cross-validation with stratification,
  • or use bootstrap.
  • Train model on the entire data set

63
Predicting Probabilities
  • Classification into k classes
  • Predict probabilities p1, p2,..., pnfor each
    class.
  • Actual values a1, a2,..., an.
  • No longer 0-1 error
  • Quadratic loss function

Correct class
64
Information Loss Function
  • Instead of quadratic function
  • where the j-th prediction is correct.
  • Information required to communicate which class
    is correct
  • in bits
  • with respect to the probability distribution

65
Occam's Razor
  • Given a choice of theories that are equally good
    the simplest theory should be chosen
  • Physical sciences any theory should be
    consistant with all empirical observations
  • Data mining
  • theory predictive model
  • good theory good prediction
  • What is good? Do we minimize the error rate?

66
Minimum Description Length
  • MDL principle
  • Minimize
  • size of theory info needed to specify
    exceptions
  • Suppose trainings set E is mined resulting in a
    theory T
  • Want to minimize

67
Most Likely Theory
  • Suppose we want to maximize PTE
  • Bayes' rule
  • Take logarithms

68
Information Function
  • Maximizing PTE equivilent to minimizing
  • That is, the MDL principle!

Number of bits it takes to submit the exceptions
Number of bits it takes to submit the theory
69
Applications to Learning
  • Classification, association, numeric prediciton
  • Several predictive models with 'similar' error
    rate (usually as small as possible)
  • Select between them using Occam's razor
  • Simplicity subjective
  • Use MDL principle
  • Clustering
  • Important learning that is difficult to evaluate
  • Can use MDL principle

70
Comparing Mining Algorithms
  • Know how to evaluate the results
  • Suppose we have two algorithms
  • Obtain two different models
  • Estimate the error rates e(1) and e(2).
  • Compare estimates
  • Select the better one
  • Problem?

71
Weather Data Example
  • Suppose we learn the rule
  • If outlookrainy then playyes
  • Otherwise playno
  • Test it on the following test set
  • Have zero error rate

72
Different Test Set 2
  • Again, suppose we learn the rule
  • If outlookrainy then playyes
  • Otherwise playno
  • Test it on a different test set
  • Have 100 error rate!

73
Comparing Random Estimates
  • Estimated error rate is just an estimate (random)
  • Need variance as well as point estimates
  • Construct a t-test statistic

Average of differences in error rates
H0 Difference 0
Estimated standard deviation
74
Discussion
  • Now know how to compare two learning algorithms
    and select the one with the better error rate
  • We also know to select the simplest model that
    has 'comparable' error rate
  • Is it really better?
  • Minimising error rate can be misleading

75
Examples of 'Good Models'
  • Application loan approval
  • Model no applicants default on loans
  • Evaluation simple, low error rate
  • Application cancer diagnosis
  • Model all tumors are benign
  • Evaluation simple, low error rate
  • Application information assurance
  • Model all visitors to network are well
    intentioned
  • Evaluation simple, low error rate

76
What's Going On?
  • Many (most) data mining applications can be
    thought about as detecting exceptions
  • Ignoring the exceptions does not significantly
    increase the error rate!
  • Ignoring the exceptions often leads to a simple
    model!
  • Thus, we can find a model that we evaluate as
    good but completely misses the point
  • Need to account for the cost of error types

77
Accounting for Cost of Errors
  • Explicit modeling of the cost of each error
  • costs may not be known
  • often not practical
  • Look at trade-offs
  • visual inspection
  • semi-automated learning
  • Cost-sensitive learning
  • assign costs to classes a priori

78
Explicit Modeling of Cost
Confusion Matrix (Displayed in Weka)
79
Cost Sensitive Learning
  • Have used cost information to evaluate learning
  • Better use cost information to learn
  • Simple idea
  • Increase instances that demonstrate important
    behavior (e.g., classified as exceptions)
  • Applies for any learning algorithm

80
Discussion
  • Evaluate learning
  • Estimate error rate
  • Minimum length principle/Occams Razor
  • Comparison of algorithm
  • Based on evaluation
  • Make sure difference is significant
  • Cost of making errors may differ
  • Use evaluation procedures with caution
  • Incorporate into learning

81
Engineering the Output
  • Prediction base on one model
  • Model performs well on one training set, but
    poorly on others
  • New data becomes available ? new model
  • Combine models
  • Bagging
  • Boosting
  • Stacking


Improve prediction but complicate structure
82
Bagging
  • Bias error despite all the data in the world!
  • Variance error due to limited data
  • Intuitive idea of bagging
  • Assume we have several data sets
  • Apply learning algorithm to each set
  • Vote on the prediction (classification/numeric)
  • What type of error does this reduce?
  • When is this beneficial?

83
Bootstrap Aggregating
  • In practice only one training data set
  • Create many sets from one
  • Sample with replacement (remember the bootstrap)
  • Does this work?
  • Often given improvements in predictive
    performance
  • Never degeneration in performance

84
Boosting
  • Assume a stable learning procedure
  • Low variance
  • Bagging does very little
  • Combine structurally different models
  • Intuitive motivation
  • Any given model may be good for a subset of the
    training data
  • Encourage models to explain part of the data

85
AdaBoost.M1
  • Generate models
  • Assign equal weight to each training instance
  • Iterate
  • Apply learning algorithm and store model
  • e error
  • If e 0 or e gt 0.5 terminate
  • For every instance
  • If classified correctly multiply weight by
    e/(1-e)
  • Normalize weight
  • Until STOP

86
AdaBoost.M1
  • Classification
  • Assign zero weight to each class
  • For every model
  • Add
  • to class predicted by model
  • Return class with highest weight

87
Performance Analysis
  • Error of combined classifier converges to zero at
    an exponential rate (very fast)
  • Questionable value due to possible overfitting
  • Must use independent test data
  • Fails on test data if
  • Classifier more complex than training data
    justifies
  • Training error become too large too quickly
  • Must achieve balance between model complexity and
    the fit to the data

88
Fitting versus Overfitting
  • Overfitting very difficult to assess here
  • Assume we have reached zero error
  • May be beneficial to continue boosting!
  • Occam's razor?
  • Build complex models from simple ones
  • Boosting offers very significant improvement
  • Can hope for more improvement than bagging
  • Can degenerate performance
  • Never happens with bagging

89
Stacking
  • Models of different types
  • Meta learner
  • Learn which learning algorithms are good
  • Combine learning algorithms intelligently

Level-0 Models
Level-1 Model
Decision Tree Naïve Bayes Instance-Based
Meta Learner
90
Meta Learning
  • Holdout part of the training set
  • Use remaining data for training level-0 methods
  • Use holdout data to train level-1 learning
  • Retrain level-0 algorithms with all the data
  • Comments
  • Level-1 learning use very simple algorithm
    (e.g., linear model)
  • Can use cross-validation to allow level-1
    algorithms to train on all the data

91
Supervised Learning
  • Two types of learning
  • Classification
  • Numerical prediction
  • Classification learning algorithms
  • Decision trees
  • Naïve Bayes
  • Instance-based learning
  • Many others are part of Weka, browse!

92
Other Issues in Supervised Learning
  • Evaluation
  • Accuracy hold-out, bootstrap, cross-validation
  • Simplicity MDL principle
  • Usefulness cost-sensitive learning
  • Metalearning
  • Bagging, Boosting, Stacking
Write a Comment
User Comments (0)
About PowerShow.com