Learning Algorithm Evaluation

About This Presentation

Title:

Learning Algorithm Evaluation

Description:

Confusion Matrix ROC curves Method selection Overall: use method with largest Area Under ROC curve (AUROC) If you aim to cover just 40% of true positives in a sample ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 34

Provided by: Piete8

Category:

more less

Transcript and Presenter's Notes

Title: Learning Algorithm Evaluation

1
Learning Algorithm Evaluation
2
Algorithm evaluation Outline

Why?
Overfitting
How?
Train/Test vs Cross-validation
What?
Evaluation measures
Who wins?
Statistical significance

3
Introduction
4
Introduction

A model should perform well on unseen data drawn
from the same distribution

5
Classification accuracy

performance measure
Success instances class is predicted correctly
Error instances class is predicted incorrectly
Error rate errors/instances
Accuracy successes/instances
Quiz
50 examples, 10 classified incorrectly
Accuracy? Error rate?

6
Evaluation
Rule 1
Never evaluate on training data!
7
Train and Test
Step 1 Randomly split data into training and
test set (e.g. 2/3-1/3)
a.k.a. holdout set
8
Train and Test
Step 2 Train model on training data
9
Train and Test
Step 3 Evaluate model on test data
10
Train and Test
Quiz Can I retry with other parameter settings?
11
Evaluation
Rule 1
Never evaluate on training data!
Rule 2
Never train on test data! (that includes
parameter setting or feature selection)
12
Train and Test
Step 4 Optimize parameters on separate
validation set
validation
testing
13
Test data leakage

Never use test data to create the classifier
Can be tricky e.g. social network
Proper procedure uses three sets
training set train models
validation set optimize algorithm parameters
test set evaluate final model

14
Making the most of the data

Once evaluation is complete, all the data can be
used to build the final classifier
Trade-off performance ? evaluation accuracy
More training data, better model (but returns
diminish)
More test data, more accurate error estimate

15
Train and Test
Step 5 Build final model on ALL data (more data,
better model)
16
Cross-Validation
17
k-fold Cross-validation

Split data (stratified) in k-folds
Use (k-1) for training, 1 for testing
Repeat k times
Average results

18
Cross-validation

Standard method
Stratified ten-fold cross-validation
10? Enough to reduce sampling bias
Experimentally determined

19
Leave-One-Out Cross-validation

A particular form of cross-validation
folds instances
n instances, build classifier n times
Makes best use of the data, no sampling bias
Computationally expensive

20
ROC Analysis
21
ROC Analysis

Stands for Receiver Operating Characteristic
From signal processing tradeoff between hit rate
and false alarm rate over noisy channel
Compute FPR, TPR and plot them in ROC space
Every classifier is a point in ROC space
For probabilistic algorithms
Collect many points by varying prediction
threshold
Or, make cost sensitive and vary costs (see below)

22
Confusion Matrix
actual

-
TP
FP

true positive
false positive
predicted
TN
FN
-
false negative
true negative
FPTN
TPFN
TPrate (sensitivity)
FPrate (fall-out)
23
ROC space
J48 parameters fitted
J48
OneR
classifiers
24
ROC curves
Change prediction threshold
Threshold t (P() gt t)
Area Under Curve (AUC) 0.75
25
ROC curves

Alternative method (easier, but less intuitive)
Rank probabilities
Start curve in (0,0), move down probability list
If positive, move up. If negative, move right

Jagged curveone set of test data
Smooth curveuse cross-validation

26
ROC curvesMethod selection

Overall use method with largest Area Under ROC
curve (AUROC)
If you aim to cover just 40 of true positives in
a sample use method A
Large sample use method B
In between choose between A and B with
appropriate probabilities

27
ROC Space and Costs
28
Different Costs

In practice, TP and FN errors incur different
costs
Examples
Medical diagnostic tests does X have leukemia?
Loan decisions approve mortgage for X?
Promotional mailing will X buy the product?
Add cost matrix to evaluation that weighs
TP,FP,...

pred pred -
actual cTP 0 cFN 1
actual - cFP 1 cTN 0
29
Statistical Significance
30
Comparing data mining schemes

Which of two learning algorithms performs better?
Note this is domain dependent!
Obvious way compare 10-fold CV estimates
Problem variance in estimate
Variance can be reduced using repeated CV
However, we still dont know whether results are
reliable

31
Significance tests

Significance tests tell us how confident we can
be that there really is a difference
Null hypothesis there is no real difference
Alternative hypothesis there is a difference
A significance test measures how much evidence
there is in favor of rejecting the null
hypothesis
E.g. 10 cross-validation scores B better than A?

mean A
mean B
P(perf)
Algorithm A Algorithm B
perf
x x x xxxxx x x
x x x xxxx x x x
32
Paired t-test
32
P(perf)
Algorithm A Algorithm B
perf

Students t-test tells whether the means of two
samples (e.g., 10 cross-validation scores) are
significantly different
Use a paired t-test when individual samples are
paired
i.e., they use the same randomization
Same CV folds are used for both algorithms

William Gosset Born 1876 in Canterbury Died
1937 in Beaconsfield, England Worked as chemist
in the Guinness brewery in Dublin in 1899.
Invented the t-test to handle small samples for
quality control in brewing. Wrote under the name
"Student".
33
Performing the test
P(perf)
Algoritme A Algoritme B

Fix a significance level ?
Significant difference at ? level implies
(100-?) chance that there really is a difference
Scientific work 5 or smaller (gt95 certainty)
Divide ? by two (two-tailed test)
Look up the z-value corresponding to ?/2
If t ? z or t ? z difference is significant
null hypothesis can be rejected

perf
a z
0.1 4.3
0.5 3.25
1 2.82
5 1.83
10 1.38
20 0.88

Write a Comment

User Comments (0)