Title: Evaluation of Learning Models
1Evaluation of Learning Models
- Literature
- T. Mitchel, Machine Learning, chapter 5
- I.H. Witten and E. Frank, Data Mining, chapter 5
2Fayyads KDD Methodology
data
3Contents
- Estimation of errors for one hypothesis
- Comparison of hypotheses
- Comparison of learning models
- Practical aspects
4Two definitions of error
- True error of hypothesis h with respect to target
function f and distribution D is the probability
that h will misclassify an instance drawn at
random according to D.
5Two definitions of error (2)
- Sample error of hypothesis h with respect to
target function f and data sample S is the
proportion of examples h misclassifieswhere
is 1 if and 0
otherwise.
6Two definitions of error (3)
- How well does errorS(h) estimate errorD(h)?
7Problems estimating error
- 1. Bias If S is training set, errorS(h) is
optimistically biasedFor unbiased estimate, h
and S must be chosen independently. - 2. Variance Even with unbiased S, errorS(h) may
still vary from errorD(h).
8Example
- Hypothesis h misclassifies 12 of the 40 examples
in SWhat is errorD(h)?
9Estimators
- Experiment1. choose sample S of size n
according to distribution D2. measure
errorS(h)errorS(h) is a random variable (i.e.,
result of an experiment)errorS(h) is an unbiased
estimator for errorD(h)Given observed errorS(h)
what can we conclude about errorD(h)?
10Confidence intervals
- If
- S contains n examples, drawn independently of h
and each other - n ? 30
- Then
- With approximately 95 probability, errorD(h)
lies in the interval
11Confidence intervals (2)
- If
- S contains n examples, drawn independently of h
and each other - n ? 30
- Then
- With approximately N probability, errorD(h) lies
in the intervalwhereN 50 68 80 90 95 98
99zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58
12errorS(h) is a random variable
- Rerun the experiment with different randomly
drawn S (of size n) - Probability of observing r misclassified examples
13Binomial probability distribution
- Probability P(r) of r heads in n coin flips, if p
Pr(heads)
Binomial distribution for n 10 and p 0.3
14Binomial probability distribution (2)
- Expected, or mean value of X, EX, is
- Variance of X is
- Standard deviation of X, ?X, is
15Normal distribution approximates binomial
- errorS(h) follows a Binomial distribution, with
- mean
- standard deviation
- Approximate this by a Normal distribution with
- mean
- standard deviation
16Normal probability distribution
- The probability that X will fall into the
interval (a,b) is given by - Expected, or mean value of X, EX, is EX ?
- Variance of X is Var(x) ?2
- Standard deviation of X, ?X , ?X ?
17Normal probability distribution
- 80 of area (probability) lies in ? ? 1.28?
- N of area (probability) lies in ? ? zN?
N 50 68 80 90 95 98 99
zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58
18Confidence intervals, more correctly
- If
- S contains n examples, drawn independently of h
and each other - n ? 30
- Then
- with approximately 95 probability, errorS(h)
lies in the interval - and errorD(h) approximately lies in the interval
19Central Limit Theorem
- Consider a set of independent, identically
distributed random variables Y1...Yn, all
governed by an arbitrary probability distribution
with mean ? and finite variance ?2. Define the
sample mean, - Central Limit Theorem. As n ? ?, the distribution
governing approaches a Normal distribution,
with mean ? and variance ?2/n.
20Difference between hypotheses
- Test h1 on sample S1, test h2 on S2
- 1. Pick parameter to estimate
- 2. Choose an estimator
- 3. Determine probability distribution that
governs estimator
21Difference between hypotheses (2)
- 4. Find interval (L, U) such that N of
probability mass falls in the interval
22Paired t test to compare hA, hB
- 1. Partition data into k disjoint test sets
T1,T2,,Tk of equal size, where this size is at
least 30. - 2. For i from 1 to k, do
- 3. Return the value , where
23Paired t test to compare hA, hB (2)
- N confidence interval estimate for d
- Note approximately normally distributed
24Comparing learning algorithms LA and LB
- What wed like to estimatewhere L(S) is the
hypothesis output by learner L using training set
S - I.e., the expected difference in true error
between hypotheses output by learners LA and LB,
when trained using randomly selected training
sets S drawn according to distribution D.
25Comparing learning algorithms LA and LB (2)
- But, given limited data D0, what is a good
estimator? - Could partition D0 into training set S0 and test
set T0, and measure - Even better, repeat this many times and average
the results
26Comparing learning algorithms LA and LB (3)
k-fold cross validation
- 1. Partition data D0 into k disjoint test sets
T1,T2,,Tk of equal size, where this size is at
least 30. - 2. For i from 1 to k, do use Ti for the test
set, and the remaining data for training set Si - 3. Return the average of the errors on the test
sets
27Practical AspectsA note on parameter tuning
- It is important that the test data is not used in
any way to create the classifier - Some learning schemes operate in two stages
- Stage 1 builds the basic structure
- Stage 2 optimizes parameter settings
- The test data cant be used for parameter tuning!
- Proper procedure uses three sets training data,
- validation data, and test data
- Validation data is used to optimize parameters
28Holdout estimation, stratification
- What shall we do if the amount of data is
limited? - The holdout method reserves a certain amount for
- testing and uses the remainder for training
- Usually one third for testing, the rest for
training - Problem the samples might not be representative
- Example class might be missing in the test data
- Advanced version uses stratification
- Ensures that each class is represented with
approximately equal proportions in both subsets
29More on cross-validation
- Standard method for evaluation stratified
ten-fold - cross-validation
- Why ten? Extensive experiments have shown that
this is the best choice to get an accurate
estimate - There is also some theoretical evidence for this
- Stratification reduces the estimates variance
- Even better repeated stratified cross-validation
- E.g. ten-fold cross-validation is repeated ten
times - and results are averaged (reduces the variance)
30Issues in evaluation
- Statistical reliability of estimated differences
in performance - Choice of performance measure
- Number of correct classifications
- Accuracy of probability estimates
- Error in numeric predictions
- Costs assigned to different types of errors
- Many practical applications involve costs
31Counting the costs
- In practice, different types of classification
errors often incur different costs - Examples
- Predicting when cows are in heat (in estrus)
- Not in estrus correct 97 of the time
- Loan decisions
- Oil-slick detection
- Fault diagnosis
- Promotional mailing
32Taking costs into account
- The confusion matrix
- There are many other types of costs!
- E.g. costs of collecting training data
33Lift charts
- In practice, costs are rarely known
- Decisions are usually made by comparing possible
scenarios - Example promotional mailout
- Situation 1 classifier predicts that 0.1 of all
households will respond - Situation 2 classifier predicts that 0.4 of the
10000 most promising households will respond - A lift chart allows for a visual comparison
34Generating a lift chart
- Instances are sorted according to their predicted
probability of being a true positiveRank Predic
ted probability Actual class1 0.95 Yes2 0.93
Yes3 0.93 No4 0.88 Yes - In lift chart, x axis is sample size and y axis
is number of true positives
35A hypothetical lift chart
36Summary of measures
37Model selection criteria
- Model selection criteria attempt to find a good
compromise betweenA. The complexity of a
modelB. Its prediction accuracy on the training
data - Reasoning a good model is a simple model that
achieves high accuracy on the given data - Also known as Occams Razor the best theory is
the smallest one that describes all the facts
38Warning
- Suppose you are gathering hypotheses that have a
probability of 95 to have an error level below
10 - What if you have found 100 hypotheses satisfying
this condition? - Then the probability that all have an error below
10 is equal to (0.95)100 ? 0.013 corresponding
to 1.3 . So, the probability of having at least
one hypothesis with an error above 10 is about
98.7!