Title: NoFreeLunch
1No-Free-Lunch
2Question
- Will there ever be a single best learning
algorithm??
3Statement I
-
- "... all algorithms that search for an extremum
of a cost function perform exactly the same, when
averaged over all possible cost functions."
(Wolpert and Macready, 1995)
4Statement II
- A general-purpose universal optimization strategy
is theoretically impossible, but one strategy can
outperform another is if it is specialized to the
specific problem under consideration
5How to measure a classifier?
- The main interest of the NFL theory is
off-training-set error. - Off-training-set is a test set which doesnt
overlap the training set - This is not encompassed by standard Bayesian
analysis, sampling theory statistics and other
computationally learning theory approaches
6Wolpert
- Extended Bayesian Formalism
- Capable of addressing the off-training-set error
- Normal Bayesian learning allows for one
hypotheses, the classifier - EBF extends Bayesian learning framework so it
allows for all possible hypotheses, including
PAC, the VC framework and sampling theory
statistics
7A Metric
- The averaged performance of a classifier over all
possible cost functions can be used as a metric
to determine the similarity between two
classifiers - To calculate this distance between two classifier
the EBF is employed
8Extended Bayesian Formalism 1/3
- A theoretical framework
- d m-elements training set
- f the target input-output relation, it
generates d - c the generalization error
- H hypotheses, the algorithms guess for f
- Traditional Bayes learner
- find f such that P(fd) P(df) P(f) is
optimized
9Extended Bayesian Formalism 2/3
- Presume an input space X and an output space Y
which is mapped by a function f - Let Ha and Hb be the two hypothesis input-output
distributions generated by our two learning
algorithms in response to a (random variable)
training set d consisting of m pairs of X-Y
values - The behavior of learners A and B is captured by
the distributions P(Had) and P(Hbd)
10Extended Bayesian Formalism 3/3
- Any specific learning algorithm can now be given
by the conditional probability distribution
P(hd) - The conventional Bayesian framework is fixed to
be the Bayesian-optimal P(hd) associated with
P(fd)
11Generalization performance
-
- Where c(f,h,d) is the expected off-training set
error for zero-one loss and a uniform sampling
distribution - Now we can use the generalization performance as
a measure of similarity/distance - E(K(Ca-Cb)f,d), where K is a distance
function
12Theorem I
- The performance of h is a measure of how well
P(hd) and P(fd) are alligned - This is expressed as
13Theorem II
- For any two learning algorithms P1(hd) and
P2(hd), independent of the sampling
distribution, the following holds for
off-training-set error - Uniformly averaged over all f,
- E1(Cf,m) E2(Cf,m) 0
- Uniformly averaged over all f for any training
set d, E1(Cf,d) E2(Cf,d) 0 - Uniformly averaged over all P(f),
- E1(Cm) E2(Cm) 0
- Uniformly averaged over all P(f) for any training
set d, E1(Cd) E2(Cd) 0
14Theorem II
- In other words, by any of the posed measures, any
two learning algorithms are the same. There are
just as many situations (appropriately weighted)
in which algorithm 1 is superior to algorithm 2
as vice versa. - There is no learning algorithm that has better
performance for all f, however, there are some
learning algorithms that have higher performances
for a particular (class of) f, which is a very
small part of all possible fs
15Intuition Behind Theorem II
- If no assumption (bias) is made about the outcome
of the learner then P(f) is uniform over all f - We are concerned with E(Cd) for some particular
d - Since P(f) is uniform, then all f agreeing with d
are equally probable - So all patterns of f outside the training set are
equally probable - Therefore training a classifier for the
training-set pattern gains no information over
off-training-set examples - -gt A bias is required to train a classifier!
16An illustration of the no-free-lunch theorem,
showing the performance of a highly specialized
algorithm (red) and a general-purpose one (blue).
Both algorithms perform equally well, on
average, over all problems
17- The theorem is used as an argument against
generic searching algorithms such as genetic
algorithms and simulated annealing when employed
using as little domain knowledge as possible.
18Conclusion 1/2
- There is no classifier that is better in all f
then other classifiers, because - To be able to make judgements over
off-training-set examples a bias is needed - And while the bias will improve the
classifications of some f, there will be just as
many fs that the classifier will fail - So all classifiers have the same generalization
power, applied to different f
19Conclusion 2/2
- Use as much domain knowledge and custom
optimalization routines constructed for
particular domains as possible!
20Open Questions
- What is the difference between a test-set and
off-training-examples? - Do humans have a bias for learning?
- Yes what is it?
- No How are we then able to generalize over new
examples? - It is not our classifiers we have to improve, but
our prior knowledge