NoFreeLunch - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

NoFreeLunch

Description:

all algorithms that search for an extremum of a cost function perform exactly ... The main interest of the NFL theory is off-training-set error. ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 21

Provided by: nhei

Category:

more less

Transcript and Presenter's Notes

Title: NoFreeLunch

1
No-Free-Lunch

By
Nicolaas Heyning

2
Question

Will there ever be a single best learning
algorithm??

3
Statement I

"... all algorithms that search for an extremum
of a cost function perform exactly the same, when
averaged over all possible cost functions."
(Wolpert and Macready, 1995)

4
Statement II

A general-purpose universal optimization strategy
is theoretically impossible, but one strategy can
outperform another is if it is specialized to the
specific problem under consideration

5
How to measure a classifier?

The main interest of the NFL theory is
off-training-set error.
Off-training-set is a test set which doesnt
overlap the training set
This is not encompassed by standard Bayesian
analysis, sampling theory statistics and other
computationally learning theory approaches

6
Wolpert

Extended Bayesian Formalism
Capable of addressing the off-training-set error
Normal Bayesian learning allows for one
hypotheses, the classifier
EBF extends Bayesian learning framework so it
allows for all possible hypotheses, including
PAC, the VC framework and sampling theory
statistics

7
A Metric

The averaged performance of a classifier over all
possible cost functions can be used as a metric
to determine the similarity between two
classifiers
To calculate this distance between two classifier
the EBF is employed

8
Extended Bayesian Formalism 1/3

A theoretical framework
d m-elements training set
f the target input-output relation, it
generates d
c the generalization error
H hypotheses, the algorithms guess for f
Traditional Bayes learner
find f such that P(fd) P(df) P(f) is
optimized

9
Extended Bayesian Formalism 2/3

Presume an input space X and an output space Y
which is mapped by a function f
Let Ha and Hb be the two hypothesis input-output
distributions generated by our two learning
algorithms in response to a (random variable)
training set d consisting of m pairs of X-Y
values
The behavior of learners A and B is captured by
the distributions P(Had) and P(Hbd)

10
Extended Bayesian Formalism 3/3

Any specific learning algorithm can now be given
by the conditional probability distribution
P(hd)
The conventional Bayesian framework is fixed to
be the Bayesian-optimal P(hd) associated with
P(fd)

11
Generalization performance

Where c(f,h,d) is the expected off-training set
error for zero-one loss and a uniform sampling
distribution
Now we can use the generalization performance as
a measure of similarity/distance
E(K(Ca-Cb)f,d), where K is a distance
function

12
Theorem I

The performance of h is a measure of how well
P(hd) and P(fd) are alligned
This is expressed as

13
Theorem II

For any two learning algorithms P1(hd) and
P2(hd), independent of the sampling
distribution, the following holds for
off-training-set error
Uniformly averaged over all f,
E1(Cf,m) E2(Cf,m) 0
Uniformly averaged over all f for any training
set d, E1(Cf,d) E2(Cf,d) 0
Uniformly averaged over all P(f),
E1(Cm) E2(Cm) 0
Uniformly averaged over all P(f) for any training
set d, E1(Cd) E2(Cd) 0

14
Theorem II

In other words, by any of the posed measures, any
two learning algorithms are the same. There are
just as many situations (appropriately weighted)
in which algorithm 1 is superior to algorithm 2
as vice versa.
There is no learning algorithm that has better
performance for all f, however, there are some
learning algorithms that have higher performances
for a particular (class of) f, which is a very
small part of all possible fs

15
Intuition Behind Theorem II

If no assumption (bias) is made about the outcome
of the learner then P(f) is uniform over all f
We are concerned with E(Cd) for some particular
d
Since P(f) is uniform, then all f agreeing with d
are equally probable
So all patterns of f outside the training set are
equally probable
Therefore training a classifier for the
training-set pattern gains no information over
off-training-set examples
-gt A bias is required to train a classifier!

16
An illustration of the no-free-lunch theorem,
showing the performance of a highly specialized
algorithm (red) and a general-purpose one (blue).
Both algorithms perform equally well, on
average, over all problems
17

The theorem is used as an argument against
generic searching algorithms such as genetic
algorithms and simulated annealing when employed
using as little domain knowledge as possible.

18
Conclusion 1/2

There is no classifier that is better in all f
then other classifiers, because
To be able to make judgements over
off-training-set examples a bias is needed
And while the bias will improve the
classifications of some f, there will be just as
many fs that the classifier will fail
So all classifiers have the same generalization
power, applied to different f

19
Conclusion 2/2