NoFreeLunch - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

NoFreeLunch

Description:

all algorithms that search for an extremum of a cost function perform exactly ... The main interest of the NFL theory is off-training-set error. ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 21
Provided by: nhei
Category:

less

Transcript and Presenter's Notes

Title: NoFreeLunch


1
No-Free-Lunch
  • By
  • Nicolaas Heyning

2
Question
  • Will there ever be a single best learning
    algorithm??

3
Statement I
  • "... all algorithms that search for an extremum
    of a cost function perform exactly the same, when
    averaged over all possible cost functions."
    (Wolpert and Macready, 1995)

4
Statement II
  • A general-purpose universal optimization strategy
    is theoretically impossible, but one strategy can
    outperform another is if it is specialized to the
    specific problem under consideration

5
How to measure a classifier?
  • The main interest of the NFL theory is
    off-training-set error.
  • Off-training-set is a test set which doesnt
    overlap the training set
  • This is not encompassed by standard Bayesian
    analysis, sampling theory statistics and other
    computationally learning theory approaches

6
Wolpert
  • Extended Bayesian Formalism
  • Capable of addressing the off-training-set error
  • Normal Bayesian learning allows for one
    hypotheses, the classifier
  • EBF extends Bayesian learning framework so it
    allows for all possible hypotheses, including
    PAC, the VC framework and sampling theory
    statistics

7
A Metric
  • The averaged performance of a classifier over all
    possible cost functions can be used as a metric
    to determine the similarity between two
    classifiers
  • To calculate this distance between two classifier
    the EBF is employed

8
Extended Bayesian Formalism 1/3
  • A theoretical framework
  • d m-elements training set
  • f the target input-output relation, it
    generates d
  • c the generalization error
  • H hypotheses, the algorithms guess for f
  • Traditional Bayes learner
  • find f such that P(fd) P(df) P(f) is
    optimized

9
Extended Bayesian Formalism 2/3
  • Presume an input space X and an output space Y
    which is mapped by a function f
  • Let Ha and Hb be the two hypothesis input-output
    distributions generated by our two learning
    algorithms in response to a (random variable)
    training set d consisting of m pairs of X-Y
    values
  • The behavior of learners A and B is captured by
    the distributions P(Had) and P(Hbd)

10
Extended Bayesian Formalism 3/3
  • Any specific learning algorithm can now be given
    by the conditional probability distribution
    P(hd)
  • The conventional Bayesian framework is fixed to
    be the Bayesian-optimal P(hd) associated with
    P(fd)

11
Generalization performance
  • Where c(f,h,d) is the expected off-training set
    error for zero-one loss and a uniform sampling
    distribution
  • Now we can use the generalization performance as
    a measure of similarity/distance
  • E(K(Ca-Cb)f,d), where K is a distance
    function

12
Theorem I
  • The performance of h is a measure of how well
    P(hd) and P(fd) are alligned
  • This is expressed as

13
Theorem II
  • For any two learning algorithms P1(hd) and
    P2(hd), independent of the sampling
    distribution, the following holds for
    off-training-set error
  • Uniformly averaged over all f,
  • E1(Cf,m) E2(Cf,m) 0
  • Uniformly averaged over all f for any training
    set d, E1(Cf,d) E2(Cf,d) 0
  • Uniformly averaged over all P(f),
  • E1(Cm) E2(Cm) 0
  • Uniformly averaged over all P(f) for any training
    set d, E1(Cd) E2(Cd) 0

14
Theorem II
  • In other words, by any of the posed measures, any
    two learning algorithms are the same. There are
    just as many situations (appropriately weighted)
    in which algorithm 1 is superior to algorithm 2
    as vice versa.
  • There is no learning algorithm that has better
    performance for all f, however, there are some
    learning algorithms that have higher performances
    for a particular (class of) f, which is a very
    small part of all possible fs

15
Intuition Behind Theorem II
  • If no assumption (bias) is made about the outcome
    of the learner then P(f) is uniform over all f
  • We are concerned with E(Cd) for some particular
    d
  • Since P(f) is uniform, then all f agreeing with d
    are equally probable
  • So all patterns of f outside the training set are
    equally probable
  • Therefore training a classifier for the
    training-set pattern gains no information over
    off-training-set examples
  • -gt A bias is required to train a classifier!

16
An illustration of the no-free-lunch theorem,
showing the performance of a highly specialized
algorithm (red) and a general-purpose one (blue).
Both algorithms perform equally well, on
average, over all problems
17
  • The theorem is used as an argument against
    generic searching algorithms such as genetic
    algorithms and simulated annealing when employed
    using as little domain knowledge as possible.

18
Conclusion 1/2
  • There is no classifier that is better in all f
    then other classifiers, because
  • To be able to make judgements over
    off-training-set examples a bias is needed
  • And while the bias will improve the
    classifications of some f, there will be just as
    many fs that the classifier will fail
  • So all classifiers have the same generalization
    power, applied to different f

19
Conclusion 2/2
  • Use as much domain knowledge and custom
    optimalization routines constructed for
    particular domains as possible!

20
Open Questions
  • What is the difference between a test-set and
    off-training-examples?
  • Do humans have a bias for learning?
  • Yes what is it?
  • No How are we then able to generalize over new
    examples?
  • It is not our classifiers we have to improve, but
    our prior knowledge
Write a Comment
User Comments (0)
About PowerShow.com