Suboptimality of Bayes and MDL in Classification - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Suboptimality of Bayes and MDL in Classification

Description:

We study Bayesian and Minimum Description Length (MDL) inference in ... Our inconsistency result also holds for (various incarnations of) MDL learning algorithm ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 48
Provided by: PETERGR63
Category:

less

Transcript and Presenter's Notes

Title: Suboptimality of Bayes and MDL in Classification


1
Suboptimality of Bayes and MDL in Classification
Peter Grünwald CWI/EURANDOM www.grunwald.nl
joint work with John Langford, TTI Chicago,
Preliminary version appeared in Proceedings17th
annual Conference On Learning Theory (COLT 2004)
2
Our Result
  • We study Bayesian and Minimum Description Length
    (MDL) inference in classification problems
  • Bayes and MDL should automatically deal with
    overfitting
  • We show there exist classification domains where
    Bayes and MDL
  • when applied in a standard manner
  • perform suboptimally (overfit!) even if sample
    size tends to infinity

3
Why is this interesting?
  • Practical viewpoint
  • Bayesian methods
  • used a lot in practice
  • sometimes claimed to be universally optimal
  • MDL methods
  • even designed to deal with overfitting
  • Yet MDL and Bayes can fail even with infinite
    data
  • Theoretical viewpoint
  • How can result be reconciled with various strong
    Bayesian consistency theorems?

4
Menu
  • Classification
  • Abstract statement of main result
  • Precise statement of result
  • Discussion

5
Classification
  • Given
  • Feature space
  • Label space
  • Sample
  • Set of hypotheses (classifiers)
  • Goal find a that makes few mistakes on
    future data from the same source
  • We say c has small generalization
    error/classification risk

6
Classification Models
  • Types of Classifiers
  • hard classifiers (-1/1-output)
  • decision trees, stumps, forests
  • soft classifiers (real-valued output)
  • support vector machines
  • neural networks
  • probabilistic classifiers
  • Naïve Bayes/Bayesian network classifiers
  • Logistic regression

initial focus
7
Generalization Error
  • As is customary in statistical learning theory,
    we analyze classification by postulating some
    (unknown) distribution D on joint
    (input,label)-space
  • Performance of a classifier measured in terms of
    its generalization error (classification risk)
    defined as

8
Learning Algorithms
  • A learning algorithm LA based on set of candidate
    classifiers , is a function that, for each
    sample S of arbitrary length, outputs classifier

9
Consistent Learning Algorithms
  • Suppose are
    i.i.d.
  • A learning algorithm is consistent or
    asymptotically optimal if, no matter what the
    true distribution D is,
  • in D probability, as .

10
Consistent Learning Algorithms
  • Suppose are
    i.i.d.
  • A learning algorithm is consistent or
    asymptotically optimal if, no matter what the
    true distribution D is,
  • in D probability, as .

learned classifier
where is best
classifier in
11
Main Result
  • There exists
  • input domain
  • prior P , non-zero on a countable set of
    classifiers
  • true distribution D
  • a constant
  • such that the Bayesian learning algorithm is
    is is is is asymptotically K-suboptimal

12
Main Result
  • There exists
  • input domain
  • prior , non-zero on a countable set of
    classifiers
  • true distribution D
  • a constant
  • such that the Bayesian learning algorithm is
    is is is is asymptotically K-suboptimal
  • Same holds for MDL learning algorithm

13
Remainder of Talk
  • How is Bayes learning algorithm defined?
  • What is scenario?
  • how do , true distr. D and prior P
    look like?
  • How dramatic is result?
  • How large is K?
  • How strange are choices for
    ?
  • How bad can Bayes get?
  • Why is result surprising?
  • can it be reconciled with Bayesian consistency
    results?

14
Bayesian Learning of Classifiers
  • Problem Bayesian inference defined for models
    that are sets of probability distributions
  • In our scenario, models are sets of classifiers
    , i.e. functions
  • How can we find a posterior over classifiers
    using Bayes rule?
  • Standard answer convert each to a
    corresponding distribution and apply
    Bayes to the set of distributions thus
    obtained

15
classifiers probability distrs.
  • Standard conversion method from to
    logistic (sigmoid) transformation
  • For each and , set
  • Define priors on and on
    and set

16
Logistic transformation - intuition
  • Consider hard classifiers
  • For each ,
  • Here
  • is empirical error that c makes on data,
  • and is number of mistakes c makes
    on data

17
Logistic transformation - intuition
  • For fixed
  • log-likelihood is linear function of number of
    mistakes c makes on data
  • (log-) likelihood maximized for c that is optimal
    for observed data
  • For fixed c,
  • Maximizing likelihood over also makes sense

18
Logistic transformation - intuition
  • In Bayesian practice, logistic transformation is
    standard tool, nowadays performed without giving
    any motivation or explanation
  • We did not find it in Bayesian textbooks,
  • , but tested it with three well-known Bayesians!
  • Analogous to turning set of predictors with
    squared error into conditional distributions with
    normally distributed noise

expresses where Z
is independent noise bit
19
Main Result
Grünwald Langford, COLT 2004
  • There exists
  • input domain
  • prior P on a countable set of classifiers
  • true distribution D
  • a constant
  • such that the Bayesian learning algorithm is
    is is is is asymptotically K-suboptimal

holds both for full Bayes and for Bayes (S)MAP
20
Definition of .
  • Posterior
  • Predictive Distribution
  • Full Bayes learning algorithm

21
Issues/Remainder of Talk
  • How is Bayes learning algorithm defined?
  • What is scenario?
  • how do , true distr. D and prior P
    look like?
  • How dramatic is result?
  • How large is K?
  • How strange are choices for
    ?
  • How bad can Bayes get?
  • Why is result surprising?
  • can it be reconciled with Bayesian consistency
    results?

22
Scenario
  • Definition of Y, X and
  • Definition of prior
  • for some small , for all large n,
  • can be any strictly positive smooth
    prior

(or discrete prior with sufficient precision)
23
Scenario II Definition of true D
  • Toss fair coin to determine value of Y .
  • Toss coin Z with bias
  • If (easy example) then for all
    , set
  • If (hard example) then set
  • and for all , independently set

24
Result
  • All features are informative of , but
    is more informative than all the others, so
    is best classifier
  • Nevertheless, with true D- probability 1, as

(but note for each fixed j,
)
25
Issues/Remainder of Talk
  • How is Bayes learning algorithm defined?
  • What is scenario?
  • how do , true distr. D and prior P
    look like?
  • How dramatic is result?
  • How large is K?
  • How strange are choices for
    ?
  • How bad can Bayes get?
  • Why is result surprising?
  • can it be reconciled with Bayesian consistency
    results?

26
Theorem 1
Grünwald Langford, COLT 2004
  • There exists
  • input domain
  • prior P on a countable set of classifiers
  • true distribution D
  • a constant
  • such that the Bayesian learning algorithm is
    is is is is asymptotically K-suboptimal

holds both for full Bayes and for Bayes MAP
27
Theorem 1, extended
  • X-axis
  • maximum
  • Bayes MAP/MDL

  • maximum
  • full Bayes (binary entropy)
  • Maximum difference achieved at


28
How natural is scenario?
  • Basic scenario is quite unnatural
  • We chose it because we could prove something
    about it! But
  • Priors are natural (take e.g. Rissanens
    universal prior)
  • Clarke (2002) reports practical evidence that
    Bayes performs suboptimally with large yet
    misspecified models in a regression context
  • Bayesian inference is consistent under very weak
    conditions. So even if unnatural, result is still
    interesting!

29
Issues/Remainder of Talk
  • How is Bayes learning algorithm defined?
  • What is scenario?
  • how do , true distr. D and prior P
    look like?
  • How dramatic is result?
  • How large is K?
  • How strange are choices for
    ?
  • How bad can Bayes get?
  • Why is result surprising?
  • can it be reconciled with Bayesian consistency
    results?

30
Bayesian Consistency Results
  • Doob (1949, special case)
  • Suppose
  • Countable
  • Contains true conditional distribution
  • Then with D -probability 1,

weakly/in Hellinger distance
31
Bayesian Consistency Results
  • If
  • then we must also have
  • Our result says that this does not happen in our
    scenario. Hence the (countable!) we
    constructed must be misspecified
  • Model homoskedastic, true D heteroskedastic!

32
Bayesian consistency under misspecification
  • Suppose we use Bayesian inference based on
    model
  • If , then under
    mild generality conditions, Bayes still
    converges to distribution that is
    closest to in KL-divergence
    (relative entropy).
  • The logistic transformation ensures that
  • achieved for c that also achieves

33
Bayesian consistency under misspecification
  • In our case, Bayesian posterior does not converge
    to distribution with smallest classification
    generalization error, so it also does not
    converge to distribution closest to true D in
    KL-divergence
  • Apparently, mild generality conditions for
    Bayesian consistency under misspecification are
    violated
  • Conditions for consistency under misspecification
    are much stronger than conditions for standard
    consistency!
  • must either be convex or simple (e.g.
    parametric)

34
Is consistency achievable at all?
  • Methods for avoiding overfitting proposed in
    statistical and computational learning theory
    literature are consistent
  • Vapniks methods (based on VC-dimension etc.)
  • McAllesters PAC-Bayes methods
  • These methods invariably punish complex (low
    prior) classifiers much more than ordinary Bayes
    in the simplest version of PAC-Bayes,

35
Consistency and Data Compression - I
  • Our inconsistency result also holds for (various
    incarnations of) MDL learning algorithm
  • MDL is a learning method based on data
    compression in practice it closely resembles
    Bayesian inference with certain special priors
  • .however

36
Consistency and Data Compression - II
  • There already exist (in)famous inconsistency
    results for Bayesian inference by Diaconis and
    Freedman
  • For some highly non-parametric , even if
    true D is in , Bayes may not converge to it
  • These type of inconsistency results do not apply
    to MDL, since Diaconis and Freedman use priors
    that do not compress the data
  • With MDL priors, if true D is in , then
    consistency is guaranteed under no further
    conditions at all (Barron 98)

37
Issues/Remainder of Talk
  • How is Bayes learning algorithm defined?
  • What is scenario?
  • how do , true distr. D and prior P
    look like?
  • How dramatic is result?
  • How large is K?
  • How strange are choices for
    ?
  • How bad can Bayes get? ( what happens)
  • Why is result surprising?
  • can it be reconciled with Bayesian consistency
    results?

38
Thm 2 full Bayes result is tight
  • X-axis
  • maximum
  • Bayes MAP/MDL

  • maximum
  • full Bayes (binary entropy)
  • Maximum difference achieved at


39
Theorem 2

40
Proof Sketch
  • Log loss of Bayes upper bounds 0/1-loss
  • For every sequence
  • Log loss of Bayes upper bounded by log loss of
    0/1 optimal plus log-term

41
Proof Sketch

42
Proof Sketch
  • Log loss of Bayes upper bounds 0/1-loss
  • For every sequence
  • Log loss of Bayes upper bounded by log loss of
    0/1 optimal plus log-term

43
Proof Sketch
  • Log loss of Bayes upper bounds 0/1-loss
  • For every sequence
  • Log loss of Bayes upper bounded by log loss of
    0/1 optimal plus log-term

(Law of large nrs/Hoeffding)
44
Wait a minute
  • Accumulated log loss of sequential Bayesian
    predictions is always within of
    accumulated log loss of optimal
  • So Bayes is good with respect to log
    loss/KL-div.
  • But Bayes is bad with respect to 0/1-loss
  • How is this possible?
  • The Bayesian posterior effectively becomes a
    mixture of bad distributions (different mixture
    at different m)
  • Mixture is closer to true distribution D
    than in KL-divergence/log
    loss prediction
  • But performs worse than in terms of
    0/1 error

45
Bayes predicts too well
  • Let be a set of distribution, and let
    be defined with respect to a prior that
    makes a universal data-compressor
    wrt
  • One can show that the only true distributions D
    for which Bayes can ever become inconsistent in
    KL-divergence sense
  • are those under which the posterior predictive
    distribution becomes closer in KL-divergence to D
    than the best single distribution in

46
Conclusion
  • Our result applies to hard classifiers and
    (equivalently) to probabilistic classifiers under
    slight misspecification
  • Bayesian may argue that the Bayesian machinery
    was never intended for misspecified models
  • Yet, computational resources and human
    imagination being limited, in practice Bayesian
    inference is applied to misspecified models all
    the time.
  • In this case, Bayes may overfit even in the limit
    for an infinite amount of data

47
Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com