Title: Suboptimality of Bayes and MDL in Classification
1Suboptimality of Bayes and MDL in Classification
Peter Grünwald CWI/EURANDOM www.grunwald.nl
joint work with John Langford, TTI Chicago,
Preliminary version appeared in Proceedings17th
annual Conference On Learning Theory (COLT 2004)
2Our Result
- We study Bayesian and Minimum Description Length
(MDL) inference in classification problems - Bayes and MDL should automatically deal with
overfitting - We show there exist classification domains where
Bayes and MDL - when applied in a standard manner
- perform suboptimally (overfit!) even if sample
size tends to infinity
3Why is this interesting?
- Practical viewpoint
- Bayesian methods
- used a lot in practice
- sometimes claimed to be universally optimal
- MDL methods
- even designed to deal with overfitting
- Yet MDL and Bayes can fail even with infinite
data - Theoretical viewpoint
- How can result be reconciled with various strong
Bayesian consistency theorems?
4Menu
- Classification
- Abstract statement of main result
- Precise statement of result
- Discussion
5Classification
- Given
- Feature space
- Label space
- Sample
- Set of hypotheses (classifiers)
- Goal find a that makes few mistakes on
future data from the same source - We say c has small generalization
error/classification risk -
-
6Classification Models
- Types of Classifiers
- hard classifiers (-1/1-output)
- decision trees, stumps, forests
- soft classifiers (real-valued output)
- support vector machines
- neural networks
- probabilistic classifiers
- Naïve Bayes/Bayesian network classifiers
- Logistic regression
initial focus
7Generalization Error
- As is customary in statistical learning theory,
we analyze classification by postulating some
(unknown) distribution D on joint
(input,label)-space - Performance of a classifier measured in terms of
its generalization error (classification risk)
defined as
8Learning Algorithms
- A learning algorithm LA based on set of candidate
classifiers , is a function that, for each
sample S of arbitrary length, outputs classifier
9Consistent Learning Algorithms
- Suppose are
i.i.d. - A learning algorithm is consistent or
asymptotically optimal if, no matter what the
true distribution D is, -
- in D probability, as .
10Consistent Learning Algorithms
- Suppose are
i.i.d. - A learning algorithm is consistent or
asymptotically optimal if, no matter what the
true distribution D is, -
- in D probability, as .
learned classifier
where is best
classifier in
11Main Result
- There exists
- input domain
- prior P , non-zero on a countable set of
classifiers - true distribution D
- a constant
- such that the Bayesian learning algorithm is
is is is is asymptotically K-suboptimal
12Main Result
- There exists
- input domain
- prior , non-zero on a countable set of
classifiers - true distribution D
- a constant
- such that the Bayesian learning algorithm is
is is is is asymptotically K-suboptimal - Same holds for MDL learning algorithm
13Remainder of Talk
- How is Bayes learning algorithm defined?
- What is scenario?
- how do , true distr. D and prior P
look like? - How dramatic is result?
- How large is K?
- How strange are choices for
? - How bad can Bayes get?
- Why is result surprising?
- can it be reconciled with Bayesian consistency
results?
14Bayesian Learning of Classifiers
- Problem Bayesian inference defined for models
that are sets of probability distributions - In our scenario, models are sets of classifiers
, i.e. functions - How can we find a posterior over classifiers
using Bayes rule? - Standard answer convert each to a
corresponding distribution and apply
Bayes to the set of distributions thus
obtained
15classifiers probability distrs.
- Standard conversion method from to
logistic (sigmoid) transformation - For each and , set
- Define priors on and on
and set
16Logistic transformation - intuition
- Consider hard classifiers
- For each ,
- Here
-
- is empirical error that c makes on data,
- and is number of mistakes c makes
on data
17Logistic transformation - intuition
-
- For fixed
- log-likelihood is linear function of number of
mistakes c makes on data - (log-) likelihood maximized for c that is optimal
for observed data - For fixed c,
- Maximizing likelihood over also makes sense
18Logistic transformation - intuition
- In Bayesian practice, logistic transformation is
standard tool, nowadays performed without giving
any motivation or explanation - We did not find it in Bayesian textbooks,
- , but tested it with three well-known Bayesians!
- Analogous to turning set of predictors with
squared error into conditional distributions with
normally distributed noise
expresses where Z
is independent noise bit
19Main Result
Grünwald Langford, COLT 2004
- There exists
- input domain
- prior P on a countable set of classifiers
- true distribution D
- a constant
- such that the Bayesian learning algorithm is
is is is is asymptotically K-suboptimal
holds both for full Bayes and for Bayes (S)MAP
20Definition of .
- Posterior
- Predictive Distribution
- Full Bayes learning algorithm
21Issues/Remainder of Talk
- How is Bayes learning algorithm defined?
- What is scenario?
- how do , true distr. D and prior P
look like? - How dramatic is result?
- How large is K?
- How strange are choices for
? - How bad can Bayes get?
- Why is result surprising?
- can it be reconciled with Bayesian consistency
results?
22Scenario
- Definition of Y, X and
- Definition of prior
- for some small , for all large n,
- can be any strictly positive smooth
prior
(or discrete prior with sufficient precision)
23Scenario II Definition of true D
- Toss fair coin to determine value of Y .
- Toss coin Z with bias
- If (easy example) then for all
, set - If (hard example) then set
-
- and for all , independently set
-
24Result
- All features are informative of , but
is more informative than all the others, so
is best classifier - Nevertheless, with true D- probability 1, as
(but note for each fixed j,
)
25Issues/Remainder of Talk
- How is Bayes learning algorithm defined?
- What is scenario?
- how do , true distr. D and prior P
look like? - How dramatic is result?
- How large is K?
- How strange are choices for
? - How bad can Bayes get?
- Why is result surprising?
- can it be reconciled with Bayesian consistency
results?
26Theorem 1
Grünwald Langford, COLT 2004
- There exists
- input domain
- prior P on a countable set of classifiers
- true distribution D
- a constant
- such that the Bayesian learning algorithm is
is is is is asymptotically K-suboptimal
holds both for full Bayes and for Bayes MAP
27Theorem 1, extended
- X-axis
- maximum
- Bayes MAP/MDL
- maximum
- full Bayes (binary entropy)
- Maximum difference achieved at
28How natural is scenario?
- Basic scenario is quite unnatural
- We chose it because we could prove something
about it! But - Priors are natural (take e.g. Rissanens
universal prior) - Clarke (2002) reports practical evidence that
Bayes performs suboptimally with large yet
misspecified models in a regression context - Bayesian inference is consistent under very weak
conditions. So even if unnatural, result is still
interesting!
29Issues/Remainder of Talk
- How is Bayes learning algorithm defined?
- What is scenario?
- how do , true distr. D and prior P
look like? - How dramatic is result?
- How large is K?
- How strange are choices for
? - How bad can Bayes get?
- Why is result surprising?
- can it be reconciled with Bayesian consistency
results?
30Bayesian Consistency Results
- Doob (1949, special case)
- Suppose
- Countable
- Contains true conditional distribution
- Then with D -probability 1,
-
weakly/in Hellinger distance
31Bayesian Consistency Results
- If
- then we must also have
- Our result says that this does not happen in our
scenario. Hence the (countable!) we
constructed must be misspecified - Model homoskedastic, true D heteroskedastic!
32Bayesian consistency under misspecification
- Suppose we use Bayesian inference based on
model - If , then under
mild generality conditions, Bayes still
converges to distribution that is
closest to in KL-divergence
(relative entropy). - The logistic transformation ensures that
- achieved for c that also achieves
33Bayesian consistency under misspecification
- In our case, Bayesian posterior does not converge
to distribution with smallest classification
generalization error, so it also does not
converge to distribution closest to true D in
KL-divergence - Apparently, mild generality conditions for
Bayesian consistency under misspecification are
violated - Conditions for consistency under misspecification
are much stronger than conditions for standard
consistency! - must either be convex or simple (e.g.
parametric)
34Is consistency achievable at all?
- Methods for avoiding overfitting proposed in
statistical and computational learning theory
literature are consistent - Vapniks methods (based on VC-dimension etc.)
- McAllesters PAC-Bayes methods
- These methods invariably punish complex (low
prior) classifiers much more than ordinary Bayes
in the simplest version of PAC-Bayes,
35Consistency and Data Compression - I
- Our inconsistency result also holds for (various
incarnations of) MDL learning algorithm - MDL is a learning method based on data
compression in practice it closely resembles
Bayesian inference with certain special priors - .however
36Consistency and Data Compression - II
- There already exist (in)famous inconsistency
results for Bayesian inference by Diaconis and
Freedman - For some highly non-parametric , even if
true D is in , Bayes may not converge to it - These type of inconsistency results do not apply
to MDL, since Diaconis and Freedman use priors
that do not compress the data - With MDL priors, if true D is in , then
consistency is guaranteed under no further
conditions at all (Barron 98)
37Issues/Remainder of Talk
- How is Bayes learning algorithm defined?
- What is scenario?
- how do , true distr. D and prior P
look like? - How dramatic is result?
- How large is K?
- How strange are choices for
? - How bad can Bayes get? ( what happens)
- Why is result surprising?
- can it be reconciled with Bayesian consistency
results?
38Thm 2 full Bayes result is tight
- X-axis
- maximum
- Bayes MAP/MDL
- maximum
- full Bayes (binary entropy)
- Maximum difference achieved at
39Theorem 2
40Proof Sketch
- Log loss of Bayes upper bounds 0/1-loss
- For every sequence
- Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term
41Proof Sketch
42Proof Sketch
- Log loss of Bayes upper bounds 0/1-loss
- For every sequence
- Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term
43Proof Sketch
- Log loss of Bayes upper bounds 0/1-loss
- For every sequence
- Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term
(Law of large nrs/Hoeffding)
44Wait a minute
- Accumulated log loss of sequential Bayesian
predictions is always within of
accumulated log loss of optimal - So Bayes is good with respect to log
loss/KL-div. - But Bayes is bad with respect to 0/1-loss
- How is this possible?
- The Bayesian posterior effectively becomes a
mixture of bad distributions (different mixture
at different m) - Mixture is closer to true distribution D
than in KL-divergence/log
loss prediction - But performs worse than in terms of
0/1 error
45Bayes predicts too well
- Let be a set of distribution, and let
be defined with respect to a prior that
makes a universal data-compressor
wrt - One can show that the only true distributions D
for which Bayes can ever become inconsistent in
KL-divergence sense - are those under which the posterior predictive
distribution becomes closer in KL-divergence to D
than the best single distribution in
46Conclusion
- Our result applies to hard classifiers and
(equivalently) to probabilistic classifiers under
slight misspecification - Bayesian may argue that the Bayesian machinery
was never intended for misspecified models - Yet, computational resources and human
imagination being limited, in practice Bayesian
inference is applied to misspecified models all
the time. - In this case, Bayes may overfit even in the limit
for an infinite amount of data
47 Thank you for your attention!