Title: Inconsistency of Bayes and MDL under Misspecification
1Inconsistency of Bayes and MDL under
Misspecification
Peter Grünwald CWI, Amsterdam www.grunwald.nl
Extension of joint work with John Langford, TTI
Chicago (COLT 2004). Also at Bayesian VALENCIA
2006 meeting
2Suboptimality of Bayes and MDL in Classification
Original Title
3Our Result
- We study Bayesian and Minimum Description Length
(MDL) inference in classification problems - Bayes and MDL should automatically deal with
overfitting - We show there exist classification domains where
standard versions of Bayes and MDL perform
suboptimally (overfit!) even if sample size tends
to infinity
4Why is this interesting?
- Practical viewpoint
- Bayesian methods
- used a lot in practice
- sometimes claimed to be universally optimal
- MDL methods
- even designed to deal with overfitting
- Yet MDL and Bayes can fail even with infinite
data - Theoretical viewpoint
- How can result be reconciled with various strong
Bayesian consistency theorems?
5Menu
- Classification
- Abstract statement of main result
- Precise statement of result
- Discussion classification vs. misspecification
6Classification
- Given
- Feature space
- Label space
- Sample
- Set of hypotheses (classifiers)
- Goal find a that makes few mistakes on
future data from the same source - We say has small generalization
error/classification risk -
-
7Classification Models
- Types of Classifiers
- hard classifiers (-1/1-output)
- decision trees, stumps, forests
- soft classifiers (real-valued output)
- support vector machines
- neural networks
- probabilistic classifiers
- Naïve Bayes/Bayesian network classifiers
- Logistic regression
initial focus
8Generalization Error
- As is customary in statistical learning theory,
we analyze classification by postulating some
(unknown) distribution on joint
(input,label)-space - Performance of a classifier measured in terms of
its generalization error (classification risk)
defined as
9Learning Algorithms
- A learning algorithm based on set of
candidate classifiers , is a function that,
for each sample of arbitrary length, outputs
classifier
10Consistent Learning Algorithms
- Suppose are
i.i.d. - A learning algorithm is consistent or
asymptotically optimal if, no matter what the
true distribution is, -
- in probability, as .
11Consistent Learning Algorithms
- Suppose are
i.i.d. - A learning algorithm is consistent or
asymptotically optimal if, no matter what the
true distribution is, -
- in probability, as .
learned classifier
where is best
classifier in
12Main Result
- There exists
- input domain
- prior , non-zero on a countable set of
classifiers - true distribution
- a constant
- such that the Bayesian learning algorithm
is asymptotically -suboptimal
13Main Result
- There exists
- input domain
- prior , non-zero on a countable set of
classifiers - true distribution
- a constant
- such that the Bayesian learning algorithm
is asymptotically -suboptimal - Same holds for MDL algorithm
14Remainder of Talk
- How is Bayes learning algorithm defined?
- What is scenario?
- how do , true distr. and prior
look like? - How dramatic is result?
- How large is ?
- How strange are choices for
? - Why is result surprising?
- can it be reconciled with Bayesian consistency
results?
15Bayesian Learning of Classifiers
- Problem Bayesian inference defined for models
that are sets of probability distributions - In our scenario, models are sets of classifiers
, i.e. functions - How can we find posterior over classifiers using
Bayes rule? - Standard answer convert each to a
corresponding distribution and apply
Bayes to the set of distributions thus
obtained
16classifiers probability distrs.
- Standard conversion method from to
logistic (sigmoid) transformation - For each and , set
- Define priors on and on
and set
17Logistic transformation - intuition
- Consider hard classifiers
- For each ,
- Here
-
- is empirical error that makes on data,
- and is number of mistakes makes
on data
18Logistic transformation - intuition
-
- For fixed
- log-likelihood is linear function of number of
mistakes makes on data - (log-) likelihood maximized for that is
optimal for observed data - For fixed ,
- Maximizing likelihood over also makes sense
19Logistic transformation - intuition
- In Bayesian practice, logistic transformation is
standard tool, nowadays performed without giving
any motivation or explanation - We did not find it in Bayesian textbooks,
- , but tested it with three well-known Bayesians!
20Logistic transformation - intuition
- In Bayesian practice, logistic transformation is
standard tool, nowadays performed without giving
any motivation or explanation - We did not find it in Bayesian textbooks,
- , but tested it with three well-known Bayesians!
- Analogous to turning set of predictors with
squared error into conditional distributions with
normally distributed noise
expresses where
Z is independent noise bit
21Main Result
Grünwald Langford, COLT 2004
- There exists
- input domain
- prior on a countable set of classifiers
- true distribution
- a constant
- such that the Bayesian learning algorithm is
is is is is asymptotically -suboptimal
holds both for full Bayes and for Bayes (S)MAP
22Definition of .
- Posterior
- Predictive Distribution
- Full Bayes learning algorithm
23Issues/Remainder of Talk
- How is Bayes learning algorithm defined?
- What is scenario?
- how do , true distr. and prior
look like? - How dramatic is result?
- How large is ?
- How strange are choices for
? - Why is result surprising?
- can it be reconciled with Bayesian consistency
results?
24Scenario
- Definition of and
- Definition of prior
- for some small , for all large ,
- can be any strictly positive smooth
prior
(or discrete prior with sufficient precision)
25Scenario II Definition of true
- Toss fair coin to determine value of .
- Toss coin with bias
- If (easy example) then for all
, set - If (hard example) then set
-
- and for all , independently set
-
26Result
- All features are informative of , but
is more informative than all the others, so
is best classifier - Nevertheless, with true - probability 1, as
(but note for each fixed ,
)
27Issues/Remainder of Talk
- How is Bayes learning algorithm defined?
- What is scenario?
- how do , true distr. and prior
look like? - How dramatic is result?
- How large is ?
- How strange are choices for
? - Why is result surprising?
- can it be reconciled with Bayesian consistency
results?
28Theorem 1
Grünwald Langford, COLT 2004
- There exists
- input domain
- prior on a countable set of classifiers
- true distribution
- a constant
- such that the Bayesian learning algorithm is
is is is is asymptotically -suboptimal
holds both for full Bayes and for Bayes MAP
29Theorem 1
Grünwald Langford, COLT 2004
- There exists
- input domain
- prior on a countable set of classifiers
- true distribution
- a constant
- such that the Bayesian learning algorithm is
is is is is asymptotically -suboptimal
interdependent parameters
30Theorem 1, extended
- X-axis
- maximum
- Bayes MAP/MDL
- maximum
- full Bayes (binary entropy)
- Maximum difference achieved at
achieved with probability 1, for all large n
31Theorem 1, extended
- X-axis
- maximum
- Bayes MAP/MDL
- maximum
- full Bayes (binary entropy)
- Maximum difference achieved at
32Theorem 1, extended
- X-axis
- maximum
- Bayes MAP/MDL
- maximum
- full Bayes (binary entropy)
- Maximum difference achieved at
Bayes can get much worse than random guessing!
33How can Bayes get so bad?
- Set, for hard examples, and
- MAP is achieved for large set of classifiers
- Since these all err independently, on hard
examples, by the Law of Large numbers, the
fraction of MAP classifiers making a wrong
prediction will be - Therefore, if is a hard
example,
But then Bayes predicts !
34Thm 2 full Bayes result is tight
- X-axis
- maximum
- Bayes MAP/MDL
- maximum
- full Bayes (binary entropy)
Now maximum taken over all
35How natural is scenario?
- Basic scenario is quite unnatural, but
- Although it may not happen in real life,
describing the worst that could happen is
interesting in itself - Priors are natural (take e.g. Rissanens
universal prior) - Clarke (2003) reports practical evidence that
Bayes performs suboptimally with large yet
misspecified models in a regression context - Bayesian inference is consistent under very weak
conditions. So even if unnatural, result is still
interesting!
36Issues/Remainder of Talk
- How is Bayes learning algorithm defined?
- What is scenario?
- how do , true distr. and prior
look like? - How dramatic is result?
- How large is ?
- How strange are choices for
? - Why is result surprising?
37Is result surprising - I?
- Methods proposed in statistical learning theory
literature are consistent - Vapniks SRM, McAllesters PAC-Bayes methods
- These methods punish complex (low prior)
classifiers much more than ordinary Bayes - in the simplest version of PAC-Bayes,
- based on generalization bounds that suggest Bayes
is inconsistent in classification - Our result is still interesting we exhibit
concrete scenario that shows worst that can happen
38Is result surprising II?
- There exist a various strong consistency results
for Bayesian inference - Superficially it seems our result contradicts
these - How can we reconcile the two?
39Bayesian Consistency Results
- Doob (1949), Blackwell Dubins (1962),
Barron(1985) - Suppose
- Countable
- Contains true conditional distribution
- Then with -probability 1, as
, -
40Bayesian Consistency Results
- If
- then we must also have
- Our result says that this does not happen in our
scenario. Hence the (countable!) we
constructed must be misspecified - Model homoskedastic, true heteroskedastic!
41Bayesian consistency under misspecification
- Suppose we use Bayesian inference based on
- If , then under
mild generality conditions, Bayes predictive
distribution still converges to
that is closest to in
KL-divergence (relative entropy). - The logistic transformation ensures that
- achieved for c that also achieves
42Bayes consistency under misspecification
- In our case, Bayes does not converge to
distribution with smallest classification risk,
so it also does not converge to distribution
closest to true in KL-div. - Apparently, mild generality conditions for
Bayesian consistency under misspecification are
violated - Conditions for consistency under misspecification
are much stronger than conditions for standard
consistency! - must either be convex or simple (e.g.
parametric)
43Misspecification Inconsistency
- Our inconsistency theorem is fundamentally
different from earlier ones such as Barron (1998)
and Diaconis/Freedman (1986) - We can choose the model as a countable set of
i.i.d distributions. Then if true distribution
were in model, consistency would be guaranteed - MDL is immune to Diaconis/Freedman inconsistency,
but not to misspecification inconsistency - Diaconis/Freedman use priors so that the Bayesian
universal code does not compress the data. Such
priors make no sense from an MDL point of view.
44Misspecification Reformulation
- For all , there exists a
distribution on and a prior
on a countable set of distributions such
that, for some with - Yet for all , with
-probability 1,
45Bayes predicts too well
- Theorem 3 Let be a set of distributions.
Under mild regularity conditions (e.g. if
is countable) - the only true distributions D for which Bayesian
posterior is inconsistent - i.e., for some , a.s.,
- are those under which the posterior predictive
distribution becomes strictly closer in
KL-divergence to D than the best single
distribution in - i.e. there exists such that a.s.,
for infinitely many m
46Conclusion
- Our result applies to hard classifiers and
(equivalently) to probabilistic classifiers under
slight misspecification - Bayesian may argue that the Bayesian machinery
was never intended for misspecified models - Yet, computational resources and human
imagination being limited, in practice Bayesian
inference is applied to misspecified models all
the time. - In this case, Bayes may overfit even in the limit
for an infinite amount of data
47 Thank you for your attention!
48Wait a minute!
49Proof Sketch
- Log loss of Bayes upper bounds 0/1-loss
- For every sequence
- Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term
50Wait a minute
- Accumulated log loss of sequential Bayesian
predictions is always within constant of
accumulated log loss of optimal - So Bayes is good with respect to log
loss/KL-div. - But Bayes is bad with respect to 0/1-loss
- How is this possible?
- The Bayesian posterior effectively becomes a
mixture of bad distributions (different mixture
at different m) - Mixture is closer to true distribution D
than in KL-divergence/log
loss prediction - But performs worse than in terms of
0/1 error
51Is consistency achievable at all?
- Methods for avoiding overfitting proposed in
statistical and computational learning theory
literature are consistent - Vapniks methods (based on VC-dimension etc.)
- McAllesters PAC-Bayes methods
- These methods invariably punish complex (low
prior) classifiers much more than ordinary Bayes
in the simplest version of PAC-Bayes,
52Consistency and Data Compression - I
- Our inconsistency result also holds for (various
incarnations of) MDL learning algorithm - MDL is a learning method based on data
compression in practice it closely resembles
Bayesian inference with certain special priors - .however
53Consistency and Data Compression - II
- There already exist (in)famous inconsistency
results for Bayesian inference by Diaconis and
Freedman - For some non-parametric , even if true D is
in , Bayes may not converge to it - These type of inconsistency results do not apply
to MDL, since Diaconis and Freedman use priors
that do not compress the data - With MDL priors, if true D is in , then
consistency is guaranteed under no further
conditions at all (Barron 98)
54Proof Sketch
55Theorem 2
56Proof Sketch
- Log loss of Bayes upper bounds 0/1-loss
- For every sequence
- Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term
57Proof Sketch
58Proof Sketch
- Log loss of Bayes upper bounds 0/1-loss
- For every sequence
- Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term
59Proof Sketch
- Log loss of Bayes upper bounds 0/1-loss
- For every sequence
- Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term
(Law of large nrs/Hoeffding)