Suboptimality of Bayes and MDL in Classification - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Suboptimality of Bayes and MDL in Classification

Description:

We study Bayesian and Minimum Description Length (MDL) inference in ... Our inconsistency result also holds for (various incarnations of) MDL learning algorithm ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 48

Provided by: PETERGR63

Category:

more less

Transcript and Presenter's Notes

Title: Suboptimality of Bayes and MDL in Classification

1
Suboptimality of Bayes and MDL in Classification
Peter Grünwald CWI/EURANDOM www.grunwald.nl
joint work with John Langford, TTI Chicago,
Preliminary version appeared in Proceedings17th
annual Conference On Learning Theory (COLT 2004)
2
Our Result

We study Bayesian and Minimum Description Length
(MDL) inference in classification problems
Bayes and MDL should automatically deal with
overfitting
We show there exist classification domains where
Bayes and MDL
when applied in a standard manner
perform suboptimally (overfit!) even if sample
size tends to infinity

3
Why is this interesting?

Practical viewpoint
Bayesian methods
used a lot in practice
sometimes claimed to be universally optimal
MDL methods
even designed to deal with overfitting
Yet MDL and Bayes can fail even with infinite
data
Theoretical viewpoint
How can result be reconciled with various strong
Bayesian consistency theorems?

4
Menu

Classification
Abstract statement of main result
Precise statement of result
Discussion

5
Classification

Given
Feature space
Label space
Sample
Set of hypotheses (classifiers)
Goal find a that makes few mistakes on
future data from the same source
We say c has small generalization
error/classification risk

6
Classification Models

Types of Classifiers
hard classifiers (-1/1-output)
decision trees, stumps, forests
soft classifiers (real-valued output)
support vector machines
neural networks
probabilistic classifiers
Naïve Bayes/Bayesian network classifiers
Logistic regression

initial focus
7
Generalization Error

As is customary in statistical learning theory,
we analyze classification by postulating some
(unknown) distribution D on joint
(input,label)-space
Performance of a classifier measured in terms of
its generalization error (classification risk)
defined as

8
Learning Algorithms

A learning algorithm LA based on set of candidate
classifiers , is a function that, for each
sample S of arbitrary length, outputs classifier

9
Consistent Learning Algorithms

Suppose are
i.i.d.
A learning algorithm is consistent or
asymptotically optimal if, no matter what the
true distribution D is,
in D probability, as .

10
Consistent Learning Algorithms

Suppose are
i.i.d.
A learning algorithm is consistent or
asymptotically optimal if, no matter what the
true distribution D is,
in D probability, as .

learned classifier
where is best
classifier in
11
Main Result

There exists
input domain
prior P , non-zero on a countable set of
classifiers
true distribution D
a constant
such that the Bayesian learning algorithm is
is is is is asymptotically K-suboptimal

12
Main Result

There exists
input domain
prior , non-zero on a countable set of
classifiers
true distribution D
a constant
such that the Bayesian learning algorithm is
is is is is asymptotically K-suboptimal
Same holds for MDL learning algorithm

13
Remainder of Talk

How is Bayes learning algorithm defined?
What is scenario?
how do , true distr. D and prior P
look like?
How dramatic is result?
How large is K?
How strange are choices for
?
How bad can Bayes get?
Why is result surprising?
can it be reconciled with Bayesian consistency
results?

14
Bayesian Learning of Classifiers

Problem Bayesian inference defined for models
that are sets of probability distributions
In our scenario, models are sets of classifiers
, i.e. functions
How can we find a posterior over classifiers
using Bayes rule?
Standard answer convert each to a
corresponding distribution and apply
Bayes to the set of distributions thus
obtained

15
classifiers probability distrs.

Standard conversion method from to
logistic (sigmoid) transformation
For each and , set
Define priors on and on
and set

16
Logistic transformation - intuition

Consider hard classifiers
For each ,
Here
is empirical error that c makes on data,
and is number of mistakes c makes
on data

17
Logistic transformation - intuition

For fixed
log-likelihood is linear function of number of
mistakes c makes on data
(log-) likelihood maximized for c that is optimal
for observed data
For fixed c,
Maximizing likelihood over also makes sense

18
Logistic transformation - intuition

In Bayesian practice, logistic transformation is
standard tool, nowadays performed without giving
any motivation or explanation
We did not find it in Bayesian textbooks,
, but tested it with three well-known Bayesians!
Analogous to turning set of predictors with
squared error into conditional distributions with
normally distributed noise

expresses where Z
is independent noise bit
19
Main Result
Grünwald Langford, COLT 2004

There exists
input domain
prior P on a countable set of classifiers
true distribution D
a constant
such that the Bayesian learning algorithm is
is is is is asymptotically K-suboptimal

holds both for full Bayes and for Bayes (S)MAP
20
Definition of .

Posterior
Predictive Distribution
Full Bayes learning algorithm

21
Issues/Remainder of Talk

How is Bayes learning algorithm defined?
What is scenario?
how do , true distr. D and prior P
look like?
How dramatic is result?
How large is K?
How strange are choices for
?
How bad can Bayes get?
Why is result surprising?
can it be reconciled with Bayesian consistency
results?

22
Scenario

Definition of Y, X and
Definition of prior
for some small , for all large n,
can be any strictly positive smooth
prior

(or discrete prior with sufficient precision)
23
Scenario II Definition of true D

Toss fair coin to determine value of Y .
Toss coin Z with bias
If (easy example) then for all
, set
If (hard example) then set
and for all , independently set

24
Result

All features are informative of , but
is more informative than all the others, so
is best classifier
Nevertheless, with true D- probability 1, as

(but note for each fixed j,
)
25
Issues/Remainder of Talk

How is Bayes learning algorithm defined?
What is scenario?
how do , true distr. D and prior P
look like?
How dramatic is result?
How large is K?
How strange are choices for
?
How bad can Bayes get?
Why is result surprising?
can it be reconciled with Bayesian consistency
results?

26
Theorem 1
Grünwald Langford, COLT 2004

There exists
input domain
prior P on a countable set of classifiers
true distribution D
a constant
such that the Bayesian learning algorithm is
is is is is asymptotically K-suboptimal

holds both for full Bayes and for Bayes MAP
27
Theorem 1, extended

X-axis
maximum
Bayes MAP/MDL
maximum
full Bayes (binary entropy)
Maximum difference achieved at

28
How natural is scenario?

Basic scenario is quite unnatural
We chose it because we could prove something
about it! But
Priors are natural (take e.g. Rissanens
universal prior)
Clarke (2002) reports practical evidence that
Bayes performs suboptimally with large yet
misspecified models in a regression context
Bayesian inference is consistent under very weak
conditions. So even if unnatural, result is still
interesting!

29
Issues/Remainder of Talk

How is Bayes learning algorithm defined?
What is scenario?
how do , true distr. D and prior P
look like?
How dramatic is result?
How large is K?
How strange are choices for
?
How bad can Bayes get?
Why is result surprising?
can it be reconciled with Bayesian consistency
results?

30
Bayesian Consistency Results

Doob (1949, special case)
Suppose
Countable
Contains true conditional distribution
Then with D -probability 1,

weakly/in Hellinger distance
31
Bayesian Consistency Results

If
then we must also have
Our result says that this does not happen in our
scenario. Hence the (countable!) we
constructed must be misspecified
Model homoskedastic, true D heteroskedastic!

32
Bayesian consistency under misspecification

Suppose we use Bayesian inference based on
model
If , then under
mild generality conditions, Bayes still
converges to distribution that is
closest to in KL-divergence
(relative entropy).
The logistic transformation ensures that
achieved for c that also achieves

33
Bayesian consistency under misspecification

In our case, Bayesian posterior does not converge
to distribution with smallest classification
generalization error, so it also does not
converge to distribution closest to true D in
KL-divergence
Apparently, mild generality conditions for
Bayesian consistency under misspecification are
violated
Conditions for consistency under misspecification
are much stronger than conditions for standard
consistency!
must either be convex or simple (e.g.
parametric)

34
Is consistency achievable at all?

Methods for avoiding overfitting proposed in
statistical and computational learning theory
literature are consistent
Vapniks methods (based on VC-dimension etc.)
McAllesters PAC-Bayes methods
These methods invariably punish complex (low
prior) classifiers much more than ordinary Bayes
in the simplest version of PAC-Bayes,

35
Consistency and Data Compression - I

Our inconsistency result also holds for (various
incarnations of) MDL learning algorithm
MDL is a learning method based on data
compression in practice it closely resembles
Bayesian inference with certain special priors
.however

36
Consistency and Data Compression - II

There already exist (in)famous inconsistency
results for Bayesian inference by Diaconis and
Freedman
For some highly non-parametric , even if
true D is in , Bayes may not converge to it
These type of inconsistency results do not apply
to MDL, since Diaconis and Freedman use priors
that do not compress the data
With MDL priors, if true D is in , then
consistency is guaranteed under no further
conditions at all (Barron 98)

37
Issues/Remainder of Talk

How is Bayes learning algorithm defined?
What is scenario?
how do , true distr. D and prior P
look like?
How dramatic is result?
How large is K?
How strange are choices for
?
How bad can Bayes get? ( what happens)
Why is result surprising?
can it be reconciled with Bayesian consistency
results?

38
Thm 2 full Bayes result is tight

X-axis
maximum
Bayes MAP/MDL
maximum
full Bayes (binary entropy)
Maximum difference achieved at

39
Theorem 2

40
Proof Sketch

Log loss of Bayes upper bounds 0/1-loss
For every sequence
Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term

41
Proof Sketch

42
Proof Sketch

Log loss of Bayes upper bounds 0/1-loss
For every sequence
Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term

43
Proof Sketch

Log loss of Bayes upper bounds 0/1-loss
For every sequence
Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term

(Law of large nrs/Hoeffding)
44
Wait a minute

Accumulated log loss of sequential Bayesian
predictions is always within of
accumulated log loss of optimal
So Bayes is good with respect to log
loss/KL-div.
But Bayes is bad with respect to 0/1-loss
How is this possible?
The Bayesian posterior effectively becomes a
mixture of bad distributions (different mixture
at different m)
Mixture is closer to true distribution D
than in KL-divergence/log
loss prediction
But performs worse than in terms of
0/1 error

45
Bayes predicts too well

Let be a set of distribution, and let
be defined with respect to a prior that
makes a universal data-compressor
wrt
One can show that the only true distributions D
for which Bayes can ever become inconsistent in
KL-divergence sense
are those under which the posterior predictive
distribution becomes closer in KL-divergence to D
than the best single distribution in

46
Conclusion

Our result applies to hard classifiers and
(equivalently) to probabilistic classifiers under
slight misspecification
Bayesian may argue that the Bayesian machinery
was never intended for misspecified models
Yet, computational resources and human
imagination being limited, in practice Bayesian
inference is applied to misspecified models all
the time.
In this case, Bayes may overfit even in the limit
for an infinite amount of data

47
Thank you for your attention!

Write a Comment

User Comments (0)