Inconsistency of Bayes and MDL under Misspecification - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Inconsistency of Bayes and MDL under Misspecification

Description:

Extension of joint work with John Langford, TTI Chicago (COLT 2004) ... Standard conversion method from to : logistic (sigmoid) transformation. For each and , set ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 60

Provided by: peter808

Category:

more less

Transcript and Presenter's Notes

Title: Inconsistency of Bayes and MDL under Misspecification

1
Inconsistency of Bayes and MDL under
Misspecification
Peter Grünwald CWI, Amsterdam www.grunwald.nl
Extension of joint work with John Langford, TTI
Chicago (COLT 2004). Also at Bayesian VALENCIA
2006 meeting
2
Suboptimality of Bayes and MDL in Classification
Original Title
3
Our Result

We study Bayesian and Minimum Description Length
(MDL) inference in classification problems
Bayes and MDL should automatically deal with
overfitting
We show there exist classification domains where
standard versions of Bayes and MDL perform
suboptimally (overfit!) even if sample size tends
to infinity

4
Why is this interesting?

Practical viewpoint
Bayesian methods
used a lot in practice
sometimes claimed to be universally optimal
MDL methods
even designed to deal with overfitting
Yet MDL and Bayes can fail even with infinite
data
Theoretical viewpoint
How can result be reconciled with various strong
Bayesian consistency theorems?

5
Menu

Classification
Abstract statement of main result
Precise statement of result
Discussion classification vs. misspecification

6
Classification

Given
Feature space
Label space
Sample
Set of hypotheses (classifiers)
Goal find a that makes few mistakes on
future data from the same source
We say has small generalization
error/classification risk

7
Classification Models

Types of Classifiers
hard classifiers (-1/1-output)
decision trees, stumps, forests
soft classifiers (real-valued output)
support vector machines
neural networks
probabilistic classifiers
Naïve Bayes/Bayesian network classifiers
Logistic regression

initial focus
8
Generalization Error

As is customary in statistical learning theory,
we analyze classification by postulating some
(unknown) distribution on joint
(input,label)-space
Performance of a classifier measured in terms of
its generalization error (classification risk)
defined as

9
Learning Algorithms

A learning algorithm based on set of
candidate classifiers , is a function that,
for each sample of arbitrary length, outputs
classifier

10
Consistent Learning Algorithms

Suppose are
i.i.d.
A learning algorithm is consistent or
asymptotically optimal if, no matter what the
true distribution is,
in probability, as .

11
Consistent Learning Algorithms

Suppose are
i.i.d.
A learning algorithm is consistent or
asymptotically optimal if, no matter what the
true distribution is,
in probability, as .

learned classifier
where is best
classifier in
12
Main Result

There exists
input domain
prior , non-zero on a countable set of
classifiers
true distribution
a constant
such that the Bayesian learning algorithm
is asymptotically -suboptimal

13
Main Result

There exists
input domain
prior , non-zero on a countable set of
classifiers
true distribution
a constant
such that the Bayesian learning algorithm
is asymptotically -suboptimal
Same holds for MDL algorithm

14
Remainder of Talk

How is Bayes learning algorithm defined?
What is scenario?
how do , true distr. and prior
look like?
How dramatic is result?
How large is ?
How strange are choices for
?
Why is result surprising?
can it be reconciled with Bayesian consistency
results?

15
Bayesian Learning of Classifiers

Problem Bayesian inference defined for models
that are sets of probability distributions
In our scenario, models are sets of classifiers
, i.e. functions
How can we find posterior over classifiers using
Bayes rule?
Standard answer convert each to a
corresponding distribution and apply
Bayes to the set of distributions thus
obtained

16
classifiers probability distrs.

Standard conversion method from to
logistic (sigmoid) transformation
For each and , set
Define priors on and on
and set

17
Logistic transformation - intuition

Consider hard classifiers
For each ,
Here
is empirical error that makes on data,
and is number of mistakes makes
on data

18
Logistic transformation - intuition

For fixed
log-likelihood is linear function of number of
mistakes makes on data
(log-) likelihood maximized for that is
optimal for observed data
For fixed ,
Maximizing likelihood over also makes sense

19
Logistic transformation - intuition

In Bayesian practice, logistic transformation is
standard tool, nowadays performed without giving
any motivation or explanation
We did not find it in Bayesian textbooks,
, but tested it with three well-known Bayesians!

20
Logistic transformation - intuition

In Bayesian practice, logistic transformation is
standard tool, nowadays performed without giving
any motivation or explanation
We did not find it in Bayesian textbooks,
, but tested it with three well-known Bayesians!
Analogous to turning set of predictors with
squared error into conditional distributions with
normally distributed noise

expresses where
Z is independent noise bit
21
Main Result
Grünwald Langford, COLT 2004

There exists
input domain
prior on a countable set of classifiers
true distribution
a constant
such that the Bayesian learning algorithm is
is is is is asymptotically -suboptimal

holds both for full Bayes and for Bayes (S)MAP
22
Definition of .

Posterior
Predictive Distribution
Full Bayes learning algorithm

23
Issues/Remainder of Talk

How is Bayes learning algorithm defined?
What is scenario?
how do , true distr. and prior
look like?
How dramatic is result?
How large is ?
How strange are choices for
?
Why is result surprising?
can it be reconciled with Bayesian consistency
results?

24
Scenario

Definition of and
Definition of prior
for some small , for all large ,
can be any strictly positive smooth
prior

(or discrete prior with sufficient precision)
25
Scenario II Definition of true

Toss fair coin to determine value of .
Toss coin with bias
If (easy example) then for all
, set
If (hard example) then set
and for all , independently set

26
Result

All features are informative of , but
is more informative than all the others, so
is best classifier
Nevertheless, with true - probability 1, as

(but note for each fixed ,
)
27
Issues/Remainder of Talk

How is Bayes learning algorithm defined?
What is scenario?
how do , true distr. and prior
look like?
How dramatic is result?
How large is ?
How strange are choices for
?
Why is result surprising?
can it be reconciled with Bayesian consistency
results?

28
Theorem 1
Grünwald Langford, COLT 2004

There exists
input domain
prior on a countable set of classifiers
true distribution
a constant
such that the Bayesian learning algorithm is
is is is is asymptotically -suboptimal

holds both for full Bayes and for Bayes MAP
29
Theorem 1
Grünwald Langford, COLT 2004

There exists
input domain
prior on a countable set of classifiers
true distribution
a constant
such that the Bayesian learning algorithm is
is is is is asymptotically -suboptimal

interdependent parameters
30
Theorem 1, extended

X-axis
maximum
Bayes MAP/MDL
maximum
full Bayes (binary entropy)
Maximum difference achieved at

achieved with probability 1, for all large n

31
Theorem 1, extended

X-axis
maximum
Bayes MAP/MDL
maximum
full Bayes (binary entropy)
Maximum difference achieved at

32
Theorem 1, extended

X-axis
maximum
Bayes MAP/MDL
maximum
full Bayes (binary entropy)
Maximum difference achieved at

Bayes can get much worse than random guessing!

33
How can Bayes get so bad?

Set, for hard examples, and
MAP is achieved for large set of classifiers
Since these all err independently, on hard
examples, by the Law of Large numbers, the
fraction of MAP classifiers making a wrong
prediction will be
Therefore, if is a hard
example,

But then Bayes predicts !
34
Thm 2 full Bayes result is tight

X-axis
maximum
Bayes MAP/MDL
maximum
full Bayes (binary entropy)

Now maximum taken over all

35
How natural is scenario?

Basic scenario is quite unnatural, but
Although it may not happen in real life,
describing the worst that could happen is
interesting in itself
Priors are natural (take e.g. Rissanens
universal prior)
Clarke (2003) reports practical evidence that
Bayes performs suboptimally with large yet
misspecified models in a regression context
Bayesian inference is consistent under very weak
conditions. So even if unnatural, result is still
interesting!

36
Issues/Remainder of Talk

How is Bayes learning algorithm defined?
What is scenario?
how do , true distr. and prior
look like?
How dramatic is result?
How large is ?
How strange are choices for
?
Why is result surprising?

37
Is result surprising - I?

Methods proposed in statistical learning theory
literature are consistent
Vapniks SRM, McAllesters PAC-Bayes methods
These methods punish complex (low prior)
classifiers much more than ordinary Bayes
in the simplest version of PAC-Bayes,
based on generalization bounds that suggest Bayes
is inconsistent in classification
Our result is still interesting we exhibit
concrete scenario that shows worst that can happen

38
Is result surprising II?

There exist a various strong consistency results
for Bayesian inference
Superficially it seems our result contradicts
these
How can we reconcile the two?

39
Bayesian Consistency Results

Doob (1949), Blackwell Dubins (1962),
Barron(1985)
Suppose
Countable
Contains true conditional distribution
Then with -probability 1, as
,

40
Bayesian Consistency Results

If
then we must also have
Our result says that this does not happen in our
scenario. Hence the (countable!) we
constructed must be misspecified
Model homoskedastic, true heteroskedastic!

41
Bayesian consistency under misspecification

Suppose we use Bayesian inference based on
If , then under
mild generality conditions, Bayes predictive
distribution still converges to
that is closest to in
KL-divergence (relative entropy).
The logistic transformation ensures that
achieved for c that also achieves

42
Bayes consistency under misspecification

In our case, Bayes does not converge to
distribution with smallest classification risk,
so it also does not converge to distribution
closest to true in KL-div.
Apparently, mild generality conditions for
Bayesian consistency under misspecification are
violated
Conditions for consistency under misspecification
are much stronger than conditions for standard
consistency!
must either be convex or simple (e.g.
parametric)

43
Misspecification Inconsistency

Our inconsistency theorem is fundamentally
different from earlier ones such as Barron (1998)
and Diaconis/Freedman (1986)
We can choose the model as a countable set of
i.i.d distributions. Then if true distribution
were in model, consistency would be guaranteed
MDL is immune to Diaconis/Freedman inconsistency,
but not to misspecification inconsistency
Diaconis/Freedman use priors so that the Bayesian
universal code does not compress the data. Such
priors make no sense from an MDL point of view.

44
Misspecification Reformulation

For all , there exists a
distribution on and a prior
on a countable set of distributions such
that, for some with
Yet for all , with
-probability 1,

45
Bayes predicts too well

Theorem 3 Let be a set of distributions.
Under mild regularity conditions (e.g. if
is countable)
the only true distributions D for which Bayesian
posterior is inconsistent
i.e., for some , a.s.,
are those under which the posterior predictive
distribution becomes strictly closer in
KL-divergence to D than the best single
distribution in
i.e. there exists such that a.s.,
for infinitely many m

46
Conclusion

Our result applies to hard classifiers and
(equivalently) to probabilistic classifiers under
slight misspecification
Bayesian may argue that the Bayesian machinery
was never intended for misspecified models
Yet, computational resources and human
imagination being limited, in practice Bayesian
inference is applied to misspecified models all
the time.
In this case, Bayes may overfit even in the limit
for an infinite amount of data

47
Thank you for your attention!
48
Wait a minute!

49
Proof Sketch

Log loss of Bayes upper bounds 0/1-loss
For every sequence
Log loss of Bayes upper bounded by log loss of
0/1 optimal plus log-term

50
Wait a minute

Accumulated log loss of sequential Bayesian
predictions is always within constant of
accumulated log loss of optimal
So Bayes is good with respect to log
loss/KL-div.
But Bayes is bad with respect to 0/1-loss
How is this possible?
The Bayesian posterior effectively becomes a
mixture of bad distributions (different mixture
at different m)
Mixture is closer to true distribution D
than in KL-divergence/log
loss prediction
But performs worse than in terms of
0/1 error

51
Is consistency achievable at all?

Methods for avoiding overfitting proposed in
statistical and computational learning theory
literature are consistent
Vapniks methods (based on VC-dimension etc.)
McAllesters PAC-Bayes methods
These methods invariably punish complex (low
prior) classifiers much more than ordinary Bayes
in the simplest version of PAC-Bayes,

52
Consistency and Data Compression - I

Our inconsistency result also holds for (various
incarnations of) MDL learning algorithm
MDL is a learning method based on data
compression in practice it closely resembles
Bayesian inference with certain special priors
.however

53
Consistency and Data Compression - II

There already exist (in)famous inconsistency
results for Bayesian inference by Diaconis and
Freedman
For some non-parametric , even if true D is
in , Bayes may not converge to it
These type of inconsistency results do not apply
to MDL, since Diaconis and Freedman use priors
that do not compress the data
With MDL priors, if true D is in , then
consistency is guaranteed under no further
conditions at all (Barron 98)

54
Proof Sketch

55
Theorem 2

56
Proof Sketch