Inconsistency of Bayes and MDL under Misspecification - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Inconsistency of Bayes and MDL under Misspecification

Description:

Extension of joint work with John Langford, TTI Chicago (COLT 2004) ... Standard conversion method from to : logistic (sigmoid) transformation. For each and , set ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 60
Provided by: peter808
Category:

less

Transcript and Presenter's Notes

Title: Inconsistency of Bayes and MDL under Misspecification


1
Inconsistency of Bayes and MDL under
Misspecification
Peter Grünwald CWI, Amsterdam www.grunwald.nl
Extension of joint work with John Langford, TTI
Chicago (COLT 2004). Also at Bayesian VALENCIA
2006 meeting
2
Suboptimality of Bayes and MDL in Classification
Original Title
3
Our Result
  • We study Bayesian and Minimum Description Length
    (MDL) inference in classification problems
  • Bayes and MDL should automatically deal with
    overfitting
  • We show there exist classification domains where
    standard versions of Bayes and MDL perform
    suboptimally (overfit!) even if sample size tends
    to infinity

4
Why is this interesting?
  • Practical viewpoint
  • Bayesian methods
  • used a lot in practice
  • sometimes claimed to be universally optimal
  • MDL methods
  • even designed to deal with overfitting
  • Yet MDL and Bayes can fail even with infinite
    data
  • Theoretical viewpoint
  • How can result be reconciled with various strong
    Bayesian consistency theorems?

5
Menu
  • Classification
  • Abstract statement of main result
  • Precise statement of result
  • Discussion classification vs. misspecification

6
Classification
  • Given
  • Feature space
  • Label space
  • Sample
  • Set of hypotheses (classifiers)
  • Goal find a that makes few mistakes on
    future data from the same source
  • We say has small generalization
    error/classification risk

7
Classification Models
  • Types of Classifiers
  • hard classifiers (-1/1-output)
  • decision trees, stumps, forests
  • soft classifiers (real-valued output)
  • support vector machines
  • neural networks
  • probabilistic classifiers
  • Naïve Bayes/Bayesian network classifiers
  • Logistic regression

initial focus
8
Generalization Error
  • As is customary in statistical learning theory,
    we analyze classification by postulating some
    (unknown) distribution on joint
    (input,label)-space
  • Performance of a classifier measured in terms of
    its generalization error (classification risk)
    defined as

9
Learning Algorithms
  • A learning algorithm based on set of
    candidate classifiers , is a function that,
    for each sample of arbitrary length, outputs
    classifier

10
Consistent Learning Algorithms
  • Suppose are
    i.i.d.
  • A learning algorithm is consistent or
    asymptotically optimal if, no matter what the
    true distribution is,
  • in probability, as .

11
Consistent Learning Algorithms
  • Suppose are
    i.i.d.
  • A learning algorithm is consistent or
    asymptotically optimal if, no matter what the
    true distribution is,
  • in probability, as .

learned classifier
where is best
classifier in
12
Main Result
  • There exists
  • input domain
  • prior , non-zero on a countable set of
    classifiers
  • true distribution
  • a constant
  • such that the Bayesian learning algorithm
    is asymptotically -suboptimal

13
Main Result
  • There exists
  • input domain
  • prior , non-zero on a countable set of
    classifiers
  • true distribution
  • a constant
  • such that the Bayesian learning algorithm
    is asymptotically -suboptimal
  • Same holds for MDL algorithm

14
Remainder of Talk
  • How is Bayes learning algorithm defined?
  • What is scenario?
  • how do , true distr. and prior
    look like?
  • How dramatic is result?
  • How large is ?
  • How strange are choices for
    ?
  • Why is result surprising?
  • can it be reconciled with Bayesian consistency
    results?

15
Bayesian Learning of Classifiers
  • Problem Bayesian inference defined for models
    that are sets of probability distributions
  • In our scenario, models are sets of classifiers
    , i.e. functions
  • How can we find posterior over classifiers using
    Bayes rule?
  • Standard answer convert each to a
    corresponding distribution and apply
    Bayes to the set of distributions thus
    obtained

16
classifiers probability distrs.
  • Standard conversion method from to
    logistic (sigmoid) transformation
  • For each and , set
  • Define priors on and on
    and set

17
Logistic transformation - intuition
  • Consider hard classifiers
  • For each ,
  • Here
  • is empirical error that makes on data,
  • and is number of mistakes makes
    on data

18
Logistic transformation - intuition
  • For fixed
  • log-likelihood is linear function of number of
    mistakes makes on data
  • (log-) likelihood maximized for that is
    optimal for observed data
  • For fixed ,
  • Maximizing likelihood over also makes sense

19
Logistic transformation - intuition
  • In Bayesian practice, logistic transformation is
    standard tool, nowadays performed without giving
    any motivation or explanation
  • We did not find it in Bayesian textbooks,
  • , but tested it with three well-known Bayesians!

20
Logistic transformation - intuition
  • In Bayesian practice, logistic transformation is
    standard tool, nowadays performed without giving
    any motivation or explanation
  • We did not find it in Bayesian textbooks,
  • , but tested it with three well-known Bayesians!
  • Analogous to turning set of predictors with
    squared error into conditional distributions with
    normally distributed noise

expresses where
Z is independent noise bit
21
Main Result
Grünwald Langford, COLT 2004
  • There exists
  • input domain
  • prior on a countable set of classifiers
  • true distribution
  • a constant
  • such that the Bayesian learning algorithm is
    is is is is asymptotically -suboptimal

holds both for full Bayes and for Bayes (S)MAP
22
Definition of .
  • Posterior
  • Predictive Distribution
  • Full Bayes learning algorithm

23
Issues/Remainder of Talk
  • How is Bayes learning algorithm defined?
  • What is scenario?
  • how do , true distr. and prior
    look like?
  • How dramatic is result?
  • How large is ?
  • How strange are choices for
    ?
  • Why is result surprising?
  • can it be reconciled with Bayesian consistency
    results?

24
Scenario
  • Definition of and
  • Definition of prior
  • for some small , for all large ,
  • can be any strictly positive smooth
    prior

(or discrete prior with sufficient precision)
25
Scenario II Definition of true
  • Toss fair coin to determine value of .
  • Toss coin with bias
  • If (easy example) then for all
    , set
  • If (hard example) then set
  • and for all , independently set

26
Result
  • All features are informative of , but
    is more informative than all the others, so
    is best classifier
  • Nevertheless, with true - probability 1, as

(but note for each fixed ,
)
27
Issues/Remainder of Talk
  • How is Bayes learning algorithm defined?
  • What is scenario?
  • how do , true distr. and prior
    look like?
  • How dramatic is result?
  • How large is ?
  • How strange are choices for
    ?
  • Why is result surprising?
  • can it be reconciled with Bayesian consistency
    results?

28
Theorem 1
Grünwald Langford, COLT 2004
  • There exists
  • input domain
  • prior on a countable set of classifiers
  • true distribution
  • a constant
  • such that the Bayesian learning algorithm is
    is is is is asymptotically -suboptimal

holds both for full Bayes and for Bayes MAP
29
Theorem 1
Grünwald Langford, COLT 2004
  • There exists
  • input domain
  • prior on a countable set of classifiers
  • true distribution
  • a constant
  • such that the Bayesian learning algorithm is
    is is is is asymptotically -suboptimal

interdependent parameters
30
Theorem 1, extended
  • X-axis
  • maximum
  • Bayes MAP/MDL

  • maximum
  • full Bayes (binary entropy)
  • Maximum difference achieved at

achieved with probability 1, for all large n

31
Theorem 1, extended
  • X-axis
  • maximum
  • Bayes MAP/MDL

  • maximum
  • full Bayes (binary entropy)
  • Maximum difference achieved at


32
Theorem 1, extended
  • X-axis
  • maximum
  • Bayes MAP/MDL

  • maximum
  • full Bayes (binary entropy)
  • Maximum difference achieved at

Bayes can get much worse than random guessing!

33
How can Bayes get so bad?
  • Set, for hard examples, and
  • MAP is achieved for large set of classifiers
  • Since these all err independently, on hard
    examples, by the Law of Large numbers, the
    fraction of MAP classifiers making a wrong
    prediction will be
  • Therefore, if is a hard
    example,

But then Bayes predicts !
34
Thm 2 full Bayes result is tight
  • X-axis
  • maximum
  • Bayes MAP/MDL

  • maximum
  • full Bayes (binary entropy)

Now maximum taken over all

35
How natural is scenario?
  • Basic scenario is quite unnatural, but
  • Although it may not happen in real life,
    describing the worst that could happen is
    interesting in itself
  • Priors are natural (take e.g. Rissanens
    universal prior)
  • Clarke (2003) reports practical evidence that
    Bayes performs suboptimally with large yet
    misspecified models in a regression context
  • Bayesian inference is consistent under very weak
    conditions. So even if unnatural, result is still
    interesting!

36
Issues/Remainder of Talk
  • How is Bayes learning algorithm defined?
  • What is scenario?
  • how do , true distr. and prior
    look like?
  • How dramatic is result?
  • How large is ?
  • How strange are choices for
    ?
  • Why is result surprising?

37
Is result surprising - I?
  • Methods proposed in statistical learning theory
    literature are consistent
  • Vapniks SRM, McAllesters PAC-Bayes methods
  • These methods punish complex (low prior)
    classifiers much more than ordinary Bayes
  • in the simplest version of PAC-Bayes,
  • based on generalization bounds that suggest Bayes
    is inconsistent in classification
  • Our result is still interesting we exhibit
    concrete scenario that shows worst that can happen

38
Is result surprising II?
  • There exist a various strong consistency results
    for Bayesian inference
  • Superficially it seems our result contradicts
    these
  • How can we reconcile the two?

39
Bayesian Consistency Results
  • Doob (1949), Blackwell Dubins (1962),
    Barron(1985)
  • Suppose
  • Countable
  • Contains true conditional distribution
  • Then with -probability 1, as
    ,

40
Bayesian Consistency Results
  • If
  • then we must also have
  • Our result says that this does not happen in our
    scenario. Hence the (countable!) we
    constructed must be misspecified
  • Model homoskedastic, true heteroskedastic!

41
Bayesian consistency under misspecification
  • Suppose we use Bayesian inference based on
  • If , then under
    mild generality conditions, Bayes predictive
    distribution still converges to
    that is closest to in
    KL-divergence (relative entropy).
  • The logistic transformation ensures that
  • achieved for c that also achieves

42
Bayes consistency under misspecification
  • In our case, Bayes does not converge to
    distribution with smallest classification risk,
    so it also does not converge to distribution
    closest to true in KL-div.
  • Apparently, mild generality conditions for
    Bayesian consistency under misspecification are
    violated
  • Conditions for consistency under misspecification
    are much stronger than conditions for standard
    consistency!
  • must either be convex or simple (e.g.
    parametric)

43
Misspecification Inconsistency
  • Our inconsistency theorem is fundamentally
    different from earlier ones such as Barron (1998)
    and Diaconis/Freedman (1986)
  • We can choose the model as a countable set of
    i.i.d distributions. Then if true distribution
    were in model, consistency would be guaranteed
  • MDL is immune to Diaconis/Freedman inconsistency,
    but not to misspecification inconsistency
  • Diaconis/Freedman use priors so that the Bayesian
    universal code does not compress the data. Such
    priors make no sense from an MDL point of view.

44
Misspecification Reformulation
  • For all , there exists a
    distribution on and a prior
    on a countable set of distributions such
    that, for some with
  • Yet for all , with
    -probability 1,

45
Bayes predicts too well
  • Theorem 3 Let be a set of distributions.
    Under mild regularity conditions (e.g. if
    is countable)
  • the only true distributions D for which Bayesian
    posterior is inconsistent
  • i.e., for some , a.s.,
  • are those under which the posterior predictive
    distribution becomes strictly closer in
    KL-divergence to D than the best single
    distribution in
  • i.e. there exists such that a.s.,
    for infinitely many m

46
Conclusion
  • Our result applies to hard classifiers and
    (equivalently) to probabilistic classifiers under
    slight misspecification
  • Bayesian may argue that the Bayesian machinery
    was never intended for misspecified models
  • Yet, computational resources and human
    imagination being limited, in practice Bayesian
    inference is applied to misspecified models all
    the time.
  • In this case, Bayes may overfit even in the limit
    for an infinite amount of data

47
Thank you for your attention!
48
Wait a minute!

49
Proof Sketch
  • Log loss of Bayes upper bounds 0/1-loss
  • For every sequence
  • Log loss of Bayes upper bounded by log loss of
    0/1 optimal plus log-term

50
Wait a minute
  • Accumulated log loss of sequential Bayesian
    predictions is always within constant of
    accumulated log loss of optimal
  • So Bayes is good with respect to log
    loss/KL-div.
  • But Bayes is bad with respect to 0/1-loss
  • How is this possible?
  • The Bayesian posterior effectively becomes a
    mixture of bad distributions (different mixture
    at different m)
  • Mixture is closer to true distribution D
    than in KL-divergence/log
    loss prediction
  • But performs worse than in terms of
    0/1 error

51
Is consistency achievable at all?
  • Methods for avoiding overfitting proposed in
    statistical and computational learning theory
    literature are consistent
  • Vapniks methods (based on VC-dimension etc.)
  • McAllesters PAC-Bayes methods
  • These methods invariably punish complex (low
    prior) classifiers much more than ordinary Bayes
    in the simplest version of PAC-Bayes,

52
Consistency and Data Compression - I
  • Our inconsistency result also holds for (various
    incarnations of) MDL learning algorithm
  • MDL is a learning method based on data
    compression in practice it closely resembles
    Bayesian inference with certain special priors
  • .however

53
Consistency and Data Compression - II
  • There already exist (in)famous inconsistency
    results for Bayesian inference by Diaconis and
    Freedman
  • For some non-parametric , even if true D is
    in , Bayes may not converge to it
  • These type of inconsistency results do not apply
    to MDL, since Diaconis and Freedman use priors
    that do not compress the data
  • With MDL priors, if true D is in , then
    consistency is guaranteed under no further
    conditions at all (Barron 98)

54
Proof Sketch

55
Theorem 2

56
Proof Sketch
  • Log loss of Bayes upper bounds 0/1-loss
  • For every sequence
  • Log loss of Bayes upper bounded by log loss of
    0/1 optimal plus log-term

57
Proof Sketch

58
Proof Sketch
  • Log loss of Bayes upper bounds 0/1-loss
  • For every sequence
  • Log loss of Bayes upper bounded by log loss of
    0/1 optimal plus log-term

59
Proof Sketch
  • Log loss of Bayes upper bounds 0/1-loss
  • For every sequence
  • Log loss of Bayes upper bounded by log loss of
    0/1 optimal plus log-term

(Law of large nrs/Hoeffding)
Write a Comment
User Comments (0)
About PowerShow.com