Bayesian Information Criterion - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Bayesian Information Criterion

Description:

Simply using Maximum Likelihood principle leads to choosing highest ... To derive Pr(Z|Mm) we apply a Laplace approximation to the integral to arrive at: ... – PowerPoint PPT presentation

Number of Views:315
Avg rating:3.0/5.0
Slides: 19
Provided by: BROA2
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Information Criterion


1
Bayesian Information Criterion
  • ENEE698A Elements of Statistical Learning
  • Joshua Broadwater
  • 10/08/2003

2
Motivation
  • Need way to select the appropriate dimension of a
    statistical model (e.g. polynomial regression,
    multi-step Markov chains, etc.)
  • Simply using Maximum Likelihood principle leads
    to choosing highest possible dimension
  • Akaike proposed subtracting the dimension of the
    model from the log likelihood 1
  • AIC was never proven consistent
  • In fact, Kashyap would prove it is inconsistent
    in his 1980 paper 2
  • AIC tends to overestimate the true model order

3
Motivation (cont.)
  • Another approach based on Bayesian statistics was
    proposed by Kashyap 3 to explain The Principle
    of Parsimony
  • First documented Bayesian methodology
  • Kashyaps method was consistent
  • Maximize the posterior distribution for a class
    of models
  • Required 13 assumptions
  • The Bayesian Information Criterion is officially
    credited to Schwartz 4
  • Variants of the BIC would be published in
    following years to show better convergence
    properties
  • Hannan and Quinn 5
  • Fine and Hwang 6

4
Schwartz BIC Definition
  • Assumptions
  • Large-sample statistics
  • A-priori statistics assumed to be of a certain
    form (conjugate priors) but not known exactly
  • Leading term in the Bayes estimator is simply the
    maximum likelihood estimator.
  • The second term, however, is affected by a
    Bayesian approach due to its reflection of
    singularities of the a-priori distribution.

5
Schwartzs Bayes Procedure
  • Maximize the function S to choose the optimal
    dimension j for iid samples

Koopman-Darmois Density Family
Averaged sufficient statistic
A-priori Pr(?j)
A-priori probability of model j
6
Asymptotics
  • Given S defined for a large-sample, it can be
    proved for fixed Y and j, as n tends to 8 that R
    is bounded in n and
  • Note This only holds for the Koopman-Darmois
    family of distributions with associated a-prior
    distributions. Never proved the consistency
    beyond this. 6

BIC
7
Book Definition
  • Bayesian Information Criterion (BIC) is a way to
    estimate the best model using only an in-sample
    estimate.
  • The BIC is based on maximization of a log
    likelihood function 7.
  • When the above equation is multiplied by-1/2, it
    matches the original derivation by Schartz.

8
Background Definitions
  • Training error is the average loss over the
    training sample
  • xi and yi are the input and output samples of the
    training set
  • Loss functions can be defined as

9
BIC with Gaussian Assumption
  • Using a Gaussian model with iid samples and known
    variance, the BIC can be written under squared
    error loss as

AIC with 2 replaced by log N
10
Development of BIC
  • Problem Suppose M candidate models Mm with
    parameters ?m from which we want to find the best
    model for our data.
  • Assuming prior distributions, the posterior
    probability can be found as

11
Development (cont.)
  • Compare models using
  • BF(Z) is the Bayes Factor which defines the
    contribution of the data to the posterior odds
  • Assume Pr(Mm) is constant leaving us to define
    the BF(Z).

BF(Z)
12
Development (cont.)
  • To derive Pr(ZMm) we apply a Laplace
    approximation to the integral to arrive at
  • Defining the loss function to be
  • Provides us with the BIC

13
Application
  • Given the previous development, the best model
    is the one with the minimum BIC.
  • The BIC also provides a measure of the posterior
    probability of each model for assessment purposes

14
Examples
  • Highlight the usefulness of the BIC as well as
    its drawbacks.
  • First model Mixture of 4 2-D Gaussian
    distributions
  • BIC should correctly identify 4 mixtures
  • Second model Mixture of 8 2-D Gaussian
    distributions
  • BIC will tend to underestimate the number of
    mixtures

15
4 Gaussian Mixture Model
4 GMM Model
BIC with d M(1 D D(D-1)/2) - 1
16
8 Gaussian Mixture Model
8 GMM Model
BIC with d M(1 D D(D-1)/2) - 1
17
Summary
  • BIC (unlike AIC) is a consistent estimate.
  • BIC tends to choose models that are too simple
    due to heavy penalty on complexity.
  • BIC is regarded as an approximation to MDL
    despite being derived in an independent manner.

18
References
  • 1 H. Akaike, A New Look at the Statistical
    Identification Model, IEEE Trans. Auto Control,
    Vol. AC-19, December 1974, pp. 716-723.
  • 2 R. Kashyap, Inconsistency of the AIC Rule
    for Estimating the Order of Autoregressive
    Models, IEEE Trans. Auto. Control, Vol. AC-25,
    No. 5, October 1980, pp. 996-998.
  • 3 R. L. Kashyap, A Bayesian Comparison of
    Different Classes of Dynamic Models Using
    Empirical Data, IEEE Trans. Auto Control, Vol.
    AC-22, No. 5, October 1977, pp. 715-727.
  • 4 G. Schwartz, Estimating the Dimension of a
    Model, The Annals of Statistics, Vol. 5, No. 2,
    1978, pp 461-464.
  • 5 E. J. Hannan, B. G. Quinn, The Determination
    of the Order of an Autoregression, J.R. Statist.
    Soc., B, 41, 1979, pp 190-195.
  • 6 T. L. Fine, W. G. Hwang, Consistent
    Estimation of System Order, IEEE Trans. Auto.
    Control, Vol. AC-24, No. 3, June 1979, pp.
    387-402.
  • 7 T. Hastie, R. Tibshirani, J. Friedman, The
    Elements of Statistical Learning,
    Springer-Verlag, NY, 2001, pp. 193-208.
Write a Comment
User Comments (0)
About PowerShow.com