Minimum Description Length Principle - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Minimum Description Length Principle

Description:

... for any distribution Q on A we can find a prefix code with length function L(x) ... So first encode the length of the binary standard form using trivial code... – PowerPoint PPT presentation

Number of Views:875
Avg rating:3.0/5.0
Slides: 25
Provided by: vikasr
Category:

less

Transcript and Presenter's Notes

Title: Minimum Description Length Principle


1
Minimum Description Length Principle for
statistical Inference Presented by Vikas C.
Raykar University of Maryland, CollegePark
2
Contents
  • Statistical Modelling Traditional approach.
  • Algorithmic theory of complexity Kolmogorov
    Complexity.
  • MDL principle.
  • Coding and Information theory.
  • Formalization of MDL principle and examples.

3
Statistical Modelling Traditional approach
  • Assumes that data has been generated as a sample
    from a population.
  • Parametric
  • Non parametric
  • Unknown distribution is then estimated using the
    data.
  • Minimization of some mean loss function.
  • Maximum Likelihood
  • Least squares
  • Works well when we understand the physics of the
    problem i.e. we know that there is some law
    generating the data instrument noise.
  • If we do not understand the data generating
    process there is no way we can determine whether
    the given data set is sampled form a given
    distribution.
  • Data Mining Image processing DNA modelling

4
Getting around the Curse of false assumption
  • If we are estimating probability distributions
    from data there is no rational way to compare
    different sets of distributions.
  • Best model is the most complex one.
  • First believe that no model can capture all the
    regular features in the data.
  • Look for a model within a collection of models
    that does its best.
  • Akaike1973 Robust estimation
    Cross-validation
  • Bayesian view of modelling
  • P(Ytheta)P(thetaY)P(theta)/P(Y)
  • Meaningful if one of theta is true
  • Jeffreys interpretation as
  • probability being a degree of belief
  • If we are estimating probability distributions
    from data there is no rational way to compare
    different sets of distributions.
  • Best model is the most complex one.
  • First believe that no model can capture all the
    regular features in the data.
  • Look for a model within a collection of models
    that does its best.
  • Akaike1973
  • Bayesian view of modelling
  • Meaningful if one of theta is true
  • Jeffreys interpretation as
  • probability being a degree of belief

True model
True model
5
Contents
  • Statistical Modelling Traditional approach.
  • Algorithmic theory of complexity Kolmogorov
    Complexity.
  • MDL principle.
  • Coding and Information theory.
  • Formalization of MDL principle
  • Simple examples.

6
Algorithmic theory of information
  • Introduced by Solomonoff Kolmogorov Chaitian.
  • Data need not be regarded as a sample from any
    distribution or metaphysical true model.
  • Idea of a model is a computer program that
    describes/encodes the data.
  • Two notions
  • Complexity how long is the program
  • Information what properties can it
    expressuninteresing information

7
Regularity and compression
  • Any regularity in given data can be used to
    compress the data.
  • 01010101010101010101
  • simple rulelength(log n)
  • 00100010000000010000
  • 17 zeros and 3 ones
  • N20!/(17!3!) such strings
  • Log N 11
  • Bernoulli model
  • Flip a coin 20 times
  • Maximally complex
  • Non regular sequences cannot be compressed.

8
Kolmogorov complexity
  • U - computer be a program printing the
    desired binary string as the output and halts.
  • The program after printing halts.
  • length of the program.
  • Once a program has halted it cannot start by
    itself.
  • No program can be prefix of a longer program.
  • The Kolmogorov complexity is the length of the
    shortest program in the language of U that
    generates the string and then halts.
  • The shortest program can be considered a optimal
    model for the data.
  • Occams Razor Principle of parsimony one should
    not increase, beyond what is necessary, the
    number of entities required to explain anything
  • Does it depend on U the programming
    language-Invariance theorem

9
Kolmogorov complexity is noncomputable
  • Proof
  • Let us say we have program Q which can compute
    the Kolmogorov complexity.
  • We can write a program P which uses Q as its
    subroutine such that it finds a shortest string
    whos kolmogorov complexity is greater than the
    length of the program P.
  • Since P prints out such a string the Kolmogorov
    complexity of the string should be less than or
    equal to the length of the string.
  • Contradiction Q.E.D
  • No algorithm can find the physical laws.
  • MDL principles scales down the idea of Kolmogorov
    complexity.
  • Focus on a class of models M because we can never
    find the true model.
  • Encode the data based on a hypothesis H.
  • First encode H and then encode data on the basis
    of H.
  • Choose H that minimizes the total codelength.

10
Contents
  • Statistical Modelling Traditional approach.
  • Algorithmic theory of complexity Kolmogorov
    Complexity.
  • MDL principle.
  • Coding and Information theory.
  • Formalization of MDL principle and examples.

11
Two Part Codes MDL Principle
  • Among a set of candidate hypothesis M, the best
    hypothesis to explain the data is the one which
    minimizes the sum of
  • Trade-off between model complexity and goodness
    of fit.
  • Better a hypothesis fits the data more
    information it gives about the data more the
    information fewer the bits we need to encode it

the length, in bits, of the description of the
data when encoded using the hypothesis.
the length,in bits, of the description of the
hypothesis.

12
Example Under and Over Fitting
13
Contents
  • Statistical Modelling Traditional approach.
  • Algorithmic theory of complexity Kolmogorov
    Complexity.
  • MDL principle.
  • Coding and Information theory.
  • Formalization of MDL principle
  • Simple examples.

14
Prefix Codes Krafts Inequality
  • alphabet
  • Message is a sequence of symbols from the
    alphabet.
  • Code Bunion of all the n-cartesian folds of
    0,1
  • Prefix Code No code is the prefix of another.
  • Unique decodability.
  • An integer valued function L() corresponds to the
    codelength of a binary prefix code if and only if
    it satisfies the Krafts inequality.
  • Proof Binary Tree construction Induction
  • Given a prefix code C on A with length function
    L() we can define a distribution Q on A.
  • Conversely for any distribution Q on A we can
    find a prefix code with length function L(x)??

15
Code Design Shannons source coding theorem
  • Huffmans algorithm
  • Lower bound on the mean codelength
  • Can use as a measure of complexity.

16
Connecting code length and probability
distributions
  • Short codelength corresponds to a high
    probabililty and vice versa.
  • Given a prefix code C on A with length function
    L() we can define a distribution Q on A.
  • Conversely for any distribution Q on A we can
    find a prefix code with length function
    L(x)-log2 Q
  • Does not necessarily mean that we assume our data
    is drawn according to the probability
    distribution.
  • probability distribution is just a
    mathematical object

17
Contents
  • Statistical Modelling Traditional approach.
  • Algorithmic theory of complexity Kolmogorov
    Complexity.
  • MDL principle.
  • Coding and Information theory.
  • Formalization of MDL principle and examples.

18
Formalizing the two-part code
  • M - class of models or hypothesis
  • D - given data sequence
  • H be a hypothesis belonging to M
  • For probabilistic class of models
  • Probability of the data given the model is used
  • Two part code first code theta and then code
    the data given theta
  • We know that there exists a code C such that
  • Use this code for the second part.
  • ML estimator if we neglect the complexity of
    \theta

19
Bernoulli Example..
20
Bernoulli Example..
  • We have to truncate ? to finite precision.
  • If we use fixed precision d we need d bits to
    send one of 2d possible truncated parameter
    values.
  • We want precision that is not fixed..so first
    send d..Encode d using a prefix code..
  • d can be any natural number..How to prefix code a
    natural number..
  • Trivial 1-1 2-01 3-001 4-0001..need d bits
  • Consider binary standard form 1 0 2-1 3-10
    ..length is ceil(log d)
  • So first encode the length of the binary standard
    form using trivial coderequired ceil(log d)
    bits..
  • Total 2ceil(log d) bits
  • Repeat the trick-encode the length of length of d
  • Length ceil(logd)2ceil(log(ceil(log d))) ¼
    2loglogdlogd
  • Lc1(?)dlogd2loglogd
  • Can be shown aysmptotically that the optimal
    precison d for encoding a sample of size n is
    given d(n)0.5log(n)c
  • C2 grows linearly in n while c1 grows
    logarithmically

21
Non probabilistic models
  • Consider the case of polynomials
  • Measure goodness of fit by the total squared
    error
  • Construct a probablity distribution such that
    log(P)total squared errorconstant
  • Gaussian distribution of specifc variance
  • This does not mean the underlying model is
    Gaussian.
  • Encode the polynomial coefficients in a similar
    way as before.
  • For any model with k parameters and sample size
    n-log (pmodel)(k/2)log no(1)
  • If the data are truly generated by some
    polynomialnoise then MDL will converge to the
    true one as the sample size increases.
  • If the true degree is high and if the number of
    samples is small MDL will underfit

22
MDL is looking for a good, not for a true model
  • If enough data is not available MDL picks a model
    which is too simple.
  • This does not mean simple models are a priori
    more likely to be true or nature prefers
    simplicity or something like that.
  • The rationale is that the dataset is too small to
    identify a complex model with any reliability.
  • Is it OK to use simple models for prediction?
  • Simple model is safe-Will give a correct
    impression of the error-Model itself tells us
    that it is not very accurate.
  • In the previous Bernoulli example if the data
    were generated using a first order Markov chain?
  • Explains why modelling errors using a gaussian
    distribution generally leads to good results even
    though the distribution of errors is not gaussian
  • Probabilites are monotone transforms of
    codelengths and not frequencies

23
References
  • Peter Grünwald's Thesis The Minimum Description
    Length Principle and Reasoning under Uncertainty
  • http//homepages.cwi.nl/pdg/thesispage.html
  • First two chapters for an introduction
  • Jorma Rissanen Lectures on statistical modelling
    theory
  • http//www.cs.tut.fi/rissanen/

24
Thank You ! Questions ?
Write a Comment
User Comments (0)
About PowerShow.com