Title: A(n) (extremely) brief/crude introduction to minimum description length principle
1A(n) (extremely) brief/crude introduction to
minimum description length principle
2Outline
- Conceptual/non-technical introduction
- Probabilities and Codelengths
- Crude MDL
- Refined MDL
- Other topics
3Outline
- Conceptual/non-technical introduction
- Probabilities and Codelengths
- Crude MDL
- Refined MDL
- Other topics
4Introduction
- Example data compression
- Description methods
Source Grnwald et al. (2005) Advances in Minimum
Description Length Theory and Applications.
5Introduction
- Example regression
- Model selection and overfitting
- Complexity of the model vs. Goodness of fit
Source Grnwald et al. (2005) Advances in Minimum
Description Length Theory and Applications.
6Introduction
Source Grnwald et al. (2005) Advances in Minimum
Description Length Theory and Applications.
7Introduction
- Crude 2-part version of MDL
Source Grnwald et al. (2005) Advances in Minimum
Description Length Theory and Applications.
8Outline
- Conceptual/non-technical introduction
- Probabilities and Codelengths
- Crude MDL
- Refined MDL
- Other topics
9Probabilities and Codelengths
- Let X be a finite or countable set
- A code C(x) for X
- 1-to-1 mapping from X to Ungt00,1n
- LC(x) number of bits needed to encode x using C
- P probability distribution defined on X
- P(x) the probability of x
- A sequence of (usually iid) observations x1, x2,
, xn xn
10Probabilities and Codelengths
- Prefix codes as examples of uniquely decodable
codes - no code word is a prefix of any other
a 0
b 111
c 1011
d 1010
r 110
! 100
Source http//www.cs.princeton.edu/courses/archiv
e/spring04/cos126/
11Probabilities and Codelengths
- Expected codelength of a code C
- Lower bound
- Optimal code
- if it has minimum expected codelength over all
uniquely decodable codes - How to design one given P?
- Huffman coding
12Probabilities and Codelengths
Source http//star.itc.it/caprile/teaching/algebr
a-superiore-2001/
13Probabilities and Codelengths
- How to design code for 1, 2, , M?
- Assuming a uniform distribution 1/M for each
number - logM bits
14Probabilities and Codelengths
- How to design code for all the positive integers?
- For each k
- Describe it with 0s
- Followed by a 1
- Then encode k using the uniform code for
- In total, 2logk 1 bits
- Can be refined
15Probabilities and Codelengths
- Let P be a probability distribution over X, then
there exists a code C for X such that - Let C be a uniquely decodable code over X, then
there exists a probability distribution P such
that
16Probabilities and Codelengths
Source Grnwald et al. (2005) Advances in Minimum
Description Length Theory and Applications.
17Outline
- Conceptual/non-technical introduction
- Probabilities and Codelengths
- Crude MDL
- Refined MDL
- Other topics
18Crude MDL
- Preliminary k-th order Markov chain on X0,1
- A sequence X1, X2, , XN
- Special case 0-th order Bernoulli model (biased
coin) - Maximum Likelihood estimator
19Crude MDL
- Preliminary k-th order Markov chain on X0,1
- Special case first order Markov chain B(1)
- MLE
20Crude MDL
- Preliminary k-th order Markov chain on X0,1
- 2k parameters
- theta1000000 n1000000/n000000
- theta1000001
-
- theta1111110
- theta1111111
- Log likelihood function
- MLE
21Crude MDL
- Question Given data Dxn, find the Markov chain
that best explains D. - We do not want to restrict ourselves to chains of
fixed order - How to avoid overfitting?
- Obviously, an (n-1)-th order Markov model would
always fit the data the best
22Crude MDL
Source Grnwald et al. (2005) Advances in Minimum
Description Length Theory and Applications.
23Crude MDL
- Description length of data given hypothesis
24Crude MDL
- Description length of hypothesis
- The code should not change with the sample size
n. - Different codes will lead to preferences of
different hypotheses - How to design a code that
- Leads to good inferences with small, practically
relevant sample sizes?
25Crude MDL
- An intuitive and reasonable code for k-th
order Markov chain - First describe k using 2logk1 bits
- Then describe the d2k parameters
- Assume n is given in advance
- For each theta in the MLE theta1000000, ,
theta1111111, the best precision we can
achieve by counting is 1/(n1) - Describe each theta with log(n1) bits
- L(H)2logk1dlog(n1)
- L(H)L(DH) 2logk1dlog(n1) logP(Dk,
theta) - For a given k, only the MLE theta need to be
considered
26Crude MDL
- Good news
- We have found a principled manner to encode data
D using H - Bad news
- We have not found clear guidelines to design
codes for H
27Outline
- Conceptual/non-technical introduction
- Probabilities and Codelengths
- Crude MDL
- Refined MDL
- Other issues
28Refined MDL
- Universal codes and universal distributions
- maximum likelihood code depends on the data
- How to describe the data in an unambiguous
manner? - Design a code such that for every possible
observation, its codelength corresponds to its
ML? - impossible
29Refined MDL
- Worst-case regret
- Optimal universal model
30Refined MDL
- Normalized maximum likelihood (NML)
- Minimizing -logNML
31Refined MDL
- Complexity of a model
- The more sequences that can be fit well by an
element of M, the larger Ms complexity - Would it lead to a right balance between
complexity and fit? - Hopefully
32Refined MDL
Source Grnwald et al. (2005) Advances in Minimum
Description Length Theory and Applications.
33Outline
- Conceptual/non-technical introduction
- Probabilities and Codelengths
- Crude MDL
- Refined MDL
- Other topics
34Other topics
- Mixture code
- Resolvability
35References
- Barron, A. Rissanen, J. Yu, B. (1998), 'The
minimum description length principle in coding
and modeling', Information Theory, IEEE
Transactions on 44(6), 2743--2760. - Grnwald, P.D. Myung, I.J. Pitt, M.A. (2005),
Advances in Minimum Description Length Theory
and Applications (Neural Information Processing),
The MIT Press. - Hall, P. Hannan, E.J. (1988), 'On stochastic
complexity and nonparametric density estimation',
Biometrika 75(4), 705-714.