Title: Information geometry
1Information geometry
2Learning a distribution
- Toy motivating example want to model the
distribution of English words (in a corpus, or in
natural speech, ) - Domain S English words
- Default choice (without any knowledge) uniform
over S. - But suppose we are able to collect some simple
statistics - Pr(length gt 5) 0.3
- Pr(end in e) 0.45
- Pr(start with s) 0.08
- etc.
- Now what distribution should we choose?
3Outline
- The solution and this talk involves three
intimately related concepts - Entropy
- Exponential families
- Information projection
4Part I Entropy
5Formulating the problem
- Domain S English words
- Measured features for words x 2 S
- T1(x) 1(length gt 5) 1 if lengthgt5, 0
otherwise - T2(x) 1(x ends in e)
- etc.
- Find a distribution which satisfies the
constraints - E T1(x) 0.3, E T2(x) 0.45, etc.
- but which is otherwise as random as possible (ie.
makes no assumptions other than these constraints)
6What is randomness?
- Let X be a discrete-valued random variable what
is its randomness content? - Intuitively
- 1. A fair coin has one bit of randomness
- 2. A biased coin is less random
- 3. Two independent fair coins have two bits of
randomness - 4. Two dependent fair coins have less randomness
- 5. A uniform distribution over 32 possible
outcomes has 5 bits of randomness can describe
each outcome using 5 bits
7Entropy
- If X has distribution p(), its entropy is
- Examples
- Fair coin H ½ log 2 ½ log 2 1
- Coin with bias ¾ H ¾ log 4/3 ¼ log 4 0.81
- Coin with bias 0.99
- H 0.99 log 1/0.99 0.01 log 1/0.01 0.08
- (iv) Uniform distribution over k outcomes H
log k
8Entropy is concave
H(p)
1
p
0
1
½
9Properties of entropy
- Many properties which we intuitively expect from
a notion of randomness - 1. Expansibility. If X has distribution (p1, p2,
, pn) and Y has (p1, p2, , pn, 0), then H(X)
H(Y) - 2. Symmetry. eg. Distribution (p,1-p) has the
same entropy as (1-p, p). - 3. Additivity. If X and Y are independent then
H(X,Y) H(X) H(Y).
10Additivity
11Properties of entropy, contd
- 4. Subadditivity. H(X,Y) H(X) H(Y)
- 5. Normalization. A fair coin has entropy one
- 6. Small for small probability. The entropy of
a coin with bias p goes to zero as p goes to 0 - In fact
- Entropy is the only measure which satisfies these
six properties! Aczel-Forte-Ng 1975
12KL divergence
- Kullback-Leibler divergence (relative entropy) a
distance measure between two probability
distributions. If p, q have the same domain S - a very fundamental and widely-used distance
measure in statistics and machine learning. - Warnings
- K(p,q) is not the same as K(q,p)
- K(p,q) could be infinite!
- But at least K(p,q) 0 with equality iff p q
13Entropy and KL divergence
- Say random variable X has distribution p and u is
the uniform distribution over domain S -
- Therefore, entropy tells us the distance to the
uniform distribution! Also note H(X) log S
14Another justification of entropy
- Let X1, X2, , Xn be i.i.d. (independent,
identically distributed) random variables.
Consider the joint distribution over sequences
(X1, , Xn). - For large n, put these sequences into two groups
- (I) sequences whose probability is roughly 2-nH,
where H H(Xi) - (II) all other sequences
- Then group I contains almost all the probability
mass! - This is the asymptotic equipartition property
(AEP).
15Asymptotic equipartition
Space of possible sequences
Sequences with probability about 2-nH
For large n, the distribution over sequences (X1,
, Xn) looks a bit like a uniform distribution
over 2nH possible outcomes. Entropy tells us the
volume of the typical set.
16AEP examples
- For large n, the distribution over sequences (X1,
, Xn) looks a bit like a uniform distribution
over 2nH possible outcomes. - Example Xi fair coin
- Then (X1, , Xn) uniform distribution over 2n
outcomes - (Group I is everything)
- Example Xi coin with bias ¾
- A typical sequence has about 75 heads, and
therefore probability around q (3/4)(3/4)n
(1/4)(1/4)n - Notice log q ¾ n log ¾ ¼ n log ¼ -n H(1/4),
so the probability of a typical sequence is
indeed 2-nH
17Proof of AEP
18Back to our main question
- S English words
- For x 2 S, we have features T1(x), , Tk(x)
- (eg. T1(x) 1(length gt 5))
- Find a distribution p over S which
- 1. satisfies certain constraints
- E Ti(x) bi
- (eg. fraction of words with length gt 5 is 0.3)
- 2. has maximum entropy
- The maximum entropy principle.
19Maximum entropy
- Think of p as a vector of length S
- Maximizing a concave function subject to linear
constraints a convex optimization problem!
20Alternative formulation
- Suppose we have a prior ?, and we want the
distribution closest to it (in KL distance) which
satisfies the constraints.
A more general convex optimization problem to
get maximum entropy, choose ? to be uniform.
21A projection operation
Think of this page as the probability simplex
(ie. space of valid probability distributions
over S-vectors)
prior ?
L affine subspace given by constraints
p
p is the I-projection (information projection) of
? onto the subspace L
22Solution by calculus
Use Lagrange multipliers
Solution
(Z is a normalizer)
This is familiar the exponential family
generated by ?!
23Form of the solution
- Back to our toy problem
- p(x) / exp ?1 1(length gt 5)
- ?2 1(x ends in e) ?
- For instance, if ?2 0.81, this says that a word
ending in e is e0.81 2.25 times as likely as
one which doesnt.
24Part II Exponential families
25Exponential families
- Many of the most common and widely-used
probability distributions such as Gaussian,
Poisson, Bernoulli, Multinomial are exponential
families - To define an exponential family, start with
- Input space S µ Rr
- Base measure h Rr ! R
- Features T(x) (T1(x), , Tk(x))
- The exponential family generated by h and T
consists of log linear models parametrized by ? 2
Rk - p?(x) / e? T(x) h(x)
26Natural parameter space
- Input space S µ Rr, base measure h Rr ! R,
- features T(x) (T1(x), , Tk(x))
- Log linear model with parameter ? 2 Rk p?(x) /
e? T(x) h(x) - Normalize these models to integrate to one
- p?(x) e? T(x) G(?) h(x)
- where G(?) ln ?x e? T(x) h(x) or the
appropriate integral is the log partition
function. - This integral need not always exist, so define
- N ? 2 Rk -1 lt G(?) lt 1,
- the natural parameter space.
27Example Bernoulli
- S 0,1
- Base measure h 1
- Features T(x) x
- Functional form p?(x) / e? x
- Log partition function
is defined for all ? 2 R, so natural parameter
space N R Distribution with parameter ?
We are more accustomed to the parametrization ? 2
0,1, with ? e?/(1 e?).
28Parametrization of Bernoulli
Natural parameter ? 2 R
? ! e?/(1 e?)
Usual parameter in 0,1, aka expectation
parameter
29Example Poisson
- Poisson(?) distribution over Z is given by
- Here S Z
- Base measure h(x) 1/x!
- Feature T(x) x
- Functional form p?(x) / e? x/x!
- Log partition function
- Therefore N R. Notice ? ln ?.
30Example Gaussian
- Gaussian with mean ? and variance ?2
31Properties of exponential families
- A lot of information in the log-partition
function - G(?) ln ?x e? T(x) h(x)
- G is strictly convex
- eg. recall Poisson G(?) e?
- This implies, among other things, that G0 is
1-to-1 - G0(?) E? T(x) the mean of the feature values
- check
-
- G00(?) var? T(x) the variance of the feature
values
32Maximum likelihood estimation
- Exponential family generated by h, T
- p?(x) e? T(x) G(?) h(x)
- Given data x1, x2, , xm, find maximum likelihood
?.
Setting derivatives to zero
But recall G0(?) mean of T(x) under p? so just
pick the distribution which matches the sample
average!
33Maximum likelihood, contd
- Example. Fit a Poisson to integer data with mean
7.5. - Simple choose the Poisson with mean 7.5.
- But what is the natural parameter of this
Poisson? - Recall for a Poisson, G(?) e?
- So the mean is G0(?) e?.
- Choose ? ln 7.5
- Inverting G0 is not always so easy
34Our toy problem
- Form of the solution
- p(x) / exp ?1 1(length gt 5)
- ?2 1(x ends in e) ?
- We know the expectation parameters and we need
the natural parameters
35The two spaces
1-1 map G0
?
?
N, the natural parameter space
Rk, the expectation parameter space
Given data, finding the maximum likelihood
distribution is trivial under the expectation
parametrization, and is a convex optimization
problem under the natural parametrization.
36Part III Information projection
37Back to maximum entropy
- Recall that
- exponential families are maximum entropy
distributions! - Q Given a prior ?, and empirical averages E
Ti(x) bi, what is the distribution closest to ?
that satisfies these constraints? - A It is the unique member of the exponential
family generated by ? and T which has the given
expectations.
38Maximum entropy example
- Q We are told that a distribution over R has EX
0 and EX2 10. What distribution should we
pick? - A Features are T(x) (x, x2). These define the
family of Gaussians so pick the Gaussian
N(0,10).
39Maximum entropy restatement
- Choose
- any sample space S µ Rr
- features T(x) (T1(x), , Tk(x))
- constraints E T(x) b
- and reference prior ? Rr ! R.
- If there is a distribution of the form
- p(x) e? T(x) G(?) ?(x)
- satisfying the constraints, then it is the unique
minimizer of K(p, ?) subject to these
constraints.
40Proof
- Consider any other distribution p which satisfies
the constraints. We will show K(p, ?) gt K(p, ?).
Hmm an interesting relation K(p,?) K(p, p)
K(p, ?)
41Geometric interpretation
This page is the probability simplex (space of
valid probability S-vectors).
prior ?
L affine subspace given by constraints
p
p
p is the I-projection of ? onto L
K(p,?) K(p, p) K(p, ?) Pythagorean thm!
42More geometry
?
Let Q be the set of distributions e? T(x)
G(?) ?(x), ? 2 Rk
L affine subspace given by constraints
p I-projection of ? onto L
p
p
dim(simplex) S-1 dim(L) S-k-1 dim(Q) k
Q
43Max entropy vs. max likelihood
- Given data x1, , xm which define constraints
- E Ti(x) bi
- the following are equivalent
- 1. p is the I-projection of ? onto L that is,
p minimizes K(p, ?) - 2. p is the maximum likelihood distribution in Q
- 3. p 2 L Å Q
44An algorithm for I-projection
- Goal project the prior ? onto the
constraint-satisfying affine subspace - L p Ep Ti(x) bi, i 1, 2, , k
- Define Li p Ep Ti(x) bi (just the ith
constraint) - Algorithm (Csiszar)
- Let p0 ?
- Loop until convergence
- pt1 I-projection of pt onto Lt mod k
- Reduce a multidimensional problem to a series of
one-dimensional problems.
45One-dimensional I-projection
- Projecting pt onto Li find ?i such that
-
- has E Ti(x) bi.
- Equivalently, find ?i such that G0(?i) bi
e.g. line search.
46Proof of convergence
We get closer to p on each iteration K(p,
pt1) K(p, pt) K(pt1, pt)
pt
pt1
p
Li
47Other methods for I-projection
- Csiszars method is sequential one ?i at a time
- Iterative scaling (Darroch and Ratcliff)
- parallel all ?i updated in each step
- Many variants on iterative scaling
- Gradient methods
48Postscript Bregman divergences
- Projections with respect to KL divergence much
in common with squared Euclidean distance, eg.
Pythagorean theorem. - Q What other distance measures also share these
properties? - A Bregman divergences.
- Each exponential family has a natural distance
measure associated with it, its Bregman
divergence. - Gaussian squared Euclidean distance
- Multinomial KL divergence
49Postscript Bregman divergences
Can define projections with respect to arbitrary
Bregman divergences. Many machine learning tasks
can then be seen as information projection
boosting, iterative scaling,