Information geometry - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Information geometry

Description:

Information geometry. Learning a distribution. Toy motivating ... L = affine subspace given by ... the prior onto the constraint-satisfying affine subspace ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 50

Provided by: sanj75

Category:

more less

Transcript and Presenter's Notes

Title: Information geometry

1
Information geometry
2
Learning a distribution

Toy motivating example want to model the
distribution of English words (in a corpus, or in
natural speech, )
Domain S English words
Default choice (without any knowledge) uniform
over S.
But suppose we are able to collect some simple
statistics
Pr(length gt 5) 0.3
Pr(end in e) 0.45
Pr(start with s) 0.08
etc.
Now what distribution should we choose?

3
Outline

The solution and this talk involves three
intimately related concepts
Entropy
Exponential families
Information projection

4
Part I Entropy
5
Formulating the problem

Domain S English words
Measured features for words x 2 S
T1(x) 1(length gt 5) 1 if lengthgt5, 0
otherwise
T2(x) 1(x ends in e)
etc.
Find a distribution which satisfies the
constraints
E T1(x) 0.3, E T2(x) 0.45, etc.
but which is otherwise as random as possible (ie.
makes no assumptions other than these constraints)

6
What is randomness?

Let X be a discrete-valued random variable what
is its randomness content?
Intuitively
1. A fair coin has one bit of randomness
2. A biased coin is less random
3. Two independent fair coins have two bits of
randomness
4. Two dependent fair coins have less randomness
5. A uniform distribution over 32 possible
outcomes has 5 bits of randomness can describe
each outcome using 5 bits

7
Entropy

If X has distribution p(), its entropy is
Examples
Fair coin H ½ log 2 ½ log 2 1
Coin with bias ¾ H ¾ log 4/3 ¼ log 4 0.81
Coin with bias 0.99
H 0.99 log 1/0.99 0.01 log 1/0.01 0.08
(iv) Uniform distribution over k outcomes H
log k

8
Entropy is concave
H(p)
1
p
0
1
½
9
Properties of entropy

Many properties which we intuitively expect from
a notion of randomness
1. Expansibility. If X has distribution (p1, p2,
, pn) and Y has (p1, p2, , pn, 0), then H(X)
H(Y)
2. Symmetry. eg. Distribution (p,1-p) has the
same entropy as (1-p, p).
3. Additivity. If X and Y are independent then
H(X,Y) H(X) H(Y).

10
Additivity

Quick check

11
Properties of entropy, contd

4. Subadditivity. H(X,Y) H(X) H(Y)
5. Normalization. A fair coin has entropy one
6. Small for small probability. The entropy of
a coin with bias p goes to zero as p goes to 0
In fact
Entropy is the only measure which satisfies these
six properties! Aczel-Forte-Ng 1975

12
KL divergence

Kullback-Leibler divergence (relative entropy) a
distance measure between two probability
distributions. If p, q have the same domain S
a very fundamental and widely-used distance
measure in statistics and machine learning.
Warnings
K(p,q) is not the same as K(q,p)
K(p,q) could be infinite!
But at least K(p,q) 0 with equality iff p q

13
Entropy and KL divergence

Say random variable X has distribution p and u is
the uniform distribution over domain S
Therefore, entropy tells us the distance to the
uniform distribution! Also note H(X) log S

14
Another justification of entropy

Let X1, X2, , Xn be i.i.d. (independent,
identically distributed) random variables.
Consider the joint distribution over sequences
(X1, , Xn).
For large n, put these sequences into two groups
(I) sequences whose probability is roughly 2-nH,
where H H(Xi)
(II) all other sequences
Then group I contains almost all the probability
mass!
This is the asymptotic equipartition property
(AEP).

15
Asymptotic equipartition
Space of possible sequences
Sequences with probability about 2-nH
For large n, the distribution over sequences (X1,
, Xn) looks a bit like a uniform distribution
over 2nH possible outcomes. Entropy tells us the
volume of the typical set.
16
AEP examples

For large n, the distribution over sequences (X1,
, Xn) looks a bit like a uniform distribution
over 2nH possible outcomes.
Example Xi fair coin
Then (X1, , Xn) uniform distribution over 2n
outcomes
(Group I is everything)
Example Xi coin with bias ¾
A typical sequence has about 75 heads, and
therefore probability around q (3/4)(3/4)n
(1/4)(1/4)n
Notice log q ¾ n log ¾ ¼ n log ¼ -n H(1/4),
so the probability of a typical sequence is
indeed 2-nH

17
Proof of AEP
18
Back to our main question

S English words
For x 2 S, we have features T1(x), , Tk(x)
(eg. T1(x) 1(length gt 5))
Find a distribution p over S which
1. satisfies certain constraints
E Ti(x) bi
(eg. fraction of words with length gt 5 is 0.3)
2. has maximum entropy
The maximum entropy principle.

19
Maximum entropy

Think of p as a vector of length S
Maximizing a concave function subject to linear
constraints a convex optimization problem!

20
Alternative formulation

Suppose we have a prior ?, and we want the
distribution closest to it (in KL distance) which
satisfies the constraints.

A more general convex optimization problem to
get maximum entropy, choose ? to be uniform.
21
A projection operation
Think of this page as the probability simplex
(ie. space of valid probability distributions
over S-vectors)
prior ?
L affine subspace given by constraints
p
p is the I-projection (information projection) of
? onto the subspace L
22
Solution by calculus
Use Lagrange multipliers
Solution
(Z is a normalizer)
This is familiar the exponential family
generated by ?!
23
Form of the solution

Back to our toy problem
p(x) / exp ?1 1(length gt 5)
?2 1(x ends in e) ?
For instance, if ?2 0.81, this says that a word
ending in e is e0.81 2.25 times as likely as
one which doesnt.

24
Part II Exponential families
25
Exponential families

Many of the most common and widely-used
probability distributions such as Gaussian,
Poisson, Bernoulli, Multinomial are exponential
families
To define an exponential family, start with
Input space S µ Rr
Base measure h Rr ! R
Features T(x) (T1(x), , Tk(x))
The exponential family generated by h and T
consists of log linear models parametrized by ? 2
Rk
p?(x) / e? T(x) h(x)

26
Natural parameter space

Input space S µ Rr, base measure h Rr ! R,
features T(x) (T1(x), , Tk(x))
Log linear model with parameter ? 2 Rk p?(x) /
e? T(x) h(x)
Normalize these models to integrate to one
p?(x) e? T(x) G(?) h(x)
where G(?) ln ?x e? T(x) h(x) or the
appropriate integral is the log partition
function.
This integral need not always exist, so define
N ? 2 Rk -1 lt G(?) lt 1,
the natural parameter space.

27
Example Bernoulli

S 0,1
Base measure h 1
Features T(x) x
Functional form p?(x) / e? x
Log partition function

is defined for all ? 2 R, so natural parameter
space N R Distribution with parameter ?
We are more accustomed to the parametrization ? 2
0,1, with ? e?/(1 e?).
28
Parametrization of Bernoulli
Natural parameter ? 2 R
? ! e?/(1 e?)
Usual parameter in 0,1, aka expectation
parameter
29
Example Poisson

Poisson(?) distribution over Z is given by
Here S Z
Base measure h(x) 1/x!
Feature T(x) x
Functional form p?(x) / e? x/x!
Log partition function
Therefore N R. Notice ? ln ?.

30
Example Gaussian

Gaussian with mean ? and variance ?2

31
Properties of exponential families

A lot of information in the log-partition
function
G(?) ln ?x e? T(x) h(x)
G is strictly convex
eg. recall Poisson G(?) e?
This implies, among other things, that G0 is
1-to-1
G0(?) E? T(x) the mean of the feature values
check
G00(?) var? T(x) the variance of the feature
values

32
Maximum likelihood estimation

Exponential family generated by h, T
p?(x) e? T(x) G(?) h(x)
Given data x1, x2, , xm, find maximum likelihood
?.

Setting derivatives to zero
But recall G0(?) mean of T(x) under p? so just
pick the distribution which matches the sample
average!
33
Maximum likelihood, contd

Example. Fit a Poisson to integer data with mean
7.5.
Simple choose the Poisson with mean 7.5.
But what is the natural parameter of this
Poisson?
Recall for a Poisson, G(?) e?
So the mean is G0(?) e?.
Choose ? ln 7.5
Inverting G0 is not always so easy

34
Our toy problem

Form of the solution
p(x) / exp ?1 1(length gt 5)
?2 1(x ends in e) ?
We know the expectation parameters and we need
the natural parameters

35
The two spaces
1-1 map G0
?
?
N, the natural parameter space
Rk, the expectation parameter space
Given data, finding the maximum likelihood
distribution is trivial under the expectation
parametrization, and is a convex optimization
problem under the natural parametrization.
36
Part III Information projection
37
Back to maximum entropy

Recall that
exponential families are maximum entropy
distributions!
Q Given a prior ?, and empirical averages E
Ti(x) bi, what is the distribution closest to ?
that satisfies these constraints?
A It is the unique member of the exponential
family generated by ? and T which has the given
expectations.

38
Maximum entropy example

Q We are told that a distribution over R has EX
0 and EX2 10. What distribution should we
pick?
A Features are T(x) (x, x2). These define the
family of Gaussians so pick the Gaussian
N(0,10).

39
Maximum entropy restatement

Choose
any sample space S µ Rr
features T(x) (T1(x), , Tk(x))
constraints E T(x) b
and reference prior ? Rr ! R.
If there is a distribution of the form
p(x) e? T(x) G(?) ?(x)
satisfying the constraints, then it is the unique
minimizer of K(p, ?) subject to these
constraints.

40
Proof

Consider any other distribution p which satisfies
the constraints. We will show K(p, ?) gt K(p, ?).

Hmm an interesting relation K(p,?) K(p, p)
K(p, ?)
41
Geometric interpretation
This page is the probability simplex (space of
valid probability S-vectors).
prior ?
L affine subspace given by constraints
p
p
p is the I-projection of ? onto L
K(p,?) K(p, p) K(p, ?) Pythagorean thm!
42
More geometry
?
Let Q be the set of distributions e? T(x)
G(?) ?(x), ? 2 Rk
L affine subspace given by constraints
p I-projection of ? onto L
p
p
dim(simplex) S-1 dim(L) S-k-1 dim(Q) k
Q
43
Max entropy vs. max likelihood

Given data x1, , xm which define constraints
E Ti(x) bi
the following are equivalent
1. p is the I-projection of ? onto L that is,
p minimizes K(p, ?)
2. p is the maximum likelihood distribution in Q
3. p 2 L Å Q

44
An algorithm for I-projection

Goal project the prior ? onto the
constraint-satisfying affine subspace
L p Ep Ti(x) bi, i 1, 2, , k
Define Li p Ep Ti(x) bi (just the ith
constraint)
Algorithm (Csiszar)
Let p0 ?
Loop until convergence
pt1 I-projection of pt onto Lt mod k
Reduce a multidimensional problem to a series of
one-dimensional problems.

45
One-dimensional I-projection

Projecting pt onto Li find ?i such that
has E Ti(x) bi.
Equivalently, find ?i such that G0(?i) bi
e.g. line search.

46
Proof of convergence
We get closer to p on each iteration K(p,
pt1) K(p, pt) K(pt1, pt)
pt
pt1
p
Li
47
Other methods for I-projection

Csiszars method is sequential one ?i at a time
Iterative scaling (Darroch and Ratcliff)
parallel all ?i updated in each step
Many variants on iterative scaling
Gradient methods

48
Postscript Bregman divergences

Projections with respect to KL divergence much
in common with squared Euclidean distance, eg.
Pythagorean theorem.
Q What other distance measures also share these
properties?
A Bregman divergences.
Each exponential family has a natural distance
measure associated with it, its Bregman
divergence.
Gaussian squared Euclidean distance
Multinomial KL divergence

49
Postscript Bregman divergences
Can define projections with respect to arbitrary
Bregman divergences. Many machine learning tasks
can then be seen as information projection
boosting, iterative scaling,

Write a Comment

User Comments (0)