Information geometry - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Information geometry

Description:

Information geometry. Learning a distribution. Toy motivating ... L = affine subspace given by ... the prior onto the constraint-satisfying affine subspace ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 50
Provided by: sanj75
Category:

less

Transcript and Presenter's Notes

Title: Information geometry


1
Information geometry
2
Learning a distribution
  • Toy motivating example want to model the
    distribution of English words (in a corpus, or in
    natural speech, )
  • Domain S English words
  • Default choice (without any knowledge) uniform
    over S.
  • But suppose we are able to collect some simple
    statistics
  • Pr(length gt 5) 0.3
  • Pr(end in e) 0.45
  • Pr(start with s) 0.08
  • etc.
  • Now what distribution should we choose?

3
Outline
  • The solution and this talk involves three
    intimately related concepts
  • Entropy
  • Exponential families
  • Information projection

4
Part I Entropy
5
Formulating the problem
  • Domain S English words
  • Measured features for words x 2 S
  • T1(x) 1(length gt 5) 1 if lengthgt5, 0
    otherwise
  • T2(x) 1(x ends in e)
  • etc.
  • Find a distribution which satisfies the
    constraints
  • E T1(x) 0.3, E T2(x) 0.45, etc.
  • but which is otherwise as random as possible (ie.
    makes no assumptions other than these constraints)

6
What is randomness?
  • Let X be a discrete-valued random variable what
    is its randomness content?
  • Intuitively
  • 1. A fair coin has one bit of randomness
  • 2. A biased coin is less random
  • 3. Two independent fair coins have two bits of
    randomness
  • 4. Two dependent fair coins have less randomness
  • 5. A uniform distribution over 32 possible
    outcomes has 5 bits of randomness can describe
    each outcome using 5 bits

7
Entropy
  • If X has distribution p(), its entropy is
  • Examples
  • Fair coin H ½ log 2 ½ log 2 1
  • Coin with bias ¾ H ¾ log 4/3 ¼ log 4 0.81
  • Coin with bias 0.99
  • H 0.99 log 1/0.99 0.01 log 1/0.01 0.08
  • (iv) Uniform distribution over k outcomes H
    log k

8
Entropy is concave
H(p)
1
p
0
1
½
9
Properties of entropy
  • Many properties which we intuitively expect from
    a notion of randomness
  • 1. Expansibility. If X has distribution (p1, p2,
    , pn) and Y has (p1, p2, , pn, 0), then H(X)
    H(Y)
  • 2. Symmetry. eg. Distribution (p,1-p) has the
    same entropy as (1-p, p).
  • 3. Additivity. If X and Y are independent then
    H(X,Y) H(X) H(Y).

10
Additivity
  • Quick check

11
Properties of entropy, contd
  • 4. Subadditivity. H(X,Y) H(X) H(Y)
  • 5. Normalization. A fair coin has entropy one
  • 6. Small for small probability. The entropy of
    a coin with bias p goes to zero as p goes to 0
  • In fact
  • Entropy is the only measure which satisfies these
    six properties! Aczel-Forte-Ng 1975

12
KL divergence
  • Kullback-Leibler divergence (relative entropy) a
    distance measure between two probability
    distributions. If p, q have the same domain S
  • a very fundamental and widely-used distance
    measure in statistics and machine learning.
  • Warnings
  • K(p,q) is not the same as K(q,p)
  • K(p,q) could be infinite!
  • But at least K(p,q) 0 with equality iff p q

13
Entropy and KL divergence
  • Say random variable X has distribution p and u is
    the uniform distribution over domain S
  • Therefore, entropy tells us the distance to the
    uniform distribution! Also note H(X) log S

14
Another justification of entropy
  • Let X1, X2, , Xn be i.i.d. (independent,
    identically distributed) random variables.
    Consider the joint distribution over sequences
    (X1, , Xn).
  • For large n, put these sequences into two groups
  • (I) sequences whose probability is roughly 2-nH,
    where H H(Xi)
  • (II) all other sequences
  • Then group I contains almost all the probability
    mass!
  • This is the asymptotic equipartition property
    (AEP).

15
Asymptotic equipartition
Space of possible sequences
Sequences with probability about 2-nH
For large n, the distribution over sequences (X1,
, Xn) looks a bit like a uniform distribution
over 2nH possible outcomes. Entropy tells us the
volume of the typical set.
16
AEP examples
  • For large n, the distribution over sequences (X1,
    , Xn) looks a bit like a uniform distribution
    over 2nH possible outcomes.
  • Example Xi fair coin
  • Then (X1, , Xn) uniform distribution over 2n
    outcomes
  • (Group I is everything)
  • Example Xi coin with bias ¾
  • A typical sequence has about 75 heads, and
    therefore probability around q (3/4)(3/4)n
    (1/4)(1/4)n
  • Notice log q ¾ n log ¾ ¼ n log ¼ -n H(1/4),
    so the probability of a typical sequence is
    indeed 2-nH

17
Proof of AEP
18
Back to our main question
  • S English words
  • For x 2 S, we have features T1(x), , Tk(x)
  • (eg. T1(x) 1(length gt 5))
  • Find a distribution p over S which
  • 1. satisfies certain constraints
  • E Ti(x) bi
  • (eg. fraction of words with length gt 5 is 0.3)
  • 2. has maximum entropy
  • The maximum entropy principle.

19
Maximum entropy
  • Think of p as a vector of length S
  • Maximizing a concave function subject to linear
    constraints a convex optimization problem!

20
Alternative formulation
  • Suppose we have a prior ?, and we want the
    distribution closest to it (in KL distance) which
    satisfies the constraints.

A more general convex optimization problem to
get maximum entropy, choose ? to be uniform.
21
A projection operation
Think of this page as the probability simplex
(ie. space of valid probability distributions
over S-vectors)
prior ?
L affine subspace given by constraints
p
p is the I-projection (information projection) of
? onto the subspace L
22
Solution by calculus
Use Lagrange multipliers
Solution
(Z is a normalizer)
This is familiar the exponential family
generated by ?!
23
Form of the solution
  • Back to our toy problem
  • p(x) / exp ?1 1(length gt 5)
  • ?2 1(x ends in e) ?
  • For instance, if ?2 0.81, this says that a word
    ending in e is e0.81 2.25 times as likely as
    one which doesnt.

24
Part II Exponential families
25
Exponential families
  • Many of the most common and widely-used
    probability distributions such as Gaussian,
    Poisson, Bernoulli, Multinomial are exponential
    families
  • To define an exponential family, start with
  • Input space S µ Rr
  • Base measure h Rr ! R
  • Features T(x) (T1(x), , Tk(x))
  • The exponential family generated by h and T
    consists of log linear models parametrized by ? 2
    Rk
  • p?(x) / e? T(x) h(x)

26
Natural parameter space
  • Input space S µ Rr, base measure h Rr ! R,
  • features T(x) (T1(x), , Tk(x))
  • Log linear model with parameter ? 2 Rk p?(x) /
    e? T(x) h(x)
  • Normalize these models to integrate to one
  • p?(x) e? T(x) G(?) h(x)
  • where G(?) ln ?x e? T(x) h(x) or the
    appropriate integral is the log partition
    function.
  • This integral need not always exist, so define
  • N ? 2 Rk -1 lt G(?) lt 1,
  • the natural parameter space.

27
Example Bernoulli
  • S 0,1
  • Base measure h 1
  • Features T(x) x
  • Functional form p?(x) / e? x
  • Log partition function

is defined for all ? 2 R, so natural parameter
space N R Distribution with parameter ?
We are more accustomed to the parametrization ? 2
0,1, with ? e?/(1 e?).
28
Parametrization of Bernoulli
Natural parameter ? 2 R
? ! e?/(1 e?)
Usual parameter in 0,1, aka expectation
parameter
29
Example Poisson
  • Poisson(?) distribution over Z is given by
  • Here S Z
  • Base measure h(x) 1/x!
  • Feature T(x) x
  • Functional form p?(x) / e? x/x!
  • Log partition function
  • Therefore N R. Notice ? ln ?.

30
Example Gaussian
  • Gaussian with mean ? and variance ?2

31
Properties of exponential families
  • A lot of information in the log-partition
    function
  • G(?) ln ?x e? T(x) h(x)
  • G is strictly convex
  • eg. recall Poisson G(?) e?
  • This implies, among other things, that G0 is
    1-to-1
  • G0(?) E? T(x) the mean of the feature values
  • check
  • G00(?) var? T(x) the variance of the feature
    values

32
Maximum likelihood estimation
  • Exponential family generated by h, T
  • p?(x) e? T(x) G(?) h(x)
  • Given data x1, x2, , xm, find maximum likelihood
    ?.

Setting derivatives to zero
But recall G0(?) mean of T(x) under p? so just
pick the distribution which matches the sample
average!
33
Maximum likelihood, contd
  • Example. Fit a Poisson to integer data with mean
    7.5.
  • Simple choose the Poisson with mean 7.5.
  • But what is the natural parameter of this
    Poisson?
  • Recall for a Poisson, G(?) e?
  • So the mean is G0(?) e?.
  • Choose ? ln 7.5
  • Inverting G0 is not always so easy

34
Our toy problem
  • Form of the solution
  • p(x) / exp ?1 1(length gt 5)
  • ?2 1(x ends in e) ?
  • We know the expectation parameters and we need
    the natural parameters

35
The two spaces
1-1 map G0
?
?
N, the natural parameter space
Rk, the expectation parameter space
Given data, finding the maximum likelihood
distribution is trivial under the expectation
parametrization, and is a convex optimization
problem under the natural parametrization.
36
Part III Information projection
37
Back to maximum entropy
  • Recall that
  • exponential families are maximum entropy
    distributions!
  • Q Given a prior ?, and empirical averages E
    Ti(x) bi, what is the distribution closest to ?
    that satisfies these constraints?
  • A It is the unique member of the exponential
    family generated by ? and T which has the given
    expectations.

38
Maximum entropy example
  • Q We are told that a distribution over R has EX
    0 and EX2 10. What distribution should we
    pick?
  • A Features are T(x) (x, x2). These define the
    family of Gaussians so pick the Gaussian
    N(0,10).

39
Maximum entropy restatement
  • Choose
  • any sample space S µ Rr
  • features T(x) (T1(x), , Tk(x))
  • constraints E T(x) b
  • and reference prior ? Rr ! R.
  • If there is a distribution of the form
  • p(x) e? T(x) G(?) ?(x)
  • satisfying the constraints, then it is the unique
    minimizer of K(p, ?) subject to these
    constraints.

40
Proof
  • Consider any other distribution p which satisfies
    the constraints. We will show K(p, ?) gt K(p, ?).

Hmm an interesting relation K(p,?) K(p, p)
K(p, ?)
41
Geometric interpretation
This page is the probability simplex (space of
valid probability S-vectors).
prior ?
L affine subspace given by constraints
p
p
p is the I-projection of ? onto L
K(p,?) K(p, p) K(p, ?) Pythagorean thm!
42
More geometry
?
Let Q be the set of distributions e? T(x)
G(?) ?(x), ? 2 Rk
L affine subspace given by constraints
p I-projection of ? onto L
p
p
dim(simplex) S-1 dim(L) S-k-1 dim(Q) k
Q
43
Max entropy vs. max likelihood
  • Given data x1, , xm which define constraints
  • E Ti(x) bi
  • the following are equivalent
  • 1. p is the I-projection of ? onto L that is,
    p minimizes K(p, ?)
  • 2. p is the maximum likelihood distribution in Q
  • 3. p 2 L Å Q

44
An algorithm for I-projection
  • Goal project the prior ? onto the
    constraint-satisfying affine subspace
  • L p Ep Ti(x) bi, i 1, 2, , k
  • Define Li p Ep Ti(x) bi (just the ith
    constraint)
  • Algorithm (Csiszar)
  • Let p0 ?
  • Loop until convergence
  • pt1 I-projection of pt onto Lt mod k
  • Reduce a multidimensional problem to a series of
    one-dimensional problems.

45
One-dimensional I-projection
  • Projecting pt onto Li find ?i such that
  • has E Ti(x) bi.
  • Equivalently, find ?i such that G0(?i) bi
    e.g. line search.

46
Proof of convergence
We get closer to p on each iteration K(p,
pt1) K(p, pt) K(pt1, pt)
pt
pt1
p
Li
47
Other methods for I-projection
  • Csiszars method is sequential one ?i at a time
  • Iterative scaling (Darroch and Ratcliff)
  • parallel all ?i updated in each step
  • Many variants on iterative scaling
  • Gradient methods

48
Postscript Bregman divergences
  • Projections with respect to KL divergence much
    in common with squared Euclidean distance, eg.
    Pythagorean theorem.
  • Q What other distance measures also share these
    properties?
  • A Bregman divergences.
  • Each exponential family has a natural distance
    measure associated with it, its Bregman
    divergence.
  • Gaussian squared Euclidean distance
  • Multinomial KL divergence

49
Postscript Bregman divergences
Can define projections with respect to arbitrary
Bregman divergences. Many machine learning tasks
can then be seen as information projection
boosting, iterative scaling,
Write a Comment
User Comments (0)
About PowerShow.com