Title: Information Bottleneck EM
1Information Bottleneck EM
- Gal Elidan and Nir Friedman
School of Engineering Computer Science The
Hebrew University, Jerusalem, Israel
2Learning with Hidden Variables
Input
Output A model P(X,T)
X1 XN
T
DATA
? ? ? ? ? ?
- Problem No closed-form solution for ML
estimation - Use Expectation Maximization (EM)
- Problem Stuck in inferior local Maxima
- Random Restarts
- Deterministic
- Simulated annealing
EM information regularizationfor learning
parameters
3Learning Parameters
X1 XN
Input
Output A model P?(X)
DATA
Empirical distribution Q(X)
Parametrization ? of P
P?(X1) Q(X1)P?(X2X1) Q(X2X1)
P?(X3X1) Q(X3X1)
4Learning with Hidden Variables
X1 XN
T
Desired structure
Input
DATA
? ? ? ? ? ?
guess of ?
Empirical distribution Q(X,T) ?
Empirical distribution Q(X,T,Y)
Empirical distributionQ(X,T,Y)Q(X,T)Q(TY)
For each instance ID, complete value of T
EM Iterations
Parametrization? for P
5EM Functional
The EM Algorithm E-Step Generate empirical
distribution M-Step Maximize using
? EM is equivalent to optimizing function of
Q,P? ? Each step increases value of functional
Neal and Hinton, 1998
6Information Bottleneck EM
- Target
- In the rest of the talk
- Understanding this objective
- How to use it to learn better models
EM target
Information between hidden and ID
7Information Regularization
- Motivating idea
- Fit training data Set T to be instance ID to
predict X - Generalization Forget ID and keep essence of X
Objective parameter free regularization of
Q
(lower bound of) Likelihood of P?
Compression of instance ID
vs.
Tishby et. al, 1999
8Clustering example
EMTarget
Compressionmeasure
9Clustering example
?1
EMTarget
Compressionmeasure
1
5
6
total preservation
11
4
7
3
?1
10
2
8
9
T ? ID
10Clustering example
??
Compressionmeasure
EMTarget
Desired
??
T 2
11Information Bottleneck EM
EM functional
- Formal equivalence with Information Bottleneck
-
- at ?1 EM and Information Bottleneck coincide
- Generalizing result of Slonim and Weiss for
univariate case -
12Information Bottleneck EM
EM functional
- Formal equivalence with Information Bottleneck
Maximum of Q(TY) is obtained when
Prediction ofT using P?
Marginal ofT in Q
Normalization
13The IB-EM Algorithm for fixed ?
- Iterate until convergance
- E-Step Maximize LIB-EM by optimizing Q
-
- M-Step Maximize LIB-EM by optimizing P?
-
- (same as standard M-step)
- Each step improves LIB-EM
- Guaranteed to converge
14Information Bottleneck EM
- Target
- In the rest of the talk
- Understanding this objective
- How to use it to learn better models
EM target
Information between hidden and ID
15Continuation
easy
Follow ridge from optimum at ?0
LIB-EM
hard
0
?
Q
1
16Continuation
- Recall, if Q is a local maxima of LIB-EM then
- We want to follow a path in (Q, ?) space so
that - for all t, and y
Local maxima for all ?
17Continuation Step
start
- Start at (Q,?) so that
- Compute gradient
- Take ? direction ?
- Take a step in thedesired direction
18Staying on the ridge
start
- Potential problem
- Direction is tangent to path miss optimum
SolutionUse EM steps to regain path
19The IB-EM Algorithm
- Set ?0 (start at easy solution)
- Iterate until ?1 (EM solution is reached)
- Iterate (stay on the ridge)
- E-Step Maximize LIB-EM by optimizing Q
- M-Step Maximize LIB-EM by optimizing P?
- Step (follow the ridge)
- Compute gradient and ? direction
- Take the step by changing ? and Q
20Calibrating the step size
- Potential problem
- Step size too small ? too slow
- Step size too large ? overshoot target
21Calibrating the step size
Recall that I(TY) measures compression of
ID When I(TY) rises more of data is captured
Use change in I(TY)
Naive
Interestingarea
Too sparse
- Non-parametric involves only Q
- Can be bounded I(TY) log2T
22The IB-EM Algorithm
- Set ?0
- Iterate until ?1 (EM solution is reached)
- Iterate (stay on the ridge)
- E-Step Maximize LIB-EM by optimizing Q
- M-Step Maximize LIB-EM by optimizing P?
- Step (follow the ridge)
- Compute gradient and ? direction
- Calibrate step size using I(TY)
- Take the step by changing ? and Q
23The Stock Dataset
- Naive Bayes model
- Daily changes of20 NASDAQ stocks. 1213 train,
303 test - IB-EM outperforms best of EM solutions
- I(TY) follows changes of likelihood
- Continuation follows region of change
( marks evaluated ?)
-19
Best of EM
-21
Train likelihood
IB-EM
-23
?
0
0.2
0.4
0.6
0.8
1
Boyen et. al, 1999
24Multiple Hidden Variables
- We want to learn a model
- with many hiddens ( )
- Naive
- Potentially exponential in of hiddens
- Variational approximation
- use factorized form (Mean Field)
P? ?
Q(TY) ?
LIB-EM ?(Variational EM) - (1- ?)Regularization
Friedman et. al, 2002
25The USPS Digits dataset
- 400 samples 21 hiddens
- Superior to all Mean Field EM runs
- Time ? single exact EM run
single IB-EM 27 min
exact EM25 min/run
3/50 EM runs are ? IB-EM EM needs ? x17 time for
similar results
Offers good value for your time!
26Yeast Stress Response
- 173 experiments (variables)
- 6152 genes (samples)
- 25 hidden variables
- Superior to all Mean Field EM runs
- An order of magnitude faster then exact EM
IB-EM 6 hours
Exact EMgt60 hours
5-24 experiments
Effective when exact solution becomes intractable!
27Summary
- New framework for learning hidden variables
- Formal relation of Bottleneck and EM
- Continuation for bypassing local maxima
- Flexible structure / variational approximation
Future Work
- Learn optimal ?1 for better generalization
- Explore other approximations of Q(TY)
- Model selection learning cardinality and
enrich structure
28Relation to Weight Annealing
- Init temp hot
- Iterate until temp cold
- Perturb w ? temp
- Use QW and optimize
- Cool down
Y
X1 XN
W
1
DATA
2
3
4
M
- Similarities
- Change in empirical Q
- Morph towards EM solution
- Differences
- IB-EM uses info. regulatization
- IB-EM uses continuation
- WA requires cooling policy
- WA applicable for wider range of problems
Elidan et. al, 2002
29Relation to Deterministic Annealing
- Init temp hot
- Iterate until temp cold
- Insert entropy ? temp into model
- Optimize noisy model
- Cool down
Y
X1 XN
1
DATA
2
3
4
M
- Similarities
- Use informationmeasure
- Morph towards EM solution
- Differences
- DA parameterization dependent
- IB-EM uses continuation
- DA requires cooling policy
- DA applicable for wider range of problems