Information Bottleneck EM - PowerPoint PPT Presentation

About This Presentation
Title:

Information Bottleneck EM

Description:

Information Bottleneck EM. School of Engineering & Computer Science ... Problem: No closed-form solution for ML estimation. Use Expectation Maximization (EM) ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 29
Provided by: tommyk1
Category:

less

Transcript and Presenter's Notes

Title: Information Bottleneck EM


1
Information Bottleneck EM
  • Gal Elidan and Nir Friedman

School of Engineering Computer Science The
Hebrew University, Jerusalem, Israel
2
Learning with Hidden Variables
Input
Output A model P(X,T)
X1 XN
T
DATA
? ? ? ? ? ?
  • Problem No closed-form solution for ML
    estimation
  • Use Expectation Maximization (EM)
  • Problem Stuck in inferior local Maxima
  • Random Restarts
  • Deterministic
  • Simulated annealing

EM information regularizationfor learning
parameters
3
Learning Parameters
X1 XN
Input
Output A model P?(X)
DATA
Empirical distribution Q(X)

Parametrization ? of P
P?(X1) Q(X1)P?(X2X1) Q(X2X1)
P?(X3X1) Q(X3X1)
4
Learning with Hidden Variables
X1 XN
T
Desired structure
Input
DATA
? ? ? ? ? ?
guess of ?
Empirical distribution Q(X,T) ?
Empirical distribution Q(X,T,Y)
Empirical distributionQ(X,T,Y)Q(X,T)Q(TY)
For each instance ID, complete value of T
EM Iterations
Parametrization? for P
5
EM Functional
The EM Algorithm E-Step Generate empirical
distribution M-Step Maximize using
? EM is equivalent to optimizing function of
Q,P? ? Each step increases value of functional
Neal and Hinton, 1998
6
Information Bottleneck EM
  • Target
  • In the rest of the talk
  • Understanding this objective
  • How to use it to learn better models

EM target
Information between hidden and ID
7
Information Regularization
  • Motivating idea
  • Fit training data Set T to be instance ID to
    predict X
  • Generalization Forget ID and keep essence of X

Objective parameter free regularization of
Q
(lower bound of) Likelihood of P?
Compression of instance ID
vs.
Tishby et. al, 1999
8
Clustering example
EMTarget
Compressionmeasure
9
Clustering example
?1
EMTarget
Compressionmeasure
1
5
6
total preservation
11
4
7
3
?1
10
2
8
9
T ? ID
10
Clustering example
??
Compressionmeasure
EMTarget
Desired
??
T 2
11
Information Bottleneck EM
EM functional
  • Formal equivalence with Information Bottleneck
  • at ?1 EM and Information Bottleneck coincide
  • Generalizing result of Slonim and Weiss for
    univariate case

12
Information Bottleneck EM
EM functional
  • Formal equivalence with Information Bottleneck

Maximum of Q(TY) is obtained when
Prediction ofT using P?
Marginal ofT in Q
Normalization
13
The IB-EM Algorithm for fixed ?
  • Iterate until convergance
  • E-Step Maximize LIB-EM by optimizing Q
  • M-Step Maximize LIB-EM by optimizing P?
  • (same as standard M-step)
  • Each step improves LIB-EM
  • Guaranteed to converge

14
Information Bottleneck EM
  • Target
  • In the rest of the talk
  • Understanding this objective
  • How to use it to learn better models

EM target
Information between hidden and ID
15
Continuation
easy
Follow ridge from optimum at ?0
LIB-EM
hard
0
?
Q
1
16
Continuation
  • Recall, if Q is a local maxima of LIB-EM then
  • We want to follow a path in (Q, ?) space so
    that
  • for all t, and y

Local maxima for all ?
17
Continuation Step
start
  1. Start at (Q,?) so that
  2. Compute gradient
  3. Take ? direction ?
  4. Take a step in thedesired direction

18
Staying on the ridge
start
  • Potential problem
  • Direction is tangent to path miss optimum

SolutionUse EM steps to regain path
19
The IB-EM Algorithm
  • Set ?0 (start at easy solution)
  • Iterate until ?1 (EM solution is reached)
  • Iterate (stay on the ridge)
  • E-Step Maximize LIB-EM by optimizing Q
  • M-Step Maximize LIB-EM by optimizing P?
  • Step (follow the ridge)
  • Compute gradient and ? direction
  • Take the step by changing ? and Q

20
Calibrating the step size
  • Potential problem
  • Step size too small ? too slow
  • Step size too large ? overshoot target

21
Calibrating the step size
Recall that I(TY) measures compression of
ID When I(TY) rises more of data is captured
Use change in I(TY)
Naive
Interestingarea
Too sparse
  • Non-parametric involves only Q
  • Can be bounded I(TY) log2T

22
The IB-EM Algorithm
  • Set ?0
  • Iterate until ?1 (EM solution is reached)
  • Iterate (stay on the ridge)
  • E-Step Maximize LIB-EM by optimizing Q
  • M-Step Maximize LIB-EM by optimizing P?
  • Step (follow the ridge)
  • Compute gradient and ? direction
  • Calibrate step size using I(TY)
  • Take the step by changing ? and Q

23
The Stock Dataset
  • Naive Bayes model
  • Daily changes of20 NASDAQ stocks. 1213 train,
    303 test
  • IB-EM outperforms best of EM solutions
  • I(TY) follows changes of likelihood
  • Continuation follows region of change
    ( marks evaluated ?)

-19
Best of EM
-21
Train likelihood
IB-EM
-23
?
0
0.2
0.4
0.6
0.8
1
Boyen et. al, 1999
24
Multiple Hidden Variables
  • We want to learn a model
  • with many hiddens ( )
  • Naive
  • Potentially exponential in of hiddens
  • Variational approximation
  • use factorized form (Mean Field)

P? ?
Q(TY) ?
LIB-EM ?(Variational EM) - (1- ?)Regularization
Friedman et. al, 2002
25
The USPS Digits dataset
  • 400 samples 21 hiddens
  • Superior to all Mean Field EM runs
  • Time ? single exact EM run

single IB-EM 27 min
exact EM25 min/run
3/50 EM runs are ? IB-EM EM needs ? x17 time for
similar results
Offers good value for your time!
26
Yeast Stress Response
  • 173 experiments (variables)
  • 6152 genes (samples)
  • 25 hidden variables
  • Superior to all Mean Field EM runs
  • An order of magnitude faster then exact EM

IB-EM 6 hours
Exact EMgt60 hours
5-24 experiments
Effective when exact solution becomes intractable!
27
Summary
  • New framework for learning hidden variables
  • Formal relation of Bottleneck and EM
  • Continuation for bypassing local maxima
  • Flexible structure / variational approximation

Future Work
  • Learn optimal ?1 for better generalization
  • Explore other approximations of Q(TY)
  • Model selection learning cardinality and
    enrich structure

28
Relation to Weight Annealing
  • Init temp hot
  • Iterate until temp cold
  • Perturb w ? temp
  • Use QW and optimize
  • Cool down

Y
X1 XN
W
1
DATA
2
3
4
M
  • Similarities
  • Change in empirical Q
  • Morph towards EM solution
  • Differences
  • IB-EM uses info. regulatization
  • IB-EM uses continuation
  • WA requires cooling policy
  • WA applicable for wider range of problems

Elidan et. al, 2002
29
Relation to Deterministic Annealing
  • Init temp hot
  • Iterate until temp cold
  • Insert entropy ? temp into model
  • Optimize noisy model
  • Cool down

Y
X1 XN
1
DATA
2
3
4
M
  • Similarities
  • Use informationmeasure
  • Morph towards EM solution
  • Differences
  • DA parameterization dependent
  • IB-EM uses continuation
  • DA requires cooling policy
  • DA applicable for wider range of problems
Write a Comment
User Comments (0)
About PowerShow.com