Using Fast Weights to Improve Persistent Contrastive Divergence - PowerPoint PPT Presentation

About This Presentation

Title:

Using Fast Weights to Improve Persistent Contrastive Divergence

Description:

Density Estimation and Classification using RBMs ... Because we restrict connectivity: no intra-layer connections (Hinton, 2002; Smolensky 1986) ... – PowerPoint PPT presentation

Number of Views:225

Avg rating:3.0/5.0

Slides: 18

Provided by: people3

Learn more at: https://people.ee.duke.edu

Category:

more less

Transcript and Presenter's Notes

Title: Using Fast Weights to Improve Persistent Contrastive Divergence

1
Using Fast Weights to ImprovePersistent
Contrastive Divergence

Tijmen TielemanGeoffrey Hinton Department of
Computer Science, University of Toronto

ICML 2009
presented byJorge Silva Department of Electrical
and Computer Engineering, Duke University
2
Problems of interestDensity Estimation and
Classification using RBMs

RBM Restricted Boltzmann Machine a stochastic
version of a Hopfield network (i.e., recurrent
neural network) often used as an associative
memory
Can also be seen as a particular case of a Deep
Belief Network (DBN)
Why restricted?Because we restrict
connectivity no intra-layer connections

internal, or hidden representations
hidden units
data pattern (binary vector)
visible units
(Hinton, 2002 Smolensky 1986)
adapted from www.iro.montreal.ca
3
Notation

Define the following energy function
The joint probability P(v,h) and the marginal
P(v) are

state of the j-th hidden unit
weight of the i-j connection
visible state
hidden state
state of the i-th visible unit
biases
4
Training with gradient descent

Training data likelihood (using just one datum
for simplicity)
The positive gradient is easy
But the negative gradient is intractable
We cant even sample from the model, so no MC
approximation

5
Contrastive Divergence (CD)

However, we can approximately sample from the
model. The existing Contrastive Divergence (CD)
algorithm is one way to do it
CD gets the direction of the gradient
approximately right, though not the magnitude
The rough idea behind CD is to
start a Markov chain at one of the training
points used to estimate
perform one Gibbs update, i.e., get
treat the configuration (h,v) as a sample from
the model
What about Persistent CD?

(Hinton, 2002)
6
Persistent Contrastive Divergence (PCD)

Use a persistent Markov chain that is not
reinitialized at each time the parameters are
changed
The learning rate should be small compared to the
mixing rate of the Markov chain
Many persistent chains can be run in parallel
the corresponding (h,v) pairs are called fantasy
particles
For a fixed amount of computation, RBMs can learn
better models using PCD
Again, PCD is a previously existing algorithm

(Neal, 1992 Tieleman, 2008)
7
Contributions and outline

Theoretical show the interaction between the
mixing rates and the weight updates in PCD
Practical introduce fast weights, in addition to
the regular weights. This improves the
performance/speed tradeoff
Outline for the rest of the talk
Mixing rates vs weight updates
Fast weights
PCD algorithm with fast weights (FPCD)
Experiments

8
Mixing rates vs weight updates

Consider M persistent chains
The states (v,h) of the chains define a
distribution R consisting of M point masses
Assume M is large enough that we can ignore
sampling noise
The weights are updated in the direction of the
negative gradient of
P is the data distribution and is the
intractable model distribution(being
approximated by R)
is the vector of parameters (weights)

9
Mixing rates vs weight updates

Terms in the objective function
The weight updates increase
(which is bad), but
This is compensated by an increase in the mixing
rates, makingdecrease rapidly (which is good)
Essentially, the fantasy particles quickly rule
out large portions of the search space where Q
is negligible

this term is the neg. log-likelihood(minus the
fixed entropy of P)
this term is being maximizedw.r.t. \theta
10
Fast weights

In addition to the regular weights , the
paper introduces fast weights
Fast weights are only used for fantasy particles
their learning rate is larger and their
weight-decay is much stronger (weight-decay
ridge regression)
The role of the fast weights is to make the
(combined) energy increase faster in the vicinity
of the fantasy particles, making them mix faster
This way, the fantasy particles can escape
low-energy local modes this counteracts the
progressive reduction in learning rates, which is
otherwise desirable as learning progresses
The learning rate of the fast weights stays
constant, but the weights themselves decay fast,
so their effect is temporary

(Bharath Borkar, 1999)
11
PCD algorithm with fast weights (FPCD)
weight decay
12
Experiments MNIST dataset

Small-scale task density estimation using an RBM
with 25 hidden units
Larger task classification using an RBM with 500
hidden units
In classification RBMs, there are two types of
visible units image units and label units. The
RBM learns a joint density over both types.
In the plots, each point corresponds to 10 runs
in each run, the network was trained for a
predetermined amount of time
Performance is measured on a held-out test set
The learning rate (for regular weights) decays
linearly to zero over the computation time for
fast weights it is constant1/e

(Hinton et al., 2006 Larochelle Bengio, 2008)
13
Experiments MNIST dataset (fixed RBM size)
14
Experiments MNIST dataset(optimized RBM size)

FPCD 1200 hidden units
PCD 700 hiden units

15
Experiments Micro-NORB dataset

Classification task on 96x96 images, downsampled
to 32x32
MNORB dimensionality (before downsampling) is
18432, while MNIST is 784
Learning rate decays as 1/t for regular weights

(LeCun et al., 2004)
16
Experiments Micro-NORB dataset
non-monotonicityindicates overfittingproblems
17
Conclusion

FPCD outperforms PCD, especially when the number
of weight updates is small
FPCD allows more flexible learning rate schedules
than PCD
Results on the MNORB data also indicate
outperformance in datasets where overfitting is a
concern
Logistic regression on the full 18432-dimensional
MNORB dataset had 23 misclassification the RBM
with FPCD achieved 26 on the reduced dataset
Future work run FPCD for a longer time on an
established dataset