Title: Using Fast Weights to Improve Persistent Contrastive Divergence
1Using Fast Weights to ImprovePersistent
Contrastive Divergence
- Tijmen TielemanGeoffrey Hinton Department of
Computer Science, University of Toronto
ICML 2009
presented byJorge Silva Department of Electrical
and Computer Engineering, Duke University
2Problems of interestDensity Estimation and
Classification using RBMs
- RBM Restricted Boltzmann Machine a stochastic
version of a Hopfield network (i.e., recurrent
neural network) often used as an associative
memory - Can also be seen as a particular case of a Deep
Belief Network (DBN) - Why restricted?Because we restrict
connectivity no intra-layer connections
internal, or hidden representations
hidden units
data pattern (binary vector)
visible units
(Hinton, 2002 Smolensky 1986)
adapted from www.iro.montreal.ca
3Notation
- Define the following energy function
- The joint probability P(v,h) and the marginal
P(v) are
state of the j-th hidden unit
weight of the i-j connection
visible state
hidden state
state of the i-th visible unit
biases
4Training with gradient descent
- Training data likelihood (using just one datum
for simplicity) - The positive gradient is easy
- But the negative gradient is intractable
- We cant even sample from the model, so no MC
approximation
5Contrastive Divergence (CD)
- However, we can approximately sample from the
model. The existing Contrastive Divergence (CD)
algorithm is one way to do it - CD gets the direction of the gradient
approximately right, though not the magnitude - The rough idea behind CD is to
- start a Markov chain at one of the training
points used to estimate - perform one Gibbs update, i.e., get
- treat the configuration (h,v) as a sample from
the model - What about Persistent CD?
(Hinton, 2002)
6Persistent Contrastive Divergence (PCD)
- Use a persistent Markov chain that is not
reinitialized at each time the parameters are
changed - The learning rate should be small compared to the
mixing rate of the Markov chain - Many persistent chains can be run in parallel
the corresponding (h,v) pairs are called fantasy
particles - For a fixed amount of computation, RBMs can learn
better models using PCD - Again, PCD is a previously existing algorithm
(Neal, 1992 Tieleman, 2008)
7Contributions and outline
- Theoretical show the interaction between the
mixing rates and the weight updates in PCD - Practical introduce fast weights, in addition to
the regular weights. This improves the
performance/speed tradeoff - Outline for the rest of the talk
- Mixing rates vs weight updates
- Fast weights
- PCD algorithm with fast weights (FPCD)
- Experiments
8Mixing rates vs weight updates
- Consider M persistent chains
- The states (v,h) of the chains define a
distribution R consisting of M point masses - Assume M is large enough that we can ignore
sampling noise - The weights are updated in the direction of the
negative gradient of - P is the data distribution and is the
intractable model distribution(being
approximated by R) - is the vector of parameters (weights)
9Mixing rates vs weight updates
- Terms in the objective function
- The weight updates increase
(which is bad), but - This is compensated by an increase in the mixing
rates, makingdecrease rapidly (which is good) - Essentially, the fantasy particles quickly rule
out large portions of the search space where Q
is negligible
this term is the neg. log-likelihood(minus the
fixed entropy of P)
this term is being maximizedw.r.t. \theta
10Fast weights
- In addition to the regular weights , the
paper introduces fast weights - Fast weights are only used for fantasy particles
their learning rate is larger and their
weight-decay is much stronger (weight-decay
ridge regression) - The role of the fast weights is to make the
(combined) energy increase faster in the vicinity
of the fantasy particles, making them mix faster - This way, the fantasy particles can escape
low-energy local modes this counteracts the
progressive reduction in learning rates, which is
otherwise desirable as learning progresses - The learning rate of the fast weights stays
constant, but the weights themselves decay fast,
so their effect is temporary
(Bharath Borkar, 1999)
11PCD algorithm with fast weights (FPCD)
weight decay
12Experiments MNIST dataset
- Small-scale task density estimation using an RBM
with 25 hidden units - Larger task classification using an RBM with 500
hidden units - In classification RBMs, there are two types of
visible units image units and label units. The
RBM learns a joint density over both types. - In the plots, each point corresponds to 10 runs
in each run, the network was trained for a
predetermined amount of time - Performance is measured on a held-out test set
- The learning rate (for regular weights) decays
linearly to zero over the computation time for
fast weights it is constant1/e
(Hinton et al., 2006 Larochelle Bengio, 2008)
13Experiments MNIST dataset (fixed RBM size)
14Experiments MNIST dataset(optimized RBM size)
- FPCD 1200 hidden units
- PCD 700 hiden units
15Experiments Micro-NORB dataset
- Classification task on 96x96 images, downsampled
to 32x32 - MNORB dimensionality (before downsampling) is
18432, while MNIST is 784 - Learning rate decays as 1/t for regular weights
(LeCun et al., 2004)
16Experiments Micro-NORB dataset
non-monotonicityindicates overfittingproblems
17Conclusion
- FPCD outperforms PCD, especially when the number
of weight updates is small - FPCD allows more flexible learning rate schedules
than PCD - Results on the MNORB data also indicate
outperformance in datasets where overfitting is a
concern - Logistic regression on the full 18432-dimensional
MNORB dataset had 23 misclassification the RBM
with FPCD achieved 26 on the reduced dataset - Future work run FPCD for a longer time on an
established dataset