Using Fast Weights to Improve Persistent Contrastive Divergence - PowerPoint PPT Presentation

About This Presentation
Title:

Using Fast Weights to Improve Persistent Contrastive Divergence

Description:

Density Estimation and Classification using RBMs ... Because we restrict connectivity: no intra-layer connections (Hinton, 2002; Smolensky 1986) ... – PowerPoint PPT presentation

Number of Views:225
Avg rating:3.0/5.0
Slides: 18
Provided by: people3
Category:

less

Transcript and Presenter's Notes

Title: Using Fast Weights to Improve Persistent Contrastive Divergence


1
Using Fast Weights to ImprovePersistent
Contrastive Divergence
  • Tijmen TielemanGeoffrey Hinton Department of
    Computer Science, University of Toronto

ICML 2009
presented byJorge Silva Department of Electrical
and Computer Engineering, Duke University
2
Problems of interestDensity Estimation and
Classification using RBMs
  • RBM Restricted Boltzmann Machine a stochastic
    version of a Hopfield network (i.e., recurrent
    neural network) often used as an associative
    memory
  • Can also be seen as a particular case of a Deep
    Belief Network (DBN)
  • Why restricted?Because we restrict
    connectivity no intra-layer connections

internal, or hidden representations
hidden units
data pattern (binary vector)
visible units
(Hinton, 2002 Smolensky 1986)
adapted from www.iro.montreal.ca
3
Notation
  • Define the following energy function
  • The joint probability P(v,h) and the marginal
    P(v) are

state of the j-th hidden unit
weight of the i-j connection
visible state
hidden state
state of the i-th visible unit
biases
4
Training with gradient descent
  • Training data likelihood (using just one datum
    for simplicity)
  • The positive gradient is easy
  • But the negative gradient is intractable
  • We cant even sample from the model, so no MC
    approximation

5
Contrastive Divergence (CD)
  • However, we can approximately sample from the
    model. The existing Contrastive Divergence (CD)
    algorithm is one way to do it
  • CD gets the direction of the gradient
    approximately right, though not the magnitude
  • The rough idea behind CD is to
  • start a Markov chain at one of the training
    points used to estimate
  • perform one Gibbs update, i.e., get
  • treat the configuration (h,v) as a sample from
    the model
  • What about Persistent CD?

(Hinton, 2002)
6
Persistent Contrastive Divergence (PCD)
  • Use a persistent Markov chain that is not
    reinitialized at each time the parameters are
    changed
  • The learning rate should be small compared to the
    mixing rate of the Markov chain
  • Many persistent chains can be run in parallel
    the corresponding (h,v) pairs are called fantasy
    particles
  • For a fixed amount of computation, RBMs can learn
    better models using PCD
  • Again, PCD is a previously existing algorithm

(Neal, 1992 Tieleman, 2008)
7
Contributions and outline
  • Theoretical show the interaction between the
    mixing rates and the weight updates in PCD
  • Practical introduce fast weights, in addition to
    the regular weights. This improves the
    performance/speed tradeoff
  • Outline for the rest of the talk
  • Mixing rates vs weight updates
  • Fast weights
  • PCD algorithm with fast weights (FPCD)
  • Experiments

8
Mixing rates vs weight updates
  • Consider M persistent chains
  • The states (v,h) of the chains define a
    distribution R consisting of M point masses
  • Assume M is large enough that we can ignore
    sampling noise
  • The weights are updated in the direction of the
    negative gradient of
  • P is the data distribution and is the
    intractable model distribution(being
    approximated by R)
  • is the vector of parameters (weights)

9
Mixing rates vs weight updates
  • Terms in the objective function
  • The weight updates increase
    (which is bad), but
  • This is compensated by an increase in the mixing
    rates, makingdecrease rapidly (which is good)
  • Essentially, the fantasy particles quickly rule
    out large portions of the search space where Q
    is negligible

this term is the neg. log-likelihood(minus the
fixed entropy of P)
this term is being maximizedw.r.t. \theta
10
Fast weights
  • In addition to the regular weights , the
    paper introduces fast weights
  • Fast weights are only used for fantasy particles
    their learning rate is larger and their
    weight-decay is much stronger (weight-decay
    ridge regression)
  • The role of the fast weights is to make the
    (combined) energy increase faster in the vicinity
    of the fantasy particles, making them mix faster
  • This way, the fantasy particles can escape
    low-energy local modes this counteracts the
    progressive reduction in learning rates, which is
    otherwise desirable as learning progresses
  • The learning rate of the fast weights stays
    constant, but the weights themselves decay fast,
    so their effect is temporary

(Bharath Borkar, 1999)
11
PCD algorithm with fast weights (FPCD)
weight decay
12
Experiments MNIST dataset
  • Small-scale task density estimation using an RBM
    with 25 hidden units
  • Larger task classification using an RBM with 500
    hidden units
  • In classification RBMs, there are two types of
    visible units image units and label units. The
    RBM learns a joint density over both types.
  • In the plots, each point corresponds to 10 runs
    in each run, the network was trained for a
    predetermined amount of time
  • Performance is measured on a held-out test set
  • The learning rate (for regular weights) decays
    linearly to zero over the computation time for
    fast weights it is constant1/e

(Hinton et al., 2006 Larochelle Bengio, 2008)
13
Experiments MNIST dataset (fixed RBM size)
14
Experiments MNIST dataset(optimized RBM size)
  • FPCD 1200 hidden units
  • PCD 700 hiden units

15
Experiments Micro-NORB dataset
  • Classification task on 96x96 images, downsampled
    to 32x32
  • MNORB dimensionality (before downsampling) is
    18432, while MNIST is 784
  • Learning rate decays as 1/t for regular weights

(LeCun et al., 2004)
16
Experiments Micro-NORB dataset
non-monotonicityindicates overfittingproblems
17
Conclusion
  • FPCD outperforms PCD, especially when the number
    of weight updates is small
  • FPCD allows more flexible learning rate schedules
    than PCD
  • Results on the MNORB data also indicate
    outperformance in datasets where overfitting is a
    concern
  • Logistic regression on the full 18432-dimensional
    MNORB dataset had 23 misclassification the RBM
    with FPCD achieved 26 on the reduced dataset
  • Future work run FPCD for a longer time on an
    established dataset
Write a Comment
User Comments (0)
About PowerShow.com