Learning Lateral Connections between Hidden Units - PowerPoint PPT Presentation

About This Presentation
Title:

Learning Lateral Connections between Hidden Units

Description:

Causal Model: Learns to represent images using multiple, simultaneous, hidden, binary causes. ... of MRF's: Generalize the hybrid model to many hidden layers ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 37
Provided by: hin98
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Learning Lateral Connections between Hidden Units


1
Learning Lateral Connections between Hidden
Units
  • Geoffrey Hinton
  • University of Toronto
  • in collaboration with
  • Kejie Bao
  • University of Toronto

2
Overview of the talk
  • Causal Model Learns to represent images using
    multiple, simultaneous, hidden, binary causes.
  • Introduce the variational approximation trick
  • Boltzmann Machines Learning to model the
    probabilities of binary vectors.
  • Introduce the brief Monte Carlo trick
  • Hybrid model Use a Boltzmann machine to model
    the prior distribution over configurations of
    binary causes. Uses both tricks
  • Causal hierarchies of MRFs
    Generalize the hybrid model to many hidden
    layers
  • The causal connections act as insulators that
    keep the local partition functions separate.

3
Bayes Nets Hierarchies of causes
  • It is easy to generate an unbiased example at the
    leaf nodes.
  • It is typically hard to compute the posterior
    distribution over all possible configurations of
    hidden causes.
  • Given samples from the posterior, it is easy to
    learn the local interactions

Hidden cause
Visible effect
4
A simple set of images
Two of the training images
probabilities of turning on the binary hidden
units
reconstructions of the images
5
The generative model
  • To generate a datavector
  • first generate a code from the prior
    distribution
  • then generate an ideal datavector from the code
  • then add Gaussian noise.

value that code c predicts for the ith component
of the data vector
binary state of hidden unit j in code vector c
weight from hidden unit j to pixel i
bias
6
Learning the model
  • For each image in the training set we ought to
    consider all possible codes. This is
    exponentially expensive.

prior probability of code
prediction error of code c
posterior probability of code c
7
How to beat the exponential explosionof possible
codes
  • Instead of considering each code separately, we
    could use an approximation to the true posterior
    distribution. This makes it tractable to consider
    all the codes at once.
  • Instead of computing a separate prediction error
    for each binary code, we compute the expected
    squared error given the approximate posterior
    distribution over codes
  • Then we just change the weights to minimize this
    expected squared error.

8
A factorial approximation
  • For a given datavector, assume that each code
    unit has a probability of being on, but that the
    code units are conditionally independent of each
    other.

product over all code units
Use this term if code unit j is on in code vector
c
otherwise use this term
9
The expected squared prediction error
expected prediction
The variance term prevents it from cheating by
using the precise real-valued q values to make
precise predictions.
additional squared error caused by the variance
in the prediction
10
Approximate inference
  • We use an approximation to the posterior
    distribution over hidden configurations.
  • assume the posterior factorizes into a product
    of distributions for each
    hidden cause.
  • If we use the approximation for learning, there
    is no guarantee that learning will increase the
    probability that the model would generate the
    observed data.
  • But maybe we can find a different and sensible
    objective function that is guaranteed to improve
    at each update.

11
A trade-off between how well the model fits the
data and the tractability of inference
approximating posterior distribution
true posterior distribution
parameters
data
new objective function
  • This makes it feasible to fit models that are
    so complicated that we cannot figure out how the
    model would generate the data, even if we know
    the parameters of the model.

How well the model fits the data
The inaccuracy of inference
12
Where does the approximate posterior come from?
assume that the prior over codes also factors, so
it can be represented by generative biases.
  • We have a tractable cost function expressed in
    terms of the approximating probabilities, q.
  • So we can use the gradient of the cost function
    w.r.t. the q values to train a recognition
    network to produce good q values.

data
13
Two types of density model
  • Stochastic generative model using directed
    acyclic graph (e.g. Bayes Net)
  • Generation from model is easy
  • Inference can be hard
  • Learning is easy after inference
  • Energy-based models that associate an energy
    with each data vector
  • Generation from model is hard
  • Inference can be easy
  • Is learning hard?

14
A simple energy-based model
  • Connect a set of binary stochastic units together
    using symmetric connections. Define the energy of
    a binary configuration, alpha, to be
  • The energy of a binary vector determines its
    probability via the Boltzmann distribution.

15
Maximum likelihood learning is hard in
energy-based models
  • To get high probability for d we need low energy
    for d and high energy for its main rivals, r

It is easy to lower the energy of d
We need to find the serious rivals to d and raise
their energy. This seems hard.
16
Markov chain monte carlo
sample rivals with this probability?
  • It is easy to set up a Markov chain so that it
    finds the rivals to the data with just the right
    probability

17
A picture of the learning rule for a fully
visible Boltzmann machine
a fantasy
i
i
i
i
t 0 t 1 t
2 t infinity
Start with a training vector. Then pick units at
random and update their states stochastically
using the rule
The maximum likelihood learning rule is then
18
A surprising shortcut
  • Instead of taking the negative samples from the
    equilibrium distribution, use slight corruptions
    of the datavectors. Only run the Markov chain for
    for a few steps.
  • Much less variance because a datavector and its
    confabulation form a matched pair.
  • Seems to be very biased, but maybe it is
    optimizing a different objective function.
  • If the model is perfect and there is an infinite
    amount of data, the confabulations will be
    equilibrium samples. So the shortcut will not
    cause learning to mess up a perfect model.

19
Intuitive motivation
  • It is silly to run the Markov chain all the way
    to equilibrium if we can get the information
    required for learning in just a few steps.
  • The way in which the model systematically
    distorts the data distribution in the first few
    steps tells us a lot about how the model is
    wrong.
  • But the model could have strong modes far from
    any data. These modes will not be sampled by
    brief Monte Carlo. Is this a problem in practice?
    Apparently not.

20
Mean field Boltzmann machines
  • Instead of using binary units with stochastic
    updates, approximate the Markov chain by using
    deterministic units with real-valued states, q,
    that represent a distribution over binary states.
  • We can then run a deterministic approximation to
    the brief Markov chain

21
The hybrid model
  • We can use the same factored distribution over
    code units in a causal model and in a mean field
    Boltzmann machine that learns to model the prior
    distribution over codes.
  • The stochastic generative model is
  • First sample a binary vector from the prior
    distribution that is specified by the lateral
    connections between code units
  • Then use this code vector to produce an ideal
    data vector
  • Then add Gaussian noise.

22
A hybrid model
expected energy
minus entropy
The partition function is independent of the
causal model
recognition model
23
The learning procedure
  • Do a forward pass through the recognition model
    to compute q values for the code units
  • Use the q values to compute top-down predictions
    of the data and use the expected prediction
    errors to compute
  • derivatives for the generative weights
  • likelihood derivatives for the q values
  • Run the code units for a few steps ignoring the
    data to get the q- values. Use these q- values to
    compute
  • The derivatives for the lateral weights.
  • The derivatives for the q values that come from
    the prior.
  • Combine the likelihood and prior derivatives of
    the q values and backpropagate through the
    recognition net.

24
Simulation by Kejie Bao
25
Generative weights of hidden units
26
(No Transcript)
27
Generative weights of hidden units
28
(No Transcript)
29
Adding more hidden layers
Recognition model
Recognition model
30
The cost function for a multilayer model
Conditional partition function that depends on
the current top-down inputs to each unit
31
The learning procedure for multiple hidden layers
  • The top down inputs control the conditional
    partition function of a layer, but all the
    required derivatives can still be found using the
    differences between the q and the q- statistics.
  • The learning procedure is just the same except
    that the top down inputs to a layer from the
    layer above must be frozen in place while each
    layer separately runs its brief Markov chain.

32
Advantages of a causal hierarchy of Markov Random
Fields
  • Allows clean-up at each stage of generation in a
    multilayer generative model. This makes it easy
    to maintain constraints.
  • The lateral connections implement a prior that
    squeezes the redundancy out of each hidden layer
    by making most possible configurations very
    unlikely. This creates a bottleneck of the
    appropriate size.
  • The causal connections between layers separate
    the partition functions so that the whole net
    does not have to settle. Each layer can settle
    separately.
  • This solves Terrys problem.

data
33
THE END
34
Energy-Based Models with deterministic hidden
units
  • Use multiple layers of deterministic hidden units
    with non-linear activation functions.
  • Hidden activities contribute additively to the
    global energy, E.

Ek
k
Ej
j
data
35
Contrastive divergence
  • Aim is to minimize the amount by which a step
    toward equilibrium improves the data distribution.

distribution after one step of Markov chain
data distribution
models distribution
Maximize the divergence between confabulations
and models distribution
Minimize divergence between data distribution and
models distribution
Minimize Contrastive Divergence
36
Contrastive divergence
  • .

changing the parameters changes the distribution
of confabulations
Contrastive divergence makes the awkward terms
cancel
Write a Comment
User Comments (0)
About PowerShow.com