Learning Representations

About This Presentation

Title:

Learning Representations

Description:

To learn the optimal parameters W, we seek to maximum the likelihood of ... a set of coefficients, {ai} is drawn according to a sparse prior (Cauchy or related) ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 57

Provided by: Alexandr201

Category:

more less

Transcript and Presenter's Notes

Title: Learning Representations

1
Learning Representations
2
Maximum likelihood
Activity
World
s
r
s?
3
Maximum likelihood
Activity
World
s
r
s?
4
Maximum likelihood
Generative Model
Activity
World
s
Probabilistic model of neuronal firing as a
function of s
r
s?
Observations
World
W
Probabilistic model of how the data are
generated given W
D
W?
5
Maximum likelihood
Generative Model
Observations
World
W
yf(x,W)n
W?
Dxi, yi
6
Maximum likelihood learning

To learn the optimal parameters W, we seek to
maximum the likelihood of the data which can be
done through gradient descent or analytically in
some special cases.

7
Maximum likelihood
Generative Model
Observations
World
W
yWxn
Dxi, yi
W?
8
Maximum likelihood
Generative Model
Observations
World
W
yWxn
Dxi, yi
W?
Note that the ys are treated as corrupted data
Product over all examples
9
Maximum likelihood learning

Minimizing quadratic distance is equivalent to
maximizing a gaussian likelihood function.

10
Maximum likelihood learning

Analytical Solution
Gradient descent
Delta rule

11
Maximum likelihood learning

Example training a two layer network

Very important you need to cross validate
12
Maximum likelihood learning

Supervised learning
The data consists of pairs of input/output
vectors xi,yi.
Assume that the data were generated by a network
and then corrupted by gaussian noise.
LearningAdjust the parameters of your network to
increase the likelihood that the data were indeed
generated by your network.
Note if your network is nothing like the system
that generated the data, you could be in trouble.

13
Maximum likelihood learning

Unsupervised learning
The data consists of input vectors only xi.
Causal models assume that the data are due to
some hidden causes plus noise. This is the
generative model.
Goal of learning given a set of observations,
find the parameters of the generative model.
As usual, we will find the parameters by
maximizing the likelihood of the observations.

14
Maximum likelihood learning

Example unsupervised learning in a two layer
network

Causes
Generative model
Sensory stimuli
The network represents the joint distribution
15
Maximum likelihood learning

Wait! The network is upside down! Arentt we
doing things the wrong way around?
No the idea is that whats responsible for the
sensory stimuli are high order causes like the
presence of physical objects in the world, their
identity, their location, their color and so on.
Generative model goes from the causes/objects to
the sensory stimuli
Recognition will go from stimuli to objects/causes

16
Maximum likelihood learning

The network represents the joint distribution
P(x,y). Given the joint distribution, inferences
are easy. We use Bayes rule to compute P(yx)
(recognition) or P(xy) (expectations).

17
Mixture of Gaussians
x2
x1
18
Maximum likelihood learning

Example Mixture of gaussians

Generative model 5 parameters, p(y1), p(y2),
s2,
y
Causes
Sensory stimuli
19
Maximum likelihood learning

Example mixture of gaussians

Recognition model given a pair xs, what was
the cluster?
y
Causes
Sensory stimuli
20
Maximum likelihood learning

Example unsupervised learning in a two layer
network

Causes
Generative model
Sensory stimuli
21
Maximum likelihood learning
Causes
Sensory stimuli
Recognition
22
Maximum likelihood learning

Learning consist in adjusting the weights to
maximum the likelihood of the data,
Problem the generative model specifies
The data set does not specify the hidden causes
yi, which we would need for a learning rule like
delta rule

23
Maximum likelihood learning

Fix 1. You dont know yi? Estimate it! Pick the
MAP estimate of yi on each trial (Olshausen and
Field, as we will see later)
Note that the weights are presumably incorrect
(otherwise, there would be no need for learning).
As a result, the ys obtained this way are also
incorrect. Hopefully, theyre good enough
Main problem this breaks down if p(yixi,w) is
multimodal.

24
Maximum likelihood learning

Fix 2. Sample yi from P(yxi,w) using Gibbs
sampling.
Slow, and again were sampling from the wrong
distributionHowever, this is a much better idea
for multimodal distributions.

25
Maximum likelihood learning

Fix 3. Marginalization
Use gradient descent to adjust the parameters of
the likelihood and prior (very slow).

26
Maximum likelihood learning

Fix 3. Marginalization. Gradient descent for
mixture of two gaussians

Even when the p(xy)s are gaussian, the
resulting likelihood function is not quadratic.
27
Maximum likelihood learning

Fix 3. Marginalization
Rarely feasible in practice

If y is binary vector of N dimension, that sum
contains 2N terms
28
Maximum likelihood learning

Fix 4. The expectation-maximization algorithm
(EM)

29
Maximum likelihood learning

EM How can we optimize p(yw)? Let g1 be
p(y1w)

We have samples of this
But we dont know this Trick use an
approximation
30
Maximum likelihood learning

E-step use the current parameters to approximate
ptrue(yx) with p(yx,w)

31
Maximum likelihood learning

EM M step optimize p(y)

From E step
32
Maximum likelihood learning

EM M step. For the mean of p(xy1) use

From E step
33
Maximum likelihood learning

EM Iterate the E and M step. Guaranteed to
converge, but it could be a local minima.

34
Maximum likelihood learning

Fix 5. Model the recognition distribution and
use EM for training. Wake-Sleep algorithm
(Helmholtz machine).

35
Maximum likelihood learning

Helmholtz machine

Causes
Generative model P(xy,w)
Recognition model Q (yx,v)
Sensory stimuli
36
Maximum likelihood learning

Fix 5. Model the recognition distribution and
use EM for training. Wake-Sleep algorithm
(Helmholtz machine).
(Wake) M step Use xs to generate y according to
Q(yx,v), and adjust the w in P(xy,w).

37
Maximum likelihood learning

Helmholtz machine

Causes
Generative model P(xy,w)
Recognition model Q (yx,v)
Sensory stimuli
38
Maximum likelihood learning

Fix 5. Model the recognition distribution and
use EM for training. Wake-Sleep algorithm
(Helmholtz machine).
(Wake) M step Use xs to generate y according to
Q(yx,v), and adjust the w in P(xy,w).
(Sleep) E step Generate y with P(yw), and use
it to generate x according to P(xy,w). Then
adjust the v in Q(yx,v).

39
Maximum likelihood learning

Helmholtz machine

Causes
Generative model P(xy,w)
Recognition model Q (yx,v)
Sensory stimuli
40
Maximum likelihood learning

Fix 5. Model the recognition distribution and
use EM for training. Wake-Sleep algorithm
(Helmholtz machine).
Advantage After several approximations, you can
get both learning rules to look like delta rule
Usable for hierarchical architectures.

41
Sparse representations in V1

Ex Olshausen and Field a natural image is
generated according to a two step process
a set of coefficients, ai is drawn according to
a sparse prior (Cauchy or related)
The image is the result of combining a set of
basis functions weighted by the coefficient ci
and corrupted by gaussian noise

42
Sparse representations in V1

Network representation

ai
fi(x,y) Generative weights
y
x
43
Sparse representations in V1

The sparse prior favors solution with most
coefficients set to zero and a few with a high
value.
Why a sparse prior? Because the response of
neurons to natural images is non gaussian and
tend to be sparse.

44
Sparse representations in V1

The likelihood function

45
Sparse representations in V1

The generative model is a model of the joint
distribution P(I,afi)

46
Sparse representations in V1

Learning
Given a set of natural images, how do you learn
the basis functions?
Answer find the basis functions maximizing the
likelihood of the images, P(Ikfi). Sure, but
where to you get the as?
Olhausen and Field For each image, pick the
as maximizing the posterior over a, P(aIk,fi)
(Fix1).

47
Network implementation
48
Network implementation
ai
Recognition weights
y
x
49
Sparse representations in V1

The sparse prior favors patterns of activity for
which most neurons are silent and a few are
highly active.

50
Projective fields
51
Sparse representations in V1

The true receptive fields are input dependent
(because of the lateral interactions) in a way
that seems somewhat consistent with experimental
data

With dots
With gratings
52
Infomax idea

Represent the world in a format that maximizes
mutual information given the limited information
capacity of neurons.
Is this simply about packing bits in neuronal
firing?
What if the code is undecipherable?

53
Information theory and learning

The features extracted by infomax algorithms are
often meaningful because high level features are
often good for compression.
Example of scanned text a page of text can be
dramatically compressed if one treats it as a
sequence of characters as opposed to pixels (e.g.
this page 800x700x8 vs 200x8, 2400 compression
factor)
General idea of unsupervised learning compress
the image and hope to discover high order
description of the image

54
Information theory and learning

Ex Decorrelation in the retina leads to
center-surround receptive fields.
Ex ICA (factorial code) leads to oriented
receptive fields.
Problem what can you do beyond ICA?
How can you extract features that simplify
computation?
We need other constraints

55
Sparse Coding

Ex sparseness Why sparseness?
Grandmother cells very easy to decode and very
easy to use for further computation.
Sparse is non gaussian which often corresponds to
high level features (because it goes against the
law of large number)

Learning Representations - PowerPoint PPT Presentation

Learning Representations

To learn the optimal parameters W, we seek to maximum the likelihood of ... a set of coefficients, {ai} is drawn according to a sparse prior (Cauchy or related) ... – PowerPoint PPT presentation