Learning Representations - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Learning Representations

Description:

To learn the optimal parameters W, we seek to maximum the likelihood of ... a set of coefficients, {ai} is drawn according to a sparse prior (Cauchy or related) ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 57
Provided by: Alexandr201
Category:

less

Transcript and Presenter's Notes

Title: Learning Representations


1
Learning Representations
2
Maximum likelihood
Activity
World
s
r
s?
3
Maximum likelihood
Activity
World
s
r
s?
4
Maximum likelihood
Generative Model
Activity
World
s
Probabilistic model of neuronal firing as a
function of s
r
s?
Observations
World
W
Probabilistic model of how the data are
generated given W
D
W?
5
Maximum likelihood
Generative Model
Observations
World
W
yf(x,W)n
W?
Dxi, yi
6
Maximum likelihood learning
  • To learn the optimal parameters W, we seek to
    maximum the likelihood of the data which can be
    done through gradient descent or analytically in
    some special cases.

7
Maximum likelihood
Generative Model
Observations
World
W
yWxn
Dxi, yi
W?
8
Maximum likelihood
Generative Model
Observations
World
W
yWxn
Dxi, yi
W?
Note that the ys are treated as corrupted data
Product over all examples
9
Maximum likelihood learning
  • Minimizing quadratic distance is equivalent to
    maximizing a gaussian likelihood function.

10
Maximum likelihood learning
  • Analytical Solution
  • Gradient descent
  • Delta rule

11
Maximum likelihood learning
  • Example training a two layer network

Very important you need to cross validate
12
Maximum likelihood learning
  • Supervised learning
  • The data consists of pairs of input/output
    vectors xi,yi.
  • Assume that the data were generated by a network
    and then corrupted by gaussian noise.
  • LearningAdjust the parameters of your network to
    increase the likelihood that the data were indeed
    generated by your network.
  • Note if your network is nothing like the system
    that generated the data, you could be in trouble.

13
Maximum likelihood learning
  • Unsupervised learning
  • The data consists of input vectors only xi.
  • Causal models assume that the data are due to
    some hidden causes plus noise. This is the
    generative model.
  • Goal of learning given a set of observations,
    find the parameters of the generative model.
  • As usual, we will find the parameters by
    maximizing the likelihood of the observations.

14
Maximum likelihood learning
  • Example unsupervised learning in a two layer
    network

Causes
Generative model
Sensory stimuli
The network represents the joint distribution
15
Maximum likelihood learning
  • Wait! The network is upside down! Arentt we
    doing things the wrong way around?
  • No the idea is that whats responsible for the
    sensory stimuli are high order causes like the
    presence of physical objects in the world, their
    identity, their location, their color and so on.
  • Generative model goes from the causes/objects to
    the sensory stimuli
  • Recognition will go from stimuli to objects/causes

16
Maximum likelihood learning
  • The network represents the joint distribution
    P(x,y). Given the joint distribution, inferences
    are easy. We use Bayes rule to compute P(yx)
    (recognition) or P(xy) (expectations).

17
Mixture of Gaussians
x2
x1
18
Maximum likelihood learning
  • Example Mixture of gaussians

Generative model 5 parameters, p(y1), p(y2),
s2,
y
Causes
Sensory stimuli
19
Maximum likelihood learning
  • Example mixture of gaussians

Recognition model given a pair xs, what was
the cluster?
y
Causes
Sensory stimuli
20
Maximum likelihood learning
  • Example unsupervised learning in a two layer
    network

Causes
Generative model
Sensory stimuli
21
Maximum likelihood learning
Causes
Sensory stimuli
Recognition
22
Maximum likelihood learning
  • Learning consist in adjusting the weights to
    maximum the likelihood of the data,
  • Problem the generative model specifies
  • The data set does not specify the hidden causes
    yi, which we would need for a learning rule like
    delta rule

23
Maximum likelihood learning
  • Fix 1. You dont know yi? Estimate it! Pick the
    MAP estimate of yi on each trial (Olshausen and
    Field, as we will see later)
  • Note that the weights are presumably incorrect
    (otherwise, there would be no need for learning).
    As a result, the ys obtained this way are also
    incorrect. Hopefully, theyre good enough
  • Main problem this breaks down if p(yixi,w) is
    multimodal.

24
Maximum likelihood learning
  • Fix 2. Sample yi from P(yxi,w) using Gibbs
    sampling.
  • Slow, and again were sampling from the wrong
    distributionHowever, this is a much better idea
    for multimodal distributions.

25
Maximum likelihood learning
  • Fix 3. Marginalization
  • Use gradient descent to adjust the parameters of
    the likelihood and prior (very slow).

26
Maximum likelihood learning
  • Fix 3. Marginalization. Gradient descent for
    mixture of two gaussians

Even when the p(xy)s are gaussian, the
resulting likelihood function is not quadratic.
27
Maximum likelihood learning
  • Fix 3. Marginalization
  • Rarely feasible in practice

If y is binary vector of N dimension, that sum
contains 2N terms
28
Maximum likelihood learning
  • Fix 4. The expectation-maximization algorithm
    (EM)

29
Maximum likelihood learning
  • EM How can we optimize p(yw)? Let g1 be
    p(y1w)

We have samples of this
But we dont know this Trick use an
approximation
30
Maximum likelihood learning
  • E-step use the current parameters to approximate
    ptrue(yx) with p(yx,w)

31
Maximum likelihood learning
  • EM M step optimize p(y)

From E step
32
Maximum likelihood learning
  • EM M step. For the mean of p(xy1) use

From E step
33
Maximum likelihood learning
  • EM Iterate the E and M step. Guaranteed to
    converge, but it could be a local minima.

34
Maximum likelihood learning
  • Fix 5. Model the recognition distribution and
    use EM for training. Wake-Sleep algorithm
    (Helmholtz machine).

35
Maximum likelihood learning
  • Helmholtz machine

Causes
Generative model P(xy,w)
Recognition model Q (yx,v)
Sensory stimuli
36
Maximum likelihood learning
  • Fix 5. Model the recognition distribution and
    use EM for training. Wake-Sleep algorithm
    (Helmholtz machine).
  • (Wake) M step Use xs to generate y according to
    Q(yx,v), and adjust the w in P(xy,w).

37
Maximum likelihood learning
  • Helmholtz machine

Causes
Generative model P(xy,w)
Recognition model Q (yx,v)
Sensory stimuli
38
Maximum likelihood learning
  • Fix 5. Model the recognition distribution and
    use EM for training. Wake-Sleep algorithm
    (Helmholtz machine).
  • (Wake) M step Use xs to generate y according to
    Q(yx,v), and adjust the w in P(xy,w).
  • (Sleep) E step Generate y with P(yw), and use
    it to generate x according to P(xy,w). Then
    adjust the v in Q(yx,v).

39
Maximum likelihood learning
  • Helmholtz machine

Causes
Generative model P(xy,w)
Recognition model Q (yx,v)
Sensory stimuli
40
Maximum likelihood learning
  • Fix 5. Model the recognition distribution and
    use EM for training. Wake-Sleep algorithm
    (Helmholtz machine).
  • Advantage After several approximations, you can
    get both learning rules to look like delta rule
  • Usable for hierarchical architectures.

41
Sparse representations in V1
  • Ex Olshausen and Field a natural image is
    generated according to a two step process
  • a set of coefficients, ai is drawn according to
    a sparse prior (Cauchy or related)
  • The image is the result of combining a set of
    basis functions weighted by the coefficient ci
    and corrupted by gaussian noise

42
Sparse representations in V1
  • Network representation

ai
fi(x,y) Generative weights
y
x
43
Sparse representations in V1
  • The sparse prior favors solution with most
    coefficients set to zero and a few with a high
    value.
  • Why a sparse prior? Because the response of
    neurons to natural images is non gaussian and
    tend to be sparse.

44
Sparse representations in V1
  • The likelihood function

45
Sparse representations in V1
  • The generative model is a model of the joint
    distribution P(I,afi)

46
Sparse representations in V1
  • Learning
  • Given a set of natural images, how do you learn
    the basis functions?
  • Answer find the basis functions maximizing the
    likelihood of the images, P(Ikfi). Sure, but
    where to you get the as?
  • Olhausen and Field For each image, pick the
    as maximizing the posterior over a, P(aIk,fi)
    (Fix1).

47
Network implementation
48
Network implementation
ai
Recognition weights
y
x
49
Sparse representations in V1
  • The sparse prior favors patterns of activity for
    which most neurons are silent and a few are
    highly active.

50
Projective fields
51
Sparse representations in V1
  • The true receptive fields are input dependent
    (because of the lateral interactions) in a way
    that seems somewhat consistent with experimental
    data

With dots
With gratings
52
Infomax idea
  • Represent the world in a format that maximizes
    mutual information given the limited information
    capacity of neurons.
  • Is this simply about packing bits in neuronal
    firing?
  • What if the code is undecipherable?

53
Information theory and learning
  • The features extracted by infomax algorithms are
    often meaningful because high level features are
    often good for compression.
  • Example of scanned text a page of text can be
    dramatically compressed if one treats it as a
    sequence of characters as opposed to pixels (e.g.
    this page 800x700x8 vs 200x8, 2400 compression
    factor)
  • General idea of unsupervised learning compress
    the image and hope to discover high order
    description of the image

54
Information theory and learning
  • Ex Decorrelation in the retina leads to
    center-surround receptive fields.
  • Ex ICA (factorial code) leads to oriented
    receptive fields.
  • Problem what can you do beyond ICA?
  • How can you extract features that simplify
    computation?
  • We need other constraints

55
Sparse Coding
  • Ex sparseness Why sparseness?
  • Grandmother cells very easy to decode and very
    easy to use for further computation.
  • Sparse is non gaussian which often corresponds to
    high level features (because it goes against the
    law of large number)

56
Learning Representations
  • The main challenges for the future
  • Representing hierarchical structure
  • Learning hierarchical structure
Write a Comment
User Comments (0)
About PowerShow.com