Connectionist Computing CS4018 - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Connectionist Computing CS4018

Description:

The network was given a stream of words, with the corresponding phonemes. ... Each map is composed by a neuron (always the same) mapping a 5x5 area into a unit. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 25
Provided by: gruye
Category:

less

Transcript and Presenter's Notes

Title: Connectionist Computing CS4018


1
Connectionist ComputingCS4018
  • Gianluca Pollastri
  • office CS A1.07
  • email gianluca.pollastri_at_ucd.ie

2
Credits
  • Geoffrey Hinton, University of Toronto.
  • borrowed some of his slides for Neural Networks
    and Computation in Neural Networks courses.
  • Ronan Reilly, NUI Maynooth.
  • slides from his CS4018.
  • Paolo Frasconi, University of Florence.
  • slides from tutorial on Machine Learning for
    structured domains.

3
Lecture notes
  • http//gruyere.ucd.ie/2007_courses/4018/
  • Strictly confidential...

4
Books
  • No book covers large fractions of this course.
  • Parts of chapters 4, 6, (7), 13 of Tom Mitchells
    Machine Learning
  • Parts of chapter V of Mackays Information
    Theory, Inference, and Learning Algorithms,
    available online at
  • http//www.inference.phy.cam.ac.uk/mackay/itprnn/b
    ook.html
  • Chapter 20 of Russell and Norvigs Artificial
    Intelligence A Modern Approach, also available
    at
  • http//aima.cs.berkeley.edu/newchap20.pdf
  • More materials later..

5
Paper 2
  • Read the paper NETtalk a parallel network that
    learns to read aloud, by Sejnowski and Rosenberg
    (1986).
  • The paper is linked from the course website.
  • Email me (gianluca.pollastri_at_ucd.ie) a 250 word
    MAX summary by Feb the 26nd at midnight in any
    time zone of your choice.
  • 5. 1 off each day late.
  • You are responsible for making sure I get it, etc
    etc.

6
MLP applications matching words and sounds
  • Sejnowski and Rosenberg, NETtalk, a parallel
    network that learns to read aloud, Cognitive
    Science, 14, 179-211 (1986)
  • Teaching an MLP how to pronounce English by
    backprop.
  • The network was given a stream of words, with the
    corresponding phonemes.
  • Once the network had learned, it was possible to
    make it read.

7
MLP applications protein secondary structure
prediction
  • Proteins are strings
  • FEFHGYARSGVIMNDSGASTKS
  • GAYITPAGETGGAIGRLGNQAD
  • TYVEMNLEHKQTLDNG
  • Structures too

8
deep network
  • 4 hidden layers

9
Feature maps
  • Hidden layers 1 and 3 implement feature maps.
  • Layer 1 Input is the 16x16 image, with borders
    added for technical reasons -gt 28x28. Output is
    composed by 4 maps of 24x24 units. This is really
    implemented with 4 neurons, each taking 5x5
    inputs, each replicated in each possible position
    on the input map.
  • This sounds complicated but is fairly easy
    instead of a (28x28)-gt(24x24x4) full connectivity
    only 5x5 inputs are connected to each output
    unit. Not only, but there are just 4 neurons/sets
    of weights (weight sharing).
  • So, a (5x5)-gt4 full connectivity that sweeps the
    whole input. Only 104 weights including biases!

10
Averaging/subsampling layer
  • Hidden layers 2 and 4 implement
    averaging/subsampling stages.
  • Layer 2 24x24x4 -gt 12x12x4
  • This is performed using 4 units, each one doing a
    2x2-gt1 mapping. Weights are constrained to be all
    the same.

11
layers 3, 4 and 5
  • Layer 3 more feature maps. 12 8x8 maps. Each
    map is composed by a neuron (always the same)
    mapping a 5x5 area into a unit.
  • Layer 4 Same as layer 2. (8x8x12) -gt (4x4x12)
  • Layer 5 10 output units fully connected to layer
    4. This is where most weights are.

12
overall
  • 5 layers, position invariance encoded in the
    architecture, a lot of weights shared.
  • 100k connections -gt 2k independent parameters.
    every weight is shared on average by 50
    connections.
  • Training complexity is still o(100k) though.

13
Training the network
  • Training is by gradient descent, using
    backpropagation.
  • For each copy j of a shared weight there will be
    a ?wj. They are simply added together.

14
Results
  • After 30 epochs the error on the training set was
    1.1 and the squared error 0.017.
  • On the test set 3.4 and 0.024
  • To get 1 error 5.7 rejection (9 on just
    handwritten)
  • A lot of these were actually caused by
    preprocessing. Some of those that werent, were
    ambiguous even to humans.

15
Invariances
  • In Le Cuns paper we saw translation invariance
    was introduced into the network by weight
    sharing.
  • Teaching neural networks invariances is a general
    problem.

16
The invariance problem
  • Our perceptual systems are very good at dealing
    with invariances
  • translation, rotation, scaling
  • deformation, contrast, lighting, rate
  • We are so good at this that its hard to
    appreciate how difficult it is.
  • Its one of the main difficulties in making
    computers perceive.
  • We still dont have generally accepted solutions.

17
Invariances using features
  • Instead of representing an object directly,
    extract whatever-invariant features first.
  • For instance, if we want roto-translational
    invariance, represent an object by the distances
    between parts instead of xyz coordinates.

18
Invariances normalisation
  • For instance put a box around an object, then
    scale it to a fixed size same as preprocessing
    digits in Le Cun et al.
  • Eliminates degrees of freedom.
  • Not always trivial how to choose the box.

19
Invariances brute force
  • We can tackle invariances by
  • constraining network weights
  • using features
  • normalising
  • But computer scientists should be lazy and
    impatient. Wouldnt it be great if we could let
    the network do all the job?
  • Brute force to create invariance to
    trasformation X, for each example generate a lot
    of new other examples by applying X to it. Then
    train a large network on a fast computer.

20
Invariances brute force
  • For example, translate and rotate a digit in a
    lot of different ways and train a large network
    to recognise it.
  • It generally works well, if the transformations
    arent too large do approximate (easy)
    normalisation first.

21
Summary invariances
  • Often as tough a problem as learning after
    invariances are tackled.
  • Possible solutions
  • network design
  • features
  • normalisation
  • brute force

22
Problems with squared error
  • So far, for gradient descent, we used
  • Error squared
  • Output function linear or sigmoid (binary wont
    work)
  • There are tricky problems with squared error. For
    instance if the desired output is 1 and the
    actual output is very close to 0 there is almost
    no gradient.

23
Problems with squared error
  • These are the deltas
  • And these are f and f

24
Alternatives softmax and relative entropy
  • Non-local non linearity.
  • Outputs add up to 1 (can be interpreted as the
    probability of the output given the input).
Write a Comment
User Comments (0)
About PowerShow.com